Where is the AMIE engineering certificate invalid?

Automation of secure, fully automated deployments

Production deployments

Our # 1 goal for production deployments at AWS is to prevent negative impacts on multiple Regions at the same time and on multiple Availability Zones in the same Region. Limiting the scope of each deployment limits the potential customer impact of failed production deployments and prevents impact across multiple Availability Zones or Regions. To limit the scope of automatic deployments, we split the production phase of the pipeline into many phases and many deployments in individual regions. The teams divide regional deployments into even smaller deployments by deploying them in individual Availability Zones or in the individual internal fragments (called cells) of their service in their pipeline to further limit the extent of the potential impact of a failed production deployment.

Staggered deployments

Each team needs to strike a balance between the security of small-scale operations and the speed with which we can deliver changes to customers in all regions. Deploying changes through the pipeline in 24 Regions or 76 Availability Zones in a row has the lowest risk of widespread impact, but it could take weeks for the pipeline to deliver a change to customers worldwide. We have found that grouping deployments into "waves" of increasing size, as seen in the previous sample project pipeline, helps us strike a good balance between deployment risk and speed. The phase of each wave in the pipeline orchestrates deployments to a group of regions, encouraging wave-to-wave changes. New changes can enter the production phase of the pipeline at any time. After wave 1 pushes a set of changes from step 1 to step 2, the next set of gamma changes is pushed to step 1 of wave 1 so we don't end up with big packets of changes waiting for them to be made available for production.

The first two waves in the pipeline create the greatest confidence in the change: the first wave is deployed in a region with a low number of requests in order to limit the potential impact of the first productive deployment of the new change. The wave is only made available in one Availability Zone (or cell) within this region in order to carefully implement the change in the entire region. The second wave is then deployed to an Availability Zone (or cell) in a region with a high number of requests, where it is very likely that customers will practice all new code paths and where we will get good validation of the changes .

After we have more confidence in the security of the switch after the first pipeline waves, we can deploy more and more regions in parallel in the same wave. For example, the pipeline of the previous sample is deployed in three regions in wave 3, then in up to 12 regions in wave 4, then in the remaining regions in wave 5. The exact number and selection of regions in each of these waves and the number of waves in a service team's pipeline depend on usage patterns and the scope of each service. The later waves in the pipeline are still helping us achieve our goal of preventing negative impacts on multiple Availability Zones in the same region. When a wave is deployed in parallel in multiple regions, it follows the same careful rollout behavior for each region that was used in the first waves. Each step in the wave provides only a single Availability Zone or cell from each region of the wave.

One-box and rolling deployments

The provision for each production wave begins with a one-box stage. As in the Gamma One Box level, each Prod One Box level provides the latest code for a box (a single virtual machine, a single container, or a small percentage of the Lambda function calls) in each of the Regions or Availability Zones ready for the wave. The provision of Prod One-Box minimizes the potential impact of changes on the wave by initially limiting the requests serviced by the new code in that wave. As a rule, the one-box handles a maximum of ten percent of the total requests for the region or availability zone. If the change causes a negative impact in the one-box, the pipeline automatically rolls the change back and does not promote it to the rest of the production stages.

After the one-box phase, most teams use rolling deployments to deploy the shaft's main production fleet. Rolling deployment ensures that the service has enough capacity to serve the production load during the entire deployment. It controls the speed at which the new code is put into operation (i.e. from the point at which it serves production traffic) in order to limit the effects of changes. In a typical rolling deployment in a region, no more than 33 percent of the service boxes in this region (containers, Lambda calls, or software that runs on virtual machines) are replaced by the new code.

During a deployment, the deployment system first selects an initial batch of up to 33 percent of the boxes to be replaced with the new code. During the replacement, at least 66 percent of the total capacity is healthy and serving the inquiries. All services are scaled to withstand the loss of an Availability Zone in the region, so that we know that the service can still handle the production load even with this capacity. After the staging system determines that a box from the first batch of boxes is going through health checks, a box from the remaining fleet can be replaced with the new code, and so on. In the meantime, we're still maintaining a minimum of 66 percent of capacity to be able to service requests at all times. To further limit the impact of changes, only five percent of their boxes are deployed in some teams' pipelines. Then, however, they do quick rollbacks, where the system replaces 33 percent of the boxes with the previous code at once to speed the rollback.

The following diagram shows the state of a production environment in the middle of a rolling deployment. The new code was provided for the "one-box" level and for the first batch of the main product fleet. Another batch was removed from the load balancer and is switched off for replacement.

Metric monitoring and automatic rollback

Automated deployments typically don't have a developer in the pipeline to actively monitor each deployment to be deployed, review the metrics, and manually reset them if they encounter problems. These deployments are completely unnoticed. The provisioning system actively monitors an alarm to determine whether it should automatically withdraw an operation. A rollback switches the environment back to the container image, the package for deploying the AWS Lambda function, or the internal deployment package that was previously deployed. Our internal provisioning packages are similar to container images in that the packages are immutable and use a checksum to verify their integrity.

Each microservice in each region typically has a high-level alert that is triggered by thresholds for the metrics that affect the customers of the service (such as error rates and high latency) and system health metrics (such as CPU usage). as shown in the following example. This highly sensitive alarm is used to call the on-call technician and automatically reset the service when a deployment occurs. Often times, the rollback is already in progress when the on-call technician has been paged and begins the procedure.

Example of a high level microservice alarm

Changes introduced by a deployment can affect upstream and downstream microservices, so the deployment system must monitor the high security alarm for the microservice that is being deployed and monitor the high security alarms for the other microservices on the team to determine when to roll back. Set changes can also influence the metrics of the continuous canary tests, so that the provisioning system must also monitor for failed canary tests. In order to automatically get all these possible areas of activity under control again, the teams create high-level collective alarms that are monitored by the provisioning system. High-level summary alarms combine the status of all individual high-level microservice alarms in the team and the status of the canary alarms into a single summary status, as shown in the following example. If any of the high level alarms for the team microservices goes into the alert state, any ongoing team deployments for all of their microservices in that region will automatically reset.

Example of a high-level aggregate rollback alarm

A one-box tier serves a small percentage of the total traffic, so issues introduced by a one-box deployment may not trigger the service's high-level rollback alert. In order to absorb and reverse changes that cause problems in the one-box phase before they reach the rest of the production stages, metrics that only affect the one-box process are also undone in the one-box phase. Relate phase. For example, the error rate in the requirements that were specifically served by the One Box, which only makes up a small percentage of the total number of requirements, is reduced.

Example of a one-box rollback alarm

In addition to canceling alarms defined by the service team, our provisioning system can also detect and automatically cancel anomalies in common metrics issued by our internal web service framework. Most of our microservices report metrics like number of requests, request latency, and number of errors in a standard format. Using these standard metrics, the provisioning system can automatically roll back if there are anomalies in the metrics during a provisioning. Examples of this are when the number of requests suddenly drops to zero, or when the latency or the number of errors is much higher than normal.

Bake time

Sometimes a negative impact caused by a deployment is not readily apparent. you burns slowly. That is, it does not appear immediately during deployment, especially if the service is under a light load at that time. Encouraging the move to the next stage of the pipeline immediately after the deployment is complete can have an impact across multiple regions if the impact becomes apparent in the first region. Before a change to the next production stage is promoted, each production stage in the pipeline has a beacon time, i.e. H. when the pipeline continues to monitor the team's high-speed aggregate alarm for slow-burning impacts after a deployment is complete and before moving to the next stage.

When calculating the time we spend baking a deployment, we need to weigh the risk of having a broader impact if we push changes too quickly in multiple regions against the speed at which we can deliver changes to customers around the world. We have found that a good way to offset these risks is for earlier waves in the pipeline to have a longer beacon time while building confidence in the safety of the change, and for later waves to have a shorter beacon time. Our goal is to minimize the risk of an impact that affects multiple regions. Since most deployments are not actively monitored by a team member, the typical pipeline's standard bake times are conservative and will deploy a change in all regions in about four or five business days. Services that are larger or highly critical have even more conservative beacon times and times for their pipelines to deploy a change globally.

A typical pipeline waits at least one hour after each one-box phase, at least 12 hours after the first regional wave, and at least two to four hours after each of the remaining regional waves, with an additional one for individual regions, availability zones and cells within each wave Bake time is provided. The beacon time includes the requirement to wait for a certain number of data points in the team's metrics (e.g., "wait for at least 100 requests to the build API") to ensure that enough requests have occurred, so that it is likely that the new code has fully executed. During the entire beacon time, the provision is automatically withdrawn if the high-level aggregate alarm of the team goes into the alarm state.

While this is extremely rare, in some cases there may be a need for an urgent change (such as a security fix or mitigation for a major event affecting service availability) to be delivered to customers faster than that Time it typically takes for the pipeline to beacon and deploy changes. In these cases, we can lower the pipeline's beacon time to speed up deployment, but we need a great deal of control over the change to do so. For these cases, we need verification by the organization's chief engineers. The team must review the code change, as well as its urgency and the risk of impact, with very experienced developers who are experts in operational security. The change continues through the same steps in the pipeline as usual, but is promoted to the next stage more quickly. We counter the risk of a faster deployment by limiting the pending flight changes during this time to allow only the minimal code changes required to solve the current problem, and by actively monitoring deployments.

Alarm and time slot blockers

The pipeline prevents automatic deployments to production when there is a higher risk of causing a negative impact. The pipeline uses a number of "blockers" that assess deployment risk. For example, automatically deploying a new change to the Prod when a problem occurs in the environment could worsen or prolong the impact. Before starting a new deployment for a particular stage of production, the pipeline reviews the team's overall high-level alarm to see if there are any active issues. If the alarm is currently in the alarm state, the pipeline prevents the change from proceeding. Pipelines can also review company-wide alarms, such as: B. a major event alert, which indicates whether there is a widespread impact in another team's systems and prevents a new deployment from being deployed that could increase the overall impact. These deployment blockers can be overridden by developers when a change needs to be deployed to recover from a high-risk issue.

The pipeline is also configured with a number of time windows that determine when a deployment is allowed to start. When we configure time slots, we have to weigh two causes of the deployment risk. On the one hand, very small time windows can cause changes to pile up in the pipeline while the time window is closed, which increases the likelihood that any of these changes will have an impact on the next deployment when the time window opens. On the other hand, very large windows of time that extend beyond regular business hours increase the risk of prolonging the effects of a failed deployment. Outside business hours, it takes longer to turn on the on-call technician than during the day when the on-call technician and other team members are working. During regular business hours, the team can be deployed faster after a failed deployment if manual recovery steps are required.

Most deployments are not actively monitored by a team member, so we're optimizing the timing of deployments to minimize the time it takes to call on an on-call engineer in the event that manual recovery is required after an automatic rollback.On-call engineers usually need longer at night, on public holidays in the office and on weekends, so these times are not included in the time window. Depending on the usage patterns of the service, some issues may not emerge until hours after the deployment, so many teams also exclude Fridays and late afternoon deployments from their time slot to reduce the risk of the on-call technician deploying at night or on the weekend after a deployment must become. We have found that this set of time windows allows for quick recovery even when manual intervention is required, that it ensures that on-call services are less used outside of regular working hours and a small number of changes are bundled during the time windows are closed.

Pipelines as code

The typical AWS service team has many pipelines to deliver the team's various microservices and source types (application code, infrastructure code, operating system patches, etc.). Each pipeline has many stages of deployment for an ever-increasing number of regions and availability zones. This means a lot of configuration effort for the team for the management in the pipeline system, the provisioning system and the alarm system as well as a lot of effort to keep up to date with the latest best practices and new regions and availability zones. For the past several years, we've advocated the practice of "pipelines as code" to make it easier and more consistent to configure secure, up-to-date pipelines by modeling that configuration in code. Our internal pipelines as a code tool pull data from a centralized list of Regions and Availability Zones to easily add new Regions and Availability Zones to the pipelines in AWS. The tool also enables teams to model pipelines through inheritance by defining configurations that are common in a team's pipelines in a parent class (e.g. which regions go into each wave and how long the beacon time is for each Wave) and by subclassing the entire pipeline configuration of the microservices that inherits all of the common configuration.

Conclusion

At Amazon, we've evolved our automated deployment practices over time based on what helps us balance the security of deployment versus the speed of deployment. At the same time, we want to minimize the time it takes developers to take care of the deployment. By incorporating automated deployment security into the deployment process through extensive pre-production testing, automatic rollbacks, and staggered production deployments, we can minimize the potential impact on production from deployments. This means that developers do not have to actively monitor the deployment in production.

With fully automated pipelines, developers use code reviews to review their code and also to confirm that the change is ready for production. After the change is incorporated into the source code repository, the developer can move on to the next task and forget about deployment, knowing that the pipeline will safely and carefully bring their change into production. The automated pipeline ensures that it is continuously made available for production several times a day, with safety and speed in balance. By modeling our continuous delivery practice in code, it is easier than ever for AWS service teams to set up their pipelines to automatically and securely deploy their code changes.

More reading


About the author

Clare Liguori is a Principal Software Engineer at AWS. She is currently working on the developer experience for AWS Container Services and develops tools at the intersection of containers and software development cycle: local development, infrastructure as code, CI / CD, observability and operation.