It's Not Work When You Love It: Scaling a Portion of Xbox Live on Azure

It's been a crazy couple of weeks here on the Xbox Live team. We publicly released a major update to the console on the 6th and a bunch of new apps went live during each of the proceeding weeks. My sub-team with the broader Live team builds our services on Windows Azure. The scalability of Windows Azure has been a real asset in meeting the high demand brought on by the new releases, but there have also been some challenges.

Azure allows us to not wory about managing hardware. When we needed to bring up additional instances of roles (for those unfamiliar with Azure, this is analogous to adding additional VMs to the system) to meet demand, it was just a matter of a 15-second adjustment to a config and everything else was automatically handled by Azure within minutes. Because Azure uses upgrade domains (i.e. no more than a certain percentage of the VMs will be taken down at a given time) to perform upgrades, updates are performed without downtime. There are a couple of important lessons we learned about working with Azure from this experience.

Include consideration for Azure capacity limits in our scalability plan. Azure is built on clusters of hardware and services are run within a multi-tenant cluster of limited size. Because of this, it is possible to reach capacity within a particular cluster. The Azure team works to ensure all tenants in a cluster have space to grow, but some services are just too large. This was the case in our situation. To deal with this, the Azure team worked with us to provide the additional capacity we'll need for the near-term. However, in the longer term, we may need to be able to run across multiple clusters, which means running additional instances of our services. The challenge we foresee will be around routing requests to the various endpoints. This could be solved by having a front-door service, but if that has to scale beyond a single cluster, we will be in the same situation. A better solution may be an endpoint look-up service that can split traffic accordingly. However, this may also hit scalability limitations. At that point, we'd need to either work with Azure to figure out how to handle this or do some clever DNS tricks.
Plan ahead for the correct number of upgrade domains. In our system, we retained the number of upgrade domains used when we orginally deployed. As we began to scale, we realized that the number of upgrade domains we were using was too small as the number of role instances that were taken down at one time during upgrades was too large. It's important to keep in mind that when some percentage of role instances are taken down, the traffic that would otherwise be directed to those instances is now redirected to the remaining live instances. This means, the remaining instances must be able to support that flash of additional load without falling over. Something else to be aware of is that changing the number of upgrade domains requires deploying a new build rather than a simple config change.
Prepare for doing DNS swaps. Sometimes in-place upgrades and VIP swap upgrades aren't possible. There are a couple of changes we have made to our services in the past few months for which the standard upgrade procedures facilitated by Azure are not allowed. One such case is the addition of new endpoints in our services. A look-up service is a good solution to this problem since it allows us to redirect clients to updated service instances. However, what happens when we need to update the look-up service? In such situations, a DNS swap is applied. A DNS swap means that we bring up a second instance of the target service and change the DNS records so client traffic is routed to the new service instance. Some important things to keep in mind are that it's a good idea to have an additional service slot ready to deploy to since we wouldn't want to have to set this up under the pressure of dealing with scalability issues or other problems.
Account for the flood of traffic caused by a VIP swap or DNS swap. VIP swaps and DNS swaps can cause a set of endpoints to see traffic go from non-existent to very high in a very short time. One way to handle this is to seed the new service. For example, if caching is a concern, pre-populate the cache. This is not always possible though, so an alternative is slowly redirect traffic in stages, giving the new service a chance to ramp up before redirecting the entirety of the traffic. One important alleviation this provides is reducing the effect of the traffic spike on whatever persistent storage system is being used.

There are several additional lessons we learned from our use of Azure Storage. These lessons were primarily around understanding how our system consumes storage, what our user traffic looks like, and architecting the system accordingly. The properties of Azure Storage are fairly well publicized and it is important to keep Azure's behaviors and constraints in mind during the design of the target system. To be clear, this isn't meant to imply the waterfall development model should be used, with all the design being done up-front, but rather to keep the information in mind whenever thinking about design.

Know what assumptions are being made and their potential impact. There are many assumptions that are made during intial iterations of building a system. Some of these include what future client traffic patterns will look like, what level of scale needs to be reached, what parts of the system will be exercised most frequently, etc. These assumptions should be identified as early as possible and challenged with what-if scenarios. We made some incorrect assumptions about the level of scale certain parts of our system would have. For these particular areas, we were using reverse-timestamps as a prefix for our partition keys when accessing Azure Storage. This is a common scenario, as having the most recent items at the top of the table is useful for processing the data. However, sequential access patterns aren't handled well by Azure Storage, causing requests to get throttled as the scale increases. The best option is to randomize the partition key somehow. This is not always possible though, depending on the requirements around reading data. In our case, it made more sense to use a bucketing mechanism that divides the traffic into several buckets and within each bucket, use the reverse-timestamp. As traffic increases, so does the number of buckets, so that while we still end up with the desired sort-order, we are no longer bottle-necked by the storage limitation.
Design the system's storage access patterns. This is directly related to the above item. To know how the system will behave when traffic starts flowing in, one must be conscious of the storage access patterns it uses. We must take great care to design our storage access patterns so that they don't become a bottleneck in our system.
Understand where concurrency is an issue. It has been a recurring problem for us since we do a lot of asynchronous processing. The major issue is a typical one in distributed systems: agreement and dealing with failures. One significant limitation in Azure Storage is the throughput of an individual Azure Storage message queue, which is about 500 requests per second. Clearly, this is insufficient for a large-scale distributed system. Thus, one needs an agreement mechanism (most likely PAXOS). An obvious solution to increasing queue throughput is to pack multiple messages into a single message or to use multiple queues. However, packing multiple requests means risking data loss during accumulation, which is often unacceptable, and using multiple queues requires agreement about which processors should access which queues. It might seem straight-forward to say processors should do round-robin access of all queues, but we found that this can overload the queues if the workers synchronize in their round-robin selection, cause retrieval of the same message many times since Azure only guarantees messages are delivered at least once, and it can be a fair amount of work to develop a queue management system, which has many of its own problems. In the end, we decided to remove the use of Azure queues from our since the guarantees it provides beyond those of blob and table storage are unecessary for us. We do, however, expect to come back to re-evaluate the use of queues at a later time. To summarize, the following led us away from queues: messages can be processed multiple times, message order isn't guaranteed, and there is a low per-queue scale limitation. This means the processors must be idempotent, able to process correctly regardless of order, and require additional mechanisms to scale. Thus queues don't really save us any work for our particular case.

Saturday, December 24, 2011

Scaling a Portion of Xbox Live on Azure