Saturday, December 31, 2011

The On-Campus Interview Process at Microsoft

At Microsoft, on-campus interviews for external candidates typically go roughly as follows (transferring teams internally require interviewing too, but the process is slightly different):

  1. Check in with the recruiter managing the interview process
  2. Meet with the hiring manager for a short conversation about the team and an initial interview
  3. Interview with about 4 members of the team
  4. If all goes well, interview with an "as-appropriate" interviewer
  5. Meet with the hiring manager and/or recruiter again to wrap things up
The first conversation of the day, with the recruiter, is light and informational. The recruiter will let the candidate know what the day's schedule looks like and possibly chat about the team and the location (i.e. Redmond, Silicon Valley, Charlotte, etc.). They will often suggest a few useful tips like to try to relax and have fun, to ask questions to get a feel for the team, etc.

The first conversation with the actual team is usually with the hiring manager. They will often tell you about the team and warm you up with conversation about yourself. Some hiring managers will conduct actual technical interviews at this time and others will just get a feel for the candidate's cultural fit, leaving it to the remainder of the interviewers to test the candidate's technical skills.

The interviews with the actual team will definitely involve some technical interviewing. For software engineers, this means getting up in front of a whiteboard and working out challenging design or coding problems. Typically, each interviewer will be evaluating a particular skill set (algorithms, design, coding, etc.). If a candidate doesn't meet expectations in an earlier interview, a later interviewer may assess one or more of the same areas since (most) interviewers acknowledge the fact that nervousness can affect performance during interviews. One may also notice that interviews often start off with 5 to 10 minutes of friendly conversation. This is to help the candidate feel comfortable in the hope that they won't feel as nervous during the technical portion of the conversation.

What kind of whiteboard questions are asked? This is highly variable depending on the interviewer. Over time, most interviewers develop a methodology for assessing candidates. For testing technical capability, I personally like to start with a simple coding question, followed by a more complex coding question, followed by an open-ended design problem. I sometimes reverse the order of the second coding question with the open-ended design problem to base the coding question off the candidate's response to the design problem. Whatever the case, I try to be consistent to make a fair comparison among candidates. Each of the interviews with the team are about an hour long, so there is definitely a time squeeze when trying to get through these three problems.

Most technical interviewers will also take some time for two additional things. The first is to find out a bit about the candidate's characteristics. This frequently falls into those first 5 to 10 minutes of friendly conversation. The second is to let the candidate ask questions about the team, product, and/or company. The candidate's opportunity to ask questions almost universally comes in the last 5 to 10 minutes of the interview. After the hour is up, the candidate is handed off to the next inerviewer.

One of the team interviews will be over lunch and will be less technical in nature (no, you don't have to code while eating). I have seen some interviewers discuss a design problem over lunch, but most will have a deeper conversation about the candidate to root out their characteristics and about what they are looking for career-wise. Clearly, there are no right or wrong answers here. Just be honest, reflective, and thoughtful. Always remember that it's best to identify a poor fit as early as possible. Not only will adding someone who is a poor fit to the team be damaging to the team, but it will also be harmful to the candidate, who will suffer career stagnation and will be generally unhappy in their job.

The as-appropriate interviewer is someone who has been a Microsoft employee for a long time and who has had more hiring experience than the other interviewers. This interviewer is usually focused on the candidate's long-term potential at Microsoft. Sometimes this interview is technical, but it is almost always heavily focused on on the candidate's characteristics. Unfortunately, it can be pretty hard to tell if you're interviewing with an as-appropriate interviewer, but if you are, it's a good sign. This interviewer is called "as-appropriate" because they don't participate in the interviews if the team doesn't feel positively about the candidate.

To wrap things up with the team, the candidate will often chat briefly with the hiring manager again. This is usually just to discuss how the candidate felt about the day, answer any final questions the candidate might have, and possibly sell the candidate on the team. Depending on the situation, candidates may also have a final chat with someone higher up the hierarchy than the hiring manager. If this happens, it is usually because the team already decided that they like the candidate and really want to sell the candidate on the team. As with the as-appropriate interviewer, this is a good sign.

Sometimes, candidates will also meet with the recruiter at the end of the day. The recruiter will probably have a few questions about how the day went and might offer some suggestions of things to do while visiting the area if the candidate is from out of town.

Although it is reasonable to be eager for feedback, don't expect any indication of the results until the recruiter follows up. The team will make a final decision within a few hours after completion of the interviews and may have other candidates to follow up with. Also don't execpt any information as to why or why not a hiring decision is made. Again, it is understandable why one would want additional information, but for legal reasons, it is inappropriate to discuss the reasoning behind the decision and such information simply won't be provided.

Candidates are typically notified of the final outcome of the interview by phone within two weeks. If the teams decides to extend an offer, the recruiter will follow up the phone call with an offer letter, which will include details such as salary, etc. If the fit wasn't right, this will be the final contact, but don't forget that if you feel Microsoft is an important part of the path to your goals, be persistent and keep trying!

For an example of a technical interview (one of the interviews from step 3 of the day), see this post.

Saturday, December 24, 2011

Scaling a Portion of Xbox Live on Azure

It's been a crazy couple of weeks here on the Xbox Live team. We publicly released a major update to the console on the 6th and a bunch of new apps went live during each of the proceeding weeks. My sub-team with the broader Live team builds our services on Windows Azure. The scalability of Windows Azure has been a real asset in meeting the high demand brought on by the new releases, but there have also been some challenges.

Azure allows us to not wory about managing hardware. When we needed to bring up additional instances of roles (for those unfamiliar with Azure, this is analogous to adding additional VMs to the system) to meet demand, it was just a matter of a 15-second adjustment to a config and everything else was automatically handled by Azure within minutes. Because Azure uses upgrade domains (i.e. no more than a certain percentage of the VMs will be taken down at a given time) to perform upgrades, updates are performed without downtime. There are a couple of important lessons we learned about working with Azure from this experience.

  1. Include consideration for Azure capacity limits in our scalability plan. Azure is built on clusters of hardware and services are run within a multi-tenant cluster of limited size. Because of this, it is possible to reach capacity within a particular cluster. The Azure team works to ensure all tenants in a cluster have space to grow, but some services are just too large. This was the case in our situation. To deal with this, the Azure team worked with us to provide the additional capacity we'll need for the near-term. However, in the longer term, we may need to be able to run across multiple clusters, which means running additional instances of our services. The challenge we foresee will be around routing requests to the various endpoints. This could be solved by having a front-door service, but if that has to scale beyond a single cluster, we will be in the same situation. A better solution may be an endpoint look-up service that can split traffic accordingly. However, this may also hit scalability limitations. At that point, we'd need to either work with Azure to figure out how to handle this or do some clever DNS tricks.
  2. Plan ahead for the correct number of upgrade domains. In our system, we retained the number of upgrade domains used when we orginally deployed. As we began to scale, we realized that the number of upgrade domains we were using was too small as the number of role instances that were taken down at one time during upgrades was too large. It's important to keep in mind that when some percentage of role instances are taken down, the traffic that would otherwise be directed to those instances is now redirected to the remaining live instances. This means, the remaining instances must be able to support that flash of additional load without falling over. Something else to be aware of is that changing the number of upgrade domains requires deploying a new build rather than a simple config change.
  3. Prepare for doing DNS swaps. Sometimes in-place upgrades and VIP swap upgrades aren't possible. There are a couple of changes we have made to our services in the past few months for which the standard upgrade procedures facilitated by Azure are not allowed. One such case is the addition of new endpoints in our services. A look-up service is a good solution to this problem since it allows us to redirect clients to updated service instances. However, what happens when we need to update the look-up service? In such situations, a DNS swap is applied. A DNS swap means that we bring up a second instance of the target service and change the DNS records so client traffic is routed to the new service instance. Some important things to keep in mind are that it's a good idea to have an additional service slot ready to deploy to since we wouldn't want to have to set this up under the pressure of dealing with scalability issues or other problems.
  4. Account for the flood of traffic caused by a VIP swap or DNS swap. VIP swaps and DNS swaps can cause a set of endpoints to see traffic go from non-existent to very high in a very short time. One way to handle this is to seed the new service. For example, if caching is a concern, pre-populate the cache. This is not always possible though, so an alternative is slowly redirect traffic in stages, giving the new service a chance to ramp up before redirecting the entirety of the traffic. One important alleviation this provides is reducing the effect of the traffic spike on whatever persistent storage system is being used.
There are several additional lessons we learned from our use of Azure Storage. These lessons were primarily around understanding how our system consumes storage, what our user traffic looks like, and architecting the system accordingly. The properties of Azure Storage are fairly well publicized and it is important to keep Azure's behaviors and constraints in mind during the design of the target system. To be clear, this isn't meant to imply the waterfall development model should be used, with all the design being done up-front, but rather to keep the information in mind whenever thinking about design.

  1. Know what assumptions are being made and their potential impact. There are many assumptions that are made during intial iterations of building a system. Some of these include what future client traffic patterns will look like, what level of scale needs to be reached, what parts of the system will be exercised most frequently, etc. These assumptions should be identified as early as possible and challenged with what-if scenarios. We made some incorrect assumptions about the level of scale certain parts of our system would have. For these particular areas, we were using reverse-timestamps as a prefix for our partition keys when accessing Azure Storage. This is a common scenario, as having the most recent items at the top of the table is useful for processing the data. However, sequential access patterns aren't handled well by Azure Storage, causing requests to get throttled as the scale increases. The best option is to randomize the partition key somehow. This is not always possible though, depending on the requirements around reading data. In our case, it made more sense to use a bucketing mechanism that divides the traffic into several buckets and within each bucket, use the reverse-timestamp. As traffic increases, so does the number of buckets, so that while we still end up with the desired sort-order, we are no longer bottle-necked by the storage limitation.
  2. Design the system's storage access patterns. This is directly related to the above item. To know how the system will behave when traffic starts flowing in, one must be conscious of the storage access patterns it uses. We must take great care to design our storage access patterns so that they don't become a bottleneck in our system.
  3. Understand where concurrency is an issue. It has been a recurring problem for us since we do a lot of asynchronous processing. The major issue is a typical one in distributed systems: agreement and dealing with failures. One significant limitation in Azure Storage is the throughput of an individual Azure Storage message queue, which is about 500 requests per second. Clearly, this is insufficient for a large-scale distributed system. Thus, one needs an agreement mechanism (most likely PAXOS). An obvious solution to increasing queue throughput is to pack multiple messages into a single message or to use multiple queues. However, packing multiple requests means risking data loss during accumulation, which is often unacceptable, and using multiple queues requires agreement about which processors should access which queues. It might seem straight-forward to say processors should do round-robin access of all queues, but we found that this can overload the queues if the workers synchronize in their round-robin selection, cause retrieval of the same message many times since Azure only guarantees messages are delivered at least once, and it can be a fair amount of work to develop a queue management system, which has many of its own problems. In the end, we decided to remove the use of Azure queues from our since the guarantees it provides beyond those of blob and table storage are unecessary for us. We do, however, expect to come back to re-evaluate the use of queues at a later time. To summarize, the following led us away from queues: messages can be processed multiple times, message order isn't guaranteed, and there is a low per-queue scale limitation. This means the processors must be idempotent, able to process correctly regardless of order, and require additional mechanisms to scale. Thus queues don't really save us any work for our particular case.

Tuesday, December 6, 2011

Xbox Gets Voice Control, Search, and More

Congrats to the Xbox org and the Xbox Live services teams at Microsoft! Today, we publicly released the latest updates to the console and the backend services supporting it. With the update comes a plethora of cool technologies like voice control, search, and more.

Xbox is in a great position to lead a revolution in the way we consume and interact with media. I believe this revolution is already in progress and it's not a unique opinion (see here). However, it is possible that I've just drinken the Xbox Kool-aid. Whatever the case, I'm excited about the future of interactive media and computing in the living room.

Television has offered the same consumption model for decades now. In about the last 10 years we have seen some movement from the likes of TiVo, Netflix, Amazon, Hulu, and others moving us in a new direction, yet I still see most people tuning into their favorite shows at the network-mandated day and time. What's worse is that we just pull the media in a passive manner. There's limited interaction whether with peers, other audience members, or the collective community. Why can't you chat with other viewers during the show? Actually, you can, but not in a seamless, appealing way. You can vote on shows like American Idol, but you have to send a text message from your phone. Why isn't this more fluid? We should be able to say the name of our favorite contestant a la Kinect to cast our vote at the end of the show.

I think Xbox and Microsoft are in a great position to change this. Amazing technologies have been released over the past couple of years and there are many more to come. Xbox is in tens of millions of living rooms with a large portion of those people subscribed to Xbox Live. Live provides access to some powerful services including a marketplace where developers can submit apps for consumers to download. There are cutting edge input mechanisms such as Kinect's skeletal tracking and voice recognition. The possibilities for innovation are boundless and startups are already working on creative ideas. In fact, Microsoft and TechStars are joining forces to co-sponsor a Kinect-based startup accelerator program.

With the stage set like this, I think it's safe to say that we have a very exciting year ahead!

Friday, December 2, 2011

Regular Expressions Rock!

A colleague at work recently cobbled together the following regular expression, but wanted help to better understand it.

^(?:(?!(core\.windows\.net)).)*$

This regex is intended to match anything that is not a Windows Azure Storage URL. It essentially looks for strings that do not contain "core.windows.net". Here's a breakdown of the regex from the inside out:

  1. \.: Character match. Matches '.'. The '\' character escapes the proceeding character.
  2. core\.windows\.net: Word match. Matches the text "core.windows.net".
  3. (core\.windows\.net): Text grouping. Treats the "core.windows.net" text as a group.
  4. (?!(core\.windows\.net)): Negated look-ahead assertion. From the point of evaluation, fails the match if the proceeding text is "core.windows.net".
  5. (?:(?!(core\.windows\.net)).): A non-capturing match-all. The "(?:" means that that the contents of the grouping, which would normally be made available for retrieval after evaluation, are not captured for retrieval. The '.' matches any character, unlike the escaped "\.", which matches a '.' character.
  6. (?:(?!(core\.windows\.net))*: Repeat the evaluation any number of times.
  7. ^(?:(?!(core\.windows\.net)).)*$: Do not match if any character from the start to the end of a line is followed by "core.windows.net". '^' denotes the start of a line and '$' denotes the end of a line.
The key to understanding this is to really understand how the negated look-ahead works. Here's an excellent reference for regexes. The reference is specifically for perl, but the concepts are nearly the same across all systems that support regexes. A good question is why we can't just write something that fails if a body of text contains the target text? The reason is that regular expressions do not support such a simple negated expression, so what is shown here achieves that functionality through alternative means.

This regex is interesting because it requires a strong understanding of how regexes work versus a simple regex like ^abc$ (which matches lines containing exactly the text "abc"). To really understand it, one must understand the implications of the behavior of the negated look-ahead, which does not "consume" text. Likewise, the non-capturing grouping is about memory efficiency rather than having anything to do with the correctness of the expression.

Regular expressions are a very powerful tool for matching and retrieving data from text. They are usually concise, replacing lots of custom parsing with one clever expression. They are also easy to test in the same way the custom parsing would be tested. That is, pass them various texts and check for the expected output.

Thinking about the problem above actually leads to a great interview question. "Write a function to evaluate a set of strings to determine whether they contain <some complex pattern>?" An example complex pattern is valid email addresses. Clearly, it's possible to code up some custom parsing or devise a regex. Either way, the candidate will need to do some abstract thinking, identify edge cases, test for correctness, and explain why their solution works They should also be able to do this relatively quickly. If the candidate evaluates the trade-offs of using the different methods (regexes are not always simpler), it shows practicality in their approach to problem solving. Familiarity with regexes is also good in that it shows that they are knowledgeable about technologies that are not often taught in schools.