« Naked Conversations is shipping and Stowe starts afresh | Main | A few data points regarding the Travel meta-search market »

January 16, 2006

Developing and managing Hotmail

HotmailThis interview has already been picked up and commented upon (and /.’ed), but if you have not yet taken a look, I recommend reading this ACM piece on Hotmail, and what it means to manage one of the largest services of the web. Hotmail runs on 10,000 servers and involves several petabytes of storage (i.e millions of gigabytes) and serves, according to this Wikipedia article, 221M users who are operating billions of e-mail transactions daily. It is operated by 100 sysadmins, which is not that large a team.

Phil Smoot, the PM in charge of Hotmail product development out of the Microsoft Silicon Valley campus, shares a number of insights – from which I noted the following points regarding automation, versionning, capacity planning, impact analysis and QA:

  • QA is a challenge in the sense that mimicking Internet loads on our QA lab machines is a hard engineering problem. The production site consists of hundreds of services deployed over multiple years, and the QA lab is relatively small, so re-creating a part of the environment or a particular issue in the QA lab in a timely fashion is a hard problem. Manageability is a challenge in that you want to keep your administrative headcount flat as you scale out the number of machines.
  • [...] if you can manage five servers you should be able to manage tens of thousands of servers and hundreds of thousands of servers just by having everything fully automated—and that all the automation hooks need to be built in the service from the get-go. Deployment of bits is an example of code that needs to be automated. You don’t want your administrators touching individual boxes making manual changes. But on the other side, we have roll-out plans for deployment that smaller services probably would not have to consider. For example, when we roll out a new version of a service to the site, we don’t flip the whole site at once.
  • We do some staging, where we’ll validate the new version on a server and then roll it out to 10 servers and then to 100 servers and then to 1,000 servers—until we get it across the site. This leads to another interesting problem, which is versioning: the notion that you have to have multiple versions of software running across the sites at the same time. That is, version N and N+1 clients need to be able to talk to version N and N+1 servers and N and N+1 data formats. That problem arises as you roll out new versions or as you try different configurations or tunings across the site.
  • The big thing you think about is cost. How much is this new feature going to cost? A penny per user over hundreds of millions of users gets expensive fast. Migration is something you spend more time thinking about over lots of servers versus a few servers. For example, migrating terabytes worth of data takes a long time and involves complex capacity planning and data-center floor and power consumption issues. You also do more up-front planning around how to go backwards if the new version fails.
  • We strive to build tools that can replay live-site transactions and real-type live-site loads against single nodes. The notion is that the application itself is logging this data on the live site so that it can be easily consumed in our QA labs. Then as applications bring in new functionalities, we want to add these new transactions to the existing test beds.
  • The notion of tape backups is probably no longer feasible. Building systems where we’re just backing up changes—and backing them up to cheap disks—is probably much more where we’re headed. How you can do this in a disconnected fashion is an interesting problem. That is, how are you going to protect the system from viruses and software and administrative scripting bugs? What you’ll start to see is the emergence of the use of data replicas and applying changes to those replicas, and ultimately the requirement that these replicas be disconnected and reattached over time.
  • As you go to, let’s say, a commodity model, you have to assume that everything is going to fail underneath you, that you have to deal with these failures, that all the data has to be replicated, and that the system essentially self-heals. For example, if you are writing out files, you put a checksum in place that you can verify when the file is read. If it wasn’t correct, then go get the file somewhere else and repair the old file.
  • Last word: If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451e38a69e200d83464949969e2

Listed below are links to weblogs that reference Developing and managing Hotmail:

» Read/Write Filter from Read/WriteWeb
A daily review of Web and Media news that crosses my path during the day. - comScore: Google Continues to Hold Top Position in Search Share Rankings (see also Shore analysis: "It also means that general aggregators will continue to... [Read More]

Comments

Jeff thanks for sharing this interview, Wow 10,000 servers, 100-system administrator, it’s hard not to have a profound appreciation for the complexity and scale of hotmail. I wonder how much of their energies go into fighting spam?

cheers

Wayne Lambright
http://sfsurvey.com

Fascinating.
I have been interested in this issue of scaling your infrastructure for some time now (http://rodrigo.typepad.com/english/2005/10/planning_your_w.html)

Thanks for the pointer! Very interesting read.

Very interesting. The funny thing about Hotmail is that new users are treated better than old users. My account created in 1997 still offers only 2 MB of storage, the one created in 2005 offers 250 MB.

Excellent article, there is some food for everyone here!

Emmanuel
http://galide.jazar.co.uk

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

On the Web


  • www.flickr.com
    This is a Flickr badge showing public photos from jeffclavier. Make your own badge here.