February 01, 2006

Ed Sim on successful offshoring

Whereas it was a must until a year ago, offshoring developments to India has generated enough of its fair share of operational issues to make startup teams and VCs much more cautious. Beyond local issues – like the difficulty of hiring and keeping talent, and increasing wages – getting the coordination, leadership and motivation aspects are challenging.

Ed Sim’s post comes a propos as he relates to a successful offshoring operated by one of his companies. Some key points:

  • Offshore interesting and motivating projects
  • Develop local leadership talent
  • Hire, train and manage local staff with a long term view in mind

At the same time, SAP is going to be looking to alternatives to India because of rising costs.

January 16, 2006

Developing and managing Hotmail

HotmailThis interview has already been picked up and commented upon (and /.’ed), but if you have not yet taken a look, I recommend reading this ACM piece on Hotmail, and what it means to manage one of the largest services of the web. Hotmail runs on 10,000 servers and involves several petabytes of storage (i.e millions of gigabytes) and serves, according to this Wikipedia article, 221M users who are operating billions of e-mail transactions daily. It is operated by 100 sysadmins, which is not that large a team.

Phil Smoot, the PM in charge of Hotmail product development out of the Microsoft Silicon Valley campus, shares a number of insights – from which I noted the following points regarding automation, versionning, capacity planning, impact analysis and QA:

  • QA is a challenge in the sense that mimicking Internet loads on our QA lab machines is a hard engineering problem. The production site consists of hundreds of services deployed over multiple years, and the QA lab is relatively small, so re-creating a part of the environment or a particular issue in the QA lab in a timely fashion is a hard problem. Manageability is a challenge in that you want to keep your administrative headcount flat as you scale out the number of machines.
  • [...] if you can manage five servers you should be able to manage tens of thousands of servers and hundreds of thousands of servers just by having everything fully automated—and that all the automation hooks need to be built in the service from the get-go. Deployment of bits is an example of code that needs to be automated. You don’t want your administrators touching individual boxes making manual changes. But on the other side, we have roll-out plans for deployment that smaller services probably would not have to consider. For example, when we roll out a new version of a service to the site, we don’t flip the whole site at once.
  • We do some staging, where we’ll validate the new version on a server and then roll it out to 10 servers and then to 100 servers and then to 1,000 servers—until we get it across the site. This leads to another interesting problem, which is versioning: the notion that you have to have multiple versions of software running across the sites at the same time. That is, version N and N+1 clients need to be able to talk to version N and N+1 servers and N and N+1 data formats. That problem arises as you roll out new versions or as you try different configurations or tunings across the site.
  • The big thing you think about is cost. How much is this new feature going to cost? A penny per user over hundreds of millions of users gets expensive fast. Migration is something you spend more time thinking about over lots of servers versus a few servers. For example, migrating terabytes worth of data takes a long time and involves complex capacity planning and data-center floor and power consumption issues. You also do more up-front planning around how to go backwards if the new version fails.
  • We strive to build tools that can replay live-site transactions and real-type live-site loads against single nodes. The notion is that the application itself is logging this data on the live site so that it can be easily consumed in our QA labs. Then as applications bring in new functionalities, we want to add these new transactions to the existing test beds.
  • The notion of tape backups is probably no longer feasible. Building systems where we’re just backing up changes—and backing them up to cheap disks—is probably much more where we’re headed. How you can do this in a disconnected fashion is an interesting problem. That is, how are you going to protect the system from viruses and software and administrative scripting bugs? What you’ll start to see is the emergence of the use of data replicas and applying changes to those replicas, and ultimately the requirement that these replicas be disconnected and reattached over time.
  • As you go to, let’s say, a commodity model, you have to assume that everything is going to fail underneath you, that you have to deal with these failures, that all the data has to be replicated, and that the system essentially self-heals. For example, if you are writing out files, you put a checksum in place that you can verify when the file is read. If it wasn’t correct, then go get the file somewhere else and repair the old file.
  • Last word: If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.

December 18, 2005

Thoughts on Shared Nothing Architectures

Buddy Brad Feld has a great post on Shared Nothing Architecture, as a potential solution to performance and reliability issues faced by services I use on a day to day basis: TypePad and del.icio.us (and to some extent, Bloglines - though I don't use it so much now). I had actually spotted that del.icio.us was down as well , and was about to write my own piece out of frustration, but Brad is summarizing the situation well. In the meantime, here is my backup del.icio.us.

On the heals of TypePad’s 18 hour outage this week, there’s been (and will be) a lot of continued discussion about how to build scalable and reliable online / web-based applications.  This is not a new problem (I not so fondly remember major and systemic outages in large services such as eBay and Amazon in the late 1990’s) but it’s gotten new attention as some of the emerging applications have scaled up the point as to have an interesting numbers of regular users (e.g. – it sucks if their service goes down for more than 15 minutes).  For example, as far as I can tell, del.icio.us has been down for the last four hours (“del.icio.us is down for emergency maintenance. we'll be back as soon possible.”) and on 12/15/05 Bloglines acknowledged that “Bloglines performance has sucked eggs lately.”
Tim Wolters – an extremely capable CTO – has an introduction to how he is approaching this at Collective Intellect.  He’s taking a page from Google’s playbook and developing a web service based on a “shared nothing architecture”.  On Friday, I had two different discussions about scalable architectures (e.g. “we’re going to scale up between 10x and 100x on a meaningful base in 2006 – here’s what we are planning”) and both included elements of what Tim is describing.

The ultimate Shared Nothing Architecture relies on mirrored data centers in different physical geographies that allows a system to switch over in (quasi) real-time in case of any type of failure (power, hardware, database, etc.) - and this is expensive to deploy. Del.icio.us is not there yet, but will clearly benefit from Yahoo's scalability expertise. And as to Six Apart, well, let's hope that they'll figure this out since quite a few of us users have expressed their “discontent” (and I am being soft since many of my close friends are involved with the company). These problems happen with almost every companies as they experience a rapid growth of their online presence, and often their backup solutions are just not appropriate (and remember, don't trust these backup generators).

If you need to substantiate early exits by Web 2.0 companies, beyond generating nice payoffs for company founders, look no further: scaling to tens of millions of users and gigabytes of traffic is no simple feat, and the companies facing these issues will be at risk of losing at least a portion of their momentum if they don't handle the situation properly.

Update: it appears that del.icio.us has had to rebuild their corrupted database after a... power failure - I wonder what happened to the generators...

September 28, 2004

Colo facilities: don't trust those generators, and backup your backup plans

OK, I should now be in bed. But as I was browsing my blogroll, I noticed Dave Sifry's post relating his not-so-cool week-end spent fixing Technorati's infrastructure due to a fire at his colo. Unusual ? Unique ? Hardly so...

This must be the tenth story I hear about a colo facility that had (supposedly) all the required redundancy to "insure" reliability, including the (infamous) Diesel generators that kick in to take over short term UPSs. The issue is that those generators never seem to kick in (I hope that hospitals use a different brand than buildings and datacenters).

One of my former portfolio companies, an ASP, that did not go public on the issue but wrote to its rather unhappy clients, faced exactly the same issue, and the note from the CEO contained very similar statements to David's. Here is a brief excerpt, in which I only removed named references:

On Monday morning at 9.45am there was a complete power outage at ..., our datacenter provider. Although our Uninterruptible Power Supplies (UPSs) were triggered and ran, ...’s diesel generators failed to start before the batteries in the UPSs ran out. As a result, all our servers abruptly lost power. The power was restored by 10.45am, but some severe damage had been done to our infrastructure by the abrupt power failure, and the surge which took place when power was restored.
[Pages of detailed explanations deleted]
On behalf of the whole ... team, I would like to apologize most sincerely to our clients for this severe lapse in our service. Rest assured that we will be working extremely hard in the coming days and weeks to review every aspect of the resilience of our service, and to ensure that an incident like this cannot happen again.

Interestingly similar, ain't it ?

As to the underlying issue, the disaster recovery plan, it seems to be a common mistake to believe that having some level of redundancy leads to reliability... which means that a lot of time is generally spent on designing a technical infrastructure (comms, power, servers, disks, backups,...) that will "always" work, as opposed to defining the processes and procedures that will be applied if the s..t hits the fan, the database is completely corrupted, and servers aren't able to restart. Ie when Murphy's law kicks in (aka the "buttered slice of bread theorem" - or "Theoreme de la Tartine Beurree", which states that if you put some spread on a slice of bread, and the slice falls, it has a %$@^%*$ tendency to repeatedly fall on the wrong side).

Continue reading "Colo facilities: don't trust those generators, and backup your backup plans" »

On the Web


  • www.flickr.com
    This is a Flickr badge showing public photos from jeffclavier. Make your own badge here.