June 06, 2006

Can't go back to a zeroband world

Bandwidth TestSBC Internet has had a big mess up this morning with their network, which led to a 2+ hours interruption of our commercial DSL service. Good bye VoIP phones, faxes, Skype, etc. I tried to share my EVDO connection with the rest of the edgeio development team, but there is just not enough bandwidth to share around. Using a cell phone as a conference phone and an EVDO card as the sole network access only works for the mobile worker.

As our connection comes back, I somehow thought about the first (narrow narrowband) access we had at the beginnings of my startup in France, where we downloaded emails via uucp (unix to unix copy) via a 19.2Kb Hayes modem (or was that 9.6 ?). That was 17 years ago. Damn!

As I am about to drop DSL in favor of cable at home, being sick and tired of Earthlink, I read that announcement from Comcast about their PowerBoost, promising bursts of up to 12 to 16MB - which would finally be close to what my parents are getting in the center of France . We'll see if they deliver on that.

April 24, 2006

Can we all use the same micro-formats please ?

I live-blogged the launch of the Structured Blogging initiative during the Syndicate conference in San Francisco, and I could not help noticing that not everyone seemed in its favor. Edgeio (Usual disclaimer: I am advisor, investor and landlord to the company) through co-founder Mike Arrington had pledged support to Structured Blogging and Micro-formats.

Google recently introduced a set of protocols for GoogleBase - GData, and unfortunately these RSS/Atom extensions don't seem to be aligned with already established micro-formats. Richard Mc Manus raised the matter last week, and Edgeio's Matt Kaufman has a good post on the issue, which will hopefully be solved quickly.

Tags: , ,

January 16, 2006

Developing and managing Hotmail

HotmailThis interview has already been picked up and commented upon (and /.’ed), but if you have not yet taken a look, I recommend reading this ACM piece on Hotmail, and what it means to manage one of the largest services of the web. Hotmail runs on 10,000 servers and involves several petabytes of storage (i.e millions of gigabytes) and serves, according to this Wikipedia article, 221M users who are operating billions of e-mail transactions daily. It is operated by 100 sysadmins, which is not that large a team.

Phil Smoot, the PM in charge of Hotmail product development out of the Microsoft Silicon Valley campus, shares a number of insights – from which I noted the following points regarding automation, versionning, capacity planning, impact analysis and QA:

  • QA is a challenge in the sense that mimicking Internet loads on our QA lab machines is a hard engineering problem. The production site consists of hundreds of services deployed over multiple years, and the QA lab is relatively small, so re-creating a part of the environment or a particular issue in the QA lab in a timely fashion is a hard problem. Manageability is a challenge in that you want to keep your administrative headcount flat as you scale out the number of machines.
  • [...] if you can manage five servers you should be able to manage tens of thousands of servers and hundreds of thousands of servers just by having everything fully automated—and that all the automation hooks need to be built in the service from the get-go. Deployment of bits is an example of code that needs to be automated. You don’t want your administrators touching individual boxes making manual changes. But on the other side, we have roll-out plans for deployment that smaller services probably would not have to consider. For example, when we roll out a new version of a service to the site, we don’t flip the whole site at once.
  • We do some staging, where we’ll validate the new version on a server and then roll it out to 10 servers and then to 100 servers and then to 1,000 servers—until we get it across the site. This leads to another interesting problem, which is versioning: the notion that you have to have multiple versions of software running across the sites at the same time. That is, version N and N+1 clients need to be able to talk to version N and N+1 servers and N and N+1 data formats. That problem arises as you roll out new versions or as you try different configurations or tunings across the site.
  • The big thing you think about is cost. How much is this new feature going to cost? A penny per user over hundreds of millions of users gets expensive fast. Migration is something you spend more time thinking about over lots of servers versus a few servers. For example, migrating terabytes worth of data takes a long time and involves complex capacity planning and data-center floor and power consumption issues. You also do more up-front planning around how to go backwards if the new version fails.
  • We strive to build tools that can replay live-site transactions and real-type live-site loads against single nodes. The notion is that the application itself is logging this data on the live site so that it can be easily consumed in our QA labs. Then as applications bring in new functionalities, we want to add these new transactions to the existing test beds.
  • The notion of tape backups is probably no longer feasible. Building systems where we’re just backing up changes—and backing them up to cheap disks—is probably much more where we’re headed. How you can do this in a disconnected fashion is an interesting problem. That is, how are you going to protect the system from viruses and software and administrative scripting bugs? What you’ll start to see is the emergence of the use of data replicas and applying changes to those replicas, and ultimately the requirement that these replicas be disconnected and reattached over time.
  • As you go to, let’s say, a commodity model, you have to assume that everything is going to fail underneath you, that you have to deal with these failures, that all the data has to be replicated, and that the system essentially self-heals. For example, if you are writing out files, you put a checksum in place that you can verify when the file is read. If it wasn’t correct, then go get the file somewhere else and repair the old file.
  • Last word: If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.

December 13, 2005

Repeat after me: the index of a search engine is a commodity

This was quietly mentioned at the last session of the Search SIG, hosted by John Battelle, and has been discussed in the ranks of the searcherati's (search engine specialists): what happens to the market the day a search engine index becomes available for free (or close enough to free), and so is an entire crawling infrastructure ? This is actually something that Inktomi had been doing for a while with its index - for a fee.

Mike Arrington just pinged the Web 2.0 Workgroup (the 4th top blog network according to this new ranking ?) about the announcement by Alexa (aka Amazon) that they did just that: making their index and crawling infrastructure available to anyone for free - OK not free, but very close to it. John Battelle (who else :-) was briefed on this, and clarifies the “offering”:

In short, Alexa, an Amazon-owned search company started by Bruce Gilliat and Brewster Kahle (and the spider that fuels the Internet Archive), is going to offer its index up to anyone who wants it. Alexa has about 5 billion documents in its index - about 100 terabytes of data. It's best known for its toolbar-based traffic and site stats, which are much debated and, regardless, much used across the web. [...]

Anyone can also use Alexa's servers and processing power to mine its index to discover things - perhaps, to outsource the crawl needed to create a vertical search engine, for example. Or maybe to build new kinds of search engines entirely, or ...well, whatever creative folks can dream up. And then, anyone can run that new service on Alexa's (er...Amazon's) platform, should they wish.

It's all done via web services. It's all integrated with Amazon's fabled web services platform. And there's no licensing fees. Just “consumption fees” which, at my first glance, seem pretty reasonable. (“Consumption” meaning consuming processor cycles, or storage, or bandwidth).

The fees? One dollar per CPU hour consumed. $1 per gig of storage used. $1 per 50 gigs of data processed. $1 per gig of data uploaded (if you are putting your new service up on their platform).

So... What does it mean to have an index just 25% (-ish) the size of Google's and Yahoo's available to anyone and everyone? A few thoughts:

  • Alexa is always perceived as an approximate source of  traffic statistics based on the usage of the toolbar, and one can wonder regarding the distribution of the 5 billion pages available in the index.
  • Search engines indexes are one step closer of being a commodity - at least for the “Surface web” (as opposed to the Deep Web). When will an open source index will be developed for everyone to use ? A proper plugin infrastructure would be required to allow specialized search engines like Truveo to apply their specific heuristics (in that case, to find code that likely suggests that there might be videos on a site) - as well a scheduling.
  • This furthers the notion that the value of a search engine is in the application(s) built on top of it, like ad networks and the ability to match relevant ads to content, or any specific vertical search functionality.
  • Amazon is further leveling the playing field by offering this commodity infrastructure to competitors of the large search players - and making a little bit of money around a third layer of business: selling stuff it holds in inventory, facilitating the sale of stuff that affiliates hold in inventory, and now selling access to information about stuff one can find on the Internet.

The implications of that announcement will further develop over the next couple of days. More can be found on the on Alexa blog and Alexa Web Search site:

The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools. Alexa provides compute and storage resources that allow users to quickly process and store large amounts of web data. Users can view the results of their processes interactively, transfer the results to their home machine, or publish them as a new web service.

Update: the blogosphere is buzzing around the topic this morning - not surprisingly (here is the Memeorandum thread). I found this article from Phil Wainewright quite on point regarding the changes that this announcement might imply for Google. At the end of the day, their value is the depth and breadth of their advertising network, and their ability to match relevant ads - and therefore they most likely will continue to do well.

November 21, 2005

From syndication to synchronization: bi-directional RSS is under way with SSE

Ray Ozzie, Microsoft's recently appointed CTO, announced this morning the release of a preliminary specification of SSE (Simple Sharing Extensions), allowing RSS to be bi-directional. The draft was released under Creative Commons (Attribution-ShareAlike) - which means that anyone can implement it, hack it, modify it, as long as due credit is given to Microsoft, and changes are also made available under this license.

I'll come back to SEE in a minute but witnessing Microsoft quickly putting together a specification, that had an overstated goal of being simple, making it available under CC and engaging with the industry so openly feels novel. Can someone point to previous examples or is it the first time ? Microsoft's previous efforts to “embrace and extend” industry standards often led to incompatible specifications with proprietary extensions. Ray writes about the genesis of the idea:

Shortly after I started at Microsoft, I had the opportunity to meet with the people behind Exchange, Outlook, MSN, Windows Mobile, Messenger, Communicator, and more.  We brainstormed about this “meshed world” and how we might best serve it - a world where each of these products and others’ products could both manage these objects and synchronize each others’ changes.  We thought about how we might prototype such a thing as rapidly as possible – to get the underpinnings of data synchronization working so that we could spend time working on the user experience aspects of the problem – a much better place to spend time than doing plumbing.

Re SSE, if it is as simple and open as it seems, it could be the building block that has been missing to build richer application interactions, where calendars, contacts, etc. can be peered without going through a centralized repository. Though synchronizing calendars is much harder than people originally think, and that's why calendar companies are getting funded. I wonder if they looked at SyncML, another general purpose XML standard for synchronization. I could not find any mention in the FAQ.

I just saw that Niall thinks he will have working SSE code tonight. Got to love that open source production model. Update: Niall has finished his prototype.

November 02, 2005

New Yahoo!Maps: interactive, functional, impressive

Yahoo released at 9PM PT a beta version of a new maps implementation. The first version was functional, but not as powerful as Google’s, and did not allow for map mashups to be included in a web page (they appeared in a popup). Yahoo maps addressThis new one solves that limitation, and adds many more cool features.

I have just started playing with this new version and find it quite remarkable. It leverages Flash to deliver maps and some Ajax code to input addresses therefore providing a lot of interactivity. And it support multi-address routes - a welcome functionality.

Another interesting one is the integration of local search results. For example, entering an address maps it out to the business located at that address, or provides a list of businesses located at that address if there is more than one.

Yahoo maps fullThe “Find on the Map” features allows you to perform a local search, and display all matching results on the map. For example, these are all the “Peet’s Coffee and Tea” located in the region. I can then drag one of these coffee shop addresses and drop it on an address field, therefore building an itinerary. You can also overlay live traffic information. A small window on the right top side of the map controls the zoom factor and the area displayed on larger map. Clicking on “Print” provides the itinerary, the list of points of interest (the Peet’s I searched on) and a small map.

Overall, very cool.

Yahoo has also released a new set of APIs (simple, Flash, Ajax), alongside a suite of sample applications – most of them are still being worked upon.

More:

 

October 18, 2005

Geolocation services specialist Whereonearth goes Yahoo!

WOE VerticalGood week for my former Partners at RVC: after the acquisition of Moreover Technologies by Verisign (that was officialized yesterday), Yahoo has announced the acquisition of UK-based Whereonearth.

We (RVC, the managers of the Reuters Greenhouse Fund) had invested in the company back in January 2000. I remember my first visit to their London office, where a large number of GIS (as in Geographic Information Systems) experts were producing and/or aggregating multi-layered maps of the world. Each layer (representing contours of the land, countries, counties, zip codes, roads, rail tracks, etc.) could be independently overlayed in a very impressive way. They also had an extensive database of locations expressed in different languages, with distances. Using these assets, they had developed a number of location-based solutions.

The Yahoo Search Blog actually carried the announcement:

Whereonearth’s very talented team of software engineers and Geographic Information Systems (GIS) experts have worked hard to develop sophisticated technology that contains a unique combination of global data and software algorithms that make local search possible. Together, we’ll be able to provide the most geo-relevant information across all of Yahoo!’s products and services.

The corresponding Reuters piece from is here.

Congratulations to Dev Patel and his team.

 

October 12, 2005

MSN Messenger and Yahoo IM interoperability - at last

The widely anticipated announcement is now out: MICROSOFT AND YAHOO! ANNOUNCE LANDMARK INTEROPERABILITY AGREEMENT TO CONNECT CONSUMER INSTANT MESSAGING COMMUNITIES GLOBALLY. One of the longest standing Internet “Berlin walls” is about to collapse, as these two giants are working together on connecting their back-ends and create a virtual community of 275M users. This is slated for release sometime in Q2 2006 (why so late ?):

[…] In addition to exchanging instant messages, consumers from both communities will be able to see their friends’ online presence, share select emoticons, and easily add new contacts from either service to their friends’ list, all as part of their free IM service.

The next three questions that come to mind are:

  1. OK, how about AIM and Skype ?
    AIM is still the largest IM network, and from what we heard in the “Web 2.0 Teens Panel” it has a strong foothold in the younger generation of Internet users. Will AOL be able to maintain its virtual segregation now that its largest two competitors have agreed to do this or will they use the usual “respect for the security and privacy of our users” excuse ? Or is Rafer right in thinking that AOL is about to do a deal with Microsoft ?
    What about Skype’s IM ?
  2. What if the integration backplane was XMPP/Jabber ?
    This would create a level playing field for other IM networks - consumer and professional ones – to connect, deliver presence information and text communication. Client Userplane would be able to offer federated access to the 10,000 communities they power with such an approach.
  3. What is being connected: text messaging and presence at a minimum, but what about voice and video ?
    The press release does not mention voice and video, which is not surprising. Even if most voice-enabled networks provide a SIP interface one way or another, implementing voice integration might be a step that these companies want to consider taking a bit later. Om thinks that voice bridging is actually part of the interoperability ?

August 05, 2005

DNS servers under attacks - a weakness of the Internet and many online businesses

CNET has a detailed article on yet another threat facing the online presence of businesses, and Internet itself: DNS cache poisining.

In a DNS cache poisoning attack, miscreants replace the numeric addresses of popular Web sites stored on the machine with the addresses of malicious sites. The scheme redirects people to the bogus sites, where they may be asked for sensitive information or have harmful software installed on their PC. The technique can also be used to redirect e-mail, experts said.

As each DNS server can be in use by thousands of different computers looking up Internet addresses, the problem could affect millions of Web users, exposing them to a higher risk of phishing attack, identity theft and other cyberthreats. […]

The poisoned caches act like "forged street signs that you put up to get people to go in the wrong direction," said DNS inventor Paul Mockapetris. […]  BIND is distributed free by the Internet Software Consortium. In an alert on its Web site, the ISC says that there "is a current, wide-scale...DNS cache corruption attack."

 DNS cache poisoning is not new. In March, the attack method was used to redirect people who wanted to visit popular Web sites such as CNN.com and MSN.com to malicious sites that installed spyware, according to SANS  Internet Storm Center. […] 

According to the article, there are about 9 million DNS servers, of which a sample test has shown that as much as 30% might be at risk if targeted by a cache poisining attack.Given the potential implications, businesses have to upgrade their DNS infrastructures to BIND 9, the latest revision.

UltradnsThe alternative, that also increases the reliability, speed and functionality of a DNS infrastructure, is to switch to the Managed DNS service developed by UltraDNS. Because UltraDNS relies on a custom implementation of the BIND protocol, it is not subject to cache poisoning and other weaknesses of BIND servers. The company manages about 20% of all Internet domain names, through 20+ global and country TLDs (Top Level Domain names), and direct customers like Amazon.com and others. An additional benefit of UltraDNS is that it allows one to propagate DNS configuration changes in quasi real-time, as opposed to a few hours to a couple of days, thanks to a cluster of servers distributed around the world.

Disclosure: I have been involved in UltraDNS as an investor, a board member and more recently a consultant.

[via]

April 10, 2005

Lunch at the Internet Archive

Internet ArchiveI was meeting with a friend last Friday, who suggested I join him for the Friday lunch hosted by the Internet Archive, in its office located in the Presidio (in SF). The Archive is led by Brewster Kahle, who founded Alexa Internet and sold it to Amazon.com. The mission he has set for himself is to digitalize the world's media and store it for next generations to enjoy and benefit from. I had met Brewster Kahle a couple of times at conferences, and found his project absolutely fascinating. There is actually a great podcast on ITConversations where he explains it all.
The Archive is also involved in providing free storage and free bandwidth to OurMedia, the multimedia sharing non-profit organization spearheaded by Marc Canter and JD Lasica.

The lunch was really open to anyone showing up, and started with Brewster going around the table and asking everyone to introduce themselves, and explain what they were doing and why they were here. There were a few guys coming for interviews to join the Web development team. People then splitted in smaller groups to discuss on different topics.

Image_859Next to the meeting room was this custom made book scanner, and I chatted with the guy who had developed the software to automate the process. This machine is used to take a picture of both pages of a book whilst keeping it open flat. They can be barely seen, but there are two hi-res digital camera attached on both sides of the machine. This is linked to a computer that uploads pictures of pages, applies OCR and adds them to the archive.

Image_860On the adjacent wall is standing up open a gigantic book, as a matter of fact, it is the World's Biggest Book, which is 6.7 x 5 feet and weighs 172 pounds. Most impressive, even though this picture does not really provide the perspective. According to this article:

The 60.3 kilogram book about Bhutan (a country in South-Eastern Asia) has been recognized the biggest book in the world.
Writing it took 4 liters of ink and the amount of paper enough to cover a football field. The author of the giant book is professor from Massachusetts Institute of Technology Michael Holey.

Every copy of the book "Bhutan: Odyssey around the Kingdom" is 213.3 centimeters long by 152.3 centimeters broad, has 112 pages and costs about two thousand dollars. Holey is going to donate a part of the sum he will earn selling the book -10 thousand dollars to the charity foundation he established. The foundation builds schools in Cambodia and Bhutan. According to the Associated Press, the new book has already been enlisted in Guinness Book of Records.
Holey arranged many expeditions of students of the Institute to Cambodia and Bhutan - an isolated country with population of 700 thousand people. Then he decided to publish the photos he made in Bhutan to earn money for building schools there.

At first the professor did not plan to publish the biggest book in the world, but in a course of work on its colorful pictures he realized how exciting big photos of Bhutan"s nature and life are. Everything there is unusual, even rice is of red color.
According to those who saw the book you meet it like a person, face to face. Bookbinders had to invent a special mode of binding the book because of its enormous size.

Image_862As I was walking out, I saw the Internet Bookmobile, the van that carries a full printing equipment and Internet satellite connectivity. It is mentioned in the podcast I referred to, as well as in this detailed piece.

I am not sure if the lunch takes place every Friday, but it was certainly interesting to spend some time with these folks. And if you have never tried it, the WayBack machine (on the Archive's home page) allows one to see how a web site looked almost ten years ago. For example, here is Reuters.com as of Nov 13, 1996.

On the Web


  • www.flickr.com
    This is a Flickr badge showing public photos from jeffclavier. Make your own badge here.