Alexa Archives the Internet
Building 116 on the de-commissioned Presidio army base near the northwest corner of San Francisco is not the most conventional place to run a technology company. But then, Alexa Internet is anything but conventional. The company has taken on a task that at first glance seems quixotic that of archiving the Internet, saving a copy of this voluminous digital record to tape. As you might imagine, this is no small task. Alexa has so far amassed about seven terabytes of data, about one-third that held in hardcopy form by the Library of Congress. Indeed, a library is precisely the metaphor the company is aiming at. The name "Alexa" derives from the fabled Library of Alexandriathe largest library in the ancient world. Despite the fire that destroyed part of its collection, the library became the conduit by which many ancient works have survived. Alexa Internet hopes it will do comparable good in this digital age.
Alexa is the brainchild of Brewster Kahle, one of the architects of the Thinking Machines massively parallel supercomputer. Kahle went on to found WAIS, Inc., whose technology was a predecessor to today's Web search engines. After he sold that business to America Online, Kahle co-founded Alexa Internet and its non-profit counterpart, Internet Archives. In both companies, Kahle serves as the visionary while co-founder Bruce Gilliat handles sales and marketing.
Alexa's Internet archiving effort primarily focuses on the Web, but also includes Usenet discussions, collected from an on-site news server at the rate of about 1GB a day. Taken together, this is one of the most ambitious archiving efforts ever made. "The National Archives of the United States' digital collection has got 300GB in it. We collect that in a week," Kahle says. "One of the larger banks in the United States, Wells Fargo, has a data warehousing project where they store and mine their own information. It has about 1 terabyte in it, while we've got 7 terabytes, growing at about 1.
"It's an error message that comes back from Web servers that says that a page is no longer available," says Kahle. "It used to be there, but is no longer thereit's out of print. Alta Vista says one percent of all Web pages that are there on one day are gone one week later. So one percent of the Web per week is permanently erasedand the turnover is sometimes much faster than that. That's a shame because there is so much good content. MSNBC [the joint venture between Microsoft and the NBC television network], only keeps its materials up on the Net for two weeks. It makes sense that a publisher wouldn't necessarily keep their newspapers and magazines on the newsstands all the time. But, it's really crucial that we be able to use that material for our research, whether it's for historians and scholars or everyday people. Any user of the World Wide Web will want to have access and use the best of the Net, not just what happens to be available right now."
The Alexa toolbar also provides subscribers with "metadata" on the siteswho a given site is registered to, how many other sites point to it, how frequently it is updated, and the site's popularity among users. The software shows other Web pages of potential interest by tracking where other Alexa users have gone, as well as considering what Web sites link to that page (see sidebar). In August, Alexa struck a deal with TRUSTe, a non-profit group working to ensure privacy for Web users. The deal will allow TRUSTe's logo to appear in the "Where You Are" window of the Alexa toolbar, showing how personal information, entered by the user, will be used. Alexa intends to make all these services available free of charge. Like so many sites, revenues are intended to come through advertising.
In addition, Alexa gives a copy of its database archive to its non-profit counterpart, Internet Archives, which is based in Seattle, Washington. This is the visionary part of Kahle's vision: a realization that the Web ought to be preserved, just as any other mass media. Books, after all, are collected by libraries, and indeed, the Library of Congress's mandate is to house a copy of every book published in the United States. Television is archived by broadcasters themselves and institutions like UCLA. As for audio recordings--you can still hear the voice of the first one, made by Thomas Edison.
Of course, the Web is so new that few people have thought that its ephemeral nature matters. It was rather a means of exchange between researchers. With its rapid commercializationand its wide adoptionthe Internet has inserted itself into the very fabric of everyday communications and commerce. The Internet as a whole is the latest, probably last major medium of the twentieth century. It has, in short, become a medium worth remembering: not only as a record of late 20th century events like the death of Princess Diana, but as a record of itself, a new way for people to exchange information.
"Basically, the Net is seen as a huge newsstand," says Kahle. "But what's really important to people isn't necessarily what happens to be on the newsstand todayit's whenever it happened. You want to be able to do research based on anything that's transpired." Kahle says that not only technology historians, but historians and archeologists of every discipline are already using the Web. "We've got an unprecedented collection of human voices in this archive that has never been accessible to historians before. We believe that our early digital history represents an important change in how humans communicate. That is what we want to make sure is saved in a way that future scholars can make use of."
For example, Kahle and company are working with AT&T Laboratories, which are using the archives for computational linguistic studies and with Xerox PARC (Palo Alto Research Center), the visionary laboratory of the Xerox Corporation. "Based on our archive, they've found there are over 200 human languages represented on the Internet." Kahle says. Xerox PARC Research Scientist Jim Pitkow couldn't reveal the details of his work, but said it had to do with what he called document ecologies. "Ecology is the study of relationships within an environment. We are interested in the relationship between users and the elements of the Internet." Pitkow noted that for his purposes, the archive represents about 2-3 terabytes of documents once you factor out things like Linex source code distribution. "That's still large enough to be interesting." Tape beats out disk
While Kahle is not the first person to note the ephemeral nature of the Internet, he is probably the first to do something about it: with a project of considerable ambition. Given the growth of the Web, many people might have guessed that a project of this scope would be impossible, or at least so difficult that it would take a major company to accomplish. The fact that it is being done by a company of Alexa's small size is a testament to Kahle's foresight, as well as to the amazing strides in density made by magnetic storage media.
The archiving process starts with "web crawler" technology, which moves from site to site inventorying the holdings of the World Wide Web. Using this technology, Alexa took less than six months to record all of the text data available on public Internet sitescompleting the process in March of 1996. Now, the technology revisits each page about every six weeks, although the company is tweaking it to make more frequent visits to sites that change more often.
But where to store the data? Kahle looked at the market to determine the most cost-effective mechanism for storing large amounts of information but still having it accessible. At the high end were disk drivesoffering fast writes, fast reads, but at a high cost: $200 per gigabyte. In the middle were writable CD-ROMs, offering slower access than conventional magnetic medium, but costing lessabout $120 per gigabyte. While both of these technologies were tempting, they were also prohibitively expensive considering the size of the data set. A terabyte is 1000 gigabytesso if your archive is growing even at 1.
"By our observations, the number of Web sites is doubling every six months," Kahle says. "And the number of pages is doubling even faster than that. It's hard to count, but we estimate there are now 640,000 separate Web sites."
So Kahle went instead for tape technology, which at $20 per gigabyte, enables the archive to expand dramatically without bankrupting the company. Not that this is a permanent storage solutionbut at least its good for the foreseeable future. "Storage technology has tracked the traditional Moore's Law curveevery 18 months it gets twice as good," Kahle says. "But if the Net is getting twice as big every six months, then eventually we're going to outstrip what the storage technology can do, at least for a fixed cost. Right now we've got a backlog where the technology can handle more than what we're doing with it right now, so we've got a little bit of grace period. But, eventually we're going to have to be more selective."
Or find something even better. "These different technologies are evolving and we're interested in using whatever we can. Who knows? In ten years we may be storing these bits in a crystal or by bouncing them off the moon. All we know for certain is that storage will be getting cheaper."
Obviously, one tapeeven a high capacity onewon't do. Internet Archive uses a StorageTek TimberWolf 9710 tape robot, which is linked to a Sun SPARCstation 20. The robot contains three Quantum DLT 7000 drives and at present, 420 tapeseach of which can store up to 70 GB of compressed data. The advantage is cost, but the disadvantage is access speed.
"A tape search usually takes between five and 15 minutes, depending on how busy the system is and how complicated the page being retrieved is," says Mike Burner, Alexa's vice president of development. He notes that some pages require multiple access to the tape. While you are waiting, you can go on and do other work. The Alexa software will bring up a window when the search is completed. In addition, if you should happen to request a missing page that someone else has requested previously, access is much quicker. That's because Alexa also maintains a 200GB cache comprised of DEC hard drives, which it doesn't intend to erase, holding previously requested pages. Delivery here is 10 millisecondsessentially instantaneous. Alexa also uses Quantum Atlas II drives to store the card catalog, comprised of about 180GBthe software entity that maps where in the vast array of tapes a requested Web page resides.
Will this setup keep up with growth? Kahle is confident that it will at least be able to keep track of text and graphics. But video is another question. "When everybody is putting videos of their kids on the Net, it will be impossible to keep up with all of the video from every Web site, and we'll have to become more selective. But that's okay because it doesn't make complete sense to archive every minute of a camcorder that's pointed at a baby's crib."
Sidebar: Complement to search engines
Brewster Kahle thinks that one of Alexa's most valuable subscriber services is its navigation capabilitythe ability to provide an alternative to the conventional search engines like Alta Vista, Excite and Yahoo. The company's software talks to the browser and "knows" what page you are looking at. Then, making use of the archives and records of other Alexa users, it suggests other links that you might find of interest. "We had this idea when we were at Thinking Machines," Kahle says. "The key idea is when there is so much content out there, you need powerful methods to find just the stuff you want. When you are looking for something, you only want the 10 best. Right now, the search engines have so much to search across that they are coming back with thousands of hits. We thought we needed better techniques and technology to help you find what you want.
"The key thing was not a smarter search engine, or using intelligent agent software. It was using all the people that are on the Internet to help you find things. If you could use the experiences of millions to help you find whatever you are looking for this morning, that's a key navigation technique that isn't being used right now in the search engines."
Conventional search engine work with key phrases. Search for the term "FedEx" on Alta Vista and you get 36,810 hits, including the FedEx home page (which appears number one in the list), and related topics like the FedEx St. Jude golf tournament. By contrast, Kahle says, if you were already on the FedEx home page, Alexa's software would show you other shipping companiesthe Postal Service, DHL, international shipping services. "We give you 10 and we try to make them completely honed.
Kahle maintains that the Alexa service would do really well in Japan because "Japanese tend to use communication infrastructure really solidly, where Americans often think that they know everything. I'd like to have lots of Japanese users because that is how links start to get better and better. This is a technology that learns from people.
As a test of how the engine might work on a Japanese page, Kahle logged onto the home page of Software Design's publisher, Gijutsu Hyoron: http://
softbank. co. jp
ascii. co. jp
ai-pub. co. jp
iwanami. co. jp
gakken. co. jp
jri. co. jp/ park/ kyoritsu
Kahle is interested in the relevancy of these sites and invites Software Design readers to write him (in English) at alexa.
Sidebar: "Colonel" Kahle at the San Francisco Presidio
Alexa Internet's base in the San Francisco Presidio makes it a part of one of the more unusual national parks in the United States. The 1480 acre property has been an army base since the Spanish established it as a fort in 1776, the same year the United States gained its independence from England. But even when the U.
Now, the Presidio is indeed a parkkind of like Yosemite, but with a big difference. It is administered by the Presidio Trust, whose goals are to both provide recreation and try to make the property pay for itself within 15 years: renting out the buildings to organizations, many of whom are committed to looking after world resources. Alexa's neighbors include the Thoreau Center for Sustainability, the The Gorbachev Foundation, USA/
Alexa has been in its Presidio headquarters since it was foundedover a year agoand co-founder Brewster Kahle says its park-like setting and relatively low population feels like a university campus on summer break. "One of the perks of working here is you can live in the Presidio," he says. "Earlier this year, I moved into a colonel's house.