F E A T U R E S Issue 2.05 - May 1996

Seek and Ye Shall Find (Maybe)

By Steve G. Steinburg

In 1668, the English philosopher John Wilkins presented a universal classification scheme to London's Royal Society. The scheme neatly divided all of reality into 40 root categories, including "things; called transcendental", "discourse" and "beasts". These categories were further divided into subgenuses (whole- footed beasts and cloven-footed beasts, for instance), and each was carefully documented with examples. Wilkins's eagerly awaited proposal was immediately published and distributed throughout Europe.
Today, Wilkins's system is remembered only as an example of the arbitrariness of attempts to classify the knowable universe. Indeed, the dream of organising all knowledge has been thoroughly discredited. It peaked in popularity during the 18th century, when the scope of human knowledge was still imaginable and the universe was thought to be rational. By the century's close, projects such as Wilkins's universal classification scheme, or Ephraim Chambers's comprehensive Cyclopaedia: or, Universal Dictionary of Arts and Sciences, had come to seem utopian. Although a few have continued to dream of a universal library - Vannevar Bush, who described his memex system in 1945; Ted Nelson, who has been working on Xanadu since the early 1970s - they are widely seen as laughable in our relativist, postmodern era.
But recently there have been hints of entirely new ways to classify knowledge, new systems for sorting and storing information that avoid the pitfalls of the past and can work on unimaginably large corpuses. The long-moribund fields of knowledge organisation and information retrieval are, once again, showing signs of life. The reason, of course, is the Web.
The most popular sites on the Web today are those - like the Yahoo! catalogue, like the Alta Vista search engine - that attempt to exert some kind of order on an otherwise anarchic collection of documents. The hard problems of knowledge classification and indexing are suddenly of commercial importance. The result has been a spate of high-tech start-ups, formed mostly by computer scientists and linguists, that are intent on making the Web act more like a well-organised library. Their efforts, rooted in equal parts hubris and brilliance and marked by a conviction that the problem is solvable, can seem startlingly reminiscent of John Wilkins and his contemporaries.
Admittedly, equating the Web with all human knowledge is an exaggeration - but not as much of one as you might think. A year and a half ago, the Web's content was heavily tilted toward a few niches: there was a lot about Unix and UFOs, not much about real estate or poetry. But today the breadth of the Web comes close to covering all major subjects. Indeed, at its current growth rate, the Web will contain more words than the giant Lexis-Nexis database by this summer, and more than today's Library of Congress within three years. And the Web defines "knowledge" far more loosely than any library. Even the Total Library of Jorge Luis Borges, which contained all knowledge and its contradiction, didn't include live video feeds of coffeepots. So if the entire Web can be organised, that goes a long way toward organising all of knowledge. But the difficulty of the task quickly becomes apparent when we look at attempts to solve similar problems. The most obvious place to turn - library science - turns out to be of almost no help. For one thing, even librarians admit that the schemes used today are antiquated and inadequate: the phrase "classification in crisis" has become a cliché in the library community. The most common systems in the US - the Dewey Decimal System and Library of Congress Classification - were developed at the end of the 19th century. Unsurprisingly, they are poor at classifying knowledge in "newly" established fields like genetics or electrical engineering. More important, library classification is bound by restrictions that the digital world is not. While a physical book can be shelved in only one place, a digital document can be placed in several categories at the cost of only a few bytes.
The field of information retrieval, which focuses on automated techniques like keyword indexing for searching large databases, isn't much more encouraging for those trying to organise the Web. The simple reason: even humans are poor at deciding what information is relevant to a particular question. Trying to get a computer to figure it out is nearly impossible.
Given all this, how do researchers possibly believe they can organise the rapidly growing Web? Have they really solved the problems that have stumped scientists for the last 200 years, or are they just ignoring them? And if organising the Web really is possible, what are the implications?

Yahoo
To find some answers, I set out on a tour of the Bay area's new Dewey's. First stop - a grubby little office park in Mountain View, where transmission repair shops nestle next to high-tech start-ups,to meet with the people behind Yahoo! Their cramped office, jammed full of dilapidated desks, seemed at odds with the light-hearted image Yahoo! projects online. But the disarray clearly reflected the company's rapid growth.
Yahoo!'s statistics are impressive. Created in 1994 by Jerry Yang and David Filo, two disaffected electrical engineering and computer science grad students from Stanford University, Yahoo! lists more than 200,000 Web sites under 20,000 different categories. Sites that track pollution, for example, are listed under Society and Culture:Environment and Nature:Pollution. These categories form what the people at Yahoo! a bit pretentiously refer to as their ontology - a taxonomy of everything. Their ordering of the Web is precise enough - and intuitive enough - that almost 800,000 people a day use Yahoo! to search for everything from Web-controlled hot tubs to research on paleontology. In almost every way you can measure, Yahoo! has successfully imposed order on the chaotic Web.
But how much longer can its hold last? Already, Yahoo! falls short of cataloguing the half-million or so sites on the Web. The enormity of its task is almost comical - I picture Jerry Yang as Charlie Chaplin in Modern Times, confronted with an endless stream of new work that is only increasing in speed. Sites that don't make a point of notifying Yahoo! of their existence often don't end up being listed. And as the Web continues its exponential growth, Yahoo! too will have to grow exponentially if it's to keep up.
It's a concern that Jerry Yang, the less publicity-shy of the two founders, had been thinking a lot about lately. Not that he seemed terribly worried - at least, not at first. A studiously casual 27-year-old from Taiwan, Yang had the Web-to-riches rap down. His speech was peppered with buzzwords. I imagined him coolly promoting Yahoo! - "We're a content-driven, interactive information provider" - to the executives at companies like Softbank Corp. and walking away with a couple of million dollars in financing.
Yahoo!'s technology is relatively straightforward. It works like this: first, the URLs of new Web sites are collected. Most of these come by email from people who want their sites listed, and some from Yahoo!'s spider - a simple program that scans the Web, crawling from link to link in search of new sites. Then, one of twenty human classifiers at Yahoo! looks the Web site over and determines how to categorise it.
Really, the only hard part - the only part that your average high-school geek couldn't do - is developing the classification scheme, the ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, such as, Are movies part of Art or Entertainment?
To solve this problem, Yang and Filo hired Srinija Srinivasan as their "Ontological Yahoo!". Another former Stanford student, Srinivasan is unfailingly helpful, quick to answer any question in her relaxed California accent. Perhaps that's why Newsweek claimed she was trained in library science when including her among the 50 people who matter most on the Internet. Actually, her background is in artificial intelligence. But Srinivasan was well prepared for tackling the organisation of the Web: previously she had been working at a lunatic-fringe project in Texas, attempting to teach a computer the fundamentals of human knowledge.
Starting with the ad hoc categories she inherited from Yang and Filo, Srinivasan began slowly and deliberately steering Yahoo!'s ontology toward completeness. Mainly, it's been a matter of adding new categories and reorganising hierarchies as the Web evolves from containing only specialised, technical information to containing content from every field of knowledge. But she's also set up certain guidelines to ensure consistency. For example, every regional Web site is now put in the regional hierarchy, and a cross-link to the site is placed under the appropriate topic. So a Florida real estate company is listed under Florida, with a cross-link from real estate.
A few months ago, Srinivasan told me, she was adding categories almost every day. Now major adjustments are becoming much more infrequent. She pointed to this as support for Yang's assertion that "at some point, our scheme will become relatively stable. We will have captured the breadth of human knowledge."
I'd like to think it was that easy, and that Yang would succeed. But a story he and Srinivasan told me about recent events at Yahoo! left me convinced I would have to look elsewhere for the answer.
The story began when the Messianic Jewish Alliance of America submitted its Web page to Yahoo! A classifier quickly reviewed the site - one which contains everything from Stars of David to articles about Israel, not to mention the word "Jewish" in its name - and placed it under Society and Culture:Religion:Judaism.
But here's where things got tricky. True, MJAA members are born of Jewish mothers and are hence, by definition, Jews. But they also believe that Jesus Christ is the messiah. In the eyes of most Jews, that makes the MJAA a bunch of heretics. Or at least Christians.
So when a few vocal and Net-savvy Jews saw the MJAA listed under Judaism, they let loose a salvo of email demanding that Yahoo! remove MJAA's listing. A bit taken aback by the protesters' virulence ("threats of boycotts," Yang said with amazement), Yahoo! quickly yielded and reclassified MJAA under Christianity. Of course, this caused the MJAA to protest that they were now being incorrectly labelled. After a modern-day Solomonic compromise, the MJAA and a few similar groups can now be found listed under Society and Culture:Religion: Christianity:Messianic Judaism - which is linked by a cross-reference from Judaism.
Yang looked at me sheepishly when telling this story. After all, he believes in truth, justice and the Internet way. Hell, he even gave me a mini-sermon about how the Net is egalitarian - the little guy can publish just as easily as the big guy. Yet he knows the MJAA was pushed around because it didn't have mainstream Judaism's clout.
But the MJAA story is interesting not just for exposing the realpolitik of classification. It's proof that no ontology is objective - all have their own biases and proclivities. Yang was quick to admit this: in fact, he referred to Yahoo!'s ontology as the company's editorial. "Organising the Web is sometimes like being a newspaper editor and inciting riots," he said with a touch of exasperation. "If we put hate crimes in a higher level of the topic hierarchy, well, it's our editorial right to do so, but it's also a very heavy responsibility."
Yahoo!'s success, Yang argued, is evidence that point of view and knowledge classification are not incompatible. Just as we learn to automatically compensate for right-wing bias while reading The Wall Street Journal's editorial page, we can also learn to adjust for the perspective that Yahoo! embodies. We can learn to think like a Yahoo! classifier. The real problem, Yang and Srinivasan agreed, is making sure that Yahoo!'s point of view remains consistent.
That point of view, after all, comes from having the same 20 people classify each site, and by having those people crammed together in the same building where they are constantly engaged in a discussion of what belongs where. Lose that closeness and the biases will start to become more diffuse. Yang admitted as much, saying, "It's hard to expand Yahoo!, because you end up with too many points of view." Instead of the Journal's editorial page, you end up with something like CNN, where prejudices are masked by a pretense of objectivity. For Yahoo!, that translates into a category scheme where users have a hard time guessing where they'll find what they're looking for. So Yahoo! is faced with an unforgiving trade-off between the size and the quality of its directory. If Yahoo! hires another 50 or 60 classifiers to examine every last site on the Web, the catalogue will become less consistent and more difficult to use. On the other hand, if Yahoo! stays with a small number of classifiers, the percentage of sites Yahoo! knows about will continue to shrink.
Yahoo! will probably take this latter path and simply admit that it is an opinionated guide, a sort of "best of the Web", and not a complete catalogue. That will make for a successful business - look how popular are the "cool site of the day" Web pages - but brings us little closer to a universal library. By relying on human intelligence to organise the Web, Yahoo! falls victim to subjectivity.

Inktomi
What's needed, I decided, is an index of the Web. A concordance that keeps track of every word on every Web site. Like a catalogue, a keyword index organises Web sites based on their content, but it does so at the word level instead of by subject. Sites about Messianic Judaism are found by looking for pages that contain the words "Jesus" and "Jewish". This eliminates the subjectivity that plagues classification schemes like Yahoo! - a document either contains the word "Judaism" or it doesn't. However, indexing increases the size of the task from keeping track of millions of documents to keeping track of billions of words.
When the first concordance, or keyword index, of the Bible was compiled by Hugues de Saint-Cher in 1240, the task required the labour of 500 monks. But the labour involved is almost completely mindless; today, a computer can construct a keyword index for a small library in minutes, using a straightforward technique known as an inverted index.
An inverted index is simply a huge table, where the rows represent documents and columns represent words. If document x contains word y, then there will be a binary 1 in row x, column y of the table. To find all documents that contain a specific word, the computer simply scans for 1s in the appropriate column. With a little added work, it's possible to do more complex searches: find all documents that contain the word Wired and not the word "amphetamine". The table helps speed up the search process because only the appropriate columns, instead of the documents themselves, need to be examined.
Even with the aid of computers, however, the problem of scale becomes daunting as the size of the corpus increases. Depending on whom you ask, the Web currently contains somewhere between 30 million and 50 million pages. (Louis Monier, the technical leader at Digital Equipment Corp.'s Alta Vista search engine, says at least 45 million, while Michael Mauldin of Lycos says 30 to 50.) Given that the average Web page contains about 500 words, or 7 kilobytes of text, we can guess that the Web contains somewhere between 200 and 330 gigabytes of text. And these numbers are growing by 20% every month, says Mauldin. In three years, as the Web surpasses the roughly 16 terabytes in the current Library of Congress, will the inverted index become too large to store feasibly? Will it simply take too long to compute? Or will attempts at indexing the Web break down in some other, unexpected way?
To find out, I headed to the computer science department at the University of California at Berkeley, where Eric Brewer, an assistant professor, is studying these questions. I caught up with him in Cal's new computer science building, the startlingly ugly green-tiled Soda Hall. As we sat down in an empty conference room, Brewer was quick to mention that, along with grad student Paul Gauthier, he had created Inktomi (inktomi.berkeley.edu/, named after a mythological spider of the Plains Indians) - one of the largest indexes of the Web. And how, unlike other large Web indexes such as Lycos and Alta Vista, Inktomi doesn't require a half-a-million-dollar investment in computer hardware. "We didn't just throw money at the problem like those guys," he said. "We've come up with a truly scalable solution." The result, Brewer assured me, is a system that will be able to index the entire Web even five years from now.
Inktomi is one of the first real-world applications of hive computing (see Wired 1.07, page 26). The idea is to create a supercomputer by lashing together lots of existing workstations with a network, then having each workstation work on one piece of a problem. The result is cheap (because you're using off-the-shelf components) and fast (because you can keep adding more workstations to increase performance). Inktomi works by splitting the inverted index of the entire Web over four Sun SPARCstations. This is enough computational power and memory to handle about a million users per day and index several million documents. But despite Brewer's assurance, I wasn't convinced that Inktomi's technique will work when the number of documents and users has increased by two orders of magnitude. At some point, it seemed to me, the Web would be so large, and changing so fast, that it would be physically impossible to keep up. Something would break.
My first guess was that the bottleneck would be getting the data. Right now, indexers use software spiders that crawl through the Web and download every page for indexing. The spiders start with a list of a dozen or so known sites. They index these pages, then follow every link to every new page the sites contain and index those. The process repeats until the spider can't find any links in the Web that it hasn't visited. Back when the Web was young, when it contained only a few thousand pages, this procedure took less than a day. Now it takes even the quickest spiders three or four days to roam the entire Net. Alta Vista's spider, for example, downloads 2.5 million pages a day out of the more than 21 million it knows about. Won't the Web soon be so large, I asked Brewer, that the index cannot be completed before more pages are added, making it perpetually out of date?
Brewer says no: according to him, we will just use smarter and faster spiders. After all, he pointed out, it's possible to get a 155Mbps connection to the Internet. That means the entire contents of the Web can theoretically be sucked down in about five hours. Sure, the Web is growing, said Brewer, but so is available bandwidth. The real problem is that Web spiders spend most of their time waiting to connect to the Web site.
So, to speed up the process, Inktomi uses multiple computers to crawl the Web. A few dozen workstations in the Berkeley computer science department are set up to start crawling the Web when nobody else is using them. By breaking the problem up this way, Inktomi can take almost full advantage of its Net connection. Inktomi also plans a few other tricks - for example, to keep track of which Web sites change most frequently and make sure it checks those sites every day.
OK. But what about storage? After all, you're trying to keep track of the entire Web! You're trying to store in one place the contents of hundreds of thousands of hard drives. In a few more years, that will have to be prohibitively expensive.
Not so, claimed Brewer, excited by the opportunity to show- case another advantage of his system. Remember that you need to store only the inverted index, instead of the actual documents. Inktomi uses some clever techniques to reduce the table's size even further, so a document takes up only about 4% of its original space. Which means that even when the Web is a terabyte of text, a complete index will take up only about 41 gigabytes. You can buy that kind of disk space for less than $10,000 today.
Admittedly, Inktomi currently keeps track only of which words appear in which document - it doesn't know the order in which the words occur. That means Inktomi can't search for occurrences of "Clinton" within five words of "President", for example. However, this is an easy thing to add, said Brewer, and he'll add it soon. Even with this word-proximity information, the index will still be only about 15% of the total size of the Web.
OK, OK, I was willing to acknowledge, storage might not be a problem. But it's not just the size of the Web that's growing, it's the number of users. What happens when everyone in the whole world is connected to the Web, and half of them are trying to use Inktomi at the same time? Perhaps computational power will be the bottleneck.
Definitely not a problem, Brewer insisted unflaggingly. Inktomi has been stress tested at more than 2.5 million queries a day with no difficulty - and that's with just four outdated workstations. Hook together 40 state-of-the-art computers and Inktomi should be able to handle 100 million queries a day - easily. The Web may be growing exponentially, but processors are on that same curve.
I left Berkeley convinced that indexing the Web, while likely to remain a challenge, won't be insurmountable. But after using Inktomi more, I started to wonder if an index really satisfied my desire for organising knowledge. I could usually find what I was looking for, but I felt as if I was poking around in the dark. I remembered something Jerry Yang had told me at Yahoo!: "The difference between a catalogue and an index is that a catalogue provides context." That made sense now.
A catalogue not only helps you find a Web site, it also tells you how it fits into the grand scheme of things. Yahoo!, for example, shows that the site for the United Patriotic Alliance belongs to Society and Culture:Alternative:Militia Movement. It also lists sites that offer an opposing viewpoint in that same section. And it includes a handy cross-reference to Society and Culture: Firearms. Doing a keyword search for the United Patriotic Alliance, on the other hand, doesn't provide any of that. It's like operating with blinders on: you can only see what's directly in front of you.
And, I found, there's another, more subtle drawback. Indexes not only don't provide context for the document, they don't provide context for the keywords - because the user can immediately jump to the page that contains a particular word. Using an online index, it's all too easy to find out what someone has said about, say, racism, and then quickly take that quote out of context. By allowing you to jump right to the good stuff, instead of forcing you to read all the way through the document, indexes promote scanning, not reading.
Organising knowledge with a keyword index is less like a universal library than like a giant, Burroughs-style cut-up poem. Pages become organised together for no reason other than random confluence of words. While indexes solve the problems of subjectivity and scale that plague classification schemes, they don't impose enough order. The more I tried to use Inktomi, the more I realised that operating just on words is too low-level. There needs to be something in between.

Architext
Finding that in-between has long been a goal of information retrieval research. Even in the 1960s, when online databases were puny by comparison, it was clear that simple keyword searching was inadequate. What was needed was some way to make sense of a document, to figure out what it was really about. But despite concerted efforts, nothing that really works much better has been found. That's why Architext Software's announcement last October of the Excite system, which indexes the Web by concept rather than by keyword, was greeted with as much scepticism as enthusiasm.
For one thing, Architext had come out of nowhere. Founded in 1993 by six Stanford students, none with any real background in information retrieval, the company picked up $3 million from Kleiner Perkins Caufield & Byers and began to promote Excite's "concept-based searching". But Architext didn't release any details about how the system actually worked, nor did it enter the annual TREC (Text Retrieval Conference) competitions, where search engines compete head-to-head. In short, it looked like just one more case study on how the word "Web" has the ability to cloud investors' minds.
That's why I was so surprised when I met Graham Spencer, Architext's 24-year-old vice president of technology. Instead of the glad-handing salesman I expected, he was a self-described punk. Tall, ectomorphic, with tightly cropped hair, Spencer looked out of place in the cubicle-filled office. But, he quietly insisted, he has stuck to the punk do-it-yourself ethic by founding a start-up and making sure it offers a useful service "without fucking anyone over".
The actual service, it turns out, was decided on somewhat arbitrarily. The company's founders knew they wanted to start a business but weren't sure what kind. It was Spencer who suggested they build a search engine, because "information retrieval seemed like the easiest place to make progress".
The "problem" of information retrieval can actually be nailed down to two issues: synonymy and homonymy. The first is a problem because a search for documents containing the word "film" won't find documents containing synonyms such as "movie". Homonyms, words that are spelled the same but have different meanings, are a problem because the search will find documents containing "a film of oil". All efforts at improving information retrieval involve trying to remove these problems. For example, some of today's best systems - such as Cornell's SMART engine - use a thesaurus to expand automatically a user's search and capture more documents. Some also eliminate homonyms by trying to figure out how a word is being used in a document. This is done by collecting statistics on which words commonly occur together. This way, if the search engine sees the word "film" near the word "director", it can guess that the word is being used to refer to a motion picture. When I quizzed Spencer on the actual technique Excite uses, he became noticeably more circumspect. On the one hand, he wants to brag about his system's algorithm so people don't think he's just full of hype. On the other hand, he doesn't want to give too much away. From what he did finally tell me, the system appears to use a fairly sophisticated approach. The idea is to take the inverted index of the Web, with its rows of documents and columns of keywords, and compress it so that documents with roughly similar profiles are clustered together. This way, two documents about movies will be clustered together - even if one uses the word "movie" and one uses "film" - because they will have many other words in common. The result is a matrix where the rows now represent concepts instead of actual documents. This cleanly attacks the problems of both synonymy and homonymy.
It turns out that the basic idea behind this approach was first developed in 1988 by a group of scientists at Bellcore, under the name Latent Semantic Indexing. The technique, although shown to be very effective, has been plagued by its heavy computational requirements. It's simply too slow for most practical appli- cations. But, after all, that's what computer scientists like Spencer are good at. And what Architext has apparently done is find a way to perform LSI more efficiently.
What makes Excite so exciting is that it comes up with a classification scheme through statistical analysis of the actual documents. It learns about subject categories from the bottom up, instead of imposing an order from the top down. It is a self-organising system. This eliminates two of the biggest criticisms of library classification: that every scheme has a point of view, and that every scheme will be constantly struggling against obsolescence.
To come up with subject categories, Architext makes only one assumption: words that frequently occur together are somehow related. As the corpus changes - as new connections emerge between, say, O. J. Simpson and murder - the classification scheme automatically adjusts. The subject categories reflect the text itself - not the world-view of a few computer scientists in Mountain View, or of a 19th-century Puritan named Melvil Dewey.
Anecdotal evidence in support of Excite is fairly strong. I've tried doing identical searches on Inktomi, Lycos and Excite, and found that Excite returned the most relevant documents. Which isn't to say that Excite is perfect: it still returned a fair number of superfluous documents that left me scratching my head, trying to figure out the possible connection. This isn't too surprising, since some words may frequently occur together even if they aren't really related - thereby throwing off Excite's statistical algorithms. Nonetheless, what bothered me most about Excite is not how it searches, but what it searches.
Excite doesn't just index the Web - it also indexes every message posted to about 10,000 Usenet newsgroups. That sounds harmless - after all, Usenet is a completely public message board that anyone can read. Yet searching Usenet with Excite, or similar services such as DejaNews and Alta Vista, can feel surprisingly invasive. It's possible, for example, to search on a person's name and find every message they've posted - whether on comp.client-server or rec.arts .erotica. Using these tools, anyone can build a profile of a person's interests, based on where they post.
Spencer became animated when I asked him about the privacy issue; he launched into a topic he had obviously thought about. "I think that indexing Usenet is OK because it is a completely public forum, but other things do make me uncomfortable." For example, Web indexes often end up indexing the archives of Internet mailing lists.
"There is a process of joining a mailing list, so it does seem kind of private." Nonetheless, Architext has plans to take indexing even further. "One thing we want to do is index IRC," said Spencer. "It will let you find people who are talking about things you're interested in." And it will let anyone play at being Big Brother.
But Architext's Excite makes a significant step toward building a universal library. By using concepts instead of keywords, information is forced into an organised structure instead of being left as a jumble of words. It still has a couple of technical shortcomings: its fairly simple statistical technique of automatic classification is prone to error, and it still doesn't provide the context a system like Yahoo! does.

Oracle
I found the pieces I was looking for at Oracle Corp.'s sprawling campus in Redwood Shores. I had been hearing rumours for a while about the product they were developing, but it seemed like no one in the close-knit Web indexing community really knew how it worked. What little I had heard was contradictory: Oracle's software was a hopeless pipe dream, a Byzantine attempt at artificial intelligence that would never work - or it was the mother of all search engines.
To clear things up, I met with the man in charge of the project at Oracle, Kelly Wical. Genial and rotund, Wical isn't some computer scientist fresh out of college: he's been working for the last 20 years on a system to help computers understand English. His goal is a program that can not only analyse a sentence and figure out information such as what the important nouns are and how they are being modified, but actually understand the written word from the reader's point of view. His quest began while at a computer company in Houston, where he worked on a program to aid users searching for information on specific topics in gigantic online manuals. Then, in 1988 he founded Artificial Linguistics Inc. and continued to attack the problem of understanding written English, producing a sophisticated grammar checker as a spin-off of core technology. In 1991, ALI was purchased by Oracle, and Wical was brought on board to develop his system under the name ConText.
None of which may sound terribly relevant to building a universal library. Except that ConText's ability to understand English comes both from its knowledge of grammar and from its incredibly detailed hierarchy of concepts. ConText knows, for example, that Paris is a city in France, which is a country in Europe. This combination of knowledge is exactly what Excite lacks (and what causes its automatic classification algorithm sometimes to make glaring errors).
The problem is that creating such a comprehensive knowledge base seems impossible. OK, I said, even assuming that this linguistic engine can parse English sentences (something scientists have been struggling with for years), the process of creating a taxonomy of concepts, not just major subjects, would require an unprecedented amount of effort.
Wical smugly agreed. Already, more than 100 person-years have been spent building ConText's database of knowledge. To do so, Oracle has employed dozens of "lexicographers", a lofty title for what are often college interns who do the necessary legwork. "We've sent people to grocery stores, to scientific conferences, even sex shops," said Wical. There, the lexicographers identify the subfields of metallurgy, for example, or the types of pornography, and then incorporate the results into ConText's ontology. This data is supplemented with automatic statistical techniques, similar to those used by Excite, that analyse huge collections of documents for unique concepts and relations between them.
The result of all this effort is a nine-level hierarchy - with each level offering increased specificity - that currently identifies a quarter-million different concepts in English. The scheme also includes approximately ten million cross-references between related concepts, such as Paris and France, roadways and death. ConText uses this data when it analyses a document and then decides which of the concepts best describe the document's topic.
I wanted to see how ConText works in practice, so Wical let me try out the text-analysis engine on a few articles I had brought along while he watched over my shoulder. Some of ConText's features - like its summarise tool, which takes a document and tries to compress it down to just the important parts - turned out to be pretty unimpressive. But when it came to document classification, ConText was unerring.
An article I wrote about hive computing, for example, was correctly classified under Science and Technology:Hard Sciences: Computer Industry: Supercomputing; Science and Technology: Hard Sciences: Computer Industry: Workstations; and Business and Economics: Economics. An excerpt from Takedown, a book by Tsutomu Shimomura and John Markoff about notorious hacker Kevin Mitnick, was classified under Science and Technology: Hard Sciences; Computer Industry: Cyberculture: Hackers.
The only time ConText really failed at classification was when we tried it on a piece of fiction. Wical happened to have a chapter of Tolkien's The Hobbit on his hard drive, and it came back classified under Geography and Mythology, neither of which seemed to me the real topic of the book. I half jokingly took Wical to task for this, and he shrugged. "What's in The Hobbit isn't what it's about."
An obvious enough point, but it underscores an important truth about information retrieval. No matter how good the technology, it can only work when the meaning of a document is directly correlated to the words it contains. Fiction - or anything that relies on metaphor or allegory, that evokes instead of tells - can't be usefully classified or indexed. Its meaning comes from the reader. That's a big limitation for any attempt to automatically organise the Web.
Despite this concern, I left Oracle's headquarters convinced that a useful, complete organisation of the Web was possible. The Web no longer seemed too large, nor computers too dumb. I imagined a system combining Inktomi's scalable hive computing, Excite's self-organised classification and ConText's raw knowledge. I was, though, starting to question the real point of indexing the Web. I'd had some vague notion of a universal library advancing science, informing voters, saving the world - who knows? The feeling of omniscience that came from searching gigantic databases like Lexis-Nexis seemed reason enough. But something Wical had said made me wonder.

Why bother?
The issue came up when I asked Wical what could have possibly kept him interested and motivated to work on the same project - the quest to understand English - for the last 20 years. At first, he just made some vague noises: it was an "interesting problem with a lot of practical application". But, wanting to hear something that jibed with my own reasons, I kept probing. Finally, he leaned back and said, "My personal reason? Well, I want to talk to hobbits."
Wical slowly began to talk about his fascination with The Lord of the Rings and his dream to bring Tolkien's books to life by writing a computer program that understands everything in the fantastical trilogy. Once the books have been made digital, Wical said, they could be interactive. He could enter the story.
I found this reply so unexpected that it made me wonder if my own motives for wanting to organise knowledge might appear equally odd. I decided to watch how people use existing search engines in order to understand their popularity. But after a dull half hour watching queries come into Inktomi, I still had no answer.
True, just looking at the most common search terms pointed to an obvious driving force: sex. The top ten search terms sent to Inktomi were "sex", "nude", "pictures", "adult", "women", "software", "erotic", "erotica", "gay" and "naked". But these terms made up less than a quarter of all queries. Other search terms ranged all from people's names to "wood-burning stove" to "nine-inch nails".
So I went to see Brewster Kahle. The founder of Wide Area Information Servers Inc., Kahle is one of the handful of people who have actually managed to get rich from information retrieval. With his huge, bushy hair and exaggerated hyperkineticism, he looked like a clown after too much coffee. But no one knows more about the intersection of the Internet and knowledge organisation.
"Information retrieval is not about finding how much tannin there is in an apple," he declared in his San Francisco office. "It's about letting everyone publish." And he was off on a long rant about how organising the Web matters, because, as Architext's Spencer had told me, "it's about people finding people, not finding information." Indexing the Web allows the 40 people interested in Bulgarian folk-singing to find each other; it creates communities.
Even with Kahle's gesticulations, the argument didn't seem very convincing. Then Kahle started speaking at a fever pitch, one foot on the table, arms oscillating wildly: "I grew up watching just a whole lot of TV, signals coming right at me. Then, at school, teachers would just tell me stuff, and I'd just try to remember it. But, when I finally hit graduate school, the teachers would say, 'Here's what's known, here's what isn't. If you make any progress... here's my home number!' Finally, I had a chance to contribute!"
Here was something I could relate to! Now I understood why indexes mattered. As Yahoo!'s Jerry Yang had said: "If the Web was a broadcast medium, then we could just do something like TV Guide." But once anyone can publish, anyone can contribute, we need some new kind of organisation.
Knowledge organisation is important not because of how much knowledge there is now, but because of how many people are becoming involved in its production. Web indexes now play the same role that atlases did in the 16th century. Both hold an appeal that goes far beyond any possible usefulness. Both lead to dreams of exploring new territories, of discovering new opportunities. And both are evocative because of what they leave blank.
Steve G. Steinberg (steve@wired.com) is a section editor at Wired US.