Web Site Expert巻頭レポート(英語)

Build your Own Search Engine: Yahoo! BOSS

On the Hakia search engine, the term "Riza Berkan" pulls up some highly organized results. There's a biography section: "Dr. Berkan is a nuclear scientist with a specialization in artificial intelligence and fuzzy logic." There are also separate sections for images, news and interviews, awards and accomplishments, and bibliography, each with its own links. Not every query on Hakia produce results this crisp-it helps that Riza Berkan is Hakia's founder and CTO. But when I plugged in famous names from the U.S. presidential election: Barack Obama, John McCain, Sarah Palin, Joe Biden-and for that matter, Paris Hilton (who created her own reply video to a McCain ad)-each was presented in results decidedly more organized than you'd find using mainstream search engines.

Berkan calls Hakia a general purpose "semantic" search engine, but to create it, he didn't have to start from scratch. He used an API called Yahoo! Search BOSS-as in "Build your Own Search Service." With BOSS, anyone with decent Web programming skills can create a search engine of their own making. Launched last July, BOSS is the successor to Yahoo! Search API. The difference is as much conceptual as technical. "The initial API was a way to distribute our search product," said Bill Michels, senior director and general manager of the Open Search Platform. BOSS, he said, represents a philosophical shift. Its purpose is not to drive traffic to Yahoo! search, but to use that engine to build a search platform of your own. "The API gives you more flexibility: you don't have to take our page rankings, our branding, or our presentation. Anything you want to change can be changed. We even allow you to re-rank the search results based on what's relevant to you. Nobody else does that."

Michels said that search APIs have traditionally been about distribution, not for enabling third parties to build their own search products. "That was true with Yahoo! Search API, as well. Whatever you build, it's got to say 'Powered by Yahoo! Search' or whichever API you are using. In addition, you are not always getting the full search index. So you wind up spreading the word about someone else's product and driving people back to the 'real' product." With BOSS, he said, not just the API, but the business proposition itself, allows you to build a search engine of your own. This "white label" approach can be seen both on Hakia and another early BOSS adopter, Me.dium. On both sites, the term "Powered by Yahoo!" is nowhere in evidence. Berkan's blog does have an entry explaining how BOSS! integrates with the Hakia architecture, but that was his idea, not a Yahoo! requirement.

Yahoo! has also removed the limit of 5,000 searches per day. Instead, developers agree that in the future, their site will either feature Yahoo! advertising or pay a fee. And so it would appear that Yahoo! is thinking of BOSS as a way to bring in additional revenues by featuring its advertising in search engines other than its own. Yahoo! provides the API, you use the API to build a search engine, users click on the featured ads, and Yahoo! gives you a share of those revenues. In the competition with Google, BOSS will probably not change the balance. So far, there are few search engines that actually use the technology. But Yahoo! spokespeople express high hopes "Over the course of years, we want this to be much more than a blip," said Prabhakar Raghavan, head of research and search strategy at Yahoo, in an interview with the New York Times." The newspaper said that as of last May, Yahoo! had 20.6 percent of all searches in the United States, versus 61.8 percent for Google.

BOSS's ability to change that 1:3 ratio with its chief competitor remains to be seen. But for developers, the benefits of the API are clearer, especially when it comes to branding. "It's important to point out just how radical the BOSS concept is," said Peter Newcomb, founder and CTO of Me.dium, another early BOSS adopter. It really changes the game for anyone wanting to get into the search business. It's a great way to add value to any site that can use search, without requiring direct attribution to Yahoo! and without the usual usage restrictions. With BOSS, you can govern the look and feel and you can change how the search is configured. It's not just manipulate the results with a Boolean operator or two. The API offers all kinds of interesting filters, specifiers, and fairly complex operators. So if you have a website about, say, biking, you can 'tune' the results to that interest."

Newcomb's point about adding value to other sites is worth repeating. Yahoo! has promoted BOSS by pointing to websites who present themselves largely as alternative search engines. But the API is flexible enough to use that BOSS would appear to have value on existing websites who want to feature a more specialized search engine-in terms of presentation, ranking of results, or both. In other words, a search engine built with BOSS does not need to be at the center of your site to make a difference. You can offer search without calling yourself a new search engine.

For developers, the biggest challenge posed by BOSS, and the key to BOSS's success, is in coming up with an idea that sets your search engine apart. The mainstream search engines may not be all things to all people, but they are what most people call home-if not via an actual home page, then by a search box on the browser. That means giving people something they can't quite get anywhere else. With the BOSS API, said Michels, "you can innovate with rankings, with presentation, and in blending in your own content. You can bring in your technology, data, insights into your user base, and metadata associated with other URLs." Yahoo! also offers an experimental BOSS Mashup Framework, in which SQL-like commands can mash up the BOSS API with third-party data sources. Hence, depending on your creativity and approach, you may be able to attract people with specific interests, whether sports or politics, with a search engine more tailored to their needs. Or you might build a search engine tailored to the users of an existing site-so that the searches they conduct bring in results that meet their interests-and your website's focus.

Hakia: weighing toward credibility

For Berkan, the opportunity to build a search engine came out of his background. A nuclear physicist involved with information processing, he co-authored the 1997 book Fuzzy Systems Design Principles: Building Fuzzy If-Then Rule Bases, which was published by IEEE. "You can't operate a nuclear system with junk information," he said. And "junk information" is how Berkan characterizes the typical results of a Web search. "The information being pushed today is popularity-based," as opposed to at least aspiring to some academic standard. "With All the search engines today, including Google and Yahoo! , it's very much like getting up in the morning and turning on CNN. What is pushed to you is whatever is popular. That's the perspective, and there really has been no other perspective, available."

With a semantic search, he said, the driving force is not popularity but credibility-that is, it comes from academic sources that are less biased and more verifiable. "For instance, if you search on the benefits of aspirin, a conventional search engine has a mixture of sites. As a consumer, you don't know which are credible and which are not. With Hakia, we are trying to bring you results that are more credible. Berkan thinks Hakia will first be attractive to professional users doing what he calls "knowledge-intensive" searches in the areas of medicine, finance and law-"where the quality of information can be critical." That difference is not always apparent, at least yet. When I searched on "aspirin" and the cholesterol-lowering drug "Lipitor" in both Yahoo! and Hakia, the sites provided by each engine largely overlapped. The biggest apparent difference was that Hakia's were better categorized: "Basic information and FAQ," "Diseases treated by this drug," "Side effects," "Clinical Trials," "News," "Research and statistics." Hakia is in beta, and Berkan doesn't minimize the challenge. "We are trying to finish the site this year, but a semantic search is a difficult thing to build, and we expect it could take years." On the other hand, he said, the BOSS API is comparatively easy to use. "Developers who are considering it shouldn't think twice. Yahoo! has provided a lot of resources, and there's no point in re-inventing the wheel."

Me.dium: social browsing

If Hakia emphasizes credibility over popularity, Me.dium aims for the reverse-emphasizing search results that reflect the immediate interest of its users. The site offers a downloadable browser extension to enable what Me.dium calls "social browsing," the ability to surf the Web with friends. "The content of the index of our search engine is based entirely on what our users are interested in," said founder and CTO Peter Newcomb. Me.dium uses BOSS to fill in gaps. "Without BOSS, our search engine works very well for what people are actually interested in, right now. During the presidential election, for example, you'll get great results. However, if you're looking for a needle in the haystack-obscure error messages, for example-our technology is not going to do a great job with it. Before BOSS, our only option was to give really bad results or no results at all. The great thing about BOSS is that we can bring in Yahoo! results, which we can merge right in." Newcomb describes BOSS as a "super-simple advanced query access to the Yahoo! search infrastructure, without any real constraints."

When a search term is entered, it is added to a URL template that is sent to BOSS, which returns an XML file. JSON-JavaScript Object Notion is also supported. "We then mix those results using a proprietary algorithm," said Newcomb. "In some cases, our results are so good, we just use those. In other cases, we actually don't have any results-so we just use Yahoo! But in most cases, we take the merged results and present them to the user." The coding didn't take long. "Our first swipe at it took about a week, and we were still building our search engine infrastructure. The actual Yahoo! integration was fairly trivial. If you just wanted to use BOSS and 're-decorate' the results, that could be done in an afternoon."

Newcomb said that the challenge with BOSS has more to do with user expectations than with implementation. "Users expect that search results are whatever Google or Yahoo! gives them. Our results can be somewhat different, and changing those expectations can be difficult." Newcomb maintains that when a query is on a particularly hot topic, Me.dium's results are markedly better than with a conventional search engine-but users aren't always aware of the comparative differences. The conventional search engine remains the norm, at least until proven otherwise. Newcomb cites as an example the Cuil search engine, which got a lot of attention when it launched as a Google competitor. "They fell flat on their faces because people were expecting something, and they didn't get it. And that has to do with the user interface, with the results you get back, and with niceties like search suggestions if you misspell a word. You have to get all that right first before people will tolerate differences."

A related challenge, said Newcomb, is that people also expect search results to be fast-within less than half or even a tenth a second. "Yahoo! by itself does a pretty good job on that. But BOSS represents an extra hop, and therefore, the response is slightly longer. There's really nothing major you can do about that, short of building your own infrastructure. But it's worth noting. In our case, it's important that when we get a search term, we don't first run our search, and only then search Yahoo! We do both in parallel."

I asked Newcomb how he can compete in one of the most entrenched markets imaginable, in which "Google" has become a verb ("let's Google it), and Yahoo! and Microsoft are both trying their best to catch up. How does a small search engine with no marketing budget, no big development teams, attract eyeballs? "Getting people to try our service is absolutely a hard thing. In fact, Yahoo! and Microsoft are having a hard time getting people to use their engines instead of Google's. Google's brand is so incredibly strong, and anyone competing against them will have a huge challenge." Newcomb said that for smaller search engine companies, there's really no choice-you can't go head-on, but must think of yourself as filling a niche-finding a need the big search engines have overlooked. "Me.dium is two million people strong and growing," he said.

Newcomb said that Me.dium's strengths could be seen last spring during the U.S. presidential primaries when the search engine first came online. "At the time, if you did a mainstream search on 'Clinton,' you'd get a lot of stuff about Bill Clinton. You'd have to go to the second or third page to see anything about Hillary. The first time we did a search on our engine, it came back with YouTube videos of Hillary Clinton's latest speech. Because that's what people are interested in, right now. Whereas Google and Yahoo! are interested in what's historically of interest. Now they've since caught up, but it took a long time." Even so, Newcomb doesn't expect that everyone will make Me.dium their default search engine.

Using BOSS

To use BOSS, developers first obtain a BOSS App ID. Registration requires some basic information about the developer, company, and the application being built. From there, developers can use BOSS to access Yahoo! search services: Web, News, and Image, as well as Spelling Suggestions, which is becoming an expected feature on all English-language search engines, helping ensure that a misspelled term will not run into a dead end. Yahoo! promises additional search "verticals," as these selected searches are called, as well as additional data sources. The BOSS API, like other Yahoo! Web Services, is "REST-Like" (representational state transfer), with parameters encoded into the request URL. The returned results are in XML or JSON, as determined by the programmer, who can then change the result order, eliminate any results they don't want, and blend in their own data. Yahoo! says it is also releasing "an experimental Python library called the Boss Mashup Framework, which provides simplified interfaces for retrieving search results via the Boss API. The framework also provides functions for remixing the results with other data sources."

As with the Yahoo! search engine itself, multiple languages are supported, including Japanese-with language and region set using Universal BOSS API arguments that apply to Web, Image and News searches. Other universal arguments include the number of results to return (10 is the default, 50 maximum), the XML/JSON preference, and the ability to restrict searches (except image searches) to a specified list of URLs. The request URL can be formatted to deliver more specific results: content can be restricted to a set of pre-defined sites using the 'sites' URL parameter, with a comma-separated list of domains, i.e., sites=mlb.com,cnn.com.

A set of API query operators can tailor the search. Quotes produce searches on the exact phrase. A minus (-) operative excludes key words, and a site: operator includes or excludes documents based on their domain. Other arguments work specifically on Web, Image or News searches. For Web searches, you can filter out adult and hate content (filter=[-porn] [-hate])on content in 14 different languages, including Japanese. And you can use type= to specify what types of documents to return: HTML, text, pdf, etc. XML response fields also vary by the type of search. For example, response fields for a News search include the total number of hits, the summary abstract of the story, its headline, language, date of publication, and URLs for the story and the publication, itself. Developers are free to use any and all of these XML field descriptions to create their own results layout.

Yahoo! provides the following XML example of a news search, with abstract, article URL, title, language, date, time, source, and publication URL all shown. The total number of hits is 8775, shown 10 (news count="10") at a time from the beginning (start=0) of results.

<ysearchresponse responsecode="200">
  <resultset_news count="10" start="0" totalhits="8775"deephits="8775">
      <abstract>June 16 (Bloomberg) -- Adidas AG , the world's second - largest sporting-goods maker, will ``clearly exceed'' its full-
        year sales target for soccer-related goods and gain share in all major markets, Chief Executive Officer Herbert Hainer
      <title>Adidas Will `Clearly Exceed' Soccer Sales Target, Hainer Says</title>
      <language>en english</language>