People-centric Computing: MIT's Project Oxygen

Amer.
Hal, despite your enormous intellect, are you ever frustrated by your dependence on people to carry out actions?

Hal.
Not in the slightest bit. I enjoy working with people. I have a stimulating relationship with Dr Poole and Dr Bowman. My mission responsibilities range over the entire operation of the ship, so I am constantly occupied. I am putting myself to the fullest possible use, which is all, I think, that any conscious entity can ever hope to do.

Amer.
Dr Poole, what's it like living for the better part of a year in such close proximity with Hal?

Poole.
Well, it's pretty close to what you said about him earlier, he is just like a sixth member of the crew - [you] very quickly get adjusted to the idea that he talks, and you think of him - uh -really just as another person

-2001: A Space Odyssey

Hal-the smart computer with the famously bad judgment- has become a 21st century version of the Turing test, suggesting a new benchmark for intelligent computing. Alan Turing defined a thinking machine as one who could fool an interrogator into thinking it is human. Hal's implicit definition is tougher: the computer should be indistinguishable from a colleague. It should be a "sixth member of the crew," exhibiting human-like reasoning power with the ability to interact with people conversationally. A thinking machine should try to be one of us-replicating the colleague you converse with while walking down the hall.

With the year 2001 come and gone, just how close are we to passing the "Hal test?" A research project at the Massachusetts Institute of Technology is trying to find out. Its goal is to develop what it calls "pervasive, human-centered computing." MIT's Project Oxygen was conceived by Michael Dertouzos, the late director of MIT's Laboratory for Computer Science (LCS), LCS associate director Anant Agarwal, and Rodney Brooks, director of the MIT Artificial Intelligence Laboratory. LCS director Victor Zue has since joined Brooks and Agarwal as a project leader, and each-like the proverbial blind men describing the elephant-has a slightly different take on Project Oxygen's purpose.

"Michael [Dertouzos] felt there has been too much interacting with Windows," recalls Brooks,. "His slogan was: 'Let's enable people to do more by doing less.' LCS associate director Anant Agarwal wants to harness computing and communications technology that is growing more abundant and less costly by the year. LCS director Victor Zue is interested in speech recognition, arguing that researchers should pay more attention to semantics and intent rather than syntax and form. "My particular take is that we have been suckered by the computer revolution into being drawn into the computer's world," says Brooks. "So we spend our day in front of a virtual desktop, manipulating virtual paper files, and the computer doesn't even care if we're there looking at the screen. We've conformed to it, rather than the other way around. So I say-we should bring the computer out into our world. It should be serving us."

These variations seem to boil down to a single theme: as much as possible, computers should work with humans, not the other way around-a goal pursued in earnest at least since Xerox Palo Alto Research Center's creation of the graphical user interface. The very idea of an icon is to provide users with a visual analogy, a way of relating the arcana of technology to the ordinary world. The interface suggested by Project Oxygen is the next obvious step-emphasizing the person-to-person elements of speech and physical gestures, de-emphasizing the keyboard and mouse. Not that either of these ubiquitous devices will disappear. With a wink, Brooks will argue that keyboards could go the way of paper tape, a technology also once considered indispensable to the human-machine interface. While Brooks candidly acknowledges that typing is a fast, efficient way to communicate with a machine, he thinks there are better ways to interact with a machine, if only the machine were sufficiently smart.

Project Oxygen includes some 30 MIT faculty members and six corporate partners: NTT, Hewlett-Packard Company, Nokia Research Center, Philips Research, Acer Group, and Delta Electronics. The project has gotten a big financial boost from the Defense Advanced Research Projects Agency (DARPA), the instigator of the Internet. The term "Oxygen" derives from the project's goal of making computational and communications resources as ubiquitous as air and as easy as breathing. In an "Oxygenated" world, hardware installation involves not assembling workstations and cables, but climbing ladders to mount radio frequency "beacons" and passive "listeners" on the ceiling, turning a room or even an entire building into a massive computer interface. People are tracked by location, and listened to with microphone arrays that extract the speaker's commands from ambient room noise. Computer vision technology can determine whether they are looking at a screen or pointing. Because the technology is passive (compared to a keyboard), it is inherently intrusive, but potentially convenient.

A measure of how far Project Oxygen has come can be seen from several MPEG demonstrations on its website (http://oxygen.lcs.mit.edu). (The demos are actually a bit dated-as the newest, at this writing, have not yet been released to the public). One shows a student with a small circuit board around his neck in the manner of a corporate badge. Behind him sit a desk lamp and computer, which plays the opening notes to the John Lennon song Imagine. As the man leaves his office, the light shuts off and the music stops. As he enters a public lounge, the song "follows" him, playing where it left off on a nearby laptop.

In another demo, a student tells a computer to display a world map, display Iraq, and "tell me about this place." The computer responds in writing. The conversation continues:

"Query start"

"What's your question?," asks the computer in a Hal-like voice.

"What is the population of Poland?"

"I'm asking the Infolab START system 'what is the population of Poland.'" The answer comes up on the screen.

"What is the infant mortality rate in Iran?"

The computer responds similarly, and again, the answer appears on the screen. (You can exercise the natural language component of the demo at www.ai.mit.edu/projects/infolab/ailab).

Intelligent spaces, smart hand-helds

The basic construct of Project Oxygen is an "intelligent space"-a room, or perhaps, an entire building, wired to receive verbal and visual input, process that information, and act accordingly. Fully fledged intelligent spaces would take on human-like qualities: sensing the outside world, processing that information, and even altering the physical environment based on the conclusions drawn. The "eyes" and "ears" of such a space include wall- and ceiling-mounted "beacons" that track motion, a bank of microphones capable of honing in on a conversation, a video camera, as well as sensors and actuators embedded, say, in the coffee maker or flower pot. The beacons are components of the Cricket location support system, which employs both ultrasounds and RF signals to determine a user's location. Because the two signals transmit at different speeds, a listening device built into the hardware can use the timing difference to calculate its location. Oxygen also incorporates handheld devices, called H21s, as well as sketching and design tools. A prototype H21 is built atop a Compaq Ipaq palmtop, with the addition of a CMOS camera, an accelerometer, a field-programmable gate array, and an audio codec, and Linux. Future H21s will incorporate a new MIT-designed CPU architecture, called RAW (Raw Architecture Workstation), a low-powered chip featuring programmable data paths.

To control this equipment as well as process and exchange the sensed information, Oxygen uses the rather fuzzy term "environmental space," or "E21" (The "21" suffix refers to "21st" century. But some students think the reference dated and are wearing T-shirts that say "E22") In a sense, E21s provide unembedded intelligence behind the embedded systems. Applications include tracking people, identifying faces, filtering a command request from other conversation. "In some sense my office is an E21 because I haven't touched the light switch in here for over six months," says Brooks. "I walk into the room, say 'computer,' and it says 'beep.' I say 'turn on the lights, open the drapes,' and they do. If I have a visitor, I say 'begin the presentation' and the slide projector turns on, powering up PowerPoint on my machine."

Oxygen's conception of a network (called an"N21") prizes flexibility, with variable data rates and error correction. Oxygen developers refer to a "collaborative region," a self-organizing set of computers and other devices that work together and share some degree of trust. Hence the sensors within a meeting room would operate on the same set of assumptions regarding confidentiality.

Interpreting human gestures and speech is a prime Project Oxygen goal. "Our ability to recognize where someone is looking is working pretty well," says Brooks. "We can tell where someone is pointing at a screen with a fairly good angular resolution. We also have the components to recognize whether someone is sitting or standing or walking around, or recognize a person from their gait-although we don't have them all integrated into the Oxygen system. Computer vision for those sorts of tasks has made remarkable strides in the last three years."

The H21 handheld includes built-in face recognition. Pick one up and it "sees" your face, recognizes who you are, and adapts to your preferences-thereby achieving a level of accommodation higher than many marriages. Brooks says that the human face is just the sort of vision problem that can be solved with current technology. "Basically all humans are the same: we have a head on top of a body, we have two eyes-we have a structure," he says. "By contrast, we can't do general object recognition and we're not trying to do that in Oxygen. We're not trying to recognize the difference between a glass of water and a cellphone and a screwdriver." Brooks thinks the combination of face and voice recognition (i.e. the identification of a person through his or her voice) provides a highly reliable means of identification, with no passcode required.

Speech recognition encounters the same sort of hurdles and machine vision: it needs a context for the conversation it is having. If the system "knows" you are talking about the weather, it can react fairly intelligently. But the technology has not advanced to the state where it can hold an open conversation, in which no topic is preordained and the human, rather than the computer, drives it. "With many conventional speech-understanding systems, you are essentially navigating a menu, which is really annoying," says Brooks. "In our demo, people take the initiative over what aspect of weather they are talking about and when. They can change the subject, which is very important to make systems usable. But, we're still within limited domains."

Oxygen has succeeded in creating applications that are intelligent within a context-like a discussion of the weather. Moreover, voice recognition systems from the MIT's Spoken Language Systems Group are speaker-independent, and can be language-independent, as well. The system can identify a new language and switch to it, while still tracking the conversation. In one online demonstration, an English and a Japanese speaker alternate in conversation with the computer, which responds in kind.

Once the system understands what you are saying, it needs to reply. Developers envision a universal knowledge base that places no restrictions on name or data type, with support for both relational and hypertext relationships-with the ability to reprent knowledge from one person in the format preferred by another. Haystack, a new take on an Outlook-like personal information manager, is one early attempt. It employs Semantic Web RDF (a post-XML move toward making Web-based data more machine-readable, using the Resource Description Framework). For end-user queries, researchers are using START, a natural language query system. (For a demonstration, see http://www.ai.mit.edu/projects/infolab/ailab).

As Haystack demonstrates, the gap between Project Oxygen's vision and the reality of ongoing research is wide. The project's website describes a computer system with the intelligence of a smart assistant, able to prioritize your e-mail and book an airline flight on your behalf. But flesh and blood assistants need not fear for their jobs. Computers have become better listeners, viewers and language speakers. They can open drapes and turn on the lights, recognize faces and track movement. But these are nanosteps on a long road. A computer is now the world's best chess master, but we are a long way from a computer with an intellect so inflated it is frustrated by its human colleagues. Project Oxygen offers the possibility for a more human-like interface, not, as yet, a superior intellect. But for anyone who has gone cross-eyed chained to a screen and keyboard, a superior interface would be quite a leap.

A conversation with Rodney Brooks, director of the MIT Artificial Intelligence Laboratory and the Fujitsu Professor of Computer Science.

Joining MIT in 1984, Rodney Brooks is best known for his research in robots and what they can tell us about human intelligence. He is author of Flesh and Machines: How Robots Will Change Us (2002) and Cambrian Intelligence: The Early History of the New AI (1999), as well as a 1985 book on programming in Common Lisp. Brooks was one of four men in widely diverse occupations featured in the Errol Morris documentary "Fast, Cheap and Out of Control," named for one of Brooks' scientific papers (www.sonypictures.com/classics/fastcheap/). Brooks is also Chairman and Chief Technical Officer of iRobot Corporation and holds degrees from Flinders University of South Australia and Stanford University

One of Project Oxygen's main goals is to re-invent the computer interface. How are you going about it?

We have several ways of achieving it. For example, the system understands where people are in the room. It uses that information to "steer" microphone arrays that are mounted on the ceiling to listen in on that portion of the room. "Steering" means delaying signals from different microphones by a differing amounts of time, then combining those signals together.

Now, where should you steer to?

One way of determining that is by searching the acoustic space to find out where the sound is coming from. But you can do a lot better if you've got visual information locating the person is who is talking. You can do an even better job if you watch their lips move. You've probably had the experience at a noisy party where you're talking to someone, but as soon as you look away, you can't understand what they're saying. Vision helps in an unencumbered way to get the right speaker into the speech understanding system. We look at who is sitting down, who's standing up, where they are pointing, whether they are looking at the screen. This information gives context to the speech. We do that all the time in conversation. We understand what people are paying attention to by where they are looking and pointing; it's a very important part of interactive discussion.

We are also experimenting with a sketching interface. A lot of what people do in meetings is work on a white board, with others contributing. So we want the machine to understand what is being sketched and written, by connecting the white board to back-end legacy systems.

Did you see the video of the guy speaking to his handheld? He uses it as a cellphone outside, but when he walks inside, the connection automatically switches to 802.11, turning the conversation into a video conference. The handheld has a little camera, but the resolution isn't particularly high. So he walks into another room where there's a big screen display and high resolution cameras, he puts the handheld down on the table, and the system realizes what's going on and shifts the video conference over to that big screen.

That's very different from our current way of interacting with machines and communications, where the connection is fixed. If we're connected by cellphone, we must reestablish the connection if we switch to a laptop. In Oxygen, the abstraction is the session, not the connection. The session is centered around the person, and continues even though the underlying connections are changing up the wazoo. We're implementing all this on top of TCP/IP. In that sense we're not recreating the Internet, we're building on top of it. But this different abstraction level changes the way you think about human-machine interaction.

Is the same true, for example, in seeking, "the nearest printer"?

Right. It's what you need, it's centered around the person, the task, not the details of what they are doing. The abstraction is that someone wants to print something.

Is the goal of Project Oxygen to create a smart assistant?

That's one way of looking at it, but it's not necessarily the goal. Our goal is broader: creating smarter systems for all kinds of applications. For example, I don't think changing a cellphone to a videoconference is really quite the role of an assistant-it's slightly different.

From a software development standpoint, what are the challenges?

There are many. One challenge is in trying to develop new types of services on top of TCP/IP-services that the infrastructure wasn't envisioned for. Another challenge: integrating the work from many researchers who come from different areas of computer science. How do we make their systems talk to each other in a flexible layer above the traditional operating system? The idea is to allow researchers to plug their pieces in and get them to talk to the other pieces in flexible ways, yet have the system be robust while running on lots and lots of different processors all the time.

From a development standpoint, are you still working within C and conventional languages, or are we seeing tools coming along at the meta-language level?

Certainly at the meta-language level. One of the things we developed under Oxygen is a Web tool for building language systems for a new domain. We built our weather system this way. You go to the web site, click around to specify what needs to be talked about, what actions need to be taken, how the interface has to be structured. This is all done with mouse clicks; you don't write any actual code. The system then compiles up this special purpose speech system. It's a very high level of interaction. Underneath, there's lot of C and C++ code.

A different example: we have a system called MetaGlue that pastes the agents together that operate as Oxygen. These agents are highly varied. You might have a conventional piece of software you got from somewhere else that you are putting together with other pieces, and you wrap some pieces of Java code around those, and they negotiate over the Internet over who is talking to whom, and where the particular piece of code runs. This is all transparent to you. The process can migrate from machine to machine, and move from different operating systems, and it's robust and transaction-based, so that if a machine crashes, it is able to back up to within a few hundred milliseconds of where it was and restart things. The system is perpetually on and perpetually running.

What's your relationship with your industry partners? Are they contributing equipment? People?

They are certainly contributing funding. They are also contribute people, who are in residence at MIT from all the companies all the time. Some come for three months, other for six months. The longest projected visit we have is a three year visit from one researcher. So there are certainly people from the companies here involved in the research directly, and that's a very good mechanism for transferring results back to the company. Then we have quarterly workshops where more people come from the companies. And we have visits to the companies with maybe 10 faculty showing up en masse, where we can then communicate with maybe a few hundred at a central research lab for two days.

We've got six companies very involved, and NTT is very much involved, with Oxygen and beyond. I'm on two steering committees - one at NTT and one here at MIT-and I chair the MIT side. The NTT/MIT collaboration started in on 7/1/98. When Oxygen came along, they joined in with Oxygen also. Projects range from fundamental computer science-algorithmic things-through applications to broadband to computer vision, to delivery of services. NTT has a fantastic set of research labs. We have great respect for them, and have known people there for many years, and have had lots of small collaborations-so it was great to have this big collaboration.