Software Designers~The People Behind the Code~(英語)

#8Matthew Levine Director of Mapping Engineering, Akamai Technologies

From his Cambridge, Massachusetts office at Akamai Technologies, Matthew Levine works in a kind of control tower, routing not aircraft, but data: about 20 percent of the total Internet traffic, using 48,000 servers in 70 countries. As director of mapping engineering, Levine and his team try to figure out how the Internet is put together and what to do when bottlenecks slow things down. The job isn’t getting any easier. Once a text-only medium, the Web now hosts streaming high-definition video and is vying to become an alternative to both television and DVDs?and that is putting a lot more data into the ⁠pipes”

Levine was an early student of Brian Kernighan’s at Princeton University?the Bell Labs researcher turned professor whose interview appeared on these pages a couple of months ago. In 1995, Levine studied theoretical computing, specializing in the design and analysis of algorithms, at the Massachusetts Institute of Technology (MIT). In 1996, he collaborated with Tom Leighton, then a professor of applied mathematics, and with fellow graduate student Danny Lewin and other researchers on a paper describing a method for data caching?that is, storing frequently requested data closer to the people who request it. Leighton and Lewin became the principal founders of Akamai in 1998. Levine joined the following year. Over his 10-year career, Levine moved from software engineer to architect to his current role as data-router-in-chief.

What appealed to you about algorithms?

I found it interesting that by doing something very simple, your program could be tremendously more efficient. Algorithms are like clever puzzles, and over time, I found that the problem sets were lots of fun?but they were generally about tweaking a known solution. I found it much harder to pursue open questions, the ones that a lot of smart people have already looked at, which is what I needed to do as a graduate student. In the theory community, there’s a bunch of these: open problems that lots of people wish they could solve.

One open problem?the one you worked on with Tom Leighton and Danny Lewin?has become your life’s work.

Early on, the World Wide Web Consortium was grappling with this problem: if a website gets really popular, then the Internet tends to melt. So people began to ask if theoretical computer science could do something about it. We ran with that, trying to formulate this practical problem as a theory problem?and that turned into a conference paper. Tom and Danny wondered if it had commercial possibilities.

Did you?

I wasn’t sure. But I told them that’s not what I was at MIT to do: I was there to work on my PhD. As far as actually creating a company, I told them ⁠good luck.⁠⁠ Looking back, I was right: joining them would have halted my PhD work. It took them a couple of years to really look at the nature of the problem and the business opportunity, and to do the other non-academic work necessary to get the business off the ground.

But I was ultimatly wrong about what I wanted to do professionally. After they launched the company and got a few customers, I finally realized that the opportunity to join them was too good to ignore.

As a researcher, you seemed hesitant to enter the commercial world. Did you want to teach at a university?

I was thinking something like research labs: Bell Labs seemed like an attractive place where you could work on open, interesting questions. I wasn’t thinking that it would be fun to be a professional programmer. That turned out to be a mistake. As a student, I didn’t yet see how satisfying it can be to build a system that people use and care about. In an academic environment, your goal is to build a system that is both novel and technically exciting. That’s the only way other researchers will be even remotely interested in it. Whereas in a commercial setting, you can build a system that is only modestly innovative from a pure technical perspective, as long as it is well engineered and useful enough that people want to pay for it.

What I’ve since learned is that building a system that people actually care about in their daily lives can be very satisfying.

Akamai’s mapping system has become that system for you. What is it?

It’s the control system for our network operations. We have servers deployed all over the place, and the idea is to serve you from the ⁠closest⁠⁠ one. That sounds simple, but there’s a huge technical problem?how do you figure out what ⁠close⁠⁠ is on the Internet? What’s the fastest way to get data to a client machine? That answer isn’t always obvious. We have servers in Tokyo. And we have an end user, who we see as an IP address. And we need to figure out: are those servers really the best way to get data to that user? Or would other servers get the data there faster?

I was completely naive when I first encountered this problem. You’d think that, while the Internet is huge, we would know where the routers are and what’s connected to what. But it’s not that simple. The various ISPs don’t like to talk to each other about their precise network architecture. They consider that information proprietary. So when we as a third-party ask how the pieces of the Internet are connected, nobody knows. There is some public information available. We might know where an ISP’s peering points are and how we might connect to them. But when it comes to the internal network architecture, the ISPs will pretty much never tell you.

So this takes some detective work. Step one is to figure out what the Internet actually looks like using trace routes and other diagnostics. That will give us a rough sense of what seems to be connected to what. But even when we’ve mapped things out, there’s still the problem that a seemingly ⁠close⁠⁠ connection may be overloaded. We think have end users in Tokyo and servers in Tokyo?but given network conditions, do we really want to serve those end users from these servers? To answer that, we recheck the connections every two minutes, and if we spot an overload, we redirect the traffic to the next closest place. Ultimately, we approach this at the global level. The goal is to serve everybody from the best place, without overloading anything.

Tell me about your team.

My team has 29 people, including 22 software engineers and four managers who report to me. The managers know the job first-hand?they all worked as software engineers in mapping. Most of us are in Cambridge, and those who work remotely once worked here, so that they know the other team members face-to-face.

We have quite a few people with computer theory and other academic backgrounds. Over time, we’ve added more people who think of themselves as professional software engineers. So we have some people who are stronger in thinking about the algorithms and systems design. And we have others who know how to write and test good, solid code, which is critical for any production system. These various backgrounds are complementary. The people with a strong algorithmic background learn from the strong programmers how to write good solid code. The strong programmers learn some of the algorithmic considerations that will make their code more efficient. The common language for most of this work is C++.

The demands on the network are only increasing. How is that going to affect traffic flow? and affect Akamai?

It remains to be seen how much things scale up. But, for example, if people start routinely watching HDTV on the Internet, we will see a lot of bottlenecks. The danger here is that the current Internet infrastructure could get saturated. That’s the industry problem. The Akamai problem is that with growth comes more responsibility.

What do you mean?

When we got started, we were just one of several companies pushing bits around the Internet and trying to figure out the congestion points and how to get around them. We weren’t the cause of that congestion?we were just trying to avoid it. But now that we are controlling a big mass of the traffic, we could potentially cause Internet problems by routing traffic badly. For example, if we were to serve too many end users across a peering point, we would saturate that peering point. And it wouldn’t be sufficient just send that traffic somewhere else. We would need to think of ways to avoid the rerouting problem in the first place.

People don't think about this because we do a good job. But it is an issue my team is watching carefully, so that we continue to do a good job as the Web starts to see more high-definition media. This could be an opportunity for a more candid conversation between us and the ISPs to get a more expanded view of their networks. After all, we both have the same goal: to mitigate problems before they arise.

The problem of data transport seems vary widely from country to country. In its quarterly “State of the Internet” report, Akamai says that South Korea has maintained the world’s highest connection speed?averaging 11.3 Mbps. At the bottom is Eritrea, which averages just 42 Kbps.

South Korea has great broadband penetration and excellent local connectivity. But it’s not terrifically connected to the rest of the world. They aren’t alone. Australia too has good connectivity within the continent, but it is expensive to get out of Australia?because geographically, they aren’t connected to anything nearby. In both of these markets, it’s important to have server deployments so that international traffic is served locally.

What other challenges are you facing?

One of the interesting problems we’ve encountered is the challenge of applying advanced algorithms to a real system. In the abstract, network traffic is a supply and demand problem. We have a supply of server capacity, and user demand for the data contained on those servers. We can quickly formulate this as a theoretical problem?how do you match server capacity with consumer demand in some optimal way? What’s the best algorithm to do that? But in the real world, it’s not that simple because we don’t know the demand all that precisely. Ideally, we’d want to know, over the next few minutes, which end users will want to fetch which pages. That’s the ideal, but it’s not easy to anticipate. In other words, even the cleverest algorithm is only as good as the data you feed it.

There’s another question we’re always dealing with: how complex should the system be? A complex system can be more sophisticated and adapt to more situations, but it can be much harder to modify and manage. In 2005, we bought a competitor, Speedera, which gave us a chance to look at a different approach. They had developed technology for the same problem, network mapping, but their solution was much simpler: almost aggressively so. We could immediately see the tradeoffs. Their system was much easier to understand, but wasn’t able to handle every situation. Our approach is roughly the reverse.

We are constantly going back and forth between these two approaches: simplicity and complexity. Should we try to simplify the system even if we lose some functionality? Or is functionality all that matters?

Do you get emergency calls in the middle of the night?

It can happen, though that has decreased over time. Ten years ago when we had just three customers and we were building the system as fast as we could, problems were pretty common. Now, the system will tolerate a fair number of failures. When something breaks, as equipment inevitably does, we can deal with it the next day. Another difference is that I’m now a father. Before then, I used to come to work whenever I woke up and went home when I felt like it. With little kids, I wake up in the morning on their schedule, and I try to have dinner with them each night.