Software Designers~The People Behind the Code~(英語)

#1Steve Lord, Director of U.S. National Weather Service’s Environmental Modeling Center

As a teenager racing sailboats on Long Island Sound near his Connecticut home, Steve Lord paid close attention to weather forecasts. A few decades later, as director of the U.S. National Weather Service’s Environmental Modeling Center, he is helping make those forecasts better. Lord and his 145-person team are responsible for the care and tweaking of the kinds of numerical models that have become the basis for weather prediction around the world.

“This is a great job,⁠⁠ he says. ⁠Every day, I know I’m doing something that can improve things for society. If you screw up a forecast, there’s a lot that can go wrong. And when you get it right, as we usually do, there are many rewards.”

Weather prediction is one of the fundamental problems that spawned digital computing in the first place. The challenge also spurred MIT meteorologist Edward Lorenz to invent chaos theory after his initial attempts at a weather modeling ran into what he would call the ⁠butterfly effect.⁠⁠ As the professor put it in the title of a paper: can a butterfly’s wings cause a tornado in Texas? Building on Lorenz’s work and aided by the huge gains in hardware processing power, Lord and his colleagues have gotten environmental modeling down to a science. Forecasters now use the model runs to predict the weather with near certainty three days out, and with reasonable certainty up to a week ahead. Lord calls weather modeling ⁠one of the miracles of science and engineering.”

Lord majored in physics at Yale and earned his PhD in atmospheric sciences from the University of California, Los Angeles. His work on a model incorporating a new theory about how clouds affect the environment would set the course for the rest of his career. He researched hurricane behavior at the Atlantic Meteorological and Oceanographic Laboratory in Florida before moving to the forerunner of the Environmental Modeling Center in 1989, becoming permanent director in 2000. Lord is a lifelong Fortran programmer, first learning the language as an undergraduate student for a senior project. ⁠That was one of the best decisions of my life,⁠⁠ he recalls. ⁠I didn’t know back then where it would take me, but I got into graduate school based on the fact that I could program computers.”

Steve Lord spoke to me by phone from the Environmental Modeling Center in Washington D.C.

This is over-simplifying, but is your job essentially about improving algorithms?

Part of it is. We have algorithms implemented within computer codes that we improve year over year. We are also adapting the codes to new satellites and other types of observations. And we add new capabilities. A few years ago, we weren’t doing any ocean modeling. Now, are running a model every day covering the north and south Atlantic oceans.

This work is mid-ground between engineering and science. We read the scientific papers, we develop algorithms, and we write computer codes, which we test to make sure they are reliable and deliver results on time. And time is a big constraint: in an operational center, you’ve got to finish your job and make room for the next one. For example, we have an hour and 15 minutes to run our global model, and if we take an hour and 20 minutes, then the next job is late and the job after that is even later. So everything has got to run in a time box, and you’ve got to weigh the percentage of the computer’s computational power versus the amount of time available.

What hardware do you use?

We run code on an IBM Power 5 system with 2,400 processors and on a newer computer, an IBM Power 6 system with 5,000 processors. We have a better than a 99.5% delivery rate, so our down time is almost zero. Our codes have to work 24/7/365, run after run.

You are running Fortran even though much of the computing world seems to have gone with C and C++.

Parallel computing is absolutely essential for what we do. The code must be scalable--and the scalability of Fortran is still good. The scalability of C compilers is not so good yet, even though it is coming along.

Can Fortran code scale from 4,000 to 30,000 CPUs?

You do have to rethink the structure of the code. Today, our scalability is between 30 and 800 or 900 processors, but we haven’t had any practice going to 10,000 and above. This also depends on the application. A model tends to be more scalable because the code is very structured and you are doing the same thing at every grid point. But a data assimilation and analysis program has a different number of observations, with different communications structures, each time it is run.

How hard is it to find programmers who can do this?

It turns out it’s rare that anyone can do it just walking in. We’ve found it often takes a year and a half for a person coming in from graduate school to become a viable, productive, self-supporting member of the group. Until that time has passed they are very reliant on their co-workers for knowing what to do.

How come?

At a lot of environmental science education programs, particularly in United States universities, you learn how to manipulate the black box, but not what’s inside. Many people can run a set of code to generate a forecast because they know what inputs are available, and they can work on the fringes of the system to make changes. But in order to deliver better results, you’ve got to go inside the code and change the inner workings. We have a harder time finding those kinds of scientific programmers because that kind of education is harder to find.

What changed?

As a graduate student, I had the opportunity to work on one of the earliest and most distinguished global atmospheric models. My first job back then was to build a code based on a professor’s theory so that he could demonstrate whether that theory was accurate enough to simulate global climate. When the code didn’t work, I was the first person called. I had to go fix the code, even if it was Saturday morning at 1:00 am. So I made sure that I could produce code that was bulletproof. That was pretty good training. It taught me that you have to be your own worst enemy, meaning you must invent problems for your code to encounter, and then figure out robust solutions. Unless you do that, you are going to get lots of bad phone calls. For me, that was very valuable training for being in an operational center.

So now that models are at such an advanced state, nobody gets the hands-on training?

That’s partly it. And part is just the sheer complexity of what we are trying to do. As the models get better, it’s harder and harder to make them even better still. We’re at a point now where the percentage of our accuracy out three days is up to 92-to-94 percent. Given the uncertainties that we face, it’s difficult to gain one more percentage point.

The real challenge is longer-term forecasts?

Yes, the opportunity for progress is out there on the fringe of the predictability. As a benchmark, we are always looking at the point in time where our accuracy reaches 70 percent. In 1989, that number was around 5.8 days. Today, 18 years later, we are at a little over 7.5 days. Our 7.5 day forecast today is as good as our 5.8 day forecast back in 1989. The biggest factor is increased compute power: the more you have, the better your models. So we are governed by Moore’s Law. The number and speed of the CPUs, as well as the speed of the connections between them, are the limiting factor in our ability to produce more skillful systems.

So your software engineers can only be so clever in optimizing code.

There’s only so much blood you can get out of a stone―and in the end, this is all about computing speed. We must have high reliability. We must generate our models on time. We are on a relatively fixed operating budget. With those constraints in mind, imagine you now want to double the resolution―moving from, say, a 60 to a 30 kilometer grid. That requires almost 10 times the computing power. So we are really dependent on hardware advances.

When you entered the field, were numeric models as important as they are now―versus traditional methods of forecasting?

By the time I started, the necessity for running models on computers was pretty much determined. But 20 years earlier, people said that computer models will never do as well as a forecaster. Well now we do very well indeed, and the role of forecaster has changed. Forecasters add value on top of our models by having the experience of knowing under what circumstances different models do the best.

As an institution, the Center has also taken that approach by using a technique called ensemble forecasting. We do multiple runs, each with slightly different initial conditions and, perhaps, different models. The approach makes sense given the chaotic dynamics of the atmosphere and because we are uncertain about a lot of the aspects of the prediction. We’re uncertain about the models because they are not 100 percent correct. We’re uncertain about defining the initial conditions because we can’t always observe them. And the atmosphere itself is chaotic, meaning that two solutions that start off almost the same will tend to grow apart.

Do researchers in the U.S., Europe and Asia look at each others’ code?

One of the very interesting facts about this business is that while we all have about the same skill in terms of forecast accuracy, almost none of the code is portable to each other’s forecast system without a lot of work. We have all spent years learning from the same research, but with independent ways of implementing that knowledge in actual code. In our case, we have more than 40 years of development history and millions of lines of code. So it’s not like we can very easily do a forklift upgrade of our code from another weather center.

I’m still surprised when a prediction of a California storm seven days out proves correct. I don’t think you guys don’t get enough credit.

We’ve come such a long, long way since the days when I sailed on Long Island Sound in the mid-60s. Today, I listen to the forecasts broadcast on NOAA [National Oceanic and Atmospheric Administration] radio, and they are right on, day after day. You go out more than four or five days, and you still get scenarios of what might happen with the weather, and a lot of those scenarios come to pass. Considering the importance of the weather in so many things, it’s a remarkable contribution.