Tracking Human Migration Through DNA: the Genographic Project

At face value, the Genographic Project seems a perfect marriage of field research, database number crunching, and the quest to know where we as people came from. The study addresses a fundamental question of human existence: how did our collective ancestors migrate from Africa to populate the world? The answer may be found through genetic markers, the strands of the DNA chain that are passed largely unchanged from generation to generation. Collecting and analyzing those markers will be the goal of the project, a five-year endeavor announced last April by the National Geographic Society, IBM, and the Waitt Family Foundation, which was established by Gateway computer founder Ed Waitt.

"Put yourself back 50,000 to 75,000 years ago," says Ajay Royyuru, the Genographic Project's lead scientist for IBM and the head of the Computational Biology Center at IBM's Thomas J. Watson Research Center. "The entire population of humanity existed as a single group of individuals, possibly 2,000, all in Africa. It is that group that created future generations. Some remained in Africa. Others came to other parts of the world. The big questions we are trying to answer are: who were our ancestor, how are we as people related, and what was the human journey? All of recorded history goes only back 10,000 years or less. For 40,000 years before that, we have no information about who those people were or how they migrated. All we have are tantalizing glimpses. Royyuru says that the period from 40,000 to about 100,000 years ago is the darkest period in archeology, because the fossil record is so thin. So it is the genetic, rather than the fossil record, that may hold the most clues to human ancestry and migration.

To raise money for the project and give people a chance to participate, the project's website offers a $100 cheek-swab kit. Submit your DNA, wait four to six weeks, and read your results anonymously online. Samples from males take advantage of the relative stability of the Y chromosome to analyze origins along paternal lines. Female samples use the mitochondrial DNA (mtDNA) to check migratory origins along the maternal line. To ensure privacy, each kit comes with a random, untracked code which becomes the login for viewing results.

The heart of the study and the debate: DNA from indigenous people

Researchers say that time is at a premium. Vital to the project are indigenous people, who have stayed long enough in one place that their genetic markers have remained relatively unchanged over many generations. But the stability of these populations is changing as they migrate to cities, meet people from other corners of the world, and intermarry. Project researchers call the phenomenon a "scrambling of genetic signals" as "distinct peoples, languages, and cultures are quickly vanishing into a 21st century global melting pot."

And so indigenous people will be the key to the project. But there's an obvious catch: indigenous people are much more than just a donor pool. They are not just a walking collection of raw data. They have their own history and culture, which may conflict with the project's findings, as well as their own privacy concerns. Indeed the very DNA they donate can have financial value. And like any study participant anywhere, they need to be fully informed.

"In order to get an undergraduate enrolled in a scientific project, the researchers must explain what they are doing, how they are doing it, and what is the hoped for outcome," said Jonathan Marks, who teaches at the Department of Sociology and Anthropology at the University of North Carolina and is a critic of the project. It is tougher to provide that explanation to people from a different culture, who may not "share our ideas of blood, heredity and DNA. People may not realize that they are giving an immortal bit of themselves, DNA, to geneticists to do with as they like--forever. Those purposes may include things you haven't agreed to."

Another ethical problem, says Marks, is that tracing human migration through DNA may contradict legend. "We are asking people to participate in a study that will undermine who they are and where they came from. What person in their right mind would participate in a study that would undermine their self-identity? The answer is: someone who really doesn't understand what it is you want the DNA for." And then there is the question of financial interests: if you donate your DNA and someone makes a profit, do you get a financial reward, as well? American court rulings to date have sided with the research institutes, not the donor. On its website, The Genographic Project says that no patents will be filed; that all research is in the public domain.

Marks expressed concern about the lack of a bioethicist on the project's advisory panel. He says that a bioethicist's job "is to imagine the future, figure out what the boundaries are, and patrol them." Since our conversation, Dr. Simon Longstaff, Executive Director of the St. James Ethics Center in Queensland, Australia, has joined the project's advisory board. Also serving on the board is Tammy Williams, an indigenous rights activist in Cape York, Queensland.

Brian Schwegler, associate director of ethics, education, policy, and compliance at the University of Chicago, says that the clash between culture and science goes back at least to the 19th century, when German philologists used Sanskrit in an attempt to track humanity's origins. Not that legend and science always disagree. "As a number studies have shown, the two tend to be relatively close," he says. "There are stories among the Northwest indigenous groups of the black times when their civilization was almost wiped out. Recent studies have shown that this seems to link up very closely to a massive tsunami that hit the region. In those cases, the geological record and the cultural record seem to line up."

A bigger problem, Schwelger says, has been control over the data. He sites an Arizona State University study that collected blood samples from the Havasupai tribe. "The goal of the study as described was to track the prevalence of diabetes within the cultural group." Researchers were trying to determine whether the disease was a genetic condition, developed through lifestyle, or a mix of the two. "However, due to a lack of control over the data samples, this blood was sent all over the country and was used in a number of studies, tracking other types of issues, among them schizophrenia, in-breeding, migration." Neither the blood donators nor the tribe had agreed to these uses. "So the initial study was fine, but there was a lack of control over the samples. He says that, in general, any study that views collected samples as a general data bank "is a potentially ethical and legal problem. That's a lot of what underlies this fear. For indigenous groups, donating genetic material may mean losing control over their own cultural and genetic history."

I asked Schwegler if he thought The Genographic Project had avoided some these problems. He thought they had gone at least part of the way. "They indicated the data would be destroyed at the completion of the study, at least for the stuff they are doing with the western populations who buy the [cheek swab] kit." On the other hand, "the question of when the study will be concluded is up in the air because-in terms of retesting-this could drag out for years or decades."

In an email response, project spokeswoman Lucie McNeil said that samples taken from indigenous groups will be maintained in the country, ??gwith the strictest security at the regional research centers performing the field sampling for the project. The samples will remain at those facilities after the project has completed in five years, and a participant may choose to have his or her sample removed or destroyed at any time and for any reason." Samples donated by public participants will be stored at the University of Arizona for the duration of the project-which is five years. "At the conclusion of the project, or at any point beforehand at the request of the participant, the genetic material will be destroyed,"

Schwegler says that the project appears to be following standard procedures in planning to meet with community leaders ahead of taking samples "to make sure that the local groups understand the project, and that both the individuals and groups agree to participate. "Ideally, there are no pressures, so that the research design becomes fine-tuned to the cultural practices of the group." In her email response, McNeal wrote categorically: "Without the proper approvals, indigenous groups will not be sampled in the field."

Schwegler says that one question still unaddressed is who controls the data-the donators or the researchers. "There are a number of indigenous groups, both within and outside of the United States, that are requiring ownership of the data, approval of any findings before they are published, or both. That, to me, doesn't seem to be addressed."

McNeal says that while no medical research, patenting and no licensing will be permitted with any sample collected for the Genographic Project, "a major goal of the Genographic Project is to establish a 'Virtual Biobank' of diverse samples from around the world to be used for strictly anthropological inquiry. These samples will be safeguarded by several layers of security, including biometric encryption at the research centers, lock-and-key access, and several software and database security measures to protect the data. While the samples will not be distributed openly, it is the goal that they be made available on a case-by-case basis in the future to other research projects that will analyze the samples for strictly anthropological purposes."

Despite the divergent goals of researchers and their subjects, Schwegler does not think that the conflict is intractable. "These two traditions and the representatives of both need to work together to develop a collaborative research model. And you need to demonstrate why each is important to the other." The problem, he says, is particularly acute on the research side, where a study may have been designed entirely by academics. By the time the details are worked out, the only remaining question is: "how do we get these people to agree to it."

And so for the indigenous people at the project's core, an act-of-faith is required-that the participation at the very least follows the Hippocratic oath: "First, do no harm." The project is too new to assess how negotiations with community leaders is going. But in long run, the way the Genographic Project goes about its work could be as important as the genetic atlas it hopes to produce. If time is short because the genetic signals are being scrambled, then the geneticists and bioethicists have little leeway in dealing with prospective DNA donors: this time, the researchers must get it right.

Sidebar: A Conversation with IBM's Ajay Royyuru

Ajay Royyuru heads the Computational Biology Center at IBM's Thomas J. Watson Research Center and is IBM's lead scientist for the Genographic Project. He holds degrees in human biology and biophysics from institutions in New Delhi and Mumbai, with post-doctoral study in structural biology in New York. He joined IBM Research in 1998.Royyuru says that computational biology is about "applying information technology to understanding biology as an information system. It's the most complex information system that we've ever seen."

Where did the idea for the Genographic Project come from?: The idea started with Dr. Spencer Wells, who is an explorer-in-residence with National Geographic. He's a density of population geneticist and has been looking at markers on human DNA and how those markers could be used to trace ancestry, and therefore understand human migration. It's a broad theme that many population geneticists have been studying for the last two decades or so. Spencer was looking at the big picture of how this kind of research could be scaled up and applied to the entire human population, following his previous work, the documentary "Journey of Man" [a 2003 documentary in which Wells argued that all humans alive today are descended from a single man who lived in Africa around 60,000 years ago].; That is when Spencer and National Geographic started getting excited about the fact that something like this could be done on an entire population. They realized they really needed a technology partner to be with them all the way from idea to design implementation, execution of the research, and communication of the results. Fortunately, they came to us and never went anywhere else.
What has IBM contributed?: Three things: technology, research, and communications. This is a global project. We have set up ten regional centers in each geography of the world, and our goal is to obtain DNA samples from about 100,000 indigenous people and an equal or larger number of public participants. The technology contribution includes gathering the data, organizing it at the regional center, making it flow into the central database, and then turning it over to the research folks to do the analysis.; Field work will be done with ThinkPad 42s equipped with fingerprint readers-which will limit data access only to the authorized researcher. So the data is secure all the way from its inception in the field to the regional center, where we have some infrastructure to house the data temporarily. We'll use a whole bunch of IBM software to flow the data over whatever kind of wire is available-whether high bandwidth or low-into the central repository that we have built at National Geographic headquarters in Washington, DC.
What kind of systems are used?: The data itself at National Geographic is housed on two blade servers running Linux and DB2. We are still trying to figure out the compute requirements for the research part of the project, and we will bring in whatever is necessary.
Why use blade technology?: Blade is nice because it is scalable. You can do multiple things in the blade architectures simultaneously: we can do many independent tasks very efficiently, with the ability to scale back and forth between tasks. What are the challenges of researching DNA, in terms of storage, analysis, etc.? The storage requirements are relatively modest: they won't tax the technology here. The markers on the genome that we are gathering take just a few megabytes per individual. We combine that with data from the questionnaires, which ask things like your geographic location, the place of origin of your paternal and maternal ancestors, the languages they spoke, and so on.
Where are the challenges?: The biggest challenge, which is why we are engaged as a research division of IBM, is in analyzing the data-finding the correlations that we can make between genotypes and phenotypes and finding groups of individuals who hang together on the family tree.
What's involved with that?: That is an immensely large data mining problem. If you can think of the data, imagine that as three dimensional. On the first axis of the three dimensional cube you place your genotypic markers: basically, the status of whatever marker we are using on your genome. On the second axis are the answers to the questions about your phenotype-about your culture, your anthropology, the history of your family. On a third dimension are the individuals, themselves. This is not how we are structuring it in DB2, but conceptually, that describes the primary components of the data.; The challenge is to mine the data to find groups of individuals that possess similar markers and have similar characteristics, in terms of where they live, where they have been, what language they speak.
Is this strictly a heuristic problem or are there algorithms that cut through it?: There are algorithms that geneticists and researchers in this field have been using. Our goal is to push that field enormously. The sheer quantity of data can help fuel that.
So this is more an algorithmic problem than a computational one?: To begin with it is an algorithmic problem; at some point, it may become a computational problem. Today, there are no solutions that will scale to this size and give you reasonable answers for the diversity of participants that we are hoping to attract. And there are additional challenges that we hope to sink our teeth into as the data grows. For example, linguists estimate that there are some 6,000 languages spoken in the world today. Linguists anticipate that in the course of the next 100 years, probably only 1,500 of those languages will survive: that is, three-quarters of the linguistic diversity in the world will disappear in 100 or less years. And studies at the University of California, Berkeley indicate that a large number of languages are spoken by 10 or fewer people. That's how small these linguistic groups are becoming. Some of them are geographically collocated, some of them are geographically disbursed.
So you are trying to find a correlation between genetic markers and linguistics?: Absolutely. Languages take a long time to evolve. They evolve with large populations who have lived in relative isolation from the rest of the world. Linguists estimate that it takes upwards of 1,000 years for a complete, independent language to evolve. So languages therefore become extremely good proxies for the cultural diversity that we see on the planet today.; One of the challenges that we have to quickly address is how do you identify people who have that similarity in cultural traits? Do they share genetic markers? If so, what can we do to understand the relationship in linguistic diversity between one group of individuals and that of another? And to determine whether linguistic evolution is similar to the evolution of the genome, itself.
Is this going to become a measure of how well linguistic traits are as a measure of human migration?: No. The two are correlated but they are not necessarily exactly identical. In fact, the other extreme is also interesting: seeing a correlation in language with no corresponding correlation in genetics. For example, there are populations that are separated by a few thousand miles, yet speak languages that fall into the same linguistic group. Now the question is: how did these languages become similar in these two far apart regions of the world. Did the people move? Or did the language migrate as part of the culture, with neighbors teaching neighbors: and lo and behold, 2,000 years later you have two very distinct groups of individuals that are speaking the language but they share no genetic history. It's that sort of analysis that this project will certainly facilitate. If you gather data from a huge number of people, geographically disbursed, culturally very diverse, it is these sorts of questions that you will be able to answer in the course of the project.; The other big challenge is data collection, which has always been the hardest part of this kind of genetic population research. It takes a lot of field work to encounter individuals who will be willing to contribute data. That's why we took the approach of collecting data from indigenous peoples through field research as well as inviting public participants to provide samples and volunteer specific data about themselves. All the participants are, in a sense, associate researchers. By revealing a little bit about themselves-nothing that compromises their anonymity or the confidentiality of the data-they are allowing this kind of research to happen.
Have there been changes in how DNA is represented?: The most raw form of DNA information is the sequence itself. We are not sequencing the entire genome, but specific marker locations on the Y chromosome, which is the male chromosome, and mitochondrial DNA, which is a piece of DNA that is circular and is possessed by both males and females. From a cost standpoint, we have only been able to do this over the last 15 years or so.
How has the cost of sampling changed over the last few years?: Certainly an order of magnitude or more, and it continues to come down. People are still inventing methods, and the vision for most biotechnology folks working in this area is that we have to make whole genome sequencing affordable in the near time frame: within the next ten years or so.
So what was your family's genographic migration?: My ancestors got to be in India and now I know how. Based on the marker from my DNA, I can trace my paternal ancestry and I know they were approximately the second wave of migration to occupy the Indian subcontinent. People who possess my marker are the dominant population in southern India-that's no surprise to me. I lived in central India but my parents were born and lived in southern India so I know that I do belong to that place.