Math Magic Tricks: A Map of Human History, Hidden in DNA

QUANTA MAGAZINE: What got you thinking about genetic diversity as a computational problem?

JOHN NOVEMBRE: For me, the path starts pretty far back. In high school, I was a bit of a computer programming nerd. But in my classes, I was learning about the genetic code, which was completely mesmerizing. Then in college, I got a chance to do a summer research internship at Stanford, where I heard a talk by a student who had interned in Luigi Luca Cavalli-Sforzaâ€™s lab. What they do â€" what theyâ€™ve become famous for â€" is to look at variations in human genes, how theyâ€™re distributed across the globe, and what they can tell us about human history. That was fascinating to me.

I went back to my home campus, and I found a lab working on the population genetics of Quercus gambelii, the Gambel oak. I learned just how difficult a lot of the analysis tools were to use, and how much math and computation is involved in analyzing genetic data. All of a sudden I realized, â€œWait a minute. Hereâ€™s this thing I really love â€" programming â€" so why donâ€™t I combine these two passions?â€ My day-to-day activity became tinkering with computers, but my larger end is something that intellectually fascinates me, which is understanding genetic variation and how it changes through time.

Early in your career, you made waves by uncovering deficiencies in a common statistical tool known as principal component analysis (PCA). How did this discovery further your work in genetics?

What PCA does is, it takes an individualâ€™s genetic data and boils it down to just a few numbers. In learning about how this method works â€" its strengths and its weaknesses â€" I understood that the patterns it produces could reflect spatial structure in population data.

I was hoping to get access to genetic data from a region of the world where thereâ€™s dense sampling, so that I could see what variation looks like at a continuous scale, where populations kind of blend into one another. And it turned out I was very lucky in that I got invited to join a collaboration with Carlos Bustamante, [then] at Cornell, to analyze one of the largest collections of [genomic data] being applied to human populations. The full data set was 3,192 European individuals. A large fraction of the sample had answered an ancestry questionnaire to say where their grandparents came from, and based on that, we saw we had samples from roughly 37 different origins across Europe.

So what did you learn?

When we applied PCA, right away we saw this major pattern: There was a striking resemblance between where individuals are located in genetic space and their geography â€" where their grandparents came from. Thatâ€™s really remarkable given how closely related human individuals are. Most geneticists wouldnâ€™t have thought you could tease apart very fine-scale structure within continental scales.

How fine-scale are we talking about?

Letâ€™s say I took an individual and hid their geographic location and then tried to put them back on a map. How well could I do? When we did this, we could often get within a few hundred kilometers. Even when we looked at German-speaking Swiss versus French-speaking Swiss versus Italian-speaking Swiss, we could see shifts in the genetic distribution.

Iâ€™m surprised that my grandparentsâ€™ geographic coordinates could have such a notable effect on my genetics, given how often humans migrate. How do you explain this influence?

This is something I want to stress: The effect on your genetics is actually incredibly small. Itâ€™s just that weâ€™re looking at so many locations in the genome that we can pick up very small effects. This is the magic of big data: Very subtle patterns become detectable. So itâ€™s not that where your grandparents live has a huge impact on your genetics. Itâ€™s actually a very, very minor effect. But when you have hundreds of thousands of measurements, you can start to pick out that an individual seems to come from one location versus another.

What are your thoughts on the ethics of commercial ancestry tests?

I advise for Ancestry.com â€" their DNA branch â€" so Iâ€™m very sensitive to the challenges of communicating results. On the one hand, projects like our genetic map of Europe show the tremendous potential and power of these tools for learning about ancestry. But then thereâ€™s also the immense complexity of it: What does it really mean to talk about where an individual is from? We can talk about where our parents and our grandparents are from, or we can go very far back into the past when we all came from Africa. And we can have different ideas about origin, in terms of geographic location versus some kind of cultural or ethnic population.

Iâ€™d say weâ€™re still in the early days of really nailing this problem of using genetic data from today to interpret the past. Weâ€™re still facing the complexity of real biological systems and populations, which resist some of our attempts to use very simple models of history.

In what ways has your work influenced how you think about race?

Itâ€™s very clear that genetics research has a difficult and dark history. But itâ€™s been exciting to be part of a new generation doing this kind of work in a time when diversity is much more appreciated and understood and valued â€" and when we have the data to make it even more clear just how poorly conceived racial worldviews have been.

Are you thinking of a particular example?

A very powerful one for me was being part of some of the first teams to look at genome-wide data taken from multiple human populations. You can sort the genome by what regions vary the most across human populations and then ask, â€œOK, what genes are near those locations, and what do we know about them?â€

If you do this exercise, you will see, at the very extreme top of the list, variants that are involved in skin pigmentation, in eye color, in hair color. So itâ€™s an empirical fact that the things we use to see differences in each other are outliers in the human genome. Your average set of genes in the human genome is much more similar globally.

You analyzed the first whole-genome sequences of three gray wolf species and compared them to the genomes of three dog species. What did you discover?

That was a big surprise. We were thinking we might find that all three of the dog lineages are most closely related to one of the three wolf lineages. They might all be related to the Israeli wolf, for instance, because maybe dogs were domesticated in the Middle East. Or maybe there were two domestications of dogs, and the dingo would be related to the Chinese wolf while the basenji was related to the Croatian wolf.

But what we saw was that the three dogs were most closely related to each other but not embedded within the genealogy of the three wolves. Our hypothesis is that there was a wolf lineage that dogs were domesticated from that has since gone extinct. The storyâ€™s gotten incredibly complicated, and I think the final chapterâ€™s not written yet.

Are you a dog person?

Not particularly, no. I would say my motivation was primarily to try to solve this larger challenge for the whole field, which is: How do we use DNA sequences today as a record of the past? You can swap out the species names for me, and itâ€™s still interesting. Itâ€™s still a fun problem.

How has your approach to analyzing genetic data evolved over time?

Thereâ€™s been increasing movement in my work toward data visualization. Your eye can actually process a large amount of information and interpret complex patterns. With the right visualization tools, you gain a more direct and intuitive understanding of the major features of the data and how they reflect biological processes.

Math Magic Tricks

Saturday, April 25, 2020

A Map of Human History, Hidden in DNA

No comments:

Post a Comment