How we’re building the world’s largest family tree | Yaniv Erlich


People use the internet
for various reasons. It turns out that one of the most
popular categories of website is something that people
typically consume in private. It involves curiosity, non-insignificant levels
of self-indulgence and is centered around recording
the reproductive activities of other people. (Laughter) Of course, I’m talking about genealogy — (Laughter) the study of family history. When it comes to detailing family history, in every family, we have this person
that is obsessed with genealogy. Let’s call him Uncle Bernie. Uncle Bernie is exactly the last person
you want to sit next to in Thanksgiving dinner, because he will bore you to death
with peculiar details about some ancient relatives. But as you know, there is a scientific side for everything, and we found that Uncle Bernie’s stories have immense potential
for biomedical research. We let Uncle Bernie
and his fellow genealogists document their family trees through
a genealogy website called geni.com. When users upload
their trees to the website, it scans their relatives, and if it finds matches to existing trees, it merges the existing
and the new tree together. The result is that large
family trees are created, beyond the individual level
of each genealogist. Now, by repeating this process
with millions of people all over the world, we can crowdsource the construction
of a family tree of all humankind. Using this website, we were able to connect 125 million people into a single family tree. I cannot draw the tree
on the screens over here because they have less pixels than the number of people in this tree. But here is an example of a subset
of 6,000 individuals. Each green node is a person. The red nodes represent marriages, and the connections represent parenthood. In the middle of this tree,
you see the ancestors. And as we go to the periphery,
you see the descendants. This tree has seven
generations, approximately. Now, this is what happens
when we increase the number of individuals to 70,000 people — still a tiny subset
of all the data that we have. Despite that, you can already see
the formation of gigantic family trees with many very distant relatives. Thanks to the hard work
of our genealogists, we can go back in time
hundreds of years ago. For example, here is Alexander Hamilton, who was born in 1755. Alexander was the first
US Secretary of the Treasury, but mostly known today
due to a popular Broadway musical. We found that Alexander has deeper
connections in the showbiz industry. In fact, he’s a blood relative of … Kevin Bacon! (Laughter) Both of them are descendants
of a lady from Scotland who lived in the 13th century. So you can say that Alexander Hamilton is 35 degrees of Kevin Bacon genealogy. (Laughter) And our tree has millions
of stories like that. We invested significant efforts
to validate the quality of our data. Using DNA, we found that .3 percent of
the mother-child connections in our data are wrong, which could match the adoption rate
in the US pre-Second World War. For the father’s side, the news is not as good: 1.9 percent of the father-child
connections in our data are wrong. And I see some people smirk over here. It is what you think — there are many milkmen out there. (Laughter) However, this 1.9 percent error rate
in patrilineal connections is not unique to our data. Previous studies found
a similar error rate using clinical-grade pedigrees. So the quality of our data is good, and that should not be a surprise. Our genealogists have
a profound, vested interest in correctly documenting
their family history. We can leverage this data to learn
quantitative information about humanity, for example, questions about demography. Here is a look at all our profiles
on the map of the world. Each pixel is a person
that lived at some point. And since we have so much data, you can see the contours
of many countries, especially in the Western world. In this clip, we stratified
the map that I’ve showed you based on the year of births of individuals
from 1400 to 1900, and we compared it
to known migration events. The clip is going to show you
that the deepest lineages in our data go all the way back to the UK, where they had better record keeping, and then they spread along
the routes of Western colonialism. Let’s watch this. (Music) [Year of birth: ] [1492 – Columbus sails the ocean blue] [1620 – Mayflower lands in Massachusetts] [1652 – Dutch settle in South Africa] [1788 – Great Britain penal
transportation to Australia starts] [1836 – First migrants use Oregon Trail] [all activity] I love this movie. Now, since these migration events
are giving the context of families, we can ask questions such as: What is the typical distance
between the birth locations of husbands and wives? This distance plays
a pivotal role in demography, because the patterns in which
people migrate to form families determine how genes spread
in geographical areas. We analyzed this distance using our data, and we found that in the old days, people had it easy. They just married someone
in the village nearby. But the Industrial Revolution
really complicated our love life. And today, with affordable flights
and online social media, people typically migrate more than
100 kilometers from their place of birth to find their soul mate. So now you might ask: OK, but who does the hard work
of migrating from places to places to form families? Are these the males or the females? We used our data to address this question, and at least in the last 300 years, we found that the ladies do the hard work of migrating from places
to places to form families. Now, these results
are statistically significant, so you can take it as scientific fact
that males are lazy. (Laughter) We can move from questions
about demography and ask questions about human health. For example, we can ask to what extent genetic variations
account for differences in life span between individuals. Previous studies analyzed the correlation
of longevity between twins to address this question. They estimated that the genetic
variations account for about a quarter of the differences
in life span between individuals. But twins can be correlated
due to so many reasons, including various environmental effects or a shared household. Large family trees give us the opportunity
to analyze both close relatives, such as twins, all the way to distant relatives,
even fourth cousins. This way we can build robust models that can tease apart the contribution
of genetic variations from environmental factors. We conducted this analysis using our data, and we found that genetic variations
explain only 15 percent of the differences in life span
between individuals. That is five years, on average. So genes matter less than
what we thought before to life span. And I find it great news, because it means that
our actions can matter more. Smoking, for example, determines
10 years of our life expectancy — twice as much as what genetics determines. We can even have more surprising findings as we move from family trees and we let our genealogists
document and crowdsource DNA information. And the results can be amazing. It might be hard to imagine,
but Uncle Bernie and his friends can create DNA forensic capabilities that even exceed
what the FBI currently has. When you place the DNA
on a large family tree, you effectively create a beacon that illuminates the hundreds
of distant relatives that are all connected to the person
that originated the DNA. By placing multiple beacons
on a large family tree, you can now triangulate the DNA
of an unknown person, the same way that the GPS system
uses multiple satellites to find a location. The prime example
of the power of this technique is capturing the Golden State Killer, one of the most notorious criminals
in the history of the US. The FBI had been searching
for this person for over 40 years. They had his DNA, but he never showed up
in any police database. About a year ago, the FBI
consulted a genetic genealogist, and she suggested that they submit
his DNA to a genealogy service that can locate distant relatives. They did that, and they found a third cousin
of the Golden State Killer. They built a large family tree, scanned the different
branches of that tree, until they found a profile
that exactly matched what they knew about
the Golden State Killer. They obtained DNA from this person
and found a perfect match to the DNA they had in hand. They arrested him
and brought him to justice after all these years. Since then, genetic genealogists
have started working with local US law enforcement agencies to use this technique
in order to capture criminals. And only in the past six months, they were able to solve
over 20 cold cases with this technique. Luckily, we have people like Uncle
Bernie and his fellow genealogists These are not amateurs
with a self-serving hobby. These are citizen scientists
with a deep passion to tell us who we are. And they know that the past
can hold a key to the future. Thank you very much. (Applause)

31 Comments

Add a Comment

Your email address will not be published. Required fields are marked *