Genetics meets Voodoo Historical Linguistics: Genetic Variation and Population Structure in Native Americans

November 30, 2007

The process of the settlement of the Americas is one of those long-standing and fascinating research questions that can probably only be properly tackled by bringing to bear the tools of multiple disciplines: archeology, historical linguistics, and biology — especially genetic analyses of Native American populations. I was excited to see, therefore, a recent study, Genetic Variation and Population Structure in Native Americans (PLoS Genetics), that sought to use information on genetic variation in Native American populations to develop and test hypotheses about the question of prehistoric migration in the Americas.

There is much to chew on in this interesting article, and I have some queries on methodological issues related to the genetics discussed in the article, but in this post I want to comment on the use the authors made of historical linguistics. Most of the article is devoted to analyses of genetic samples from various indigenous peoples of the Americas, but one section is entitled “Genes and Languages”. The first sentence of this section reads:

We compared the classification of the population into linguist “stocks” with their genetic relationships as inferred on a neighbor-joining tree constructed from Nei genetic distances.

When I saw the word “stocks”, my eyebrows went up, and I read on:

In a neighbor-joining tree, a reasonably well-supported cluster (86%) includes all non-Andean South American populations, together with the Andean-speaking Inga population from southern Columbia. Within this South American cluster, strong support exists from separate clustering of Chibchan-Paezan (97%) and Equatorial-Tucanoan (96%) speakers (except for the inclusions of the Equatorial-Tucanoan Wayuu population with its Chibchan-Paezan geographic neighbors, and the inclusion of Kaingang, the single Ge-Pano-Carib population, with its Equatorial-Tucanoan geographic neighbors).

Chibchan-Paezan? Equatorial-Tucanoan? Ge-Pano-Carib? Uh-oh, I thought, it looks like the authors are using Greenberg’s classification of the languages of the Americas. The citations confirmed it: Greenberg (1987) and Ruhlen (1991) are their main linguistic references. I was stunned.

The authors are geneticists, and not historical linguists specializing in the Americas, so they are probably blissfully unaware of the fact that Greenberg’s classification (which Ruhlen essentially repeats) has been severely criticized by Americanist historical linguists, and is regarded by most of them as unreliable at best. They may exist, but I’ve never met an Americanist that finds Greenberg’s classification vaguely plausible. But the authors thank Merritt Ruhlen for assistance in their acknowledgement section, which indicates at least one source for their linguistic advice.

The problematic nature of the use of Greenberg’s classification is nicely, if subtly, indicated by the following observation by the authors:

As the use of a single-family grouping (Amerind) of all languages not belonging to the Na-Dene or Eskimo-Aleut families is controversial [here they cite Bolnick et al. 2004], we focused our analysis on the taxonomically lower level of linguistic stocks.

To say that Amerind is “controversial” is an understatement — but never mind that for now — as Lyle Campbell points out, even Greenberg and Ruhlen admit that they have greater confidence in the Amerind supergroup than they do in the accuracy of the subgroupings within Amerind:

Moreover, there is some reason to believe that not even Greenberg and Ruhlen have strong faith in the validity of these eleven groupings, since the repeatedly mentioned their belief that the overall Amerind construct “is really much more robust that some [of these eleven] lower branches of Amerind (Ruhlen 1994b:15; see Greenberg 1987:59). (Campbell 1997: p.328)

The Greenberg citation in question reads:

The validity of Amerind as a whole is more secure than that of any of its stocks.

So, the authors of GVPSNA think that Amerind is too controversial to be used in their paper, but Greenberg and Ruhlen think that Amerind is “more robust” and “more secure” that the “taxonomically lower level of linguistic stocks” used in GVPSNA. Simple transitivity means that these the authors should not trust the lower level stocks either.

The root problem with the lower-level groupings in Amerind is that even if the method of mass lexical comparison (MMLC) used by Greenberg and Ruhlen is viable (and there are not many historical linguists who would defend this position), the method is (as Bill Poser, among many others, has pointed out) incapable of defining subgroupings. The very best that MMLC can do (and once again, historical linguists have grave doubts even here) is show that a group of languages is related. It cannot elucidate subgroupings within that group of related languages.

I’ll save the explanations for the flaws in MMLC and its inability to define subgroupings for another post, but we see in the case of GVPSNA both a pervasive problem and an opportunity. The pervasive problem is that literacy in linguistics is low both among laymen and in other scientific disciplines — a horse long ago beaten to death over at Language Log (the horse in question, is, unfortunately, undead, and requires period new beatings). The opportunity is twofold: first, its clear that linguistics has something to offer scientists in other fields, which is nice; and second, getting the word out about the state of the art in linguistics gives linguists a great way to achieve world domination. Fast.

Works Cited

Bolnick DA, Shook BA, Campbell L, Goddard I. 2004. Problematic use of Greenberg’s linguistic classification of the Americas in studies of Native American genetic variation. Am J Hum Genet 75: 519–522.

Campbell, Lyle. 1997. American Indian Languages: The historical linguistics of Native America. Oxford University Press.

Greenberg, Joseph. 1987. Language in the Americas. Stanford University Press.

Ruhlen, Merritt. 1991. A guide to the world’s languages. Volume 1: Classification. Stanford, CA: Stanford University Press.

15 Responses to “Genetics meets Voodoo Historical Linguistics: Genetic Variation and Population Structure in Native Americans”

  1. David Marjanović Says:

    with their genetic relationships as inferred on a neighbor-joining tree constructed from Nei genetic distances.

    ARGH!!! As a biologist, let me mention that I had no idea anyone still uses neighbor-joining. That’s because neighbor-joining is not phylogenetics. It is phenetics — it counts similarities, without even trying to distinguish shared derived similarities from shared retained similarities.

    When there’s little enough homoplasy (convergence, reversals, borrowing) in the dataset, it does give a tree that is congruent with the phylogenetic tree, but how much homoplasy there is in the dataset is among the questions a phylogenetic analysis tries to answer; no such answer can be used as an a priori assumption.

    Mass lexical comparison (or multilateral comparison, as Greenberg & Ruhlen prefer) is phenetics, too. It is great for generating phylogenetic hypotheses, but incapable of testing them. Which is a real pity, because the state of phylogenetics in linguistics compared to that in biology is lamentable.

  2. levmichael Says:

    David,

    Yes, as you point out, the problem is that Greenberg and Ruhlen’s method, under any name, cannot distinguish between true cognates and chance similarities, loans, onomatopoeia, etc.

    I was unfamiliar, however, with the term ‘phenetics’; it’s interesting to note the parallels between the concepts in the two disciplines.

    I’m curious about the following comment:

    …the state of phylogenetics in linguistics compared to that in biology is lamentable.

    I may be misunderstanding your intent here, but it seems to me that the comparative method in its full form (i.e. with proper cognates sets and systematic reconstruction of protolanguages)is actually pretty good.

  3. David Marjanović Says:

    Yes, as you point out, the problem is that Greenberg and Ruhlen’s method, under any name, cannot distinguish between true cognates and chance similarities, loans, onomatopoeia, etc.

    No method can do that a priori, apart from recognizing the most obvious cases of onomatopoeia and the most obvious loans. You have to reconstruct the tree and then look at the distribution of the a-priori similarities on the tree to see whether they are cognate or not.

    Cladistics and the comparative method can, however, as part of the building of the tree, distinguish shared innovations from shared retentions. Phenetics, including multilateral comparison, doesn’t even try to do that, so it finds similarity clusters, not clades (clade = an ancestor and all its descendants).

    it seems to me that the comparative method in its full form (i.e. with proper cognates sets and systematic reconstruction of protolanguages)is actually pretty good.

    It is good, but not as good as it could be. It can only compare very few languages at once and lacks a clear optimality criterion (which is parsimony in the case of cladistics). Biological datasets often have dozens of species and hundreds of characters; these are plugged into a computer program which finds all trees that explain the dataset with the smallest number of additional assumptions.

  4. David Marjanović Says:

    Here is an example. I recommend the (very long) supplementary information.

  5. David Reed Says:

    So what then is the “best” linguistic analysis for the Americas that would be useful for examining the biologically relationships?

  6. Lev Michael Says:

    David Reed,

    Probably the best work to consult would be Lyle Campbell’s book, which I include in the references at the end of my post. Not only is the classification in that book the most widely accepted by Americanists, but Campbell also provides extensive discussion of other classificatory proposals (including Greenberg’s) and the methodological and theoretical issues involved. (A rough measure of linguists’ opinion of the volume is given by the fact that it won the Linguistic Society of America’s annual book prize.)

  7. Lev Michael Says:

    David Marjanović,

    Comments on your comments:

    No method can do that a priori, apart from recognizing the most obvious cases of onomatopoeia and the most obvious loans. You have to reconstruct the tree and then look at the distribution of the a-priori similarities on the tree to see whether they are cognate or not.

    Right. I think that the point that was not clear to the authors of the original article is that mass/multilateral comparison has no analytical means for distinguishing true cognates from false ones, whereas the comparative method does (even if it is not, in your terms, an a priori method).

    It can only compare very few languages at once and lacks a clear optimality criterion (which is parsimony in the case of cladistics).

    I’m not sure I agree with your claim that the comparative method can only compare a few languages at a time. There is no restriction, in principle, on the number of languages for which one can construct cognate sets and correspondence sets. But maybe I am misunderstanding you…

    But I am intrigued by the possibility of developing computational tools to aid reconstruction and classification — perhaps historical linguists could learn something from their biological colleagues, as you suggest.

  8. David Marjanović Says:

    Right. I think that the point that was not clear to the authors [...]

    Agreed.

    There is no restriction, in principle, on the number of languages for which one can construct cognate sets and correspondence sets.

    Indeed there isn’t any in theory, but there is one in practice. In biology, the method was made so explicit in 1950 that it can be used in computers. Lying next to me I have a paper that did a phylogenetic analysis with 3297 nucleotides of a gene of 88 species. The simplest method they used produced the two shortest trees ( = the ones that require the smallest number of assumptions) that explain the enormous dataset and contain, as part of the tree-building process, complete reconstructions of every single node; the authors don’t mention how long the calculation took, but based on my experience with much smaller datasets, I suppose it took a few hours to at most a day. Imagine comparing 88 languages within that time, including a complete reconstruction of their phylogeny and a complete reconstruction of each node.

    Now, of course, this was a molecular dataset; from sequencing to analysis, it may have taken six weeks (apart from the fact that the authors downloaded most sequences from GenBank rather than sequencing the genes themselves). Compiling morphological (anatomical) dataset of, say, 150 taxa and 400 characters is a Ph.D. thesis. The analysis itself, however, is a matter of hours.

    perhaps historical linguists could learn something from their biological colleagues, as you suggest.

    Would be the first time, though! :o)

    Have you read the paper by Rexová et al. (2004) in Cladistics? (It’s a cladistic analysis of Indo-European.) If not, I can send you the pdf.

  9. Cmonkey Says:

    Hold up, guys. The tree shown is a population tree, not a phylogenetic tree. And, it’s based on autosomal microsatellites, not languages. Neighbor joining in this context is perfectly acceptable.

  10. Lev Michael Says:

    CMonkey,

    Just to be clear, I have no complaints about the genetic analysis in GVPSNA. I am far too ignorant about the issues to have any opinion one way or another about the suitability of neighbor-joining and the like. I am also clear on the fact that the population trees presented in the paper are not based on linguistic classifications. After all, the point is to treat genetic relatedness and linguistic relatedness as independent variables and see how the two measures coincide or not.

    My point is a much more narrow one: that the method by which that the authors of GVPSNA arrive at their measure of linguistic relatedness would not accepted by most Americanist historical linguists.

  11. Cmonkey Says:

    No worries, Lev. I just saw some of the comments diverting away from your original post and argument.

    I look forward to reading your post on MMLC and subgroupings.

  12. David Marjanović Says:

    The paper is interested in where people come from and along which routes they migrated in which directions, so a phylogenetic tree would have been more useful.


  13. [...] 14, 2007 Since I have been critical of the use of historical linguistics in Genetic Variation and Population Structure in Native [...]

  14. Lev Michael Says:

    David,

    Just so you know, I am not blowing off your recommendation to read Rexová. In fact, I’m looking into the flurry of work of which Rexová is an example in some detail. I’ll probably write a post about the issues raised by this body of work in the not-too-distant future.

  15. John Cowan Says:

    I’d very much like to read that Rexová paper. Is anyone still checking these comments who can send it to me? I’m at cowan at ccil dot org.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: