Using Mesquite to inspect cognate sets

This post describes the use of a phylogenetic analysis program, Mesquite, to identify possibly erroneous cognacy judgments in large lexical datasets. I’ve found it to be a very useful tool, and I haven’t heard other linguists talk about it a great deal, so I though it might be interesting for others to hear about. But first, some background…

For the past couple of years the Berkeley Comparative Tupí-Guarani Project* has been working to develop an improved internal classification of the Tupí-Guaraní (TG) family. By this point we have, among other things, collected lexical data on 30 TG languages (plus Awetí and Mawe, two non-TG Tupian languages, to serve as out-group languages), using a 539-item comparative list, and arranged these data into approximately 1300 non-singleton cognate sets. We will, in the not-too-distant future, start constructing correspondence sets in order to begin applying the Comparative Method to this dataset, but in the meantime, we are running computational phylogenetic analyses on the lexical data to obtain a preliminary internal classification. What we obtain are trees like the following:

An inferred phylogeny of TG languages based on lexical cognate sets
An inferred phylogeny of TG languages based on lexical cognate sets (click for larger view)

This is actually a pretty credible TG tree (although it may not, of course, be entirely correct): it largely reproduces the basic groups of Rodrigues (1984/5) and the proposed subgroups of Rodrigues and Cabral (2002), along with additional structure that seems plausible if you have, like us, been spending a lot of time looking at TG lexical and morphological data. (It also yields a very sensible model for the geographical dispersal of the family, but that’s a matter for another day.) One weakness of the phylogenetic result, however, is the support values for certain subgroups. Support values correspond roughly to the probability that a given subgroup is, in fact, a subgroup, and we have wanted to use a value of 0.85 as our cutoff point for considering a clade (or subgroup) credible. Unfortunately, some of our most interesting subgroups have lower values. For example, the subgroup that corresponds the more or less to Groups I+II+III in the Rodrigues classification has a support value of 0.81.

Fortunately, one can increase the support values by improving the reliability of the cognate sets (assuming that they are not already perfect — ha ha). Computationally, lowered support values arise from ‘conflicting signals’, i.e. different sets of evidence that point to different subgroups. So, for example, there is good evidence for our Group I+II+III subgroup, i.e. cognate sets that uniquely define this subgroup, but there are other cognate sets that lead one to want to include other languages in this larger subgroup, or languages from this subgroup in other subgroups,  reducing support the support for all of the subgroups.

This kind of conflicting signal can arise from a number of sources, but two important ones are: 1) independent innovations that yield false cognacy; and 2) mistakes in building cognate sets, where two elements are deemed to be cognate when they are not. The latter issue is, of course, always a potential issue at this stage in the process, i.e. before complete application of the Comparative Method, since without adequate knowledge of the relevant sound changes, it is possible to treat bogus look-alikes as cognate, and miss true cognates due to changes that obscure cognacy. And in dealing with such a large dataset, human error inevtiably comes into play: forms are deemed cognate in the wee hours of a particular morning, which really aren’t credibly cognate by the cold light of day.

Fortunately, we have found a very useful tool for ferreting out potentially bogus cognacy judgments in the form of Mesquite, an application that serves to carry out analyses on inferred phylogenetic trees. Mesquite has many functions, but the relevant one for our purposes is its ‘reconstruction’ of ancestral states. Basically what this function does is to ‘reconstruct’ (i.e. identify) how far back in a phylogenetic tree a given phylogenetic character (in our case, a form that is a member of a particular cognate set) reconstructs, according to the tree that one’s phylogenetics application has inferred. In doing so, it also identifies cases of independent innovation (likewise, according to the inferred tree).

One thing that makes Mesquite especially nice is that it has a nice graphical interface that allows one to easily spot instances of independent innovation. First, in the following screen shot, one can see a nice instance of a character (KNEE4, presence of forms for ‘knee’ cognate to, e.g. Assuriní de Tocantíns kanawá), that seems to reconstruct quite solidly for one of the robust subgroups in our larger ‘Central’ subgroup.

A Mesquite 'reconstruction' for the TG KNEE4 set
A Mesquite ‘reconstruction’ for the TG KNEE4 cognates

Next, in the following screen shot, one can see a character (TOE2) that was, according the ancestral state reconstruction associated with the tree, independently innovated  three times: in Chiriguano, Pauserna, and Wayampí. This is a somewhat suspicious state of affairs, suggesting that it might make sense to look at the cognate set again. Doing so we see that the word for ‘toe’ in these languages is actually a compound meaning something like ‘foot head’. Body-part compounds with ‘head’ or ‘bone’ are fairly common in TG languages, suggesting that these forms for ‘toe’ are independently innovated, based on (true) cognates for ‘foot’ and ‘head’. On this basis we exclude this compound as informative for purposes of phylogenetic analysis. And note that pattern evident in the ‘reconstruction’ is precisely the kind of conflicting signal that might lower the support for subgroups like Central and Peripheral.

A Mesquite 'reconstruction' for the TOE2 cognate set
A Mesquite ‘reconstruction’ for the TG TOE2 cognates

Examining suspicious ‘reconstructions’ like the TOE2 one has led us to identify previously unnoticed complex forms, as in this case, as well as instances of poor cognacy judgments. And having identified several dozen problematic sets in this way,  we have high hopes that our next TG tree will have the support values that we are pining for. We’re keeping our fingers crossed, and I’ll post our next set of results.

In any case, I’ve found Mesquite to be such a wonderful tool for evaluating cognate sets in the context of phylogenetic analysis that I wanted to share it with others who might not be familiar with it. (And thanks, Natalia, for introducing it to the TG group!)


Rodrigues, A. D. 1984/1985. Relações internas na família lingüística tupí-guaraní. Revista de Antropologia 27/28, 33–53.

Rodrigues, A. D. and A. S. A. C. Cabral. 2002. Revendo a classificação interna da família tupí-guaraní. In A. S. A. C. Cabral and A. D. Rodrigues (eds.), Línguas Indígenas Brasileiras: Fonologia, Gramática e História, pp. 327–337. Belém: Editora Universitária, Universidade Federal do Pará.

*Current project members include Keith Bartolomei, Natalia Chousou-Polydori, Erin Donelly, and Zachary O’Hagan; alumni include Mike Roberts and Vivian Wauters. The work described here has been funded in part by NSF BCS #0966499 . Thanks also to Sebastian Drude and Françoise Rose for data-sharing!

Evaluating the linguistic evidence for an Out of America hypothesis

A lively debate has been going on over at regarding a proposal by German Dziebel, expounded in his recent book, that modern humans originated in the Americas and spread from there to the rest of the world — an Out of America (OOAm) hypothesis to mirror the more widely-accepted Out of Africa (OOAf) hypothesis. The debate has been stimulated by two posts by Dziebel (here and here), which argue that many diverse sources of data suggest that modern humans originated in the Americas, or at the very least, the available data certainly do not rule this possibility out. Much of the debate in the comment threads focuses on genetic arguments, as one might expect, but I was interested in the linguist evidence. I asked Dziebel to elaborate on the linguistic evidence, and he kindly responded as follows:

Regarding the relevance of linguistic diversity in the Americas to the problem of the peopling of the Americas, I base myself off of Johanna Nichols’s “Linguistic diversity and the first settlement of the New World.” Language 66:3. (1990) as well as her Linguistic Diversity in Space and Time (1992).

Being stranded in Lima as I am, I have no access to Nichols’ 1992 book, but I was able get the Language article through JSTOR. In this post my goal is basically to evaluate to what degree the evidence and arguments presented in Nichols (1990), cited by Dziebel in support of the OOAm hypothesis, in fact support this hypothesis. For those who want the Reader’s Digest summary, my conclusions are the following: to a large degree, the basic evidence given in Nichols (1990) is neutral with respect to the OOAm hypothesis or competing hypotheses that place human origins in other continents. However, those parts of the paper that raise arguments relevant to distinguishing various origin hypotheses come down in favor of America as a site of colonization from the Old World, and not as a site from which humans migrated. (Just to be clear: I am not arguing for or against the OOAm hypothesis as a whole, but rather, taking on the much more restricted question of whether the linguistic evidence that Dziebel cites in fact supports the OOAm hypothesis.)

For Dziebel, the interesting point of Nichols (1990) lies in the relatively high linguistic diversity of the Americas and the implications of this diversity for the antiquity of human presence in the Americas. In his comment to me, Dziebel writes:

As measured by the number of independent linguistic stocks, linguistic divergence in the Americas must have taken at least 35,000 years. Of course, this figure cannot be taken literally but there’s a marked contrast between language diversity in the Americas (and in places like Papua New Guinea, with human archaeological record of some 40,000 years) and language diversity in Africa.

Dziebel raises two points here that are based on Nichols (1990). First, the linguistic diversity found in the Americas suggests that the human presence in the Americas goes back at least 35,000 years. And second, the human diversity of the Americas is significantly greater that found in Africa.

The arguments that Nichols (1990) marshals for the early date for the initiation of human migration to the Americas are very interesting, and rely on converging sources of data. However, the single most important piece of evidence is the sheer number of linguistic stocks found in the Americas. If we follow a uniformitarian assumption about rates of linguistic differentiation, and then calculate the rate of development of distinct stocks in other parts of the world, we are led to the conclusion that there is simply no way that the linguistic diversity we find in the Americas could have developed in the time window given by Clovis-based chronologies that posit that colonization of the Americas began around 12,000 years ago, or more recent accepted chronologies that push that date back to about 20,000 years ago. Pulling together as much linguistic and and archeological evidence as she can about migration rates across Beringia and the Bering Straits, Nichols suggests a date of roughly 35,000 years for the initial migrations into the Americas.

If we abstract away from the colonization-based scenario that Nichols employs, as Dziebel clearly does, we could argue that Nichols calculations support human presence in the Americas from 35,000 years ago — whether due to migration or otherwise. However, this interesting result cannot distinguish between the OOAm hypothesis and hypotheses that place human origins in other continents. It counts as an interesting piece of evidence regarding human presence in the Americas, but does not speak to the validity of OOAm, because it tells us nothing about how these humans got to be in the Americas.

It is worth noting that although Nichols (1990) does indeed argue for an earlier human presence in the Americas than do hypotheses based on physical remains, the entire point of the article is to develop a estimate for the date of human colonization of the Americas, based on linguistic evidence. Dziebel takes the early date for human presence in the Americas presented in the paper as support for the OOAm hypothesis, but discards the fact that this date is given in the context of a model for colonization of the Americas from the Old World.

Let us now take up Dziebel’s second point, which concerns the relative linguistic diversity of the Americas and Africa. Nichols (1990) observes that if one looks at the density of linguistic stocks globally, certain areas, such as New Guinea and South America, show a higher density that other areas, such as Europe. And, as Dziebel correctly notes, the density of the Americas as whole is higher than that of Africa. But, does this fact count as evidence either for or against OOAm? No, not at all.

Dziebel interest in the relative linguistic diversity of the Americas and of Africa lies in the supposed ability of linguistic diversity to predict the age of populations:

To summarize, linguistic diversity is a good and straightforward predictor of a population’s age if geography is factored in and if it’s checked against the mtDNA and Y-chromosome picture.

While it is certainly true that, all other things being equal, linguistic diversity in a region increases over time, it does not follow that linguistic diversity is a straightforward indicator of the age of that area’s population. The confounding factor is large-scale language shift. As Nichols argues, there is good reason to believe that in Europe, for example, Indo-European languages replaced pre-Indo-European languages on a massive scale, radically reducing the linguistic diversity of the region.

Of course, Dziebel also mentions the “mtDNA and Y-chromosome picture” — but it’s not clear to me how this is relevant to the utility of using linguistic diversity to estimate the age of a population, unless his following comment gives us a clue:

Linguistic diversity steadily increases with time, unless this process is checked by geography and reversed by population replacements.

So here it appears that Dziebel makes use of the concept of ‘population replacement’ to account for interruptions in the steady growth of linguist diverstiy. But of course, language shift need not co-occur with population replacement, entirely disrupting the tidy correspondence between linguistic diversity and the age of populations. In Europe, for example, Nichols argues that Indo-European *languages* replaced pre-Indo-European ones, not that *populations* were replaced. The result was a loss of linguistic diversity. And as the following comment shows, Dziebel seems perfectly aware of this fact:

Translated into the levels of linguistic diversity, Europe experienced periods of language replacement (now it’s dominated by Indo-European languages) but all these replacements originated from the same genetic pool.

But then he concludes:

However the factors of geography and population replacement are subordinate to the factor of spontaneous differentiation because differentiation occurs all the time and everywhere, while geographical constraints and population replacements are accidental events.

What Dziebel seems to be arguing here is that even though we know that language shift occurs — and on vast scales, as in Europe and Africa — at the end of the day, linguistic diversity is still a reliable measure of a population’s age. But this is clearly false — or maybe I am misunderstanding his point. The fact that large-scale language shift occurs, without necessarily significant changes in the *biological* population, means that linguistic diversity is good as a measure of the amount of time that has transpired *subsequent to* such large scale linguistic shifts. These shifts largely erase the linguistic history of an area, screening off the population’s age prior to that point from measures based on linguistic diversity.

The fact that such large scale shifts appear to have occurred in Africa and Europe means that measures of linguistic diversity simply cannot tell us very much about the ultimate ages of those populations. Consequently, the fact that the Americas display greater linguistic diversity than Africa tells us nothing about the relative ages of the populations of the two regions. The linguistic diversity evidence that Dziebel cites simply does not bear on the validity of OOAm.

Apart from the linguistic diversity evidence just discussed, Dziebel also cites typological evidence:

The distribution of grammatical features (such as head-marking vs. dependent-marking, numeral classifiers, etc.) again shows a cline from America and Australasia to Africa and Europe, and Nichols’s argued that our perspective on an early human language comes from America and Australasia and not Africa and Europe.

It is certainly true that Nichols (1990) observes certain typological features appear to cluster in certain geographical areas, and that intermediate areas show intermediate values for the parameters in question. Thus, as extremes, South America shows a very high proportion of head-marking languages, while Europe and Africa show a very high proportion of dependent-marking languages. Intermediate areas, such as Australasia, tend to show either mixed-marking or double-marking. However, the fact that one can identify typological parameters that exhibit a cline of values between the Americas, on the one hand, and Europe and Africa, on the other, tells us little about the locus of modern human origins. By themselves, these linguistic facts are consistent with both OOAm and OOAf scenarios. They simply do not speak to validity of one hypothesis over the other.

Dziebel also says, however, that “Nichols’s argued that our perspective on an early human language comes from America and Australasia and not Africa and Europe.” Well, if she does so in Nichols (1990), I can’t find it. The closest argument I can find in Nichols (1990) to the one that Dziebel attributes to her is an observation about the relationship between colonization and the preservation of linguistic features. To summarize, Nichols observes that when new areas are colonized, it is not unusual for linguistic features to survive in the colonized area that are subsequently lost in the areas from which the linguistic stocks originally spread. Note, of course, that the languages in the colonized area continue to change, as do all human languages, so it is misleading to characterize them as somehow reflecting “early human languages”. Rather, the languages in questions simply preserve some features that were present at the time of colonization, and which tend to get lost in the original area due to language shift. Note, btw, that *were* it possible to show that American languages retain certain features subsequently lost in other parts of the world, this would actually serve as evidence, following Nichols’ arguments, for the Americas having been colonized from the Old World, rather than the reverse, as Dziebel proposes.

Thus far, then, I can find no evidence in Nichols (1990) that supports the OOAm hypothesis. I now wish to briefly review evidence given in the paper that argues against the OOAm hypothesis.

First, linguistic diversity in the Americas tends to increase the further south one goes. Modulo issues of language shift, touched on above, this fact suggests that the older American populations are found in the south, and successively more recent populations are found in Meso-America and North America. These facts are easy to reconcile with a scenario in which populations entered the American in the north in stages, with subsequent populations pushing prior ones towards the south. It is not clear how these linguistic diversity facts fit with an OOAm scenario.

Second, Nichols argues that linguistic diversity is, in general, higher in areas that have been colonized than the centers from which colonization occurred (a point to which I alluded above). Nichols argues (p. 487) that this is due to the fact that centers are loci of large scale economies, which result in linguistic spreads that reduce linguistic diversity. The greater linguistic diversity of the Americas is, by this reasoning, supportive of the Americas being a colonized region, and not the OOAm hypothesis.

To summarize, Dziebel cited Nichols (1990) as a source of evidence and arguments that support the OOAm hypothesis. In particular, Dziebel cites linguistic evidence from this work for the antiquity of human settlement in the Americas and for the existence of a typological cline linking the Old World and New. However, neither piece of evidence supports an OOAm scenario over a OOAf scenario (or vice versa). However, other evidence and arguments presented in Nichols (1990) casts doubt on an OOAm scenario. In particular, the evidence regarding linguistic diversity within the Americas is consistent with a process of colonization of the New Word by multiple migrations from the north, but is not easy to reconcile with a an OOAm scenario. Additionally, Nichols makes arguments regarding the effects of colonization on linguistic diversity which are consistent with the Americas being the site of colonization, but not with the Americas being the point from which the Old World was colonized.

Regardless of the ultimate validity of the OOAm hypothesis, then, the linguistic arguments Dziebel presents in its favor are unconvincing to me. I wish to emphasize that I am restricting my attention to the linguistic arguments, and it is possible that the genetic arguments or those based on kinship terminology provide much better evidence for OOAm. At this point, however, I am led to conclude that the linguistic evidence that Dziebel has presented so far in favor of OOAm is weak.

Of tobacco spirits and tobacco changelings: The etymology of seripigari, Part III

In previous posts (here and here) I have worried the Matsigenka word seripigari ‘shaman’ in an effort to arrive at a decent etymology for the word. I ultimately concluded that the term was originally a compound: seri ‘tobacco’ + pigari ‘seer’, where the head of the compound is a nominalized form of the verb pig ‘hallucinate, see visions’.

Having arrived at what I find to be a fairly satisfactory conclusion to the etymological puzzle presented by seripigari, I now wish to throw a serious wrench into the works: some Matsigenkas, instead of saying seripigari, say seripegari. Moreover, as Chris Beier noted in a comment to my first post on the subject, if we look in the first published dictionary of Matsigenka, Pio Aza’s 1923 Vocabulario español-machiguenga, we actually find the seripegari variant and not the seripigari variant.

At this point, I must admit I find the occurrence of the seripegari variant to be quite mysterious, although I have three hypotheses about the form. Before I go into these in detail however, I want to observe that cognates of seripigari/seripegari are to be found in all the Kampan languages, suggesting that the term is an old one, and probably reconstructs to Proto-Kampan, which I estimate was spoken some 750-1000 years ago. So its important to keep in mind that the history of this term could be quite complex, and it will probably not be possible to lay this issue to rest until a great deal more historical work has been carried out on the Kampan family.

There are two basic ways to account for the Matsigenka facts. The first is to assume that the seripegari is essentially the Proto-Kampa form, and that seripigari is an innovation that has spread to certain dialects of Matsigenka. The second is to assume the converse: that seripigari is the original form and that seripegari is the innovation.

So, the first idea for accounting for the seripigari ~ seripegari variation is that the original form of the term in Proto-Kampa was, in fact, seripegari, and that in some varieties of Matsigenka, there was a sound shift from /e/ to /i/. We know for a fact that some Kampan varieties (e.g. certain varieties of Ashéninka) systematically experienced this very change, which suggests that we are on the right track. However, we find the form seripigari even in varieties that did not experience the systematic sound change, such as Nomatsigenga and Matsigenka itself, which raises a problem with the sound change analysis.

On the other hand, if we compare certain forms in Matsigenka with those in the closely-related language Nanti, we do see some /i/:/e/ correspondences: ponchoheni ‘bird sp.’ (Nanti), ponchoini ‘bird sp.’ (Matsigenka); pomerintsih ‘take pains doing something (v.)’ (Nanti), pomirintsi ‘work hard (v.)’ (Matsigenka); taheri ‘tree sp.’ (Nanti), tairi ‘tree sp.’ (Matsigenka). The curious thing about these correspondences is that they appear to be idiosyncratic. That is, they do not seem to be the result of regular sound changes, as it does not appear possible to identify an environment that correctly predicts the alternations. In conjunction with data from other Kampan languages, we can identify the sound change in these idiosyntractic cases as /e/ to /i/ in Matsigenka, but the reason for the sound changes in these isolated instances remains quite mysterious to me. (One possible explanation for this situation is the Matsigenka references may be mixing forms from more than one dialect, in such a way that obscures the systematicity of the sound changes.) So, examples like this seem to give credence to the idea that Matsigenka has undergone some irregular /e/ to /i/ changes, which could account for the seripigari form in Matsigenka, despite the fact that Matsigenka has not undergone a systematic /e/ to /i/ change. However, if we accept the idiosyncratic sound change hypothesis, we would be forced to hypothesize an identical idiosyncratic change in Nomatsigenka, which is not particulalry plausible.

Another possibility is that the current distribution of seripigari and seripegari in Matsigenka is due to language contact among Kampan languages. For example, one possibility is that the occurence of the seripigari variant is a result of relatively recent language contact between Matsigenka and Ashéninka speakers, which has resulted in the displacement of the hypothesized hisotrically prior Matsigenka seripegari variant. This is not as crazy as it might first seem. I have noted, for example, that the Ashéninka word shirampari ‘man’ has displaced the Matsigenka word surari ‘man’ in parts of the Lower Urubamba River valley. I believe that the primary language contact occurred in the Picha River basin (which is an affluent of the Urubamba), where some Ashéninkas resettled in traditionally Matsigenka territories in the 1970s and 1980s to escape the violence of the Shining Path in their home territories to the west. From there, its seems that shirampari spread from Matsigenka speaker to Matsigenka speaker. I’ve heard the word in use as far east as Cashiriari, the uprivermost Matisgenka community on the Camisea River, which is quite far from the Picha Basin. (Note also that there is intense interaction between Nomatsigenga speakers, who also use the seripigari form, and Ashéninka speakers.) In certain respects I think this is a nice explanation, in that it tidily explains why there are two variants of the word in use by Matsigenka speakers. However, we would really need a lot more information to confirm or falsify this hypothesis. At the very least it would be nice to have isoglosses for the two variants. Any records about the date at which seripigari began to be used by Matsigenkas would also be helpful.

Note that if either of the two preceding explanations is basically correct, we would need to completely rethink the etymology of seripigari/ seripegari. Following the reasoning in my previous post, I would need to locate an intransitive verbal root peg that is consonant with Kampan ideas about shamanism to serve as the basis for the nominalized head of the compound.

When we do so, however, the options are not particularly promising. The best is peg ‘become invisible’, but Matsigenka shamans are not particularly known for becoming invisible. However, we find that in Ashéninka, the word peyari, which is cognate to Matsigenka pegari, means ‘spirit’ (lit. ‘fantasma’)(Payne 1980, p103). So plausibly, the compound seripegari originally meant something like ‘tobacco spirit’. The problem I see with this term is that it would seem to have originally denoted not the shaman, but rather his spirit helpers. Semantic shift is certainly a possibility (consider, for example, the multiple senses of ‘leech’ in traditional European medicinal practice, where the term applied to both the invertebrate and the person who employed them for curing), but I beginning to feel like I’m stretching here.

It is interesting to note, in this regard, that in Payne’s entry for sheripiyari ‘curandero, hechicero’ (healer, witch doctor) (p. 126), he actually proposes the etymology sheri ‘tobacco’ + peyari ‘fantasma’ (ghost, spirit). So even in Ashéninka we run across a mismatch in the vowel quality between the synchronic term and its supposed components under the etymology we are presently considering. Its possible that there is a tidy historical explanation for this discrepancy, but at this point I am beginning to feel that the semantic and phonological difficulties piling up for the Proto-Kampa *seripegari hypothesis render this option unattractive, even if we appeal to language contact processes.

So I think that the most plausible hypothesis at this point is that the Proto-Kampa form was indeed *seripigari and that the seripegari is an innovation in Matsigenka. The question, then, is why such a change has occurred in certain Matsigenka dialects. It would be nice if there were any evidence of dissimilation phenomena in Matsigenka that could account for this, but I have not come across any signs of such a process. Another possibility is that some Matsigenkas have reanalyzed seripigari as seri + pegari on semantic grounds — a kind of Amazonian eggcorn that subsequently gained currency as a kind of folk etymology. For this hypothesis to have much chance of being correct, we would need to have evidence that ‘transformation’ plays a prominent role in Matsigenka conceptions of shamanism. There is actually some evidence for evidence for this, as Allen Johnson notes (pdf):

In the Matsigenka conception a seripigari works by changing places with his spirit helper (or counterpart, or double) among the unseen ones. Working only at night, the seripigari drinks ayahuasca and climbs the ladder or notched pole to his platform (menkotsi) in the roof beams of his house. According to Shepard (1990: 32), the seripigari’s counterpart simultaneously drinks ayahuasca and the two trade places, occuping each other’s bodies. The spirit is now present in this world to help treat those who need his powers.

Under this analysis then, Matsigenkas have reanalyzed the proto-Kampan seripigari, originally meaning ‘tobacco seer’, as seripegari ‘tobacco changeling’, or the like. At this point, this is the best hypothesis I have for explaining the seripigari ~ seripegari that fits the historical facts for the Kampan family. However, I strongly suspect that further historical work on the Kampan family will reveal complexities I have yet to understand, so I expect to be writing a Part IV post in a couple of years…


Shepard, Glenn. 1990. Health and healing plants of the Matsigenka in Manu, Southeastern Peru. Department of Anthropology, University of California, Berkeley. Ms.

Payne, David. 1980. Diccionario Ashéninca – Castellano. Instituto Lingüístico de Verano.

Snell, Betty. Pequeño Diccionario Machiguenga – Castellano. Instituto Lingüístico de Verano.

What GVPSNA Got Right (And What Others Get Wrong)

Since I have been critical of the use of historical linguistics in Genetic Variation and Population Structure in Native Americans (PLoS Genetics) (GVPSNA), it’s only fair that I point out that they did get one major point very much correct: the contingent nature of the correspondence between genetic relatedness and linguistic relatedness. In Anthropology, this fundamental observation goes at least as far back as Boas, who observed that languages, cultures, and populations each have potentially independent trajectories through time and space. Of course, language, culture, and populations may remain bundled together for periods of time, but this is a fact to be determined by empirical investigation, and cannot be assumed at the outset.

GVPSNA presents a number of intriguing examples of the lack of tidy correspondence between genetic and linguistic relatedness, but let me focus on just one: the fact that the Arawak language-speaking Wayuu are apparently more closely genetic related to the Chibchan language-speaking groups to the west than they are than to the Arawak-speaking Piapoco to the south.

What could this mean? Could this be evidence for Arawak cultural imperialism — an Arawak linguo-cultural takeover — as Max Schmidt’s theory of Arawak expansion proposes? The data presented in GVPSNA are far too spotty to permit us to reach any conclusions (genetic data are only presented for two Arawak languages and five Chibchan languages), but the hints are tantalizing, and the direction for further research is clear.

As it turns out, however, not all geneticists are clear on the loose connection between genetic and linguistic relatedness. Let me take an areally relevant example: a 2005 article in Current Anthropology, by Francisco Mauro Salzano, Mara Helena Hutz, Sabrina Pinto Salamoni, Paula Rohr, and Sidia Maria Callegari-Jacques, entitled Genetic Support for Proposed Patterns of Relationship among Lowland South American Languages (GSPPRLSAL).

The opening paragraph gives a sense of the authors’ perspective:

Comparison of different sets of markers to unravel the past of human populations is an established procedure in both anthropology and genetics. Language characteristics can be easily quantified, and the field of comparative linguistics has a respectable past [citations cut]. Therefore, it is only natural that evolutionary geneticists have turned to linguistics to evaluate the population relationships that they have been obtaining with genetic markers.

Well, I don’t know whether such a move is “natural,” but I suspect anyone who follows this strategy is naïve. Linguistic classification simply cannot tell us anything about the genetic relationships between the populations speaking the languages in questions. GVPSNA, not to mention Boas long before, makes this point quite clearly.

In any event, the goal of the research reported in the paper is to use genetic data to evaluate three different classificatory proposals (due to Greenberg, Loukotka, and Aryon Rodrigues), linking the Arawak, Carib, Gê, and Tupí families:

The testing of the hypotheses concerning language relationship patterns of Greenberg, Loukotka, and Rodrigues was performed using genetic data by means of a method developed by Cavalli-Sforza and Piazza …

And off they go! Data is presented, algorithms are mentioned, and conclusions are drawn. However, just as linguistic classification tells us nothing about genetic relatedness, neither does genetic relatedness of populations tell us anything about linguistic relatedness of the languages the populations speak. (Incidentally, I find it incongruous that a group of Brazilian geneticists, living in a country of mixed African, European, and Native American heritage, in which Portuguese is the dominant language, could, for a moment, imagine that there is any kind of tidy correspondence between genetic and linguistic relatedness.)

And that, really, should be the end of the post. GVPSNA got an important point about the contingent correspondence between genetic and linguistic relatedness right, and, as we can see, this point is not obvious to everyone working at the intersection of genetic and linguistic classification. Kudos to the authors GVPSNA.

But what of the conclusions of GSPPRLSAL? According to the authors:

Rodrigues’s hypothesis was the only one not rejected. Other possible tree arrangements representing the relationships among the language families not considered by the three linguists were identified but are irrelevant to the present inquiry.

But then a little later they summarize their results as follows:

Genetic data support other tree configurations besides those proposed by Greenberg, Loukotka, and Rodrigues. This is not surprising, because different kinds of data may produce somewhat different estimates of the history of the populations. Among the three possibilities proposed by these scholars, that of Rodrigues has the best genetic support.

Hmm. Ok, so, of the three initial proposals, Rodrigues’ fares best, but the genetic data actually supports totally different classifications as well. I fail to understand why “this is not surprising”, or what the comment about “different kinds of data … produc[ing] different estimates of the history of the populations” means. Maybe its my ignorance about genetics speaking, but something smells fishy here.

In any event, it seems to me that if one buys the utility of genetic data in supporting linguistic classification (and there is no good reason to do so in the opinion of most historical linguists, I believe), then the fact that Rodrigues’ proposal is only one of a number of logically possible classifications supported by the genetic data should be a big deal! We don’t even know if Rodrigues’ classification fares better than some of the unmentioned alternatives.

Of course, at the end of the day, none of this really matters, because the genetic data tells us nothing about linguistic classification.

The fact that articles like this and GVPSNA get published in solid journals with major linguistic blunders in them just makes me wonder about the peer review process. Doesn’t it occur to editors that if they are reviewing an article involving linguistic classification that it wouldn’t be a bad idea to get a historical linguist involved?

Genetics meets Voodoo Historical Linguistics: Genetic Variation and Population Structure in Native Americans

The process of the settlement of the Americas is one of those long-standing and fascinating research questions that can probably only be properly tackled by bringing to bear the tools of multiple disciplines: archeology, historical linguistics, and biology — especially genetic analyses of Native American populations. I was excited to see, therefore, a recent study, Genetic Variation and Population Structure in Native Americans (PLoS Genetics), that sought to use information on genetic variation in Native American populations to develop and test hypotheses about the question of prehistoric migration in the Americas.

There is much to chew on in this interesting article, and I have some queries on methodological issues related to the genetics discussed in the article, but in this post I want to comment on the use the authors made of historical linguistics. Most of the article is devoted to analyses of genetic samples from various indigenous peoples of the Americas, but one section is entitled “Genes and Languages”. The first sentence of this section reads:

We compared the classification of the population into linguist “stocks” with their genetic relationships as inferred on a neighbor-joining tree constructed from Nei genetic distances.

When I saw the word “stocks”, my eyebrows went up, and I read on:

In a neighbor-joining tree, a reasonably well-supported cluster (86%) includes all non-Andean South American populations, together with the Andean-speaking Inga population from southern Columbia. Within this South American cluster, strong support exists from separate clustering of Chibchan-Paezan (97%) and Equatorial-Tucanoan (96%) speakers (except for the inclusions of the Equatorial-Tucanoan Wayuu population with its Chibchan-Paezan geographic neighbors, and the inclusion of Kaingang, the single Ge-Pano-Carib population, with its Equatorial-Tucanoan geographic neighbors).

Chibchan-Paezan? Equatorial-Tucanoan? Ge-Pano-Carib? Uh-oh, I thought, it looks like the authors are using Greenberg’s classification of the languages of the Americas. The citations confirmed it: Greenberg (1987) and Ruhlen (1991) are their main linguistic references. I was stunned.

The authors are geneticists, and not historical linguists specializing in the Americas, so they are probably blissfully unaware of the fact that Greenberg’s classification (which Ruhlen essentially repeats) has been severely criticized by Americanist historical linguists, and is regarded by most of them as unreliable at best. They may exist, but I’ve never met an Americanist that finds Greenberg’s classification vaguely plausible. But the authors thank Merritt Ruhlen for assistance in their acknowledgement section, which indicates at least one source for their linguistic advice.

The problematic nature of the use of Greenberg’s classification is nicely, if subtly, indicated by the following observation by the authors:

As the use of a single-family grouping (Amerind) of all languages not belonging to the Na-Dene or Eskimo-Aleut families is controversial [here they cite Bolnick et al. 2004], we focused our analysis on the taxonomically lower level of linguistic stocks.

To say that Amerind is “controversial” is an understatement — but never mind that for now — as Lyle Campbell points out, even Greenberg and Ruhlen admit that they have greater confidence in the Amerind supergroup than they do in the accuracy of the subgroupings within Amerind:

Moreover, there is some reason to believe that not even Greenberg and Ruhlen have strong faith in the validity of these eleven groupings, since the repeatedly mentioned their belief that the overall Amerind construct “is really much more robust that some [of these eleven] lower branches of Amerind (Ruhlen 1994b:15; see Greenberg 1987:59). (Campbell 1997: p.328)

The Greenberg citation in question reads:

The validity of Amerind as a whole is more secure than that of any of its stocks.

So, the authors of GVPSNA think that Amerind is too controversial to be used in their paper, but Greenberg and Ruhlen think that Amerind is “more robust” and “more secure” that the “taxonomically lower level of linguistic stocks” used in GVPSNA. Simple transitivity means that these the authors should not trust the lower level stocks either.

The root problem with the lower-level groupings in Amerind is that even if the method of mass lexical comparison (MMLC) used by Greenberg and Ruhlen is viable (and there are not many historical linguists who would defend this position), the method is (as Bill Poser, among many others, has pointed out) incapable of defining subgroupings. The very best that MMLC can do (and once again, historical linguists have grave doubts even here) is show that a group of languages is related. It cannot elucidate subgroupings within that group of related languages.

I’ll save the explanations for the flaws in MMLC and its inability to define subgroupings for another post, but we see in the case of GVPSNA both a pervasive problem and an opportunity. The pervasive problem is that literacy in linguistics is low both among laymen and in other scientific disciplines — a horse long ago beaten to death over at Language Log (the horse in question, is, unfortunately, undead, and requires period new beatings). The opportunity is twofold: first, its clear that linguistics has something to offer scientists in other fields, which is nice; and second, getting the word out about the state of the art in linguistics gives linguists a great way to achieve world domination. Fast.

Works Cited

Bolnick DA, Shook BA, Campbell L, Goddard I. 2004. Problematic use of Greenberg’s linguistic classification of the Americas in studies of Native American genetic variation. Am J Hum Genet 75: 519–522.

Campbell, Lyle. 1997. American Indian Languages: The historical linguistics of Native America. Oxford University Press.

Greenberg, Joseph. 1987. Language in the Americas. Stanford University Press.

Ruhlen, Merritt. 1991. A guide to the world’s languages. Volume 1: Classification. Stanford, CA: Stanford University Press.