This post describes the use of a phylogenetic analysis program, Mesquite, to identify possibly erroneous cognacy judgments in large lexical datasets. I’ve found it to be a very useful tool, and I haven’t heard other linguists talk about it a great deal, so I though it might be interesting for others to hear about. But first, some background…
For the past couple of years the Berkeley Comparative Tupí-Guarani Project* has been working to develop an improved internal classification of the Tupí-Guaraní (TG) family. By this point we have, among other things, collected lexical data on 30 TG languages (plus Awetí and Mawe, two non-TG Tupian languages, to serve as out-group languages), using a 539-item comparative list, and arranged these data into approximately 1300 non-singleton cognate sets. We will, in the not-too-distant future, start constructing correspondence sets in order to begin applying the Comparative Method to this dataset, but in the meantime, we are running computational phylogenetic analyses on the lexical data to obtain a preliminary internal classification. What we obtain are trees like the following:
This is actually a pretty credible TG tree (although it may not, of course, be entirely correct): it largely reproduces the basic groups of Rodrigues (1984/5) and the proposed subgroups of Rodrigues and Cabral (2002), along with additional structure that seems plausible if you have, like us, been spending a lot of time looking at TG lexical and morphological data. (It also yields a very sensible model for the geographical dispersal of the family, but that’s a matter for another day.) One weakness of the phylogenetic result, however, is the support values for certain subgroups. Support values correspond roughly to the probability that a given subgroup is, in fact, a subgroup, and we have wanted to use a value of 0.85 as our cutoff point for considering a clade (or subgroup) credible. Unfortunately, some of our most interesting subgroups have lower values. For example, the subgroup that corresponds the more or less to Groups I+II+III in the Rodrigues classification has a support value of 0.81.
Fortunately, one can increase the support values by improving the reliability of the cognate sets (assuming that they are not already perfect — ha ha). Computationally, lowered support values arise from ‘conflicting signals’, i.e. different sets of evidence that point to different subgroups. So, for example, there is good evidence for our Group I+II+III subgroup, i.e. cognate sets that uniquely define this subgroup, but there are other cognate sets that lead one to want to include other languages in this larger subgroup, or languages from this subgroup in other subgroups, reducing support the support for all of the subgroups.
This kind of conflicting signal can arise from a number of sources, but two important ones are: 1) independent innovations that yield false cognacy; and 2) mistakes in building cognate sets, where two elements are deemed to be cognate when they are not. The latter issue is, of course, always a potential issue at this stage in the process, i.e. before complete application of the Comparative Method, since without adequate knowledge of the relevant sound changes, it is possible to treat bogus look-alikes as cognate, and miss true cognates due to changes that obscure cognacy. And in dealing with such a large dataset, human error inevtiably comes into play: forms are deemed cognate in the wee hours of a particular morning, which really aren’t credibly cognate by the cold light of day.
Fortunately, we have found a very useful tool for ferreting out potentially bogus cognacy judgments in the form of Mesquite, an application that serves to carry out analyses on inferred phylogenetic trees. Mesquite has many functions, but the relevant one for our purposes is its ‘reconstruction’ of ancestral states. Basically what this function does is to ‘reconstruct’ (i.e. identify) how far back in a phylogenetic tree a given phylogenetic character (in our case, a form that is a member of a particular cognate set) reconstructs, according to the tree that one’s phylogenetics application has inferred. In doing so, it also identifies cases of independent innovation (likewise, according to the inferred tree).
One thing that makes Mesquite especially nice is that it has a nice graphical interface that allows one to easily spot instances of independent innovation. First, in the following screen shot, one can see a nice instance of a character (KNEE4, presence of forms for ‘knee’ cognate to, e.g. Assuriní de Tocantíns kanawá), that seems to reconstruct quite solidly for one of the robust subgroups in our larger ‘Central’ subgroup.
Next, in the following screen shot, one can see a character (TOE2) that was, according the ancestral state reconstruction associated with the tree, independently innovated three times: in Chiriguano, Pauserna, and Wayampí. This is a somewhat suspicious state of affairs, suggesting that it might make sense to look at the cognate set again. Doing so we see that the word for ‘toe’ in these languages is actually a compound meaning something like ‘foot head’. Body-part compounds with ‘head’ or ‘bone’ are fairly common in TG languages, suggesting that these forms for ‘toe’ are independently innovated, based on (true) cognates for ‘foot’ and ‘head’. On this basis we exclude this compound as informative for purposes of phylogenetic analysis. And note that pattern evident in the ‘reconstruction’ is precisely the kind of conflicting signal that might lower the support for subgroups like Central and Peripheral.
Examining suspicious ‘reconstructions’ like the TOE2 one has led us to identify previously unnoticed complex forms, as in this case, as well as instances of poor cognacy judgments. And having identified several dozen problematic sets in this way, we have high hopes that our next TG tree will have the support values that we are pining for. We’re keeping our fingers crossed, and I’ll post our next set of results.
In any case, I’ve found Mesquite to be such a wonderful tool for evaluating cognate sets in the context of phylogenetic analysis that I wanted to share it with others who might not be familiar with it. (And thanks, Natalia, for introducing it to the TG group!)
Rodrigues, A. D. 1984/1985. Relações internas na família lingüística tupí-guaraní. Revista de Antropologia 27/28, 33–53.
Rodrigues, A. D. and A. S. A. C. Cabral. 2002. Revendo a classificação interna da família tupí-guaraní. In A. S. A. C. Cabral and A. D. Rodrigues (eds.), Línguas Indígenas Brasileiras: Fonologia, Gramática e História, pp. 327–337. Belém: Editora Universitária, Universidade Federal do Pará.
*Current project members include Keith Bartolomei, Natalia Chousou-Polydori, Erin Donelly, and Zachary O’Hagan; alumni include Mike Roberts and Vivian Wauters. The work described here has been funded in part by NSF BCS #0966499 . Thanks also to Sebastian Drude and Françoise Rose for data-sharing!