Using Mesquite to inspect cognate sets

October 13, 2013

This post describes the use of a phylogenetic analysis program, Mesquite, to identify possibly erroneous cognacy judgments in large lexical datasets. I’ve found it to be a very useful tool, and I haven’t heard other linguists talk about it a great deal, so I though it might be interesting for others to hear about. But first, some background…

For the past couple of years the Berkeley Comparative Tupí-Guarani Project* has been working to develop an improved internal classification of the Tupí-Guaraní (TG) family. By this point we have, among other things, collected lexical data on 30 TG languages (plus Awetí and Mawe, two non-TG Tupian languages, to serve as out-group languages), using a 539-item comparative list, and arranged these data into approximately 1300 non-singleton cognate sets. We will, in the not-too-distant future, start constructing correspondence sets in order to begin applying the Comparative Method to this dataset, but in the meantime, we are running computational phylogenetic analyses on the lexical data to obtain a preliminary internal classification. What we obtain are trees like the following:

An inferred phylogeny of TG languages based on lexical cognate sets

An inferred phylogeny of TG languages based on lexical cognate sets (click for larger view)

This is actually a pretty credible TG tree (although it may not, of course, be entirely correct): it largely reproduces the basic groups of Rodrigues (1984/5) and the proposed subgroups of Rodrigues and Cabral (2002), along with additional structure that seems plausible if you have, like us, been spending a lot of time looking at TG lexical and morphological data. (It also yields a very sensible model for the geographical dispersal of the family, but that’s a matter for another day.) One weakness of the phylogenetic result, however, is the support values for certain subgroups. Support values correspond roughly to the probability that a given subgroup is, in fact, a subgroup, and we have wanted to use a value of 0.85 as our cutoff point for considering a clade (or subgroup) credible. Unfortunately, some of our most interesting subgroups have lower values. For example, the subgroup that corresponds the more or less to Groups I+II+III in the Rodrigues classification has a support value of 0.81.

Fortunately, one can increase the support values by improving the reliability of the cognate sets (assuming that they are not already perfect — ha ha). Computationally, lowered support values arise from ‘conflicting signals’, i.e. different sets of evidence that point to different subgroups. So, for example, there is good evidence for our Group I+II+III subgroup, i.e. cognate sets that uniquely define this subgroup, but there are other cognate sets that lead one to want to include other languages in this larger subgroup, or languages from this subgroup in other subgroups,  reducing support the support for all of the subgroups.

This kind of conflicting signal can arise from a number of sources, but two important ones are: 1) independent innovations that yield false cognacy; and 2) mistakes in building cognate sets, where two elements are deemed to be cognate when they are not. The latter issue is, of course, always a potential issue at this stage in the process, i.e. before complete application of the Comparative Method, since without adequate knowledge of the relevant sound changes, it is possible to treat bogus look-alikes as cognate, and miss true cognates due to changes that obscure cognacy. And in dealing with such a large dataset, human error inevtiably comes into play: forms are deemed cognate in the wee hours of a particular morning, which really aren’t credibly cognate by the cold light of day.

Fortunately, we have found a very useful tool for ferreting out potentially bogus cognacy judgments in the form of Mesquite, an application that serves to carry out analyses on inferred phylogenetic trees. Mesquite has many functions, but the relevant one for our purposes is its ‘reconstruction’ of ancestral states. Basically what this function does is to ‘reconstruct’ (i.e. identify) how far back in a phylogenetic tree a given phylogenetic character (in our case, a form that is a member of a particular cognate set) reconstructs, according to the tree that one’s phylogenetics application has inferred. In doing so, it also identifies cases of independent innovation (likewise, according to the inferred tree).

One thing that makes Mesquite especially nice is that it has a nice graphical interface that allows one to easily spot instances of independent innovation. First, in the following screen shot, one can see a nice instance of a character (KNEE4, presence of forms for ‘knee’ cognate to, e.g. Assuriní de Tocantíns kanawá), that seems to reconstruct quite solidly for one of the robust subgroups in our larger ‘Central’ subgroup.

A Mesquite 'reconstruction' for the TG KNEE4 set

A Mesquite ‘reconstruction’ for the TG KNEE4 cognates

Next, in the following screen shot, one can see a character (TOE2) that was, according the ancestral state reconstruction associated with the tree, independently innovated  three times: in Chiriguano, Pauserna, and Wayampí. This is a somewhat suspicious state of affairs, suggesting that it might make sense to look at the cognate set again. Doing so we see that the word for ‘toe’ in these languages is actually a compound meaning something like ‘foot head’. Body-part compounds with ‘head’ or ‘bone’ are fairly common in TG languages, suggesting that these forms for ‘toe’ are independently innovated, based on (true) cognates for ‘foot’ and ‘head’. On this basis we exclude this compound as informative for purposes of phylogenetic analysis. And note that pattern evident in the ‘reconstruction’ is precisely the kind of conflicting signal that might lower the support for subgroups like Central and Peripheral.

A Mesquite 'reconstruction' for the TOE2 cognate set

A Mesquite ‘reconstruction’ for the TG TOE2 cognates

Examining suspicious ‘reconstructions’ like the TOE2 one has led us to identify previously unnoticed complex forms, as in this case, as well as instances of poor cognacy judgments. And having identified several dozen problematic sets in this way,  we have high hopes that our next TG tree will have the support values that we are pining for. We’re keeping our fingers crossed, and I’ll post our next set of results.

In any case, I’ve found Mesquite to be such a wonderful tool for evaluating cognate sets in the context of phylogenetic analysis that I wanted to share it with others who might not be familiar with it. (And thanks, Natalia, for introducing it to the TG group!)

References

Rodrigues, A. D. 1984/1985. Relações internas na família lingüística tupí-guaraní. Revista de Antropologia 27/28, 33–53.

Rodrigues, A. D. and A. S. A. C. Cabral. 2002. Revendo a classificação interna da família tupí-guaraní. In A. S. A. C. Cabral and A. D. Rodrigues (eds.), Línguas Indígenas Brasileiras: Fonologia, Gramática e História, pp. 327–337. Belém: Editora Universitária, Universidade Federal do Pará.

*Current project members include Keith Bartolomei, Natalia Chousou-Polydori, Erin Donelly, and Zachary O’Hagan; alumni include Mike Roberts and Vivian Wauters. The work described here has been funded in part by NSF BCS #0966499 . Thanks also to Sebastian Drude and Françoise Rose for data-sharing!

About these ads

2 Responses to “Using Mesquite to inspect cognate sets”

  1. simonnetnz Says:

    Interesting, but you do need to be careful not to be circular – removing the data that doesn’t fit the tree well runs the risk of biasing your results. You should check *all* the data and be consistent (perhaps there are cognates that erroneously fit onto the tree well – how are you checking them?)

    Also, this is much easier to do numerically with the Retention Index or the Consistency Index (RI is better than CI). Anything less than e.g. 0.5 doesn’t fit the tree well. Mesquite can quite easily spit out stats like that for all characters across all trees. You can even calculate the *expected* CI/RI for a given tree size and use that as your filter.

    Finally, are these posterior probabilities or bootstrap support values? Don’t get fixated on pushing values higher – they are what they are, perhaps there’s some process going on that is weakening that signal (borrowing? areal diffusion? fast rates of change?). 0.81 is actually very good support. For a tree with – at a quick count – 34 tips, there are 7.297912e+45 possible permutations. The odds of finding that one particular configuration in 81% of your posterior distribution is pretty spectacular.

    –Simon

  2. Lev Michael Says:

    Hi Simon,

    Thanks very much for your comments. Quite right about the issue of circularity, of course. We *are* checking all the sets, but we are using Mesquite to identify areas that merit special scrutiny.

    We are evaluating the plausibility of cognate sets on the basis of sound correspondences, and I would say that we have not altered the majority of the sets that have ‘weird’ reconstructions. Whatever the explanation for their weirdness, there is no principled reason to split up the forms into different sets. Interestingly, we have ended up being able to merge a number of these apparently problematic sets together: it turns out that in a reasonable number of ‘weird’ cases, semantic shift resulted in two cognate sets in different parts of the data set, which jumped out as odd or problematic via the Mesquite review strategy, and which we were subsequently able to unify in a single set. Quite satisfying, as you might imagine.

    Thanks also for the observation about the Retention Index. I’ll look into that.

    As for the values, they are posterior probabilities. You are quite right that borrowing is at play in certain cases, and as you say, in these cases, the values are what they are, whatever the explanation. But the quality of cognate sets can also be improved as one becomes more familiar with a language family, and Mesquite has certainly focused our attention on certain problematic areas.

    Thanks for your comments once again!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: