This post describes the use of a phylogenetic analysis program, Mesquite, to identify possibly erroneous cognacy judgments in large lexical datasets. I’ve found it to be a very useful tool, and I haven’t heard other linguists talk about it a great deal, so I though it might be interesting for others to hear about. But first, some background…

For the past couple of years the Berkeley Comparative Tupí-Guarani Project* has been working to develop an improved internal classification of the Tupí-Guaraní (TG) family. By this point we have, among other things, collected lexical data on 30 TG languages (plus Awetí and Mawe, two non-TG Tupian languages, to serve as out-group languages), using a 539-item comparative list, and arranged these data into approximately 1300 non-singleton cognate sets. We will, in the not-too-distant future, start constructing correspondence sets in order to begin applying the Comparative Method to this dataset, but in the meantime, we are running computational phylogenetic analyses on the lexical data to obtain a preliminary internal classification. What we obtain are trees like the following:

An inferred phylogeny of TG languages based on lexical cognate sets

An inferred phylogeny of TG languages based on lexical cognate sets (click for larger view)

This is actually a pretty credible TG tree (although it may not, of course, be entirely correct): it largely reproduces the basic groups of Rodrigues (1984/5) and the proposed subgroups of Rodrigues and Cabral (2002), along with additional structure that seems plausible if you have, like us, been spending a lot of time looking at TG lexical and morphological data. (It also yields a very sensible model for the geographical dispersal of the family, but that’s a matter for another day.) One weakness of the phylogenetic result, however, is the support values for certain subgroups. Support values correspond roughly to the probability that a given subgroup is, in fact, a subgroup, and we have wanted to use a value of 0.85 as our cutoff point for considering a clade (or subgroup) credible. Unfortunately, some of our most interesting subgroups have lower values. For example, the subgroup that corresponds the more or less to Groups I+II+III in the Rodrigues classification has a support value of 0.81.

Fortunately, one can increase the support values by improving the reliability of the cognate sets (assuming that they are not already perfect — ha ha). Computationally, lowered support values arise from ‘conflicting signals’, i.e. different sets of evidence that point to different subgroups. So, for example, there is good evidence for our Group I+II+III subgroup, i.e. cognate sets that uniquely define this subgroup, but there are other cognate sets that lead one to want to include other languages in this larger subgroup, or languages from this subgroup in other subgroups,  reducing support the support for all of the subgroups.

This kind of conflicting signal can arise from a number of sources, but two important ones are: 1) independent innovations that yield false cognacy; and 2) mistakes in building cognate sets, where two elements are deemed to be cognate when they are not. The latter issue is, of course, always a potential issue at this stage in the process, i.e. before complete application of the Comparative Method, since without adequate knowledge of the relevant sound changes, it is possible to treat bogus look-alikes as cognate, and miss true cognates due to changes that obscure cognacy. And in dealing with such a large dataset, human error inevtiably comes into play: forms are deemed cognate in the wee hours of a particular morning, which really aren’t credibly cognate by the cold light of day.

Fortunately, we have found a very useful tool for ferreting out potentially bogus cognacy judgments in the form of Mesquite, an application that serves to carry out analyses on inferred phylogenetic trees. Mesquite has many functions, but the relevant one for our purposes is its ‘reconstruction’ of ancestral states. Basically what this function does is to ‘reconstruct’ (i.e. identify) how far back in a phylogenetic tree a given phylogenetic character (in our case, a form that is a member of a particular cognate set) reconstructs, according to the tree that one’s phylogenetics application has inferred. In doing so, it also identifies cases of independent innovation (likewise, according to the inferred tree).

One thing that makes Mesquite especially nice is that it has a nice graphical interface that allows one to easily spot instances of independent innovation. First, in the following screen shot, one can see a nice instance of a character (KNEE4, presence of forms for ‘knee’ cognate to, e.g. Assuriní de Tocantíns kanawá), that seems to reconstruct quite solidly for one of the robust subgroups in our larger ‘Central’ subgroup.

A Mesquite 'reconstruction' for the TG KNEE4 set

A Mesquite ‘reconstruction’ for the TG KNEE4 cognates

Next, in the following screen shot, one can see a character (TOE2) that was, according the ancestral state reconstruction associated with the tree, independently innovated  three times: in Chiriguano, Pauserna, and Wayampí. This is a somewhat suspicious state of affairs, suggesting that it might make sense to look at the cognate set again. Doing so we see that the word for ‘toe’ in these languages is actually a compound meaning something like ‘foot head’. Body-part compounds with ‘head’ or ‘bone’ are fairly common in TG languages, suggesting that these forms for ‘toe’ are independently innovated, based on (true) cognates for ‘foot’ and ‘head’. On this basis we exclude this compound as informative for purposes of phylogenetic analysis. And note that pattern evident in the ‘reconstruction’ is precisely the kind of conflicting signal that might lower the support for subgroups like Central and Peripheral.

A Mesquite 'reconstruction' for the TOE2 cognate set

A Mesquite ‘reconstruction’ for the TG TOE2 cognates

Examining suspicious ‘reconstructions’ like the TOE2 one has led us to identify previously unnoticed complex forms, as in this case, as well as instances of poor cognacy judgments. And having identified several dozen problematic sets in this way,  we have high hopes that our next TG tree will have the support values that we are pining for. We’re keeping our fingers crossed, and I’ll post our next set of results.

In any case, I’ve found Mesquite to be such a wonderful tool for evaluating cognate sets in the context of phylogenetic analysis that I wanted to share it with others who might not be familiar with it. (And thanks, Natalia, for introducing it to the TG group!)

References

Rodrigues, A. D. 1984/1985. Relações internas na família lingüística tupí-guaraní. Revista de Antropologia 27/28, 33–53.

Rodrigues, A. D. and A. S. A. C. Cabral. 2002. Revendo a classificação interna da família tupí-guaraní. In A. S. A. C. Cabral and A. D. Rodrigues (eds.), Línguas Indígenas Brasileiras: Fonologia, Gramática e História, pp. 327–337. Belém: Editora Universitária, Universidade Federal do Pará.

*Current project members include Keith Bartolomei, Natalia Chousou-Polydori, Erin Donelly, and Zachary O’Hagan; alumni include Mike Roberts and Vivian Wauters. The work described here has been funded in part by NSF BCS #0966499 . Thanks also to Sebastian Drude and Françoise Rose for data-sharing!

Advertisements

One of my favorite new blogs on the linguistics scene is Diversity Linguistics Comment, which presents itself as

… a scholarly blog that discusses current issues in language typology and language description, written by linguists for other linguists. The notion of “diversity linguistics” recognizes the close connections between the enterprises of language comparison and analysis of particular languages. Topics include grammatical structures (syntax and morphology, phonology), language contact, language change in a comparative perspective, and genealogical linguistics.

Posts appear somewhat infrequently, but they are always substantive and interesting, often accompanied by equally meaty comment threads. As an example of the fare provided, consider Simeon Floyd’s recent post on Quechua adjectives (here), which engages with the debate over the universality of word classes, especially as this question intersects with descriptive linguistic practice. The post focuses on Simeon’s own work on Quechua adjectives, and Martin’s Haspelmath’s criticism of Simeon’s (and others’) conclusions. I find it to be a very thoughtful piece that provides a nice example of the subtle issues involved in applying putatively cross-linguistically valid labels like ‘adjective’ to language-specific word classes, and also shows how attention to naturally-occurring discourse can play a crucial role in grammatical analysis.

In the context of work that I’ve been carrying out on 17th and 18th century Omagua society, I’ve come to be interested in the etymology of the names of Omagua communities mentioned in the Jesuit records of the period, including those found on Samuel Fritz‘ map (a not awful copy is available here). A brief inspection of these names as given in these sources (e.g. Zuruité, Iviraté, Yoaivaté, and Aruparaté) reveal that most of them appear to consist of a nominal root (e.g. zurui = /surui/ ‘catfish sp.’ or ivira = /ɨwɨra/ ‘tree’) and a suffix -té. What, though, is the suffix in question?

An obvious first guess for a Tupí-Guaraní (TG) specialist would be the suffix -eté, found in most TG languages, which expresses meanings like ‘true’ or ‘real’. This suffix surfaces, for example, in Paraguayan Guaraní word yawareté ‘jaguar’ (lit. ‘true jaguar’, to distinguish it from yawar which now means ‘dog’, due to semantic shift). And indeed one finds cognates to this suffix in Omagua, as in the word Omaguayete ‘true Omaguas’, recorded in Jesuit sources, which was apparently the autonym for the Omagua group that lived on the Upper Napo (see here for details). But for two reasons, it seems unlikely that the -te found in the toponyms in question is the same suffix. In the first place, meaning of the form that would be derived seems implausible and unmotived: why would the Omaguas wanted to have called their communities ‘True Catfish’ or ‘True Tree’? Second, the form of the suffix in Omaguayete doesn’t quite seem to match that of the toponyms in question, since in the former it appears to be -yete, but in the toponyms, -te. Even if one argued, as one might want to (see below), that the form of the suffix in the toponyms is underlyingly -ete, it is unclear why vowel hiatus would have been resolved by vowel deletion in the toponyms, but not in the name of the Omagua subgroup autonym.

By chance, however, I recently noticed that in at least one TG language, Parintintín, a suspiciously similar suffix, -ete, is used to derive derive river names (Betts 1981). One possibility, then, is that -te did the same thing in Old Omagua (it no longer does), and that the 17th and 18th century community names were ultimately river names. Nothing would be more natural, in fact: indigenous Amazonian communities very frequently take as their names the names of nearby small tributaries. On this view, then, -te was an endocentric derivational suffix that derived hydronyms. Whether -te derived forms that denoted rivers in general or small rivers is unclear at this point. Although every Amazonian language I’ve done substantial work with exhibits hydronymic derivational morphology like this, some languages exhibit two (or more) morphemes that distinguish the size of the river, while others don’t. For example, Máíhɨ̃ki distinguishes two sizes of river (-ya ‘river’ and -gaya ‘creek’), but Iquito exhibits only a single hydronymic derivational suffix (-mu).

There are two things that would need to be done to properly evaluate the hypothesis sketched out above. First, it would be good to see if there are tributaries in former Omagua territory that actually bear names with the -te suffix. Of course, there has been a lot of toponymic turnover since the 18th century, when the Omaguas were decimated by disease and Portuguese slave raids, and largely abandoned their former territories. Names of Quechua origin have probably replaced most of the older Omagua names, but some traces may remain. I don’t have maps on hand of the necessary detail, but my big 3,300,000:1 scale map of the Amazon basin reveals one such tributary, Puruté, suggesting that some progress could be made here.

The second issue to examine would be to see if other TG languages exhibit a hydronym-deriving suffix cognate to Parintintín -ete. This would help reassure us that the resemblance between the Omagua and Parintintín suffixes is not a chance similarity.

References

Betts, L. V. 1981. Dicionário Parintintin-Portugues Portugues-Parintintin. Cuiabá: Summer Institute of Linguistics.

A new volume in the occasional Survey (of Californian and Other Indian Languages) Reports series was just published, and is available online here. The volume, entitled Structure and contact in languages of the Americas, was edited by John Sylak-Glassman and Justin Spence, and includes a number of very interesting articles on South and Central American languages, as evident in the table of contents, reproduced below:
  • Subgrouping in the Tupí-Guaraní family: A phylogenetic approach by Natalia Chousou-Polydouri and Vivian Wauters
  • A ‘perfect’ evidential: The functions of -shka in Imbabura Quichua by Jessica Cleary-Kemp
  • Hierarchies, subjects, and the lack thereof in Imbabura Quichua subordinate clauses by Clara Cohen
  • One -mi: An evidential, epistemic modal, and focus marker in Imbabura Quechua by Iksoo Kwon
  • The stops of Tlingit by Ian Maddieson and Caroline L. Smith
  • The plank canoe of southern California: Not a Polynesian import, but a local innovation by Yoram Meroz
  • Variable affix ordering in Kuna by Lindsey Newbold
  • Passive constructions in Kʷak̓ʷala by Daisy Rosenblum
  • Dialect contact, convergence, and maintenance in Oregon Athabaskan by Justin Spence
  • Affix ordering in Imbabura Quichua by John Sylak-Glassman