Dissertations on Amazonian Languages

I recently discovered a very nice online resource: an online collection of dissertations and master’s theses on Amazonian languages, here (http://www.etnolinguistica.org/teses). The majority of the dissertations on this page were written by students at Brazilian universities, but there are also several from US and European ones. It’s especially nice that this page has Brazilian dissertations, since those are frequently hard to get a hold of outside of Brazil.

Here, to whet your appetite, is a small sample of titles:

Ferreira, Rogério Vicente. 2005. Língua Matis (Pano): uma descrição gramatical. Doutorado, Unicamp.

Freitas, Deborah de Brito Albuquerque Pontes. 2003. Escola Makuxi : identidades em construção. Doutorado, Unicamp

Santos, Manoel Gomes dos. 2006. Uma gramática do Wapixana (Aruák): aspectos da fonologia, da morfologia e da sintaxe. Doutorado, Unicamp.

Sousa Filho, Sinval Martins de. 2007. Aspectos morfossintáticos da língua Akwe-Xerente. Doutorado, UFG.

Zuccolillo, Carolina Maria Rodriguez. 2000. Língua, nação e nacionalismo: um estudo sobre o Guarani no Paraguai. Doutorado, Unicamp.

Is it time for a ‘new’ Anthropological Linguistics?

Just a few days ago I finally obtained a copy of Unwrapping the Sacred Bundle, which drew my mind back to an issue that has concerned me for several years. Let me explain.

To risk stating the obvious, linguistic form and social action are complexly intertwined: linguistic form is instrumental in social action, and social action both affects the selection of particular elements of linguistic form in communicative interaction, and through the cumulative effects of such selection, drives changes in linguistic form, through processes such as grammaticalization. Nevertheless, there are domains in which, as an idealization, we can usefully treat linguistic form as largely independent of social action, and conversely, social action as largely independent of linguistic form. The viability of these idealizations is evident in the institutionalization of the disciplines of Linguistics, on the one hand, and disciplines like Anthropology and Sociology, on the other. I have no quarrel with the fact that these disciplines are oriented towards research in which the idealizations based on the relative independence of linguistic form and social action hold sway. However, I believe that this institutionalization of the division of the linguistic-formal/social-actional continuum has had an unfortunate effect on the study of the vast middle ground of phenomena for which these idealizations are untenable. Numerous scholars have, of course, recognized that the idealizations in questions are problematic in certain respects, leading to the rise of several hybrid sub-disciplines: linguistic anthropology, sociolinguistics, ethnomethodology/conversational analysis, and the sociology of language among them.

Each of these subdisciplines has made important contributions to understanding the middle ground in which linguistic form and social action are irreducibly intertwined, but I also believe that the disciplinary centers of gravity around which they orbit have tended to pull each subdiscipline towards the respective idealizations of independence of linguistic form and social action that characterize the core of each institutionalized discipline. To be clear, this has not affected each sub-discipline’s capacity to do valuable work, as each still focuses on some portion of the linguistic-formal/social-actional spectrum that merits attention. But the overall consequence has been, in my opinion, to create (or recreate, or perhaps, leave) a gap in attention to the middle of the spectrum where linguistic form and social action are so tightly intertwined that serious attention must be paid to both.

This problem is revealed clearly, in the realm of Anthropology, at a number of points in Unwrapping the Sacred Bundle, which examines issues of subdisciplinarity in Anthropology. There are several relevant and thought-provoking passages in this collection, but I’d like to zero in on one in James Clifford’s contribution which speaks directly to the issue of the institutional division of labor with respect to the intersection of linguistic form and social action. Since he articulates the view from the anthropological hill so nicely, I quote him at length:

Perhaps the most dramatic disarticulation of the four fields ensemble [i.e. archaeology, cultural anthropology, linguistic anthropology, and physical anthropology] has taken place with respect to “linguistic anthropology.” Most departments today do not feel the need for a distinct linguistic track or faculty cluster. The study of linguistic process is very much part of anthropological work, but it tends to be seen as one of sociocultural anthropology’s many provinces. Few anthropologists now study “languages” in the sustained descriptive/analytic way that was common to the generation of Sapir or Kroeber. As Silverstein argues (this volume), “Linguistic anthropology is sociocultural anthropology with a twist, the theoretical as well as instrumental (via ‘discourse’ or ‘the discursive’) worrying of our same basic data, semiosis in various orders of contextualization.

Here Clifford alludes to two related processes: the progressive elimination of linguistic anthropology from anthropology departments, and, in the minority of cases where it survives or flourishes, the ascendance within linguistic anthropology of theoretical concerns and methods dominant in cultural anthropology, and the concomitant marginalization of theory and methods related to linguistic form. As far as I see the disciplinary situation, the convergence of linguistic and cultural anthropology in recent decades is in itself a fine development; there is much interesting work being done in this vein. However, I do see a regrettable side-effect: the emergence of a significant gap in research coverage of a part of the linguistic-social spectrum to which linguistic anthropology used to attend. Specifically, I see a significant gap emerging in the area of the study of linguistic form as a socially-embedded phenomenon — that is, linguistic form as an instrument of social action and conversely, social action as a factor that affects linguistic form.

Lest I be seen as exaggerating the problem, let me point out that there is some work that I believe focuses on precisely the area in question, such as Bill Hank’s work on referential practice, John Haviland’s work on spatial deixis and evidentiality, and some of Alessandro Duranti’s work on Samoan ethnopragmatics, among others. However, work of this type is becoming rarer in the pages of journals like the Journal of Linguistic Anthropology and Language and Society.

My sense is that linguistic anthropology’s estrangement from the close study of linguistic form constitutes a major change in the orientation of the discipline, and one that is unlikely to be reversed in the near future. It seems to me that those of us who believe that the socially-contextualized study of linguistic form is important and valuable need to find a new intellectual space in which to organize our efforts, and new institutional spaces where such work can be based. As my post title suggests, I am fond of the new-old name ‘Anthropological Linguistics’ as a denomination for a field that concentrates on the socially-contextualized study of linguistic form, but I could imagine others. Regardless, I think the real question is whether Anthropological Linguistics, so defined, can organize itself into a productive community and find an institutional home.

Claire Bowern’s Linguistic Fieldwork Site

Anyone who is preparing for their first serious fieldwork would do well to visit Claire Bowern’s fieldwork site: here. The site is mostly intended as a companion to her new book on linguistic fieldwork, but there is ample material on the site that is useful and/or of interest in its own right, including elicitation stimuli, lists of semantic domains, the odd article — like Himmelmann’s article distinguishing documentary and descriptive linguistics, suggested further reading, and numerous links to online resources. (Note that many of the most interesting things are accessed through the ‘Chapter’ link.)

I think people who are new to linguistic fieldwork will benefit the most from the site, but I suspect even old hands will find a few things of interest.


On the off chance that not everyone reads all the comments, I wanted to mention two recent comments of interest…

Northwest Journal of Linguistics

Tony Webster mentioned another new, free, open access, online linguistics journal: The Northwest Journal of Linguistics, which focuses on the languages of northwest North America. To be sure, even the most liberal definitions of Greater Amazonia don’t stretch the borders that far north, but we mustn’t be too parochial.

The journal started publishing in 2007, and has released four issues with an article apiece. One article caught my eye in particular as having significant import outside the areal linguistics of the northwest: Extending the Prosodic Hierarchy: Evidence from Lushootseed Narrative by David Beck and David Bennett. The basic claim of this article is that in Lushootseed narratives one finds evidence for a multi-utterance prosodic constituent, the prosodic paragraph. One of the nice points about this article is that it incidentally makes a case for the linguistic relevance of particular discourse genres and verbal art. The poetic line is a well-known prosodic constituent associated with verbal artistry, and the authors of this article argue that the prosodic paragraph, typically neglected in treatments of the prosodic hierarchy, has good empirical support. Showing one of the major strengths of online journals, the article also includes sound files of the sections of narrative analyzed in the article.


Nick Thieberger wrote in to mention a project that he is involved with at the University of Melbourne to develop an application (EOPAS) that allows one to export Toolbox text as HTML with time-aligned links to audiofiles. Discussion of this project and several other interesting discussions and links related to interlinearized text and other forms of annotation can be found at their project wiki: here.

New Issue of Language Documentation and Conservation

The new issue of Language Documentation and Conservation (LD&C) just became available, through the LD&C homepage (http://nflrc.hawaii.edu/ldc/).

If you haven’t yet had a look at this free, online, peer-reviewed journal, I highly recommend you do so now. Articles include thoughtful discussions of matters related to language documentation, and language loss and revitalization. As a bonus, each issue includes a section with reviews of linguistic software.

What GVPSNA Got Right (And What Others Get Wrong)

Since I have been critical of the use of historical linguistics in Genetic Variation and Population Structure in Native Americans (PLoS Genetics) (GVPSNA), it’s only fair that I point out that they did get one major point very much correct: the contingent nature of the correspondence between genetic relatedness and linguistic relatedness. In Anthropology, this fundamental observation goes at least as far back as Boas, who observed that languages, cultures, and populations each have potentially independent trajectories through time and space. Of course, language, culture, and populations may remain bundled together for periods of time, but this is a fact to be determined by empirical investigation, and cannot be assumed at the outset.

GVPSNA presents a number of intriguing examples of the lack of tidy correspondence between genetic and linguistic relatedness, but let me focus on just one: the fact that the Arawak language-speaking Wayuu are apparently more closely genetic related to the Chibchan language-speaking groups to the west than they are than to the Arawak-speaking Piapoco to the south.

What could this mean? Could this be evidence for Arawak cultural imperialism — an Arawak linguo-cultural takeover — as Max Schmidt’s theory of Arawak expansion proposes? The data presented in GVPSNA are far too spotty to permit us to reach any conclusions (genetic data are only presented for two Arawak languages and five Chibchan languages), but the hints are tantalizing, and the direction for further research is clear.

As it turns out, however, not all geneticists are clear on the loose connection between genetic and linguistic relatedness. Let me take an areally relevant example: a 2005 article in Current Anthropology, by Francisco Mauro Salzano, Mara Helena Hutz, Sabrina Pinto Salamoni, Paula Rohr, and Sidia Maria Callegari-Jacques, entitled Genetic Support for Proposed Patterns of Relationship among Lowland South American Languages (GSPPRLSAL).

The opening paragraph gives a sense of the authors’ perspective:

Comparison of different sets of markers to unravel the past of human populations is an established procedure in both anthropology and genetics. Language characteristics can be easily quantified, and the field of comparative linguistics has a respectable past [citations cut]. Therefore, it is only natural that evolutionary geneticists have turned to linguistics to evaluate the population relationships that they have been obtaining with genetic markers.

Well, I don’t know whether such a move is “natural,” but I suspect anyone who follows this strategy is naïve. Linguistic classification simply cannot tell us anything about the genetic relationships between the populations speaking the languages in questions. GVPSNA, not to mention Boas long before, makes this point quite clearly.

In any event, the goal of the research reported in the paper is to use genetic data to evaluate three different classificatory proposals (due to Greenberg, Loukotka, and Aryon Rodrigues), linking the Arawak, Carib, Gê, and Tupí families:

The testing of the hypotheses concerning language relationship patterns of Greenberg, Loukotka, and Rodrigues was performed using genetic data by means of a method developed by Cavalli-Sforza and Piazza …

And off they go! Data is presented, algorithms are mentioned, and conclusions are drawn. However, just as linguistic classification tells us nothing about genetic relatedness, neither does genetic relatedness of populations tell us anything about linguistic relatedness of the languages the populations speak. (Incidentally, I find it incongruous that a group of Brazilian geneticists, living in a country of mixed African, European, and Native American heritage, in which Portuguese is the dominant language, could, for a moment, imagine that there is any kind of tidy correspondence between genetic and linguistic relatedness.)

And that, really, should be the end of the post. GVPSNA got an important point about the contingent correspondence between genetic and linguistic relatedness right, and, as we can see, this point is not obvious to everyone working at the intersection of genetic and linguistic classification. Kudos to the authors GVPSNA.

But what of the conclusions of GSPPRLSAL? According to the authors:

Rodrigues’s hypothesis was the only one not rejected. Other possible tree arrangements representing the relationships among the language families not considered by the three linguists were identified but are irrelevant to the present inquiry.

But then a little later they summarize their results as follows:

Genetic data support other tree configurations besides those proposed by Greenberg, Loukotka, and Rodrigues. This is not surprising, because different kinds of data may produce somewhat different estimates of the history of the populations. Among the three possibilities proposed by these scholars, that of Rodrigues has the best genetic support.

Hmm. Ok, so, of the three initial proposals, Rodrigues’ fares best, but the genetic data actually supports totally different classifications as well. I fail to understand why “this is not surprising”, or what the comment about “different kinds of data … produc[ing] different estimates of the history of the populations” means. Maybe its my ignorance about genetics speaking, but something smells fishy here.

In any event, it seems to me that if one buys the utility of genetic data in supporting linguistic classification (and there is no good reason to do so in the opinion of most historical linguists, I believe), then the fact that Rodrigues’ proposal is only one of a number of logically possible classifications supported by the genetic data should be a big deal! We don’t even know if Rodrigues’ classification fares better than some of the unmentioned alternatives.

Of course, at the end of the day, none of this really matters, because the genetic data tells us nothing about linguistic classification.

The fact that articles like this and GVPSNA get published in solid journals with major linguistic blunders in them just makes me wonder about the peer review process. Doesn’t it occur to editors that if they are reviewing an article involving linguistic classification that it wouldn’t be a bad idea to get a historical linguist involved?

“The Linguists” at Sundance

Just in case you have missed the blizzard of linguisticky publicity…

Dear Colleagues, Friends, Family, and Supporters of Ironbound Films,

We are nothing short of elated to announce that our documentary feature THE LINGUISTS was selected to world premiere in the newly minted “Spectrum: Documentary Spotlight” category at the 2008 Sundance Film Festival.

THE LINGUISTS is the first documentary supported by the National Science Foundation to ever make it to Sundance.

The trailer is at http://www.thelinguists.com. Here’s a brief synopsis:

It is estimated that of 7,000 languages in the world, half will be gone by the end of this century.

THE LINGUISTS follows David Harrison and Gregory Anderson, scientists racing to document languages on the verge of extinction. In Siberia, India, and Bolivia, the linguists’ resolve is tested by the very forces silencing languages: institutionalized racism and violent economic unrest.

David and Greg’s journey takes them deep into the heart of the cultures, knowledge, and communities at risk when a language dies.

Loreto Regionalism and Indigenous Amazonians

I was recently delighted to discover that Gabel Sotíl has a blog, Tipishca. Gabel Sotíl is a prolific writer and public intellectual based in Iquitos, Peru, who writes on Amazonian environmental issues and indigenous groups, in connection to education and politics in Loreto. One of the especially nice things about his blog is that Gabel uses it to reproduce pieces he has written for local newspapers and magazines, which are all but impossible to obtain outside of Iquitos. Check them out!

Gabel is part of an interesting political movement, which for want of a better term in English I refer to as Loreto regionalism. Some background first: Loreto is the largest departamento (a state-like administrative unit) in Peru, and covers the entire north of Peruvian Amazonia. (Loreto was previously even larger, until 1980, when a large chunk of Loreto was split off as its own departamento, Ucayali.) Loreto is fairly geographically cut off from the rest of Peru, and its capital, Iquitos, is said to be the largest city in the world without road connections (apart from roads to nearby population centers, Iquitos can only be reached by river or air). Loreto has been a major producer of oil since at least the 1970s, from which time, roughly, one can date the emergence of a growing number of regionalist political parties that have pushed for greater regional autonomy. The oldest of these parties is Fuerza Loretana, which has been joined by a host of other parties, such as Unidos Por Loreto, Frente Independiente de Loreto, and Frente Patriótico de Loreto, among many others. The Loreto regionalist movement is largely driven by the perception that the national government’s interest in Loreto (regardless of the party in power), extends little beyond natural resource extraction (especially petrochemicals), and that the needs of Loreto are essentially ignored when it comes to government policy, despite it being the largest departamento in the country (population density is low, however: ~900,000 for the entire department).

One of the most interesting facets of Loreto regionalism to me is that the political movement is tied to a vibrant community of regionalist intellectuals. This group of regionalist intellectuals is mostly located in Iquitos, the region’s capital, and they dedicate much effort to reconceptualizing the relationship between Loreto and the remainder of Peru, and creating what they frame as a properly Amazonian vision for Loreto. A major part of this intellectual project has been to develop an historical and cultural grounding for Loreto independent from the coastal and Andean grounding that occupies a central place in the national imaginary of most Peruvians. This reconceptualization has focused on three major themes: rainforest ecology, indigenous Amazonian societies, and ribereño folklore. The interest of these Iquitos intellectuals in indigenous Amazonian societies is motivated by two concerns: a desire to create a regional historical narrative distinct from the prevalent national narrative originating with the “Inkas”, and a desire to rethink the extractivist economic paradigm that characterizes the national government’s interest in Amazonia.

You can even see this totally different historical grounding of Loreto in the first paragraph of its Wikipedia article:

Lo que es hoy el vasto departamento de Loreto ha sido una región habitada desde los inicios de su poblamiento por una gran diversidad de tribus que lograron un pofundo conocimiento de las especies de sus respectivos entornos.

[What is today the vast department of Loreto has been an inhabited region, from the beginnings of its peopling, by a great diversity of tribes that achieved a deep knowledge of the species of their respective environments.]

Passages like the preceding one may not seem particularly revolutionary to the casual observer, but in a nation that (when it looks beyond its mestizo vision of itself) grounds itself in the Inka Empire, the decision to tie the origins of Loreto to Amazonian indigenous peoples represents a significant shift in the way the cultural origins of (part of) Peru are conceived. Many more examples of this conceptual reframing of Loreto history can be found on Tipishca.

In some cases, the yoking of the Loreto regionalist movement to indigenous Amazonian cultures and societies makes use of rather romantic ideas regarding these societies, or consigns them to an originary past, but in many cases has led to an explicit valorization of Amazonian societies and languages. While anti-indigenous racism is still a significant factor in Loreto, it is now balanced, to a degree I have seen nowhere else in Peruvian Amazonia, by a respect for indigenous Amazonians, their societies, and their languages. Once again, not all the manifestations of this respect are entirely benign (for example, I find the widespread “folklorization” of Amazonian cultures to be quite suspect), but other manifestations have served to open up spaces in public discourse for the concerns of Amazonian peoples. For all its flaws and shortcomings, I find this general receptivity to indigenous Amazonian issues to be unparalleled in the remainder of Peru.

I will close with an illustrative example that I found quite striking: in late 2006, while I was in Loreto, a group of Shuar communities on the upper Pastaza river decided to occupy a number of well-heads and pumping stations owned by PlusPetrol, thereby temporarily bringing oil production in that part of Loreto to a halt. The Shuar have been fighting for many years to have something done about the massive pollution resulting from the petrochemical operations near their communities, and the occupation was an effort to force PlusPetrol and the government to negotiate in earnest. What surprised me about this series of events was the general level of support in Iquitos (where I was at the time), for the Shuar and their actions, both in the local press and among people on the street. This contrasts rather starkly with pollution and accidents PlusPetrol has recently been responsible for in the southern Peruvian Amazon, in the Urubamba River basin. In this latter case, there has been little concern, either in the press or among local mestizos, for the impacts of petrochemical activities on indigenous peoples (principally the Matsigenka). Similarly, there is very little interest in, or respect for, Amazonian peoples among mestizos at the local level, and no serious regionalist movement of the type one finds in Loreto.

I am led to wonder how unusual the political situation in Loreto is, when compared with other regions in Greater Amazonia. In any event, Tipishca is a window on the world of Loreto regionalism and its relation to indigenous concerns, and I recommend it to all Amazonianists.

‘People’ is the Plural of ‘Stupid’

I was recently using the facilities in Epoch when I spied the following graffito: “‘people’ is the plural of ‘stupid'”. There are many things to admire about this pithy cynicism, but there is a specifically linguistic angle from which it can be appreciated, which jogged my memory about an erroneous etymology I recently saw for a Matsigenka word.

Despite the adage that to explain a joke is to ruin it, clarity on this point is essential for what follows: ‘stupid’ is an adjective, and in English, at least, adjectives do not have a plural form. Furthermore, ‘people’ is a noun, whereas ‘stupid’ is an adjective — and the plural form of an adjective (if one exists in a given language) is not, generally speaking, a noun. In other words, there is no way that ‘people’ could be the plural of ‘stupid’. In a way, then, the graffito lamenting the stupidity of the masses is itself stupid (even if it is deliberate), which I find quite charming, as if the cynic is winking at his or her own complicity in the situation being lamented.

The connection with Matsigenka etymology comes in the form a footnote to a paper (PDF) on Matsigenka religion by Dan Rosengren, in which he remarks, regarding a class of spiritual beings:

Saangaríte, which is a plural form of saankari, is usually translated as “the pure ones” which as a rule is conceived of in a moral sense as synonymous with “the good ones.” … Since saankari also is used to describe clean water it is here suggested that it is the visual rather than the moral quality that is referred to. Clean water cannot be seen and neither can the saangaríte.

To be clear about what Rosengren is saying, he identifies saankari as an adjective (“used to describe clear water”) and then claims that saangaríte is the “plural form” of this word. (I have my doubts, btw, about the long vowel in saankari; I suspect it’s just stress. It’s not important, really, but I spell it with a short /a/. Also, the surface g in saangarite is due to allophonic post-nasal voicing, and I replace this with the underlying /k/.)

What I found humorous about Rosengren’s etymological proposal is that it unintentionally repeats the error that gives the above-discussed graffito its linguistic edge. That is, while sankari is an adjective (according to Rosengren), sankarite is a noun, with the consequence that whatever the relationship between sankari and sankarite may be, the latter is certainly not the plural of the former, contrary to Rosengren’s claim. In fact, sankarite is not plural at all, but rather, is unspecified for number, as are all Matsigenka nouns which are not overtly plural marked (with the plural suffix -egi, or the collective plural -page).

But Rosengren certainly is correct in noting a connection between the two forms — let’s see if by applying a little knowledge of Matsigenka grammar and comparative Arawak linguistics we can figure out what it is.

Let’s start from the beginning. In Matsigenka, and most of the related Kampan languages, adjectives ending with the syllable ri are generally derived from verbs. In this case, the verb root in question is sank ‘be invisible, be transparent, not be visible’. An example of the use of this verb is given in (1).

(1) Komaginaro isankanaka.
‘The Woolly Monkey disappeared from sight.’ (e.g. by brachiating away into the foliage).

I’ve also heard a causativized form of the verb used to mean ‘erase’ (i.e. make invisible), as in (2).

(2) Posankanakero kaseta.
‘Please erase the audio cassette.’

In any event, one can derive from the intransitive verb root sank the adjective sankari ‘invisible, transparent’ which can be applied to clear water and glass, as well as to non-visible entities. This, it would seem at first glance, is the origin of the sankari.

But things aren’t quite that straightforward. The single biggest puzzle is that sankarite is a noun, whereas the element sankari that Rosengren identified, is an adjective. Moreover, Matsigenka does not exhibit a deadjectivizal nominalizer.

I think the best clue regarding the correct etymology of sankarite involves the final syllable of the word te. As it turns out, cognates of the morpheme -te surface in other languages of the Arawak family as an animate noun class marker. In other words, this morpheme indicates that the noun to which it is affixed is, in general, a living thing. A particularly clear example is the morpheme -ite, found in Tariana (Aikhenvald, 2003, p.93-4).

If this is correct, then, the animate noun class marker -te would have to be suffixed to a noun, meaning that sankari must be a noun, not an adjective, as Rosengren suggests. Fortunately, as it turns out, we can reconstruct a deverbal nominalizer -ri for Proto-Kampan (the language from which Matsigenka descended). This nominalizer can still be seen in certain forms in Matsigenka, such as matsikanari ‘dark shaman’ (cf. matsik ‘bewitch’) or shigatsiri ‘satellite’, from shig ‘run’. (It is a curious fact that the deverbal nominalizer and the deverbal adjectivizer have the same form, but the generalization is quite clear.) If this is correct, then the sankari in sankarite is not an adjective meaning ‘clear, invisible’, but rather a derived noun meaning ‘clear, invisible thing’.

If we now consider the full form sankarite, we conclude that the name of this class of spirit beings stems from a form meaning, roughly, ‘invisible living things’, or perhaps more evocatively ‘invisible beings’. Which is, as it turns out, a pretty good description of sankarite!

Works Cited

Aikhenvald, Alexandra. 2003. A grammar of Tariana. Cambridge University Press.

Shoebox and beyond

Having previously discussed research funding, let me now turn to another question that I get asked with some frequency by people thinking of heading to the field for the first time: what software do I need — and more specifically: what is Shoebox, and what do you think of it? What follows is my very personal opinion about the Shoebox program.

Briefly, Shoebox is a program for creating dictionaries and interlinearized texts, produced by SIL. What do I think of it? Well, I think of Shoebox like that ancient hulk of a car that your uncle gave you for your 18th birthday, that you still use because you can’t afford anything better, that wheezes, rattles, belches smoke and smells like oil and gasoline, which breaks down so regularly that you have to keep a toolbox in the trunk to make repairs by the side of the road — and which you have to pump the gas on when you come to a stop sign so that it doesn’t stall — but which, at the end of the day, gets you to your destination — perhaps not in style, and definitely not in comfort — but gets you there.

Most people I know who do fieldwork use Shoebox, but dream of the day that something better will come along and they can take the rattling hulk out into back field and leave it there to rust. (Speaking of which, the most recent version of Shoebox was re-christened Toolbox, although it differs in only a few ways. SIL has, more recently, produced a new program, called FLEx, that fills the same basic function as Shoebox/Toolbox, and which is reviewed here. I have not tried it yet, but it looks like a significant improvement of Shoebox in most respects.)

What is Shoebox? Shoebox is essentially a lexical database program joined to a morphological parsing program. The database is designed with dictionary-making in mind, and one of its major virtues is that it includes an export function that permits one to export the dictionary database as a Microsoft Word document that is formatted in a recognizable dictionary style, with headwords and subentries. The database has basic filtering and search functions.

The morphological parser is intended to split words in a text up into their constituent morphemes and assign glosses and parts of speech to the segmented morphemes. The results of the parsing processes are outputted as interlinearized text. The parser searches for morphemes by searching the dictionary database, and a potentially very useful function is that one can set the preferences so as to open a new dictionary entry whenever one comes across a morpheme that the parser does not encounter in the dictionary, thereby allowing you to build up your dictionary by parsing texts. One can also in this way build up corpora of parsed texts that can then be searched for glosses or morphemes of interest.

In my view Shoebox’s strengths are its dictonary-format output, and the fact that the lexical database and parser are fairly well integrated. For a long time, there was little else out there that fulfilled these functions so conveniently in a single package, although that is now changing (see below). SIL, who wrote and maintained the program, also made it available for free — which is a good price.

As my griping above suggests, however, I find Shoebox frustrating in several respects.

Perhaps the single greatest weakness of Shoebox is its documentation, which is woefully inadequate. I have known several people who have tried to use Shoebox, but have given up in frustration. The program comes with a tutorial, which is good, as far as it goes, and it has a help feature, but much of what you need to know to use the program is simply not documented anywhere. If you have programming experience or are generally software savvy, you can, with a lot of patience and gnashing of teeth, figure out most of what you need to know (although there are still a number of things that remain mysterious to me, even after several years). Otherwise, I strongly recommend getting some tutoring from someone who has been using the program for a long time. It will save you a lot of time.

Because Shoebox documentation is so spotty, a number of people have written their own notes for set-up, such as this and this. Here is an article that discusses using Shoebox.

The dictionary-formatted output is one of the best things about Shoebox, in my opinion, but it still leaves a lot to be desired. The user has fairly little control over the formatting, unless one is willing to edit the files that control the conversion from the plain-text Shoebox database to Word (or RTF). That is not something that most people feel comfortable about.

The morphological parser works well for languages with agglutinative morphology, and works best when there is little allomorphy or morphophonology. There are ways to handle allomorphy and morphophonology, including a way to input conditional or environmental rules, but I have found that in languages with lots of morphology and complex allomorphy or morphophonology the parser tends to make lots of errors and one has to spend a lot of time telling the parser what to do, or, even worse, one has to spend a lot of time beefing up one’s lexical entries to deal with ambiguities and parsing problems. In Nanti, a Kampan language I work on, the parser works pretty well, but for Iquito, a Zaparoan language I work on, the parser is really not worth the trouble, at least so far. (I suppose this may reflect my lack of computational skill; I’m not a computational linguist, but neither am I computationally illiterate. If a linguist with my abilities is having a hard time getting the parser to perform well, that suggests to me that its not very well designed.)

Another major weakness of the parser, in my view, is the formatting of its interlinearized output file. The interlinearized output looks OK on the screen, but it is a very laborious process of cutting, pasting, and reformatting to move it from the Shoebox text file to another document. As a consequence, its fairly impractical to use Shoebox to create publishable interlinearized texts (contrast this with the dictionary output).

Finally, I find that the overall design and organization of the user interface leaves much to be desired. In many cases, important functions are buried in such a way that one has to go through nested sets of dialogue boxes to get at them. Finding functions can also be a chore, since they are sometimes put in pretty obscure places. In addition, text windows and menus are sometimes very small, so that its hard to see all one needs to see. As a result of these issues, actually using the program can be frustrating. Maybe this is because I’m a Mac user, but I also find Shoebox’s sensitivity to, and stupidity about, file locations to be frustrating, as it makes transferring databases between files or between machines a tricky affair.

In summary, then, Shoebox is vastly, vastly better than nothing, but I find that the program leaves much to be desired. For a long time, though, it was really the only game in town, so the situation was pretty much “put up or shut up” (or for the computationally-minded among us: write something better). However, the situation is beginning to change…

Moving Beyond Shoebox

As I mentioned above, SIL has released a new package, FLEx, which is intended to replace Shoebox/Toolbox. I have not yet tried it, but it looks like a considerable improvement over Shoebox. It is reviewed here. As soon as I have some time (i.e. when I have finished my dissertation), I plan to take FLEx out for spin and see how it works.

TshwaneLex is a commercially available lexicography program for creating dictionaries. This review makes it look like a fairly attractive option, except that its not free (150 Euros for an academic license). Also, it is not integrated with a parser, the way Shoebox is, so if that is important to you, then Tshwanelex is not for you.

There are some other tools out there, but as far as I can tell, most of them are not yet ready for prime time. The E-MELD School of Best Practice is a great resource for any linguist heading off to the field for the first time, and they have a large quantity of information about software here.