WeSay: A tool for community-participatory lexicography?

In the most recent volume of Language Documentation and Conservation there is an article (here) about a piece of lexicographical software called WeSay. The interesting thing about WeSay is that it is designed to be used by lay members of language communities — rather than professional linguists — to build dictionaries of their own languages. There are obvious reasons why this application would be interesting to many of us involved in language documentation, but I want to relate a personal experience that indicates why something like WeSay would definitely fill a serious gap in the range of currently available lexicographical software.

One of the major goals of the Iquito Language Documentation Project, in which I participated, was to integrate trained community members (‘community linguists’) into the day-to-day research activities of the project. One area in which we felt the community linguists could be especially productive was lexicography, especially since the tasks involved squared with the community linguists’ personal interests in the language. The question that immediately arose was how to coordinate the community linguists’ lexicographical work with the task of building the Shoebox lexical database.

Our first idea was simply to teach the community linguists how to enter data into the Shoebox database. We were already in the process of teaching the community linguists how to use PC laptops and word-processing software, so we thought that extending their training to include Shoebox would be a relatively straightforward matter. Unfortunately, this did not turn out to be the case. Shoebox can be difficult to use even for individuals with considerable computer-related experience, and for the community linguists, who were learning to use computers for the first time in their lives, the application proved far to finicky and difficult to use.

One of our team members had significant programming experience, however, and suggested that he write a front end for Shoebox that would considerably simplify the community linguists’ interactions with the database. The idea was a good one, but I had two misgivings. First, I was concerned that regardless of how foolproof the front end seemed, over the course of the nine months we were away from the community, and the community linguists were working on the dictionary independently, *something* unforeseen would happen with the front end, and bring work to a halt. Second, I was concerned that the team member who promised to maintain the front end would not stay with the project for its entire duration, and we would be left with a piece of home-grown software that we didn’t know how to modify or fix, should the need arise. We debated the issue at length, but the team leaned towards the front end idea, so we decided to try it.

At first, everything went well. The front end worked very well, and the community linguists found it easy and comfortable to use. The visiting linguists (including me) left at the end of the summer, and it was then that the problems arose. After about four months, something happened that disrupted the connection between the front end and the Shoebox database, and that was that until the team of visiting linguists returned five months later. The community linguists were smart and started entering their data into an Excel spreadsheet, so their work didn’t grind to a halt, but we had to spend a lot of time transferring the data into Shoebox. So, all in all, the front end experiment was not a great success. And, to top it off, the team member who wrote the front end didn’t return — he decided to quit linguistics and go into real estate.

From then on, the community linguists collected their data in notebooks, and every June, when the visiting linguists arrived, we spent many hours entering the data into Shoebox. Hardly a very efficient process, but the best we could manage at the time.

It should be obvious, then, why I was very excited to read about WeSay. Before I provide a brief description, let be add that apart from the LD&C article, information can also be obtained at the WeSay website (www.wesay.org), which includes a page of screenshots and Flash movies that illustrate how the program works (here).

Basically the idea behind WeSay is a much better implemented and more comprehensive version of the front end we came up with in the field. The user interface consists of relatively simple forms into which one enters data, and the entire paraphernalia of directories, data field codes, and the like are hidden from view. The program also provides guidance, in terms of semantic fields, to prompt the collection of lexical data, further facilitating the independent work of community members. Significantly, WeSay also provides localization tools, so that the interface can be translated into the locally appropriate language. Despite the simplicity of the interface, however, WeSay can also export in data formats used by more powerful lexicographical software. And note that WeSay is free, open source software, and can be downloaded from the WeSay site.

In many respects, then, WeSay sounds like the answer for those who are interested in linguistic documentation projects with significant community participation. I have yet to try it out myself, but I look forward to doing so when I have the time. If any readers have had personal experience with WeSay, I’d be interested to hear about it.

WALS now online

I just learned (via the etnolinguistica.org list) that the World Atlas of Linguistic Structures (WALS) was recently made available online (here). This useful resource was formerly only available in book and CD format, and it cost several hundred dollars. It is now available for free, and in exploring the new online version, I actually found it easier to work with than the older CD version. At least on my Mac, the user interface for the CD version was fairly small, which gave it cramped feeling and made it a little pesky to use. The online version, however, makes much better use of the screen, and the layout and navigation seem improved to me.

In case you’ve never used or seen WALS, I encourage you to take a look. Basically, it represents an effort to collate typological information on a large number of languages (2500, they say), present it in a easily searchable manner, and display the results on a map. Each major typological parameter (say, grammatical number) is also accompanied by an essay, which lays out the basic definitions and distinctions involved. But the best way to know how it works is probably just to play around with it. I must admit that I find I just enjoy poking around WALS, even when I don’t have any real work to do with it. It has even helped combat my Amazonia-centric typological provincialism ;).

The Ideophone

If you haven’t done so yet, I recommend visting The Ideophone, a new blog written by Mark Dingemanse, a PhD student at MPI Nijmegen. So far he has mostly been writing substantial and interesting posts on African languages and expressivity. He has also just written a post on Zotero, a free bibliographic database program with nice web browser integration.

New Issue of Language Documentation and Conservation

The new issue of Language Documentation and Conservation (LD&C) just became available, through the LD&C homepage (http://nflrc.hawaii.edu/ldc/).

If you haven’t yet had a look at this free, online, peer-reviewed journal, I highly recommend you do so now. Articles include thoughtful discussions of matters related to language documentation, and language loss and revitalization. As a bonus, each issue includes a section with reviews of linguistic software.

Shoebox and beyond

Having previously discussed research funding, let me now turn to another question that I get asked with some frequency by people thinking of heading to the field for the first time: what software do I need — and more specifically: what is Shoebox, and what do you think of it? What follows is my very personal opinion about the Shoebox program.

Briefly, Shoebox is a program for creating dictionaries and interlinearized texts, produced by SIL. What do I think of it? Well, I think of Shoebox like that ancient hulk of a car that your uncle gave you for your 18th birthday, that you still use because you can’t afford anything better, that wheezes, rattles, belches smoke and smells like oil and gasoline, which breaks down so regularly that you have to keep a toolbox in the trunk to make repairs by the side of the road — and which you have to pump the gas on when you come to a stop sign so that it doesn’t stall — but which, at the end of the day, gets you to your destination — perhaps not in style, and definitely not in comfort — but gets you there.

Most people I know who do fieldwork use Shoebox, but dream of the day that something better will come along and they can take the rattling hulk out into back field and leave it there to rust. (Speaking of which, the most recent version of Shoebox was re-christened Toolbox, although it differs in only a few ways. SIL has, more recently, produced a new program, called FLEx, that fills the same basic function as Shoebox/Toolbox, and which is reviewed here. I have not tried it yet, but it looks like a significant improvement of Shoebox in most respects.)

What is Shoebox? Shoebox is essentially a lexical database program joined to a morphological parsing program. The database is designed with dictionary-making in mind, and one of its major virtues is that it includes an export function that permits one to export the dictionary database as a Microsoft Word document that is formatted in a recognizable dictionary style, with headwords and subentries. The database has basic filtering and search functions.

The morphological parser is intended to split words in a text up into their constituent morphemes and assign glosses and parts of speech to the segmented morphemes. The results of the parsing processes are outputted as interlinearized text. The parser searches for morphemes by searching the dictionary database, and a potentially very useful function is that one can set the preferences so as to open a new dictionary entry whenever one comes across a morpheme that the parser does not encounter in the dictionary, thereby allowing you to build up your dictionary by parsing texts. One can also in this way build up corpora of parsed texts that can then be searched for glosses or morphemes of interest.

In my view Shoebox’s strengths are its dictonary-format output, and the fact that the lexical database and parser are fairly well integrated. For a long time, there was little else out there that fulfilled these functions so conveniently in a single package, although that is now changing (see below). SIL, who wrote and maintained the program, also made it available for free — which is a good price.

As my griping above suggests, however, I find Shoebox frustrating in several respects.

Perhaps the single greatest weakness of Shoebox is its documentation, which is woefully inadequate. I have known several people who have tried to use Shoebox, but have given up in frustration. The program comes with a tutorial, which is good, as far as it goes, and it has a help feature, but much of what you need to know to use the program is simply not documented anywhere. If you have programming experience or are generally software savvy, you can, with a lot of patience and gnashing of teeth, figure out most of what you need to know (although there are still a number of things that remain mysterious to me, even after several years). Otherwise, I strongly recommend getting some tutoring from someone who has been using the program for a long time. It will save you a lot of time.

Because Shoebox documentation is so spotty, a number of people have written their own notes for set-up, such as this and this. Here is an article that discusses using Shoebox.

The dictionary-formatted output is one of the best things about Shoebox, in my opinion, but it still leaves a lot to be desired. The user has fairly little control over the formatting, unless one is willing to edit the files that control the conversion from the plain-text Shoebox database to Word (or RTF). That is not something that most people feel comfortable about.

The morphological parser works well for languages with agglutinative morphology, and works best when there is little allomorphy or morphophonology. There are ways to handle allomorphy and morphophonology, including a way to input conditional or environmental rules, but I have found that in languages with lots of morphology and complex allomorphy or morphophonology the parser tends to make lots of errors and one has to spend a lot of time telling the parser what to do, or, even worse, one has to spend a lot of time beefing up one’s lexical entries to deal with ambiguities and parsing problems. In Nanti, a Kampan language I work on, the parser works pretty well, but for Iquito, a Zaparoan language I work on, the parser is really not worth the trouble, at least so far. (I suppose this may reflect my lack of computational skill; I’m not a computational linguist, but neither am I computationally illiterate. If a linguist with my abilities is having a hard time getting the parser to perform well, that suggests to me that its not very well designed.)

Another major weakness of the parser, in my view, is the formatting of its interlinearized output file. The interlinearized output looks OK on the screen, but it is a very laborious process of cutting, pasting, and reformatting to move it from the Shoebox text file to another document. As a consequence, its fairly impractical to use Shoebox to create publishable interlinearized texts (contrast this with the dictionary output).

Finally, I find that the overall design and organization of the user interface leaves much to be desired. In many cases, important functions are buried in such a way that one has to go through nested sets of dialogue boxes to get at them. Finding functions can also be a chore, since they are sometimes put in pretty obscure places. In addition, text windows and menus are sometimes very small, so that its hard to see all one needs to see. As a result of these issues, actually using the program can be frustrating. Maybe this is because I’m a Mac user, but I also find Shoebox’s sensitivity to, and stupidity about, file locations to be frustrating, as it makes transferring databases between files or between machines a tricky affair.

In summary, then, Shoebox is vastly, vastly better than nothing, but I find that the program leaves much to be desired. For a long time, though, it was really the only game in town, so the situation was pretty much “put up or shut up” (or for the computationally-minded among us: write something better). However, the situation is beginning to change…

Moving Beyond Shoebox

As I mentioned above, SIL has released a new package, FLEx, which is intended to replace Shoebox/Toolbox. I have not yet tried it, but it looks like a considerable improvement over Shoebox. It is reviewed here. As soon as I have some time (i.e. when I have finished my dissertation), I plan to take FLEx out for spin and see how it works.

TshwaneLex is a commercially available lexicography program for creating dictionaries. This review makes it look like a fairly attractive option, except that its not free (150 Euros for an academic license). Also, it is not integrated with a parser, the way Shoebox is, so if that is important to you, then Tshwanelex is not for you.

There are some other tools out there, but as far as I can tell, most of them are not yet ready for prime time. The E-MELD School of Best Practice is a great resource for any linguist heading off to the field for the first time, and they have a large quantity of information about software here.