Shoebox and beyond
December 4, 2007
Having previously discussed research funding, let me now turn to another question that I get asked with some frequency by people thinking of heading to the field for the first time: what software do I need — and more specifically: what is Shoebox, and what do you think of it? What follows is my very personal opinion about the Shoebox program.
Briefly, Shoebox is a program for creating dictionaries and interlinearized texts, produced by SIL. What do I think of it? Well, I think of Shoebox like that ancient hulk of a car that your uncle gave you for your 18th birthday, that you still use because you can’t afford anything better, that wheezes, rattles, belches smoke and smells like oil and gasoline, which breaks down so regularly that you have to keep a toolbox in the trunk to make repairs by the side of the road — and which you have to pump the gas on when you come to a stop sign so that it doesn’t stall — but which, at the end of the day, gets you to your destination — perhaps not in style, and definitely not in comfort — but gets you there.
Most people I know who do fieldwork use Shoebox, but dream of the day that something better will come along and they can take the rattling hulk out into back field and leave it there to rust. (Speaking of which, the most recent version of Shoebox was re-christened Toolbox, although it differs in only a few ways. SIL has, more recently, produced a new program, called FLEx, that fills the same basic function as Shoebox/Toolbox, and which is reviewed here. I have not tried it yet, but it looks like a significant improvement of Shoebox in most respects.)
What is Shoebox? Shoebox is essentially a lexical database program joined to a morphological parsing program. The database is designed with dictionary-making in mind, and one of its major virtues is that it includes an export function that permits one to export the dictionary database as a Microsoft Word document that is formatted in a recognizable dictionary style, with headwords and subentries. The database has basic filtering and search functions.
The morphological parser is intended to split words in a text up into their constituent morphemes and assign glosses and parts of speech to the segmented morphemes. The results of the parsing processes are outputted as interlinearized text. The parser searches for morphemes by searching the dictionary database, and a potentially very useful function is that one can set the preferences so as to open a new dictionary entry whenever one comes across a morpheme that the parser does not encounter in the dictionary, thereby allowing you to build up your dictionary by parsing texts. One can also in this way build up corpora of parsed texts that can then be searched for glosses or morphemes of interest.
In my view Shoebox’s strengths are its dictonary-format output, and the fact that the lexical database and parser are fairly well integrated. For a long time, there was little else out there that fulfilled these functions so conveniently in a single package, although that is now changing (see below). SIL, who wrote and maintained the program, also made it available for free — which is a good price.
As my griping above suggests, however, I find Shoebox frustrating in several respects.
Perhaps the single greatest weakness of Shoebox is its documentation, which is woefully inadequate. I have known several people who have tried to use Shoebox, but have given up in frustration. The program comes with a tutorial, which is good, as far as it goes, and it has a help feature, but much of what you need to know to use the program is simply not documented anywhere. If you have programming experience or are generally software savvy, you can, with a lot of patience and gnashing of teeth, figure out most of what you need to know (although there are still a number of things that remain mysterious to me, even after several years). Otherwise, I strongly recommend getting some tutoring from someone who has been using the program for a long time. It will save you a lot of time.
The dictionary-formatted output is one of the best things about Shoebox, in my opinion, but it still leaves a lot to be desired. The user has fairly little control over the formatting, unless one is willing to edit the files that control the conversion from the plain-text Shoebox database to Word (or RTF). That is not something that most people feel comfortable about.
The morphological parser works well for languages with agglutinative morphology, and works best when there is little allomorphy or morphophonology. There are ways to handle allomorphy and morphophonology, including a way to input conditional or environmental rules, but I have found that in languages with lots of morphology and complex allomorphy or morphophonology the parser tends to make lots of errors and one has to spend a lot of time telling the parser what to do, or, even worse, one has to spend a lot of time beefing up one’s lexical entries to deal with ambiguities and parsing problems. In Nanti, a Kampan language I work on, the parser works pretty well, but for Iquito, a Zaparoan language I work on, the parser is really not worth the trouble, at least so far. (I suppose this may reflect my lack of computational skill; I’m not a computational linguist, but neither am I computationally illiterate. If a linguist with my abilities is having a hard time getting the parser to perform well, that suggests to me that its not very well designed.)
Another major weakness of the parser, in my view, is the formatting of its interlinearized output file. The interlinearized output looks OK on the screen, but it is a very laborious process of cutting, pasting, and reformatting to move it from the Shoebox text file to another document. As a consequence, its fairly impractical to use Shoebox to create publishable interlinearized texts (contrast this with the dictionary output).
Finally, I find that the overall design and organization of the user interface leaves much to be desired. In many cases, important functions are buried in such a way that one has to go through nested sets of dialogue boxes to get at them. Finding functions can also be a chore, since they are sometimes put in pretty obscure places. In addition, text windows and menus are sometimes very small, so that its hard to see all one needs to see. As a result of these issues, actually using the program can be frustrating. Maybe this is because I’m a Mac user, but I also find Shoebox’s sensitivity to, and stupidity about, file locations to be frustrating, as it makes transferring databases between files or between machines a tricky affair.
In summary, then, Shoebox is vastly, vastly better than nothing, but I find that the program leaves much to be desired. For a long time, though, it was really the only game in town, so the situation was pretty much “put up or shut up” (or for the computationally-minded among us: write something better). However, the situation is beginning to change…
Moving Beyond Shoebox
As I mentioned above, SIL has released a new package, FLEx, which is intended to replace Shoebox/Toolbox. I have not yet tried it, but it looks like a considerable improvement over Shoebox. It is reviewed here. As soon as I have some time (i.e. when I have finished my dissertation), I plan to take FLEx out for spin and see how it works.
TshwaneLex is a commercially available lexicography program for creating dictionaries. This review makes it look like a fairly attractive option, except that its not free (150 Euros for an academic license). Also, it is not integrated with a parser, the way Shoebox is, so if that is important to you, then Tshwanelex is not for you.
There are some other tools out there, but as far as I can tell, most of them are not yet ready for prime time. The E-MELD School of Best Practice is a great resource for any linguist heading off to the field for the first time, and they have a large quantity of information about software here.