Monday, February 21, 2011

Lexomics: An Explanation

I've been meaning to write this post for, I don't know, months, but life and being department Chair (those are two distinct states, like living and being a zombie) has gotten in the way.  But now I've just learned that I'm going to be part of a working group at the Santa Fe Institute in March, so I need to get my ideas in order, and this blog is as good a place as any to do that. 

The word "lexomics" describes an evolving set of methods for finding patterns in textual corpora.  The term is taken from Bioinformatics, where it is used to describe the search for "words" and patterns in genomes (my colleague Betsey Dyer is apparently the first person to coin the word, an obvious adaptation of "genomics"). 

Our lexomic methods (which are described in much more detail in forthcoming papers in JEGP and Modern Philology, as well as in my new book, Tradition and Influence) use computer-assisted statistical methods to troll through the Dictionary of Old English corpus.  Essentially, we cut texts into segments, count the words in each segment and compute their relative frequencies, and then compare these relative frequencies from segment to segment using a statistical method called hierarchical, agglomerative clustering.  This method produces a branching diagram, called a dendrogram, that shows how similar (and thus how different) each segment is from each other. 

Other scholars, such as John Burrows, have used similar methods to examine texts and even to attempt to determine the authorship of disputed or anonymous texts -- with varying degree of success.  These approaches are usually much more complex and sophisticated than ours: scholars sometimes remove the 50 or 100 most frequently used words, or they force all words to standard spellings and morphologies, or they lemmatize their texts.  We simply dumped all the words into the hopper and started counting them (this was just a preliminary experiment, after all).

But to our surprise, our methods seemed to "work" very effectively without any sophistication.  For example, we were able to match one poem with the correct section of another poem to which it was related, and we could separate out the two well-known sections of a third poem with absolute accuracy (I'm only being oblique here because I think it would be impolite to scoop the massive JEGP article which will be out soon).  It may be that by lumping together orthographic, morphological and other variants, we were able to detect patterns that were relatively subtle. 

We also seem to be able to detect when a segment of a text has a different source than the main body of the text, which is particulary exciting for Anglo-Saxon studies because we have so many texts that are composites (so we have good controls) and others whose composite nature is controversial. 

Thus far we get "good" results--in the sense that they are consistent with what we know from traditional methods--for Old English poetry and prose, Latin prose, Middle English poetry, and, intriguingly, some Modern English prose.  We are hoping to refine the techniques by testing fully lemmatized texts (this is difficult, because lemmatizing is incredibly time consuming) and trying other, more sophisticated statistical methods.   As  you'll see in the two articles (and the book), we've been able to shed some light on the Cynewulfian corpus and on the structure of  Guthlac A and its relationship to other texts.  Soon we hope to be able to tell you some interesting things about Alfred's Orosius, King Horn, Bede's History, and the play Mule Bone

The connection to my book is this: lexomics methods can detect and to some degree measure influence.  In my new book, I argue that tradition is a special case of influence, and so detecting influence is in a way detecting certain kinds of traditions.  This gives us an empirical way of looking at a topic that has tended to be approached in a very fuzzy way. 

But--and this is perhaps the most important point in this capsule summary--lexomics does not work at all if you don't have a deep familiarity with the texts ("wearing the English Professor hat" I call it) and the critical problems associated with them.  A dendrogram itself can tell you very little, but a dendrogram coupled to an understanding of the sources and structure of a poem has the ability to shed light upon--and even re-date--a complex text.

I'm hoping in my visit to the Santa Fe Institute to learn how others are approaching culture as a complex evolutionary system, and perhaps improve lexomics (and certainly offer it to them) as a tool for trying to untangle and trace a few strands of the massive cultural tapestry.