Wednesday, April 27, 2011

NEH Supports Lexomics

We got the grant!

I guess third time was the charm.  The National Endowment for the Humanities has fully funded our Lexomics project for the next two years (project total $178,000).  We will be expanding lexomic analysis from just Old English (though we will be continuing this research) to medieval Latin, Middle English, and texts from the Harlem Renaissance, and we will be collaborating with Shawn Christian (Harlem Renaissance), Sarah Downey (Latin), Yvette Kisor (Old English, Beowulf) and Scott Kleinman (Old and Middle English, approaches to computational lemmatization).  It's going to be an exciting two years.   

Very soon we will have available on the lexomics website (lexomics.wheatoncollege.edu) a tool called "Divi-text," which allows people to upload any electronic text and cut it into chunks (in preparation for lexomic analysis).  In the next year or so we will also complete the "dendro-grammer," which will enable researchers to produce their own dendrograms without having to learn how to use the statistical analysis software, R. 

In July our team's first article in a major journal will appear:

Michael D.C. Drout, Michael J. Kahn, Mark D. LeBlanc and Christina Nelson, "Of Dendrogrammatology: Lexomic Methods for Analyzing the Relationships Among Old English Poems," JEGP 110 (2011): 301-36. 

Some time after that another article from the research group will appear in Modern Philology:

Sarah Downey, Michael D.C. Drout, Michael J. Kahn and Mark D. LeBlanc, "'Books Tell Us': Lexomic and Traditional Evidence for the Sources of Guthlac A."

Currently work is ongoing on the Cynewulfian corpus (though some of that is in the JEGP article), Beowulf, Bede's Ecclesiastical History, the Old English translation of Orosius, King Horn, and Mule Bone by Zora Neale Hurston and Langston Hughes.

I will post links to software and papers.

Monday, February 21, 2011

Lexomics: An Explanation

I've been meaning to write this post for, I don't know, months, but life and being department Chair (those are two distinct states, like living and being a zombie) has gotten in the way.  But now I've just learned that I'm going to be part of a working group at the Santa Fe Institute in March, so I need to get my ideas in order, and this blog is as good a place as any to do that. 

The word "lexomics" describes an evolving set of methods for finding patterns in textual corpora.  The term is taken from Bioinformatics, where it is used to describe the search for "words" and patterns in genomes (my colleague Betsey Dyer is apparently the first person to coin the word, an obvious adaptation of "genomics"). 

Our lexomic methods (which are described in much more detail in forthcoming papers in JEGP and Modern Philology, as well as in my new book, Tradition and Influence) use computer-assisted statistical methods to troll through the Dictionary of Old English corpus.  Essentially, we cut texts into segments, count the words in each segment and compute their relative frequencies, and then compare these relative frequencies from segment to segment using a statistical method called hierarchical, agglomerative clustering.  This method produces a branching diagram, called a dendrogram, that shows how similar (and thus how different) each segment is from each other. 

Other scholars, such as John Burrows, have used similar methods to examine texts and even to attempt to determine the authorship of disputed or anonymous texts -- with varying degree of success.  These approaches are usually much more complex and sophisticated than ours: scholars sometimes remove the 50 or 100 most frequently used words, or they force all words to standard spellings and morphologies, or they lemmatize their texts.  We simply dumped all the words into the hopper and started counting them (this was just a preliminary experiment, after all).

But to our surprise, our methods seemed to "work" very effectively without any sophistication.  For example, we were able to match one poem with the correct section of another poem to which it was related, and we could separate out the two well-known sections of a third poem with absolute accuracy (I'm only being oblique here because I think it would be impolite to scoop the massive JEGP article which will be out soon).  It may be that by lumping together orthographic, morphological and other variants, we were able to detect patterns that were relatively subtle. 

We also seem to be able to detect when a segment of a text has a different source than the main body of the text, which is particulary exciting for Anglo-Saxon studies because we have so many texts that are composites (so we have good controls) and others whose composite nature is controversial. 

Thus far we get "good" results--in the sense that they are consistent with what we know from traditional methods--for Old English poetry and prose, Latin prose, Middle English poetry, and, intriguingly, some Modern English prose.  We are hoping to refine the techniques by testing fully lemmatized texts (this is difficult, because lemmatizing is incredibly time consuming) and trying other, more sophisticated statistical methods.   As  you'll see in the two articles (and the book), we've been able to shed some light on the Cynewulfian corpus and on the structure of  Guthlac A and its relationship to other texts.  Soon we hope to be able to tell you some interesting things about Alfred's Orosius, King Horn, Bede's History, and the play Mule Bone

The connection to my book is this: lexomics methods can detect and to some degree measure influence.  In my new book, I argue that tradition is a special case of influence, and so detecting influence is in a way detecting certain kinds of traditions.  This gives us an empirical way of looking at a topic that has tended to be approached in a very fuzzy way. 

But--and this is perhaps the most important point in this capsule summary--lexomics does not work at all if you don't have a deep familiarity with the texts ("wearing the English Professor hat" I call it) and the critical problems associated with them.  A dendrogram itself can tell you very little, but a dendrogram coupled to an understanding of the sources and structure of a poem has the ability to shed light upon--and even re-date--a complex text.

I'm hoping in my visit to the Santa Fe Institute to learn how others are approaching culture as a complex evolutionary system, and perhaps improve lexomics (and certainly offer it to them) as a tool for trying to untangle and trace a few strands of the massive cultural tapestry. 

Monday, January 31, 2011

A Nice Little Trick Enabled by Lexomics (and Excel)

It's nice when people overhear conversations and then help you.

I was at the climbing gym (Rock Spot Climbing in Boston -- the best climbing gym anywhere) and, in between bouldering runs, was talking with my wife about how my research was coming.  Somehow we got to talking about whether Excel could speed up some of my searching.  A guy at the gym overheard and said he had been the Excel guru for a Psych research project and offered to help.  What follows comes from that brief collaboration.  By combining material on the Lexomics website with Excel, you can do some interesting searching in uncommon words in the corpus of Anglo-Saxon.


Let's say you are researching a particular Old English poem, say, Juliana.  You want to look at the more uncommon words in this poem and see if they are shared with the rest of Cynewulf's poetry or with other texts in Anglo-Saxon. 


Go to the Lexomics website, choose "tools," and the "word frequencies."  Click "entire corpus" and then "get stats."  Click on the HERE to download this as an Excel file.  You now have a file with a list of every word in the Anglo-Saxon corpus ranked in order of frequency.

Copy the column of words and the column of word frequencies and paste them into a new spreadsheet as column A and column B.

Now go back to the lexomics website, go to "tools," "word frequencies," and choose the poem of interest.  "Get stats" for that poem and download them by clicking on HERE.  You now have an Excel file with a list of every word in the poem ranked in order.  Copy the column with the words and paste it into column C in your spreadsheet.

Now you are ready to find those words that appear in your poem and only a few times in the rest of the corpus.

Go to cell D1 and enter the following formula:

=SUM( if ( $A$1:$A$x = C1, if ($B$1:$B$x < n, 1)))  ; where x = the total number of words in column A and n = the low frequency threshold (i.e., you want all words that appear fewer than 5 times)


*important* do not just press ENTER.  Instead, press CTR-SHIFT-ENTER.





Then copy the formula into the entire D column by clicking the box in the lower right corner and dragging down to the last word in D.



It will take a few moments for processing.


When processing is complete, you will have a 0 in every cell in D in which the word does not fulfill the criteria (appearing in your poem and between 5 and 2 times in the corpus), and a 1 when the word does fulfill the criteria.


You can search for these 1's manually or use "Conditional Formatting" to bold or color the rows with a 1 in column D.




Now you can search these words in the Dictionary of Old English concordance and see where else they appear.  Look for patterns.  Enjoy.