[This is the text from a lightning talk I gave at the Triangle Research Library Network Annual Meeting on July 23, 2014. It was just 6 minutes so not much room for detail. I will probably be working on a longer piece about Doc South Data in the coming months.]
I want to start by thanking everyone for coming and thanks to Chad for organizing. While I’m thanking folks, I should note that I am showing off other people’s work. Thanks to Nick Graham, Tim Shearer, Emily Brassell and Steve Erhart for making this project possible.
- One of the oldest digital archive projects at UNC dating back to 1996
- Built to provide web access to some of the University’s most popular collections
- Currently, Doc South consists of 17 collections
These are high quality collections that do an excellent job of letting researchers take a look at primary documents without having to come to Chapel Hill and without causing further wear on the items. In short, they do exactly what such online projects were designed to do: take advantage of technology to increase access to unique collections.
However, we are starting to see researchers who want to take advantage of even newer technology. You may have heard people talk about Big Data, Text Analysis, Distant Reading: all of these terms point toward a variety of practice that use digital tools to look for patterns in large quantities of text. This has been very common in medicine, sociology, political science and marketing but humanities scholars are showing interest as well.
As anyone who has ever tried to do any data analysis knows, the secret to good results is to start with good data. Because we have not generally thought of humanities collections as data, finding something interesting to analyze is kind of a problem.
“Part of the explanation for why more historians have not undertaken text mining and topic modeling projects lies in the limited availability of machine readable texts.”
– Stephen Robertson, Director CHMN
Libraries have an opportunity to do something about this. In addition to trying to work with vendors to help them understand the kinds of work our faculty want to do, we can make it easier to this work on the collections we already control.
In early fall 2014, we will be releasing the Doc South Data project. This will provide easy access to the collections in DocSouth in formats that are optimized for text analysis tools.
Once we go live, users will be able to download a .zip file that contains the following:
- A folder with all the items as plain text,
- A folder with all the items as xml,
- a table of contents file,
- a “read me” file that provides useful information about the collection.
The data are very flexible and will be able to be analyzed with several popular tools. That being said, I had a tool called Voyant (http://voyant-tools.org/) in mind when we designed the project. Voyant was created by two literary scholars and digital humanists in Canada, Stefan Sinclair and Geoffrey Rockwell. It is a really great tool for doing fast, simple visualizations of texts (Documentation on how to use Voyant is here: http://docs.voyant-tools.org/start/).
Below is a simple word cloud based on the text of The Life and Times of Frederick Douglass.
Word clouds aren’t terribly exciting anymore but we can also create comparative visualizations in Voyant. The screenshot below shows you a word cloud based on the text from two narratives combined. By clicking on any of the words, you will see that word in context and also a chart that compares the frequency of that word’s use in bothtexts.
This was a relatively easy project but we hope it helps people get more out of our collections.