Notes on Text and Data Mining for Libraries

On Wednesday (2/19/2014) I attended a webinar organized by the folks at the Center for Research Libraries on the topic of Text and Data Mining. Each of the speakers focused on what libraries can be doing to help scholars, particularly in the humanities and social sciences, use digital/digitized resources not simply to access and read texts but to look for patterns in large chunks of that text. Largely, this is what “big data” looks like for the humanities and social sciences at the moment. I think this is a really timely and important topic for research libraries because of three converging factors:

1. Scholars like Stanford’s Franco Morretti and the University of Richmond’s Ed Ayers are doing groundbreaking work in the humanities using data/text analysis. You can read about Morretti’s practice of “distant reading” (as opposed to close reading) in THIS times article. You can read about Ed Ayers’ Visualizing Emancipation project HERE. Really smart people getting recognition for this kind of work will lead to both greater acceptance in the academy generally (think tenure and promotion) and greater interest from individual scholars.

2. Knowledge of how to use the tools needed to perform text/data analysis is becoming more common. Part of this is that there are more opportunities for training from online courses. More importantly, some of the tools are becoming much easier to use. Google N-Gram viewer (discussed below) is a really simple word count tool but other more sophisticated tools have also emerged that require little or no advanced technical training. Voyant Tools, developed by Canadian literary scholars including Stéfan Sinclair and Geoffrey Rockwell offers numerous ways to visualize texts including word clouds, word frequencies and network graphing. Newer tools such as RAW perform similar functions.

3. The machine readable data sets scholars need to conduct this research are beginning to become more prevalent. Approaches like distant reading depend on statistics and averages so the patterns you detect are more useful the more representative (and large) your data set is. Google has been allowing users to conduct relatively crude text analysis through the Google Books N- Gram viewer. (Read more about what they problematically call Culturnomics HERE). More recently, Twitter announced plans to open up their archive of tweets to a handful of lucky researchers in the form of DATA GRANTS. Say what you will about Twitter but this archive offers some tantalizing possibilities for researching the fine grain of daily life. This is particularly interesting when you are looking at daily life during extraordinary times. While I Emory, some graduate students I worked with used an archive of Tweets from the Occupy Wall Street protests and they were able to link protest actions with weather conditions and trace the evolution of the movement day-to-day. You can see that project HERE.

The webinar on Wednesday featured three speakers though, due to audio problems, I missed most of what the last person was saying. CRL captured the audio and the slides and should be making them available shortly. In the meantime, I wanted to share a couple of things that struck me from the first two speakers.

The first presenter was Bob Scott from Columbia. I was particularly struck by his description of the library as a workshop where scholars can manipulate data. He argued that the library is particularly well suited to this role because they have access to the data, are centrally located and, importantly, can keep an eye out for ways to reuse, remix and recycle tools and data sets. Scott suggested that libraries could do more to work with vendors to make more analyzable and text-mineable data available. He suggested that we could argue for APIs, vendor “sandboxes” or even just files in cloud based storage. Some libraries already have data sets they may not know about in the form of back-up drives sent from vendors of online databases. Duke has experimented with making these drives available for data analysis.

In order to get up and running with this emerging research practice, Scott suggested librarians do two things:

1) Identify interested faculty and partner with them to develop infrastructure and expertise

2) Allow librarians to begin experimenting with data sets and tools as a way to build skills and identify needs.

Scott also pointed out the problems associate with this including the need for new skills, new partnerships and even new positions despite the absence of any new time or new money.

The second speaker was Kalev Keetaru from Georgetown. He is the author of Data Mining Methods for the Content Analysis (UNC Library // Amazon). This talk focused a bit more on working directly with vendors to provide usable data for scholars. In his experience, vendors are sometimes willing to work with libraries and scholars on a case by case basis. However, doing this can be very time-consuming so they want to make sure the researcher know what he or she is doing.

I doubt that “case by case” will be the way that vendors want to deal with this for much longer as text/data analysis becomes more and more common. Libraries and librarians can really play a crucial role here by communicating the needs of researchers to vendors and demonstrating the value of these techniques to scholars.