The Scholarly Communications Institute is a long running program, supported by the Mellon Foundation. The philosophy of SCI is to identify those promising hives of activity around academia and provide them with a chunk of time, some inspiration and a collaborative working environment so they really focus on the problem.
SCI was based at the University of Virginia from 2003 until 2013. 2014 marked the Institute’s first year in its new home in the Research Triangle. Staff from Duke did most of the coordination but an advisory board made up of national experts and representatives from area institutions did the high level planning. They set a very cooperative, productive and collegial tone for the event.
Last Spring, there was a call for proposals for teams to participate in SCI. This was unlike most CFPs in that they were not looking for people to talk about projects that were finished or which were slated to be completed. To quote the SCI website:
Rather, the SCI will offer the time, freedom, and diversity of participants to foster intellectual risk taking, collaborative and creative speculation, bridging of institutional divides, germination of actionable ideas, cultivation of new networks, discovery of common ground, all without fear of failure or the burden of having to produce immediate, concrete, sustainable deliverables.
This really is a rare opportunity and we knew that UNC libraries had to be involved. Fortunately, we had the perfect project just begging for some attention.
The University of North Carolina’s North Carolina Collection contains over three millions pages of historic newspapers from around the state is being digitized as part of a partnership with Newspapers.com. In exchange for allowing Newspapers.com to digitize the collection and add it to their subscription service, the library has been given access to that service. This is a powerful yet simple for the library’s users to search the collection and view page images.
In addition to this really great access, we also received all of the output from the digitization of each of the 3 million pages on several external hard drives. This output includes the following:
- OCR text output in .xml
All told, that’s about 80 terabytes of data (which is a lot of data). We can only use it locally for three years but, after that embargo expires, we can mount our own publicly accessible version of the North Carolina newspapers (with some sort of search interface) or allow other UNC partners to do so.
We cannot – and don’t need to – do anything with this data that replicates what your can already do with a subscription to Newspapers.com. Furthermore, we cannot – and nor would we want to – create a commercial product with the data.
So, our task at SCI was to determine if there was some useful thing we could do with all of those digitized pages; with all of those images and text. I teamed up with Pam Lach, Associate Director of the Digital Innovation Lab and together we built a team and wrote a proposal. Fortunately, the proposal was accepted and we all spent November 10 – 13 at the Rizzo Center in Chapel Hill.
In addition to Pam and myself, there were four other team members:
Nick Graham is the Program Coordinator, North Carolina Digital Heritage Center. He has an intimate knowledge of the collection and the deal with Newspapers.com.
Stephanie Williams, a software developer in the Digital Heritage Center, has a deep technical understanding of information systems. She helped us understand what was possible and provided the team with crucial reality checks.
Mike Barker, Assistant Vice Chancellor for Research Computing, was also on the team. His technical knowhow was important but, more important than that, was his perspective as a university administrator. Mike was able to help us see this project from a much higher and to put it in the context of the University’s mission and strategy.
Newspapers.com let us borrow their Director of Business Development, Brent Carter. Brent was super helpful when we needed inside information about how some of the data had been structured.
Even if we had not come out of SCI with a plan for the newspaper data, just having the opportunity for the six of us to hang out for a few days would have been worth it. Teams composed of people from a variety of campus units are often very difficult to convene and this one was particularly diverse. The units represented were:
- The Research Hub and Davis R&IS
- The Digital Innovation Lab
- Wilson Library
- Library Information Technology (LIT)
- Research Computing
- A digital content provider
All of our worlds overlap in important ways and it is in our best interests to figure out how to collaborate. We got to think through those issues for this particular project but I think we all came away with a better understanding of how we can work together in the future.
Very early on we were struck by the possibility of making the data available for text mining & data analysis similar to what we had done with the Doc South over the summer.
In order to do this, we need at least four things:
- A place for all that data
- A way to search the data and select a subset of it for study
- A mechanism for delivering that data to the user
- An community of researchers who want to use the resource
All of these things are actually pretty complicated but I’m happy to report that there is a least a little movement on each of them.
The second day Mike Barker set up some secure storage space for the data. I was neither accessible nor secure sitting on hard drives in Nick’s office so this storage space made everyone more comfortable and sets the stage for the rest of our plans. This alone made SCI a huge win for the team.
But, that was just Tuesday morning. We turned to the idea of the interface and Mike, Brent and Stephanie had some really good conversations about how to build that so that users could search, select and get access to the data they wanted. Having identified a workable solution, we now face the more complicated challenge of finding some time to actually build it.
As for community building, Pam and I have been working Ashley Reed from the Carolina Digital Humanities Initiative to build a series of training opportunities and at least one guest speaker to help generate interest and know-how amongst the researchers on campus. Definitely be on the look out for announcements about such things as library staff are absolutely invited to take part in anything we plan.
All in all I think this was a really great experience for everyone on the team. We got lots of work done on the project but we also got see what the other groups were working on and learn from them. There were a few presentations including a really great one on the Open Library of the Humanities. Finally, we were able to take advantage of this roving band of inspirational experts including Steve Wheatly from the American Council of Learned Societies, Don Waters from Mellon and our own Anne Gilliland.
I’ve really just scratched the surface of what we did out at the Rizzo Center and I’m happy to take questions now or after the meeting.
(This is the slightly cleaned up text of a talk I gave at an all staff meeting on Dec. 2, 2014)