Last week’s TwTT session explored some of the impressive analytical tools that can be applied to archives once they become machine readable. This week, Dean Irvine, visiting Yale from Dalhousie University in Nova Scotia, gave an engaging and technical talk on the crafting of the digital edition – the process that takes printed text to hyper text while adding layers of functionality. By the end of the talk attendees were exposed not only to leading content management systems and workflows, but also educated on the steps involved in the production of a digital edition, warned of the obstacles most frequently encountered, and given tips on where to begin the process and what kind of advantages a digital edition can provide.
Irvine’s work is associated with the “Editing Modernism in Canada,” or EMiC project, a 33 university endeavor with roots in the classroom. EMiC uses new technology to stimulate interest in Canadian modernist studies, and part of the program is an annually updated summer course in creating digital editions using the newest available technology. The yearly changes are unveiled at a workshop, at Yale in May 2012 this year, before the program begins. Dean’s TwTT presentation gives a bit of a sneak peek into what will be revealed at that workshop and what the latest developments are in the production of digital editions.
Before delving into the production of digital editions, a little should be said about what a digital edition is and what makes it special. A digital edition is an edited electronic version of a printed work that has been enhanced with annotations and metadata tags to increase the usability and value of the text. Digital editions can exist as text only (content, no form), which has the advantage of being easily searched, read, and annotated, but loses the original image and character of the work, a particular loss when dealing with letters, handwritten manuscripts, and other historical documents where structure and form are at least as significant as content. “Digital page” editions also exist, which are just scanned images (form, no content) that are easy to produce and accumulate but difficult to use. What EMiC found was that students wanted the best of both worlds – an easily searchable and annotatable text that preserves the original form and structure – and this led to the creation of the “image based edition” (preserves form and content). Dean points out that the best way to think of the image based edition is as a multi-layered object. Looking at the screen, the person sees an image of the document, but behind the screen the computer can read, search, and mark up the text, delivering a level of usability that is simply not possible in a printed work while preserving the aesthetic quality that is frequently lost in internet editions.
The path of converting a scanned page to a “digital page” was the next topic of discussion, and Irvine walked the audience through the steps of crafting a digital edition, the first of which is to select a content management system. A CMS is a tool to manage collections and workflows, and two are frequently used in the production of digital editions by EMiC students: Omeka and Islandora. While Omeka (colloquially called “wordpress for scholars”) is easier to use and may be a good answer if your institution has limited support for the digital humanities, Islandora is a more powerful solution that combines a Drupal front end with a Fedora-Commons back end, and is the one used by EMiC, the Smithsonian, and others. While both are used in the presentation of digital editions, it should be noted that they are not limited to electronic books, but are in fact scalable systems that can also accommodate photographs, audio, and other mixed media. The CMS can be thought of as the library or exhibition space that will set the rules and format of your exhibition and also hold the content. An example of the possibilities of using an Islandora system can be seen here.
Once a CMS has been selected, Dean pointed out that the process of actually making a digital edition is designed to be deceptively simple. Uncompressed TIFF files are uploaded to the system, the user fills out a few metatags, and presses an “ingest” button. Hiding behind the ingest button is a flowchart of processing and formatting handled by the computer that does not fit onto a single slide. The changes made that are relevant to the user are the recognition of text characters (OCR), restructuring in XML (a language that holds content) and then XSLT (a language that describes structure). Images are converted to the .jpeg format to improve compatibility and decrease file size. Text is encoded to be compliant with the leading standard for humanities text encoding, known as TEI. These steps are complex, and introduce some issues that Irvine discusses later, but for the moment, what the end user sees is an output document that has been tagged and is ready for markup.
At the moment text and image markup is not integrated in the same software package. For image markup, a tool called the IMT is used, available with an open source license here. For text, a number of editors are available, but he teaches the use of an experimental XML editor called CWRC-writer. Although it will eventually be released as an open source application, for now it is under development and people affiliated with Yale or EMiC interested in using it should contact Dean directly. Some other interesting solutions for markup exist, including the TILE environment from the University of Maryland, which allows for image and text markup, but relies too heavily on users programming their own plug-ins to be widely accessible.
After markup, the user has a digital edition of a text that is accessible via a content management system – so what comes next? For one thing, the user is free to take advantage of the legion benefits of a digital text. Tagging and metadata addition make it easier to find related works and to understand the content of a collection. Software tools also exist to make data come to life. An example is Juxta, which allows users to collate and compare electronic texts, identifying and saving differences in long texts. The applications for this kind of data are manifold, including this example, which exposes changes in Darwin’s The Origin of Species over its various editions to reflect changes in scientific thought. An upcoming project at the University of Toronto will examine the changes made to the complete works of Shakespeare over time. Digital editions are becoming easier to make, and add a level of depth and open doors to analysis not possible with print editions.
Although the advantages of the digital edition are plentiful, and there is significant interest in making printed texts available in edited electronic form, there are many challenges that confront scholars trying to make the leap to digital. The first is that while packages have advanced and become more usable, there is still a significant amount of technical overhead involved in setting up a collection of digital editions – enough to scare away people who may be uncomfortable in a terminal environment or with a command based text editor. A second problem is in the lack of a unified software solution. As reflected above, many different packages come together to produce a digital edition, some of which run on different operating systems, which can add steps and headaches to the process. Finally, licensing can be an issue. If ABBYY is used as an OCR engine, universities have to pay for a page quota that can quickly be exhausted through beginner error. A switch to completely open source software has been completed at EMiC, but is still not universal. Irvine pointed out that his group is seeking to address these issues with the ultimate goal being a completely cloud based solution that is no harder to use than WordPress, but that this solution may still be in the future. For this reason he emphasizes the importance of making training and programs available, since many people will be willing to participate in digitization process if they receive help and guidance.
Dean Irvine’s TWTT talk opened the black box of the virtual collection and digital edition, showing not only the features of enhanced texts but also how users can create digital editions themselves. Next week’s TwTT talk will expand on adding depth to virtual texts with Carol Chiodo’s presentation “Blogging Dante’s Comedy: Beatrice in the Tag Cloud.” See you there!
For full coverage of this session, please click the video below (note a slight delay upon initial playback):