Arts Special Collections in the Classroom

In the past year, more students have entered the classroom of the Haas Arts Library than the study room. Special Collections librarians Jae Rossman and Molly Dotson joined us this Tuesday to discuss what this means, what is available through the arts library, and how instructors have taken advantage of the Haas Special Collections.

The first step to integrating special collections into the classroom is to realize that the arts library has been reorganized to make accessing materials as straightforward as possible. A fifteen seat seminar room with its own projector has been installed on site, and rather than being spread throughout different libraries and exhibition spaces, the arts Special Collections are all housed either in the Haas library or offsite at the library shelving facility (LSF). Although items at the LSF must be requested online, a location filter for “Haas Special Collections” on Orbis will still find them, making searching for relevant materials more intuitive.

The materials available through special collections are diverse. The Haas Arts Library holds materials from the Faber Birren Collection of Books on Color to a leading collection of bookplates and the Arts of the Book Collection on the history of printing and typography. With a focus on accessibility, the classroom and related materials are not only used by Yale instructors, but also by other regional universities and even the local arts magnet high school. Classes that use the space – over 40 in the past year – typically invite an arts librarian to comment on the materials being discussed. Having the classroom inside the library also allows for a degree of student interaction with sensitive materials that would be impossible otherwise – for example, students are allowed to touch and look at parts of the collection relevant to their work.

Jae and Molly decided to use three classes to highlight the value of special collections resources in teaching, two in Yale College and one at the graduate level. Jessica Helfand’s freshman seminar “Studies in Visual Biography” has a session in the Haas library where students are able to interact with relevant collections. Richard Rose’s college seminar “Art of the Printed Word” uses the Arts of the Book collection to give students exposure to historical bookmaking in advance of their final project of making a book.

Anna Craycroft’s graduate course on painting represents a creative use of special collections where research itself

is treated as an artistic process. In order to accomplish this, the Yale database of finding aids is used, where each set of finding aids is engaged as a depiction of a possible way to think about the collection being represented.  Once again, by having a session in a place where students can have simultaneous access to the database and to the materials themselves, course objectives can be achieved that would otherwise be very difficult.

The Haas Arts Library’s Special Collections can suit a variety of needs, and Jae and Molly pointed out the class programs can be tailored to match the level of the students being taught.  Since librarians are almost always involved in the class, students may also explore the collections from a perspective they may not have been exposed to previously.  Even with the growing number of classes using the Special Collections, librarians are still happy to help professors set up a class at Haas.  For more information, email Jae Rossman directly, at least a week in advance of the desired classroom date.  For frequently asked questions, see the Special Collections’ access policies page here.

For full coverage of this session, please click the video below
(note a slight delay upon initial playback):

Blogging Dante’s Comedy: Beatrice in the Tag Cloud

Carol Chiodo presenting at TwTTOver the past two weeks TwTT has peered behind the user interface to reveal the technology at work in the production of digital editions and the analysis of text databases.  This week, Carol Chiodo, a PhD candidate at Yale’s Italian Language and Literature program, was Virgil for the session, guiding us through the other side of academic technology, and showing how a deceptively simple blog can prompt students to truly engage with a text and work together in the production of unique and meaningful criticism.

Why a blog?

Teaching Dante’s work, Carol points out, is a bit of an economics problem.  The task itself is daunting – many universities will split the minor works and Divina Commedia into two semesters.  Meanwhile, meaningfully “engaging” the text can easily lead to a dissertation.  The resources available are also limited – the class must be accessible to students with background in neither the subject nor the original language, and Giuseppe Mazzotta’s Dante in Translation is taught in a single fall term with two 75 minute lectures and a 50 minute discussion section weekly.  With such serious time and information constraints a successful presentation depends not only on the ability to convey information understandably, but also on encouraging students to spend time considering, analyzing, and discussing the text – the essence of meaningful engagement.

The key to a successful class therefore becomes taking advantage of all available teaching resources.  Two in particular stand out to Carol: the campus technology infrastructure, and other students.    Rather than looking at the Yale WiFi network as the Facebook-bringing bane of the classroom, Carol encourages educators to see a communication platform where professors and teaching fellows can bring students together – both with each other and the text – providing an alternative to the shallow insights of Wikipedia or SparkNotes.  Peer interaction is also crucial.  By encouraging discussion and analysis beyond the classroom, students begin to think more critically about the works and a space is created for engagement that is always open, always communal, and always subject to discussion and peer review.

Professor Mazzotta’s class is itself no stranger to technology.  In 2004, a CMI2 project helped to bring multimedia and more efficient presentation into the classroom.  Then, in 2008, the class was made available through the Open Yale Courses project.  Nonetheless, the focus on students that the blog would provide would be new for the class, and even though it was obvious that they were using other enhancements, how much they would embrace the blog was an open question.

What happened?

Carol’s section was the only one of the four sections to use a blog, set up with help from the Yale ITG, and her students were some of the most successful in the class.  She admits that initially it was difficult convincing students to make and tag entries, particularly since no additional course credit was being given, but that in a very short time the worth of the blog became evident and students became enthusiastic about writing entries and commenting on the work of peers.  Students who were reluctant to comment verbally in the seminar room would sometimes be prolific posters on the blog, and their ideas were opened to consideration.

In short order, three advantages of the blog became clear, which eventually led to superior final papers.  First, students were obligated to constantly formulate, express, and defend ideas in writing.  Second,tagging allowed for a visual representation of what students were finding and considering important, allowing themes to be easily traced over time.  Finally, students were able to build on each other’s ideas to set up a body of analysis that by the end of the course was complete enough to serve as the sole secondary text for the term paper.

Beyond the aforementioned benefits, Carol points out that the blog actually improved writing quality by bringing peer review and accountability into every week’s discussion and the term paper.  She notes that her own involvement in the blog, after getting it running, was minimal and she rarely needed to comment or moderate.  Blog posts were shown on the classroom projector every section, encouraging students to submit meaningful and timely entries.  A web format made commenting and review easy, which led to a care in writing that was reflected in the final paper.  One third of her students submitted term papers of such a caliber that they were submitted to the Dante Society’s undergraduate competition.

What next?

Carol’s presentation reveals how a blog can be used to unite two resources that are frequently underutilized in a lecture class – the internet and peer interaction.  By exploiting this untapped potential, students can be encouraged to not only aggressively engage texts but also generate and defend interpretations in the face of peer review.  The ultimate product, she argues, is a demonstrably better term paper and a superior grasp of the material than would be possible in a more passive environment.

For full coverage of this session, please click the video below (note a slight delay upon initial playback):

video platform video management video solutions video player

EMiC and Making Digital Editions

Dean Irvine talks about Editing Modernism in CanadaLast week’s TwTT session explored some of the impressive analytical tools that can be applied to archives once they become machine readable.  This week, Dean Irvine, visiting Yale from Dalhousie University in Nova Scotia, gave an engaging and technical talk on the crafting of the digital edition – the process that takes printed text to hyper text while adding layers of functionality.  By the end of the talk attendees were exposed not only to leading content management systems and workflows, but also educated on the steps involved in the production of a digital edition, warned of the obstacles most frequently encountered, and given tips on where to begin the process and what kind of advantages a digital edition can provide.

Screenshot of the Editing Modernism in Canada websiteIrvine’s work is associated with the “Editing Modernism in Canada,” or EMiC project, a 33 university endeavor with roots in the classroom.  EMiC uses new technology to stimulate interest in Canadian modernist studies, and part of the program is an annually updated summer course in creating digital editions using the newest available technology.  The yearly changes are unveiled at a workshop, at Yale in May 2012 this year, before the program begins.  Dean’s TwTT presentation gives a bit of a sneak peek into what will be revealed at that workshop and what the latest developments are in the production of digital editions.

Before delving into the production of digital editions, a little should be said about what a digital edition is and what makes it special.  A digital edition is an edited electronic version of a printed work that has been enhanced with annotations and metadata tags to increase the usability and value of the text.  Digital editions can exist as text only (content, no form), which has the advantage of being easily searched, read, and annotated, but loses the original image and character of the work, a particular loss when dealing with letters, handwritten manuscripts, and other historical documents where structure and form are at least as significant as content.  “Digital page” editions also exist, which are just scanned images (form, no content) that are easy to produce and accumulate but difficult to use.  What EMiC found was that students wanted the best of both worlds – an easily searchable and annotatable text that preserves the original form and structure – and this led to the creation of the “image based edition” (preserves form and content).  Dean points out that the best way to think of the image based edition is as a multi-layered object.  Looking at the screen, the person sees an image of the document, but behind the screen the computer can read, search, and mark up the text, delivering a level of usability that is simply not possible in a printed work while preserving the aesthetic quality that is frequently lost in internet editions.

Screenshot of the Islandora websiteThe path of converting a scanned page to a “digital page” was the next topic of discussion, and Irvine walked the audience through the steps of crafting a digital edition, the first of which is to select a content management system.  A CMS is a tool to manage collections and workflows, and two are frequently used in the production of digital editions by EMiC students: Omeka and Islandora.  While Omeka (colloquially called “wordpress for scholars”) is easier to use and may be a good answer if your institution has limited support for the digital humanities, Islandora is a more powerful solution that combines a Drupal front end with a Fedora-Commons back end, and is the one used by EMiC, the Smithsonian, and others.  While both are used in the presentation of digital editions, it should be noted that they are not limited to electronic books, but are in fact scalable systems that can also accommodate photographs, audio, and other mixed media.  The CMS can be thought of as the library or exhibition space that will set the rules and format of your exhibition and also hold the content.  An example of the possibilities of using an Islandora system can be seen here.

Once a CMS has been selected, Dean pointed out that the process of actually making a digital edition is designed to be deceptively simple.  Uncompressed TIFF files are uploaded to the system, the user fills out a few metatags, and presses an “ingest” button.  Hiding behind the ingest button is a flowchart of processing and formatting handled by the computer that does not fit onto a single slide.  The changes made that are relevant to the user are the recognition of text characters (OCR), restructuring in XML (a language that holds content) and then XSLT (a language that describes structure).  Images are converted to the .jpeg format to improve compatibility and decrease file size.  Text is encoded to be compliant with the leading standard for humanities text encoding, known as TEI.  These steps are complex, and introduce some issues that Irvine discusses later, but for the moment, what the end user sees is an output document that has been tagged and is ready for markup.

At the moment text and image markup is not integrated in the same software package.  For image markup, a tool called the IMT is used, available with an open source license here.  For text, a number of editors are available, but he teaches the use of an experimental XML editor called CWRC-writer.  Although it will eventually be released as an open source application, for now it is under development and people affiliated with Yale or EMiC interested in using it should contact Dean directly.  Some other interesting solutions for markup exist, including the TILE environment from the University of Maryland, which allows for image and text markup, but relies too heavily on users programming their own plug-ins to be widely accessible.

After markup, the user has a digital edition of a text that is accessible via a content management system – so what comes next?  For one thing, the user is free to take advantage of the legion benefits of a digital text.  Tagging and metadata addition make it easier to find related works and to understand the content of a collection.  Software tools also exist to make data come to life.  An example is Juxta, which allows users to collate and compare electronic texts, identifying and saving differences in long texts.  The applications for this kind of data are manifold, including this example, which exposes changes in Darwin’s The Origin of Species over its various editions to reflect changes in scientific thought.  An upcoming project at the University of Toronto will examine the changes made to the complete works of Shakespeare over time.  Digital editions are becoming easier to make, and add a level of depth and open doors to analysis not possible with print editions.

Although the advantages of the digital edition are plentiful, and there is significant interest in making printed texts available in edited electronic form, there are many challenges that confront scholars trying to make the leap to digital.  The first is that while packages have advanced and become more usable, there is still a significant amount of technical overhead involved in setting up a collection of digital editions – enough to scare away people who may be uncomfortable in a terminal environment or with a command based text editor.  A second problem is in the lack of a unified software solution.  As reflected above, many different packages come together to produce a digital edition, some of which run on different operating systems, which can add steps and headaches to the process.  Finally, licensing can be an issue.  If ABBYY is used as an OCR engine, universities have to pay for a page quota that can quickly be exhausted through beginner error.  A switch to completely open source software has been completed at EMiC, but is still not universal.  Irvine pointed out that his group is seeking to address these issues with the ultimate goal being a completely cloud based solution that is no harder to use than WordPress, but that this solution may still be in the future.  For this reason he emphasizes the importance of making training and programs available, since many people will be willing to participate in digitization process if they receive help and guidance.

Dean Irvine’s TWTT talk opened the black box of the virtual collection and digital edition, showing not only the features of enhanced texts but also how users can create digital editions themselves.  Next week’s TwTT talk will expand on adding depth to virtual texts with Carol Chiodo’s presentation “Blogging Dante’s Comedy: Beatrice in the Tag Cloud.”  See you there!

For full coverage of this session, please click the video below (note a slight delay upon initial playback):

How to do Your Own Topic Modeling

In the first Teaching with Technology Tuesday of the fall 2011 semester, David Newman delivered a presentation on topic modeling to a full house in Bass’s L01 classroom.  His research concentrates on data mining and machine learning, and he has been working with Yale for the past three years in an IMLS funded project on the applications of topic modeling in museum and library collections.  In Tuesday’s talk, David broke down what topic modeling is, how it can be useful, and introduced a tool he designed to make the process accessible to anyone who can use a computer.

What is Topic Modeling and How is it Useful?

David introduced topic modeling as an “answer to information overload.” In short, it’s a system to have a computer automatically search and categorize large archives, combing them for patterns that can eventually be used to get a better idea of what’s inside.  The process works best when there are thousands to millions of documents involved, and the output can be thought of as a list of subject tags, although that description is not completely accurate.  As the computer sifts through the documents, it identifies words that repeat and words that co-occur.  It then identifies sets of these “tokens” and groups them together.  The result is a list of keyword groups that link to the documents that contain those keywords – a form of AI subject classification.

Although the computer can never be quite as creative or accurate as a human reader, it compensates in sheer volume – making topic modeling perfect for large data sets.  As books are scanned and archives digitized, topic modeling provides a fast way to help collections managers figure out what they are holding, and gives researchers better metadata to quickly find what they need.

Applications of topic modeling are diverse.  The NSF uses topic modeling to figure out what subjects are most active in publications, helping to produce “field surveys” that assist in funding decisions and understanding the state of research.  Historians can use topic modeling to try to identify changes in the historical record over time.  Social scientists may wish to identify trending topics on social networks.  Creative humanists can even model long books, although David concedes that the output, even in a long text divided by pages, can vary in quality.  At Yale, topic modeling is being applied to art metadata in the Haas Art and Architecture library in an effort to make collections more accessible to researchers.  With all of the applications of the technology, aspiring topic-modelers will be glad to know that Dr. Newman has helped to produce a piece of open-source software that makes the process accessible to anyone.

DIY Topic Modeling with the Topic Modeling Tool (TMT)

While topic modeling has applications in diverse disciplines, the amount of intensive computer work involved scares away many academics who could potentially benefit from the technique.  For this reason, the tool presented by David focuses on keeping the process simple and automated, allowing the researcher to spend more time analyzing and less time typing.

Accessible here, David’s software (called simply the “topic-modeling-tool”) is a graphical user interface for an existing open source project called mallet, which is included in the download and does the behind the scenes heavy lifting.  Written in java for maximum portability, the TMT allows users to import text files, either as files in a folder or as a single giant text file, set a few options for how they want topics identified, specify how many topic categories they want produced, and a few minutes later, get out both HTML and CSV formatted results with both the topics generated and the list of documents containing those topics.

Instructions and sample files are given on the website, and the options are intuitive enough to allow users to “learn by playing,” but David gave us some tips on how to approach topic modeling projects with the TMT.  Users should expect to increase the number of output topics if they wish results to be more precise.  For example, if trying to identify documents that discuss music, 10 topics should be sufficient.  If trying to differentiate between types of music, 20 topics may be necessary.  Results can also be made more specific through the use of stopwords, which are ignored by the computer as it models the documents.  This can be used to cut down on word “polluters,” for example, text that may appear frequently in by-lines.  Thresholds for tagging can also be set to increase the resolution of results, for example, the document must repeat the key text at least five times in order to be tagged.

In addition to being easy to use, TMT is not limited to English, and can process any language with clearly delimited words, including languages that use Cyrillic or Arabic alphabets.  Unfortunately, some East Asian languages pose a challenge as the computer has difficulty distinguishing between tokens.

What Next?

David’s presentation exposed some of the uses for and tools of topic modeling, and the TMT opens up this powerful system of analysis to almost anyone.  As some audience members pointed out, however, the greatest difficulty of of topic modeling arguably comes from getting the data one wishes to analyze in a usable form.  Yale has a number of resources to help with this challenge, including an upcoming workshop on using the open source package R in conjunction with Google Documents for data mining, and also next week’s TwTT workshop which will include information on how to work with large archives in the humanities.

For full coverage of this session, please click the video below (note a slight delay upon initial playback):