Cleaning Data to Enhance Access and Standardize User Experience, Part I: Planning and Prioritization

Hi! This is Alicia Detelich, archivist at Manuscripts and Archives, and Christy Tomecek, project archivist at the Fortunoff Video Archive for Holocaust Testimonies. We are co-leaders of the ArchivesSpace Public User Interface (PUI) implementation project’s Data Cleanup and Enhancements Workgroup. Today we’re going to share a little bit about the initial planning efforts that our group has relied upon to guide us through this project.

Our five-member group has a variety of tasks to accomplish before the PUI “goes live” early next year, including:

  • Reviewing data across Yale’s 14 ArchivesSpace repositories
  • Determining the extent of cleanup and normalization required, and the approach(es) to making changes
  • Working with repository liaisons to ready data for ArchivesSpace publication
  • Creating, testing, and executing cleanup and normalization scripts
  • Performing quality control on updated data

Data cleanup and normalization is a full time job for some archivists, and we all have other responsibilities. Because of our time constraints and the enormous quantities of metadata produced by Yale’s special collections repositories, we had to start by doing some hard thinking about which data issues would have the greatest impact on the security of our records and on the experience of our users, and to limit our work to just those areas. Keeping our expectations realistic and avoiding overcommitting has been an important part of this process, and has almost certainly kept us from going insane (for the most part)

During an early brainstorming session in which we identified a laundry list of potential data issues to address, we debated which of these issues would be “show-stoppers,” which ones should be prioritized in order to enhance access, and which we could afford to deal with at a later date. By the end of the meeting, we had settled on the following areas of focus:

Publication status

In our current system, YFAD, finding aids are published once a week after the documents are proofread for content and for correct EAD encoding. This process also includes using an XSLT transformation which, among other things, has certain stopgaps in place for suppressing data that may have been accidentally published or is restricted from researchers while still necessary for staff. In the ArchivesSpace PUI, our finding aids will now be published instantly and will not have these XSLT suppressions. This will require us to review our current finding aid data to ensure that nothing that is confidential is accidentally made available, such as student or patient names. Repositories may also wish to unpublish records for collections that are still in process and that they cannot make available to researchers. To that end, we will work with representatives of each repository to identify any necessary changes to the current publication status of all resources, archival objects, accessions, and notes, and make these changes prior to the official launch date.

Date normalization

The ArchivesSpace PUI is more dynamic than our current system, and provides many more opportunities for filtering and faceting data. Because of this, it is much more important for us to have structured data that can be read and manipulated by machines. For instance, the PUI allows users to search for materials created during a given date range. This functionality requires that dates entered into ArchivesSpace be machine-readable. During previous migrations, date information was often added to the expression field, but not parsed into ArchivesSpace’s machine-readable beginning and end dates. Additionally, practices for formulating dates have varied widely among repositories – there are almost too many variations to count. While we may not be able to fix every single date issue, the more we can accomplish before launch, the more effective the PUI will be for our users.

How many ways can you say ‘undated’?

Machine-actionable restrictions

Our conditions governing use and conditions governing access data are also top candidates for normalization. Since 2015, it has been possible for repositories to add structured restriction information, such as end dates or condition notes, to our resource and archival object records. Aeon is able to act upon these restrictions, letting users and staff know if an item is restricted, and allowing for appeal by researchers and review by repositories.

A great deal of work has already been done to add machine-actionable dates and local access restriction types to the records of some repositories, but there are still a number of resources and archival objects which could benefit. The impending integration of Aeon with ArchivesSpace, and the potential impact of this functionality on staff and on users indicated to us that it should be one of our top priorities for clean-up and enhancement.

Note Labels

Yale’s special collections repositories have traditionally operated independently of one another, and so over time developed different policies and systems for doing descriptive work. This has resulted in a wide array of standards, jargon, and even grammar choices in our finding aids. One example is the all kinds of variability in the way that descriptive notes are labeled. While this may not seem all that consequential, labels can affect a user’s experience quite drastically. It might not be clear that a “Summary” or “Description of the Papers” is the same as a “Scope and Content” note. It is important that the same type of note be called the same thing, no matter which repository a user is searching.

In our current YFAD setup, the display of labels is suppressed by the above-mentioned XSLT, but this will no longer be the case once the PUI is implemented. This necessitates a thorough evaluation of our note label usage, and an eventual policy decision about how notes should be labeled – across all repositories – going forward.


Our repositories have long been adding URLs to note fields or digital object records. Unfortunately, we’ve been doing this so long that some of these links are likely broken. Directing users to a 404 page is never ideal, so we took this project as an opportunity to review our links to determine how many are broken. Though we aren’t necessarily testing the accuracy of the links – whether they direct the user to the intended web page – we just want to know (for now) if the links actually work.


With the exception of Manuscripts and Archives, most repositories at Yale are still using two systems to manage their descriptive and collection control metadata. YFAD and Aeon pull in descriptive information from ArchivesSpace, and container and location data comes from our ILS, Voyager.

Integration with Aeon is a major part of the PUI implementation project, and having complete and accurate container data in ArchivesSpace is a necessary part of the integration work. In order to ensure the accuracy of our top container data, we will need to compare what is in ArchivesSpace with what is in Voyager. Any discrepancies, particularly where there is data in Voyager but not in ArchivesSpace, will need to be resolved in conjunction with the repositories which created the data.

Shared records and controlled value lists

One interesting side effect of importing EAD into ArchivesSpace is that any controlled values that are present in the EAD file can be added to the enumeration values list in ArchivesSpace. This has left us with some very messy controlled value data. For instance:


Normalizing these shared lists – most importantly those related to container and extent types – will present users with a more unified experience, and facilitate robust searching and faceting in the PUI. Removing duplicative or erroneous values will also help prevent messy data issues from recurring in the future.

The (nearly completed!) project to clean up our name and subject records that Alison mentioned in her last post also dovetails nicely with this work. The addition of Library of Congress URIs, and the eventual removal of duplicate records will greatly enhance the functionality of the PUI, and will remove duplicative or incorrect values that may confuse users.

All this in six months? No problem!

Current mood.

Since deciding on the scope of the work, we’ve undertaken the fun and slightly intense task of auditing our data in each of these areas. We’ll be back in another post to talk a bit about our data auditing process and reveal some of our most interesting results. For now though, we’re excited to do our part to help make the ArchivesSpace PUI more useful for staff and researchers.

Cooperation, Co-Everything, and One (of many) Excellent Question(s)

Hi, everybody.  Long-time reader, first-time poster.  I’m Mark Custer, and I’ve been working as an Archivist and Metadata Coordinator at the Beinecke Rare Book & Manuscript Library for just over two years now.  This past year, most of my job duties have centered on ArchivesSpace. In addition to co-chairing Yale University’s ArchivesSpace Committee with Mary Caldera, I co-taught two ArchivesSpace workshops last year that were offered by Lyrasis, a membership community of information professionals, which was formed by the combination of two other regional consortiums.  In October, I helped out at a Boston workshop as a trainer in training; and in December, I co-taught a workshop that was co-sponsored by the Rochester Regional Library Council and the University of Rochester.  Looking back on the year 2014, then, what stands out most to me in my professional life is the increasing importance and necessity of partnerships. The Latin prefix co- was everywhere, and I don’t think that this notion of co-everything will be taking a backseat anytime soon.

These partnerships are precisely the sorts of things that have me so excited about ArchivesSpace.  To me, the most important thing that is emerging from the ArchivesSpace project so far is the community, not the system — don’t get me wrong, though, I’m extremely impressed by how the software has been able to combine the features and functions of Archivists’ Toolkit and Archon into a single project in such a short amount of time!  I’d even venture to say that the community is not only influencing the development of the software by making itself known through its individual and institutional voices, but that the community is also showing signs that it intends to nourish and nurture that software with a collective voice.  And, full disclosure, I’m also currently serving on the ArchivesSpace Users Advisory Council, so if you don’t agree with that statement, please let me know.

Of course, there’s still a long way for us to go.  For instance, at the end of the two-day ArchivesSpace workshop in Rochester, one of the participants asked an excellent question, which I’ll paraphrase here:

“How can I adopt more efficient workflows using ArchivesSpace?”

Each of the instructors, myself included, as well as a few of the other participants, provided a few suggestions to this important question.  What struck me by those answers, though, is that none of the suggestions were ArchivesSpace specific just yet.  That shouldn’t actually surprise me, given the relative newness of ArchivesSpace – both the software and the community – but it does remind me that we have a lot of work to do.  But it’s precisely this sort of work that I’d really like to see the archival community communicating more about in 2015.

As Maureen has already talked about in another blog post (, one of the ways that we’d like to enable more efficient workflows in ArchivesSpace is to enhance its container management features, ideally by really letting those functions run in the background so that archivists can focus on archival description.  A few other (collective) workflows that I hope that ArchivesSpace will make more efficient include:

  • Assessing archival collections
  • Printing box and folder labels
  • Publishing finding aids to external aggregators, such as ArchiveGrid, automatically
  • Integrating with other specialized systems, such as Aeon, Archivematica (check out what the Rockefeller Archive Center has done with Archivematica and the AT in this blog post, for example!), Google Analytics, SNAC, Wikipedia, etcetera

I’d love to hear how others would like to create efficiencies using ArchivesSpace, so please leave comments here or send me an email.  I think that we need to strive for cooperative systems that promote cooperative data, including web-based documents, and I really do think that the ArchivesSpace community is poised to achieve those goals.

Building a Community Through ArchivesSpace Implementation

So far you have probably seen posts by my colleagues discussing the efforts to make ArchivesSpace work in our complex multi-repository environment at Yale. To date, we have evaluated the application in its present form, hired consultants to develop additional functionality, and are currently engaged in extensive testing. However, in addition to trying to effectively implement ArchivesSpace, we have also needed to consider how we might work together more effectively.

There are twelve discrete repositories at Yale that will be implementing ArchivesSpace. Currently many of these repositories work in their own instance of Archivists’ Toolkit or outside of an archives management system, and the archivists at each repository have developed some individual repository-specific methods for managing containers and describing materials. While we need to ensure that ArchivesSpace will work for us, in committing to a single, university-wide version of ArchivesSpace, the implementation of ArchivesSpace is providing us with a unique opportunity to further develop cooperation amongst the many repositories at Yale.

Much of our work to this end has been straightforward. For example, in the summer of 2014, our Committee standardized the controlled vocabulary lists in ArchivesSpace. However, some of our work has been more complex and far-reaching. In the fall of 2014, we interviewed archivists at all twelve repositories about their practices, including their approaches to managing containers and locations as well as their description of archival material, particularly non-paper formats. While these interviews began with the explicit goal of gaining a better understanding of procedures at Yale so that our Committee could make sure that our implementation of ArchivesSpace met everyone’s needs, during our discussions regarding description it became apparent that current practices are widely divergent among campus repositories, requiring further cross-repository discussion regarding the description of born-digital materials, digital surrogates, and A/V materials.

We have developed a task force consisting of archivists from multiple units on campus in order to determine basic guidelines for description of these types of materials. This task force will share its proposed description guidelines with all stakeholders at the University, responding to feedback and reaching consensus, with the goal of configuring Yale’s installation of ArchivesSpace to accommodate these guidelines.

We look forward to updating you on our progress and sharing our guidelines once they are complete.

Happy New Year!