Mucking around in ArchivesSpace Locally

It may occasionally be part of your job to get inside of the guts of ArchivesSpace so that you can test a new release, diagnose a bug, or generally get a better sense of what’s going on with the system that you rely on. Also, depending on your organizational structure and the way that resources are distributed, you may need to be in a position to help your local IT department answer questions about what’s happening with ArchivesSpace.

It’s (very smart) common practice to not give access to server environments beyond a small cadre of IT department professionals. And you wouldn’t want to experiment there anyway! Thus, you need a sandbox.

A few folks have asked for instructions about how to create this environment. In short, you’ll be installing ArchivesSpace on your local machines and then hooking it up to a copy of your production MySQL database. I’m assuming that your organization uses MySQL behind ArchivesSpace because most do. There’s really great documentation about all of this on the ArchivesSpace github page, but it may be overwhelming if you haven’t done this before. By the way, if you’ve ever managed a database before, you really don’t need this tutorial.

I’ll also talk through configuration, regular updates to the database from your production database, and things to think about when testing.

Congratulations! You’re about to create a consequence-free environment. This means that you should feel free to mess around, take risks, and learn by breaking things. You’re going to break things. Continue reading Mucking around in ArchivesSpace Locally

Migration, step by step

Like Mary mentioned, we’re in! Migration was a five-week process that involved a lot of on-the-fly problem solving and the chance to really engage with how ArchivesSpace works. It also required setting alarms to check on scripts at 2:00 am. Thank goodness we’re done.

We work in a large, complex organization with a lot of local requirements. We also monkeyed around with AT quite a bit, and are moving into an ArchivesSpace environment with a lot of plug-ins. For most folks, it will be a lot easier than this. Here’s my documentation of what we’ve done and what I would think about doing differently if I were to do this again. Continue reading Migration, step by step

Keeping timestamps and creator names from AT

If you’ve already migrated from Archivists’ Toolkit to ArchivesSpace, you know that timestamps on your records will be reset to the time of migration, and the name associated with the creator of the record will be changed to “admin.” Here at Yale, since accessioning is such a serious activity (after all, it’s the moment when we take legal, physical, and intellectual control of records), we wanted to keep that information in ArchivesSpace. At this time, we’re fine for only having this for accession records, although this technique could be modified for other records, too.

This was a project for Adam Shahrani, Mark Custer, and myself. And basically, with Adam’s patient assistance, we just wrote a SQL script to select cognate values from Archivists’ Toolkit and ArchivesSpace, and update the ASpace values with AT values.

The script for that is here, on GitHub. All relevant warnings apply. Obviously, you’ll need to update schema names. The other thing is that we did this in an environment where both databases were on the same server (localhost). Doing this update across servers would be more complicated.

And here’s what it looks like:

Screen Shot 2015-05-01 at 10.20.25 AM


Check out those sweet timestamps from 2012!

Reading Migration Errors and Fixing Our Data

We’ve been doing what feels like a zillion practice migrations to get our data ready for ArchivesSpace. Every time we do a test migration, we read the error reports to see what’s wrong with our AT data. From there, we clean up our at database with the aim of a completely error-free migration when it’s time to do this for real. This is still in progress, but common errors and clean-up techniques are below.

We had to get around inherent problems with the AT -> ASpace migrator.

  •  Resource records and accession records that are linked to subjects and agents already in the database won’t migrate. This is a really, really bad one.
    Here’s what the error looks like:

    Endpoint: http://localhost:8089/agents/corporate_entities
    AT Identifier:Name_Corporate->Yale Law School.
    Status code: 400
    Status text: Bad Request
    {"error":{"names":["Agent must be unique"]}}

    Make no mistake — this is a “record save error”. It’s not just that the agent or subject is no longer linked — the whole finding aid or accession record actually isn’t migrating. That’s a no-go for us.
    Since we’re moving the records of eight repositories (in four different Archivists’ Toolkit databases) into a single ArchivesSpace instance, we knew that we wouldn’t be able to live with this error. We toyed with the idea of various hacks (pre-pending subjects and agents with a unique string in each database so that they wouldn’t repeat), but in the end we decided to contract with Hudson Molonglo to fix the importer. We’ll be happy to report more on that once the work is done.

Because of the requirements of our advanced container management plug-in, we had to make sure that existing data met compatibility requirements.

  • Barcodes and box numbers have to match up. If you have ten components with the same barcode where eight say box “1” and one says box “2”, the migrator can’t create a top container. We wrote a bunch of sql reports to anticipate these problems, and have done a ton of clean-up over the last few months to make sense of them. In most cases, this required actually pulling down the materials and checking which components belonged to which containers and what their barcodes are. Many, many thanks to my colleague Christy who did this work.
  • Boxes can’t be assigned to more than one location (because of the laws of physics).
    Here’s what the error looks like. The relevant bits are in red text:

    Endpoint: http://localhost:8089/repositories/2/batch_imports?migration=ArchivistToolkit
    AT Identifier:RU.121
    Status code: 200
    Status text: OK
     "errors": ["Server error: #<:ValidationException: {:errors=>{\"container_locations\"=>[\"Locations in ArchivesSpace container don't match locations in existing top container\"]}, :object_context=>{:top_container=>#<TopContainer @values={:id=>1963, :repo_id=>2, :lock_version=>1, :json_schema_version=>1, :barcode=>\"39002042754961\", :indicator=>\"1\", :created_by=>\"admin\", :last_modified_by=>\"admin\", :create_time=>2015-04-17 21:44:39 UTC, :system_mtime=>2015-04-17 21:44:39 UTC, :user_mtime=>2015-04-17 21:44:39 UTC, :ils_holding_id=>nil, :ils_item_id=>nil, :exported_to_ils=>nil, :legacy_restricted=>0}>, :aspace_container=>{\"container_locations\"=>[{\"ref\"=>\"/locations/124\", \"start_date\"=>\"Fri Apr 17 17:44:37 EDT 2015\", \"status\"=>\"current\"}], \"indicator_1\"=>\"1\", \"type_1\"=>\"box\"}, :top_container_locations=>[\"/locations/1\"], :aspace_locations=>[\"/locations/124\"]}}>"],
     "saved": []

    This may take a bit of explanation. Basically, as I wrote before, in the AT data model, the container indicator is just a piece of data that’s associated with every component. The database has no way of knowing that each of those components called box “8” actually refer to the same thing. This can result in a lot of problems.
    One of those problems is that in the same collection, some components can be called box “8” and be assigned to one location, while other components can be called box “8” and be assigned to another location. Our migrator is trying to make sense of these containers in a more rigorous way, and it knows that the same box can’t be in two different places. Thus, it throws an error.
    I did these fixes in the database — in most cases, it’s only one or two components out of many that are associated with the errant location. We made a decision internally that we’re comfortable going with a majority-rules fix — in other words, if seventeen folders in a box are assigned to the off-site storage facility and one folder in the box is assigned to Drawer 8 in room B59, we’re pretty sure that the whole box is actually at the off-site storage facility. We have other data stores (our ILS, for instance) that we can use to double-check this data.
    If you want descriptive information about these components, this is actually kind of tricky — I’ve written about this report on my other blog, and you’re welcome to use it.

  • Box numbers have to be very, very literally the same. “17a” and “17A” aren’t the same. Neither are “8” and “8 ” (see the space?). We’ve written a pre-migration normalization script to help deal with some of these. Others we’ve fixed by updating the database, based on the error report.

Some errors are just warnings, but are really good to clean up.

  • Two collections probably shouldn’t have the same EADID. In most cases, this is a typo and easily fixed by looking at the record.
  • Begin dates shouldn’t be bigger than end dates
    Here’s the error:

    End date: 1061 before begin date: 1960, ignoring end date
    Record:: Resource Component: RU.126/ref14250

    Haha. That’s probably not from before the Battle of Hastings. We can just fix that typo.

  • Digital objects shouldn’t have the same identifier. Again, usually just a typo. The record will still migrate but the migrator will append a string to the end of the identifier to make it unique.
  • Miscellaneous other stuff. For instance:
    Endpoint: http://localhost:8089/repositories/2/batch_imports?migration=ArchivistToolkit
    AT Identifier:RU.703
    Status code: 200
    Status text: OK
     "errors": ["Server error: #<:ValidationException: {:errors=>{\"notes/0/subnotes/1/items/168/label\"=>[\"Property is required but was missing\"]}}>"],
     "saved": []

    Finding and fixing this was a huge pain in the neck. We had a list, deeeeeep into this collection, that was missing a label for one of the entries. Once we found it, we found that we couldn’t edit it in the application and had to figure out how to fix it in the database. We all learned a lot that day.

We’re very happy to hear from others about what errors they’ve found during their migrations, and how they’ve gone about fixing them!

Working more efficiently with ArchivesSpace — some use cases

We’ve talked a bit already about our work with a vendor to make sure ArchivesSpace supports efficient workflows. After reading Mark’s blog post, I’m re-energized to think of the good this work will do — archivists in repositories will be able to do collection control projects in a far more robust and efficient way!

Right now, we are deeeeeeeeep into this work. During our planning call last night, one of the software developers asked for use cases for some of our requirements for selecting and changing information about containers — I was going to just make a screencast and notes to put in our project management software, but it occurred to me that this could be useful information to document more publicly, especially since folks are asking about how ArchivesSpace can help them do their work more efficiently.

So, here we go.

Here are some use cases.

Managing information about containers in bulk

An entire collection has been described, and I know for sure which containers I’ll be using. I’m now ready to slap barcodes on these boxes, tell the system what these barcodes are, tell the system what kinds of boxes I’m using (we call this a container profile), associate these boxes with a location, and associate an ILS holdings record identifier.

In order to do this, I need an easy way of saying “yeah, I’m looking for boxes that belong to this resource” or even “yeah, I’m looking for boxes that belong to this series.” In our environment (and I know this happens elsewhere too), it’s very common for each series to start over with box 1. So, when I’m at the point of putting a barcode on a box, I need to be sure I know which box 1 I’m sticking that barcode on.[1]
Holy Cow! Look at all of those box ones!
Holy Cow! Look at all of those box ones!

In Archivists’ Toolkit, if you want to do this work, you can only scope it to a single resource record. We know that it might be desirable to do bulk actions across several collections, so we want the option to scope this work to a single resource, but we don’t want to be stuck with it.

And then, in the current Archivists’ Toolkit plug-in, you would pick the containers you want and update various fields. We’ve been thinking slightly differently about which fields we would want to update and how, but suffice it to say that we would want to pick boxes that share an attribute (like location, ILS holdings ID, container type, whatever), and then be able to enter data about that attribute and expect the changes to propagate across containers. [2]

This is really exciting because currently in ArchivesSpace, every time that you associate a container with a component (like “Correspondence from Anne Shirley to Diana Barry 1877”), you would have to enter the barcode anew. This obviously isn’t very efficient, and it can result in a whole lot of errors. In the workflow we’re proposing, you would be able to know that the box one for each of those components is the SAME box one, and you’d only have to enter the barcode once.

Managing the relationship between containers and locations in bulk

Here’s another use case: Maybe I’m doing a shelf-read. I come to a location in my repository that’s described in a location record. But maybe the location associated with those containers in the database isn’t correct! I want a quick and easy way of selecting the containers in my repository that I see in that location and associating those with the appropriate location record. This is currently impossible to do in ArchivesSpace — in fact, it does a really screwy thing where if you look up a location (like “Library Building, A Vault, Row 1, Bay 1, Shelf 1” ) it gives you a list of all the archival objects there (in other words, intellectual description), not the containers! You don’t want to go into every intellectual description and update the repeating box information. This work would make it possible to say “Boxes 3, 4, 5, 7, and 10 from MSS.111, series 4 all belong in this location” and for the system to know that.

Or maybe that location is a palette or a set of shelves that are designated as needing to go off-site. In order to document the fact that they are, indeed, going off-site, I want a quick and easy way to pick those containers and update the ILS holdings ID. If the palette or location is itself barcoded, that makes it even easier to disambiguate what I’m trying to do! [3]

I hope you’ve enjoyed this journey through how we want ArchivesSpace to work. Obviously, software development is an ever-changing endeavour, ruled by compromise and the art of the possible. So don’t take these use cases as promises. But it should give you a good sense of what our priorities and values are as we approach this project.

[1] Our user story for this is: “BULK SELECTION: As an archivist, I want an easy and fast way to choose all or some (contiguous and non-contiguous) records in a result set to further act upon.”

[2] Our user story for this is: “BULK SELECTION: As an archivist, I would like to define a set of container records that are associated with descendant archival_objects of an archival_object with a set Component Unique Identifier in order to perform bulk operations on that set.” which is obviously a perfect storm of software jargon and archives jargon, but basically we’re saying that we need to know which series this belongs to. Since series information is stored in the component unique identifier, that’s what we want to see.

[3] Our user story for this is: “BULK SELECTION: As an archivist, I would like to define a set of container records associated with a location in order to perform bulk operations on that set. When defining a set of container records by location, I would like the option to choose a location by its barcode.”

Managing Content, Managing Containers, Managing Access

In my last blog post, I talked a bit about why ArchivesSpace is so central and essential to all of the work that we do. And part of the reason why we’re being so careful with our migration work is not just because our data is important, but also because there’s a LOT of it. Just at Manuscripts and Archives (the department in which I work at Yale, which is one of many archival repositories on campus), we have more than 122,000 distinct containers in Archivists’ Toolkit.

With this scale of materials, we need efficient, reliable ways of keeping physical control of the materials in our care. After all, if a patron decides that she wants to see material in one specific container among the more than 122 thousand, we have to know where it is, what it is, and what its larger context is in our collections.

Many years ago, when Manuscripts and Archives adopted Archivists’ Toolkit (AT) as our archival management system, we developed a set of ancillary plug-ins to help with container management. Many of these plug-ins became widely adopted in the greater archival community. I’d encourage anyone interested in this functionality to read this blog post, written at the time of development, as well as other posts on the AT@Yale blog  (some things about the plug-in look marginally different today, but the functions are more or less the same).

In short, our AT plug-in did two major things.

  1. It lets us manage the duplication of information between AT and our ILS
    At Yale, we create a finding aid and a MARC-encoded record for each collection*. In the ILS, we also create “item records” for each container in our collection. That container has an associated barcode, information about container type, and information about restrictions associated with that container.
    All of this information needs to be exactly the same across both systems, and should be created in Archivists’ Toolkit and serialized elsewhere. Part of our development work was simply to add fields so that we could keep track of the record identifier in our ILS that corresponds to the information in AT.
  2. It let us assign information that was pertinent to a single container all at once (and just once).
    In Archivists’ Toolkit (and in ArchivesSpace too, currently), the container is not modeled. By this I mean that if, within a collection, you assign box 8 to one component and also box 8 to another component, the database has not declared in a rigorous way that the value of “8” refers to the same thing. Adding the same barcode (or any other information about a container) to every component in box 8 introduces huge opportunities for user error. Our plug-in for Archivists’ Toolkit did some smart filtering to create a group of components that have been assigned box 8 (they’re all in the same collection, and in the same series too, since some repositories re-number boxes starting with 1 at each series), and then created an interface to assign information about that container just once. Then, in the background, the plug-in duplicated that information for each component that was called box 6.
    This wasn’t just about assigning barcodes and Voyager holdings IDs and BIBIDs — it also let us assign a container to a location in an easy, efficient way. But you’ll notice in my description that we haven’t really solved the problem of the database not knowing that those box 8’s are all the same thing. Instead, our program just went with the same model and did a LOT of data duplication (which you database nerds out there know is a no-no).

Unfortunately, ArchivesSpace doesn’t yet model containers, and as it is now, it’s not easy to declare facts about a container (like its barcode or its location) just once. Yale has contracted with Hudson Molonglo to take on this work. Anyone interested in learning more is welcome to view our scope of work with them, available here — the work I’m describing is task 6 in this document, and I look forward to describing the other work they will be doing in subsequent blog posts. We’ve also parsed out each of the minute actions that this function should be able to do as a set of user stories, available here. Please keep in mind that we are currently in development and some of these functions may change.

Once this work is completed, we plan to make the code available freely and with an open source license, and we also plan to make the functions available for any repository that would like to use them. Please don’t hesitate to contact our committee of you have questions about our work.


* We (usually/often/sometimes depending on the repository) create an EAD-encoded finding aid for a collection at many levels of description, and also create a collection-level MARC-encoded record in a Voyager environment. This process currently involves a lot of copying and pasting, and records can sometimes get out of sync — we know that this is an issue that is pretty common in libraries, and we’re currently thinking of ways to synchronize that data.

Making Our Tools Work for Us

Metadata creation is the most expensive thing we do.

I hear myself saying this a lot lately, mostly because it’s true. In the special collections world, everything we have is unique or very rare. And since we’re in an environment where patrons who want to use our materials can’t just browse our shelves (and since the idea of making meaning out of stuff on shelves is ludicrous!), we have to tell them what we have by creating metadata.

Creating metadata for archival objects is different than creating it for a book — a book tells you more about itself. From a book’s title page, one can discern its title, its author, who published it. Often, an author will even write an abstract of what happens in the book and someone at the Library of Congress will have done work (what we call subject analysis) to determine its aboutness.

In archives, none of that intellectual pre-processing has been done for us. Someone doing archival description has to answer a set of difficult questions in order to create high-quality description — who made this? Why was it created? What purpose did it serve for its creator? What evidence does this currently serve about what happened in the past? And the same questions have to be addressed at multiple levels — what is the meaning behind this entire collection? What does it tell us about the creator’s impact on the world? What is the meaning behind a single file collected by the creator? What purpose did it serve in her life?

Thus, the metadata we create for our materials is also unique, rare, intellectually-intensive, and essential to maintain.

Here, as part of a planning session, Mary and Melissa are talking through which tasks need to be performed in sequence.
Here, as part of a planning session, Mary and Melissa are talking through which tasks need to be performed in sequence.

Currently, we use a tool called Archivists’ Toolkit to maintain information about our holdings, and this blog is about our process of migrating to a different tool, called ArchivesSpace. Because, like I say, the data in ArchivesSpace is expensive and unique, we’ve taken a very deliberative and careful approach to planning for migration.

We’re lucky to have a strong group, with diverse backgrounds. Mary Caldera and Mark Custer are our co-chairs, and have strong management and metadata expertise between them. Melissa Wisner, our representative from Library IT, has a background in project management and business analysis. She was able to walk us through our massive project planning, and helped us understand and make sense of the many layers of dependencies and collaborations that will all have to be executed properly in order for this project to be successful. Others on the group include experts in various archival functions and standards. And beyond this, we have established a liaison system between ourselves on the committee and archivists from other repositories at Yale, so we can make sure that the wisdom of our wider community is being harnessed and the transition to this new system is successful for all of us.

Anyone interested in viewing our project timeline is welcome to see it here. We know that other repositories are also involved in transition to ArchivesSpace, and we would be happy to answer questions you may have about our particular implementation plan.