Reading Migration Errors and Fixing Our Data

We’ve been doing what feels like a zillion practice migrations to get our data ready for ArchivesSpace. Every time we do a test migration, we read the error reports to see what’s wrong with our AT data. From there, we clean up our at database with the aim of a completely error-free migration when it’s time to do this for real. This is still in progress, but common errors and clean-up techniques are below.

We had to get around inherent problems with the AT -> ASpace migrator.

  •  Resource records and accession records that are linked to subjects and agents already in the database won’t migrate. This is a really, really bad one.
    Here’s what the error looks like:

    Endpoint: http://localhost:8089/agents/corporate_entities
    AT Identifier:Name_Corporate->Yale Law School.
    Status code: 400
    Status text: Bad Request
    {"error":{"names":["Agent must be unique"]}}

    Make no mistake — this is a “record save error”. It’s not just that the agent or subject is no longer linked — the whole finding aid or accession record actually isn’t migrating. That’s a no-go for us.
    Since we’re moving the records of eight repositories (in four different Archivists’ Toolkit databases) into a single ArchivesSpace instance, we knew that we wouldn’t be able to live with this error. We toyed with the idea of various hacks (pre-pending subjects and agents with a unique string in each database so that they wouldn’t repeat), but in the end we decided to contract with Hudson Molonglo to fix the importer. We’ll be happy to report more on that once the work is done.

Because of the requirements of our advanced container management plug-in, we had to make sure that existing data met compatibility requirements.

  • Barcodes and box numbers have to match up. If you have ten components with the same barcode where eight say box “1” and one says box “2”, the migrator can’t create a top container. We wrote a bunch of sql reports to anticipate these problems, and have done a ton of clean-up over the last few months to make sense of them. In most cases, this required actually pulling down the materials and checking which components belonged to which containers and what their barcodes are. Many, many thanks to my colleague Christy who did this work.
  • Boxes can’t be assigned to more than one location (because of the laws of physics).
    Here’s what the error looks like. The relevant bits are in red text:

    Endpoint: http://localhost:8089/repositories/2/batch_imports?migration=ArchivistToolkit
    AT Identifier:RU.121
    Status code: 200
    Status text: OK
    {
     "errors": ["Server error: #<:ValidationException: {:errors=>{\"container_locations\"=>[\"Locations in ArchivesSpace container don't match locations in existing top container\"]}, :object_context=>{:top_container=>#<TopContainer @values={:id=>1963, :repo_id=>2, :lock_version=>1, :json_schema_version=>1, :barcode=>\"39002042754961\", :indicator=>\"1\", :created_by=>\"admin\", :last_modified_by=>\"admin\", :create_time=>2015-04-17 21:44:39 UTC, :system_mtime=>2015-04-17 21:44:39 UTC, :user_mtime=>2015-04-17 21:44:39 UTC, :ils_holding_id=>nil, :ils_item_id=>nil, :exported_to_ils=>nil, :legacy_restricted=>0}>, :aspace_container=>{\"container_locations\"=>[{\"ref\"=>\"/locations/124\", \"start_date\"=>\"Fri Apr 17 17:44:37 EDT 2015\", \"status\"=>\"current\"}], \"indicator_1\"=>\"1\", \"type_1\"=>\"box\"}, :top_container_locations=>[\"/locations/1\"], :aspace_locations=>[\"/locations/124\"]}}>"],
     "saved": []
    }

    This may take a bit of explanation. Basically, as I wrote before, in the AT data model, the container indicator is just a piece of data that’s associated with every component. The database has no way of knowing that each of those components called box “8” actually refer to the same thing. This can result in a lot of problems.
    One of those problems is that in the same collection, some components can be called box “8” and be assigned to one location, while other components can be called box “8” and be assigned to another location. Our migrator is trying to make sense of these containers in a more rigorous way, and it knows that the same box can’t be in two different places. Thus, it throws an error.
    I did these fixes in the database — in most cases, it’s only one or two components out of many that are associated with the errant location. We made a decision internally that we’re comfortable going with a majority-rules fix — in other words, if seventeen folders in a box are assigned to the off-site storage facility and one folder in the box is assigned to Drawer 8 in room B59, we’re pretty sure that the whole box is actually at the off-site storage facility. We have other data stores (our ILS, for instance) that we can use to double-check this data.
    If you want descriptive information about these components, this is actually kind of tricky — I’ve written about this report on my other blog, and you’re welcome to use it.

  • Box numbers have to be very, very literally the same. “17a” and “17A” aren’t the same. Neither are “8” and “8 ” (see the space?). We’ve written a pre-migration normalization script to help deal with some of these. Others we’ve fixed by updating the database, based on the error report.

Some errors are just warnings, but are really good to clean up.

  • Two collections probably shouldn’t have the same EADID. In most cases, this is a typo and easily fixed by looking at the record.
  • Begin dates shouldn’t be bigger than end dates
    Here’s the error:

    End date: 1061 before begin date: 1960, ignoring end date
    Record:: Resource Component: RU.126/ref14250

    Haha. That’s probably not from before the Battle of Hastings. We can just fix that typo.

  • Digital objects shouldn’t have the same identifier. Again, usually just a typo. The record will still migrate but the migrator will append a string to the end of the identifier to make it unique.
  • Miscellaneous other stuff. For instance:
    Endpoint: http://localhost:8089/repositories/2/batch_imports?migration=ArchivistToolkit
    AT Identifier:RU.703
    Status code: 200
    Status text: OK
    {
     "errors": ["Server error: #<:ValidationException: {:errors=>{\"notes/0/subnotes/1/items/168/label\"=>[\"Property is required but was missing\"]}}>"],
     "saved": []
    }

    Finding and fixing this was a huge pain in the neck. We had a list, deeeeeep into this collection, that was missing a label for one of the entries. Once we found it, we found that we couldn’t edit it in the application and had to figure out how to fix it in the database. We all learned a lot that day.

We’re very happy to hear from others about what errors they’ve found during their migrations, and how they’ve gone about fixing them!