The underlying storage infrastructure in use by Yale University Library’s Hydra repository is now 3-Lock data compliant. Data and information at Yale is classified into three different tiers to categorize data security (more here). 3-Lock data can include things like Social Security numbers, credit card numbers, trade secrets, medical records, tax records, grades for assignments and courses, passport numbers, Veterans Administration data, and bank account numbers. This upgrade is intended to meet the needs of units across the Library depositing material, and in particular units that are interested in the storage of digital personal papers that may contain sensitive content. Thanks to Steve DeGroat and John Coleman of the Design & Quality Assurance team within Yale ITS Infrastructure Services.
In addition, I wanted to provide a link to the November Partner Phone call notes. Some highlights below:
- The Hydra Steering group elected two new members: Jon Dunn, Indianna University and Michael Giarlo, Penn State University.
- The Steering group will undertake a process to draft a set of bylaws for governance of the project given its rapid growth.
- The Steering group is also beginning to look at contract services for banking and other legal activities such as trademark and intellectual property.
- Significant expansion of Hydra adoption in Europe was reported as well as a successful Hydra UK event at the University of Hull.
- Hydra had excellent presence and reception at DLF in Atlanta this past October.
- DuraSpace conducted a Fedora 4 webinar which is now available online.
- A proposal for formalizing the creation of Hydra special interest groups is near ratification.
- The Hydra Archivists Working Group (HAWG) has reformed under the new leadership of Ben Goldman, Penn State University.
- Stanford announced ArcLight as a formal project that is kicking off and seeking partnerships for planning and development. In short, ArcLight is an effort to build a Blacklight-based environment to support discovery and digital delivery of information in Archives. Which also includes integration with ArchivesSpace and EAD support.
- Hydra expansion into South America is underway and several universities (wishing to remain anonymous at this time) are beginning Hydra development. Another major milestone is the work they are taking on to translate all of the Hydra documentation written in English into Spanish.
Fedora is used locally in a number of applications including our Hydra instance, Finding Aids Database, AMEEL and the Joel Sumner Smith collection. In our local instances we have been using versions of Fedora 3 since 2006. Fedora 3.8, to be released in December, will be the final release of the 3.x line with energy at that pointed shifted primarily to Fedora 4.
Fedora 4 consists of a complete refactoring of the fedora 3 code now built on top of the Modeshape and JCR repository APIs, with improvements in ease of installation, scaling, and RDF support. Below is a full list of features.
Yale contributes financially to the product as a bronze member and we also contribute to the programming efforts. Osman Din and Eric James, both in Digital Library Programming Services, actively participate in development. In addition, I sit on the Fedora Leadership Committee that handles the gathering of use cases, prioritization of features being programmed as well as budget planning.
Fedora 4.0 is now undergoing a cycle of beta releases to allow institutions to begin adopting it. On November 9th Fedora 4.0 Beta 4 was released again with an eye towards simple installation and support for performance and repository size. Fedora 4.1, to begin development in 2015, will focus on supporting the upgrade/migration process from Fedora 3.x. Some peers, including Penn State, have already begun to replace some of their Fedora 3 repositories with Fedora 4. We are also starting to think about how our migration strategies can dovetail with fedora 4.1 for our own adoption starting around the of summer of 2015.
With that little bit of background, I thought I would share the recent development notes. If you have questions about Fedora, do not hesitate to contact me (firstname.lastname@example.org), Eric James (email@example.com) or Osman Din (firstname.lastname@example.org).
(note: Eric provided valuable editorial feedback for the above post)
Release date: 9 November, 2014
We are proud to announce the fourth Beta release of Fedora 4. In the continuing effort to complete the Fedora 4 feature set, this Beta release is one of several leading up to the Fedora 4 release. Full release notes and downloads are available on the wiki: https://wiki.duraspace.org/display/FF/Fedora+4.0+Beta+4+Release+Notes.
Andrew Woods (DuraSpace)
1) Sprint Developers
Adam Soroka (University of Virginia)
Benjamin Armintor (Columbia University)
Chris Beer (Stanford University)
Esme Cowles (University of California, San Diego)
Giulia Hill (University of California, Berkeley)
Jared Whiklo (University of Manitoba)
Jon Roby (University of Manitoba)
Kevin S. Clarke (University of California, Los Angeles)
Longshou Situ (University of California, San Diego)
Michael Durbin (University of Virginia)
Mohamed Mohideen Abdul Rasheed (University of Maryland)
Osman Din (Yale University)
2) Community Developers
Aaron Coburn (Amherst College)
Frank Asseg (FIZ Karlsruhe)
Nikhil Trivedi (Art Institute of Chicago)
1) Removed features
In the interest of producing a stable, well-tested release, the development team identified and removed a number of under-developed features that had not been sufficiently tested and documented. These features were not identified as high priorities by the community, but they may be re-introduced in later versions of Fedora 4 based on community feedback.
– Namespace  creation/deletion endpoint
– Locks endpoint
– Workspaces other than the ‘default’
– Admin internal search endpoints
– Policy-driven storage
– Batch operations in single request
– Auto-versioning configuration option
– Sitemaps endpoint
– Writable nodetypes endpoint
2) REST API
The REST API is one of the core Fedora 4 components, and this release brings it more in line with the emerging W3C Linked Data Platform 1.0  specification. An example of this is the new tombstone functionality ; URIs are not supposed to be reused, so deleting a resource leaves a tombstone in its place that serves as a notification that the resource has been deleted. Child nodes of deleted resources also leave tombstones. Other examples of LDP-related REST API changes include:
– Support for hashed URIs  as subjects and objects in triples.
– Binary and binary description model changed:
– From: binary description at /resource, and binary at /resource/fcr:content,
– To: binary description at /resource/fcr:metadata, and binary at /resource
– Labels are required when creating new versions of resources .
– Content-Disposition, Content-Length, Content-Type are now available on HEAD requests .
The Fedora 4 ontology  was previously broken out into several different namespaces, but these have now been collapsed into the repository  namespace. Additionally, the oai-pmh  namespace has been added to the ontology.
Fedora 4 provides native linked data functionality, primarily by conforming with the W3C Linked Data Platform 1.0  specification. The LDP 1.0 test suite  is executed against the Fedora 4 codebase as a part of the standard build process, and changes are made as necessary to pass the tests. Additionally, integrations tests for real-world RDF assertions  have also been added to the codebase.
Recent changes to suport LDP include:
– When serializing to RDF, child resources are included in responses , versus having to traverse auto-generated intermediate nodes.
– All RDF types on properties are now supported .
– Prefer/Preference-Applied headers have been updated  to match the latest requirements .
– RDF language types are now supported .
– The full range of LDP containers  are now supported
– Changed terminology from:
– object -> container
– datastream -> non-rdf-source-description
– Replaced relationships from:
– hasContent/isContentOf, to:
5) External modules
In additional to the core Fedora 4 codebase, there are a number of supported external modules that offer useful extensions to the repository. Two such modules are being introduced in the Fedora 4.0 Beta 4 release: Fedora 4 OAI Provider  and Fcrepo Camel .
The Fedora 4 OAI Provider implements the Open Archives Protocol Version 2.0  using Fedora 4 as the backend. It exposes an endpoint at /oai which accepts OAI conforming HTTP requests. A Fedora resources containing set information can be created then exposed at the module’s endpoint which accepts HTTP POST requests containing serialized Set information adhering to the OAI schema.
Fcrepo Camel provides access to an external Fedora 4 Containers API  for use with Apache Camel . Camel is middleware for writing message-based integrations, so this component can be used to connect Fedora 4 an extensive number of external systems , including Solr and Fuseki. This functionality is similar to that of the Fcrepo Message Consumer , except it is based on a well-maintained Apache project rather than being custom Fedora 4 code. Therefore, this component is likely to replace the Message Consumer in the future, though the Message Consumer will still be part of the Fedora 4.0 release.
6) Admin Console
The administrative console provides a simple HTML user interface for viewing the contents of the repository and accessing functionality provided by the REST API. This release introduces support for custom velocity templates  based on the hierarchy of mixing types. Now, if you create a new mixin type, the templates to be used in the admin console will include the resource’s primary type, mixin types, and parent types thereof.
The projection  (also known as federation) feature allows Fedora 4 to connect to external storage media via a pluggable connector framework. A read-only filesystem connector is included with this release.
Additionally, Fedora 4 now has standardized support for externally-referenced content .
8) Java client library
The Java Client Library  is an example of a module that was conceived by Fedora community members who recognized a common need and rallied to design  and implement the functionality. This release includes an improvement to list the children of a resource  in the client library.
A key component under the covers of Fedora 4 is ModeShape , one that the Fedora 4 project tracks closely. Fedora 4.0 Beta 4 includes an upgrade to the production version of ModeShape 4.0.0 .
Fedora 4 comes with built-in profiling machinery that keeps track of how many times specific services have been requested, how long each request takes to be serviced, etc. These metrics can be visualized using Graphite . Because Graphite can be difficult to setup and configure , this release includes a Packer.io build  which completely automates the process of standing up a Graphite server.
Additionally, the pluggable role-based  and XACML  authorization modules have been pre-packaged into fcrepo-webapp-plus . This project builds custom-configured fcrepo4 webapp war files that include extra dependencies and configuration options.
10) Test Coverage
Unit and Integration test coverage  is a vital factor in maintaining a healthy code base. The following are the code coverage statistics for this release.
– Unit tests: 66.2%
– Integration tests: 69.4%
– Overall coverage: 82.5%
I’d like to introduce everyone to Tracy MacMath, she joined us this week as a User Interface Programmer who will be working primarily on Hydra, Blacklight, Ladybird and other Hydra related applications we adopt such as Avalon. Previously Tracy worked as a User Experience Producer at Gartner and received a Masters of Science in Interactive Communications from Quinnipiac University.
In addition to introducing Tracy, I thought it might help to offer some quick bios for the whole team.
Michael Friscia – Manager, Digital Library Programming Services
I arrived at Yale in 2007 and worked primarily supporting workstations, ILLiad and programming on various projects for the LSF, Map Department and the wide range of Digital Library interfaces. Since then Library IT has changed quite a bit and I now manage the group that is primarily responsible for working with the Hydra implementation in support of a number of grant funded projects including Arcadia, NEH and the Dr. Henry Kissinger Papers. I started programming early, my first computer arriving for Christmas in 1979 and recently celebrated 30 years of programming C++ applications though still enjoy programming in Algol and Basic on a variety of vintage computers in a collection that spans from 1964 to 1995 with some overflow in my office that I crank up from time to time. I enjoy writing software of all types but spend my free time working on several open source video game and game emulation projects.
Eric James, Senior Programmer Analyst
My official title is Programmer Analyst, Library IT. I was hired 7 years ago to work on a digital repository service that became the current YaleFindingAidDatabase, and a grant project A MiddleEasternElectronicLibrary (AMEEL) that was one of the earliest adopters of a software stack for the submission, achiving and dissemination of digital material. This work has matured over the years and is now basically taken the form of Hydra, an interinstitutional project with these same goals bringing together such components as fedora, solr, and mysql. I am a programmer on these these projects (php, java and ruby/RAILS) and have been involved in various teams such as the Digitization Task Force, the YFAD Coordinating Committee, the Digital Repository Archiving Committee, and most recently the Kissinger Project working to coordinate technology with our strategic plans. I have participated as a programmer in several sprints for the fedora 4 project (the future underlying repository of most of our solutions) and in the development and use of the hydra stack, and am involved in working groups related to these projects. Throughout these projects I have worked with software project management tools such at GIT, sharepoint, wrike, basecamp, pivotal tracker, jira, confluence wikis, and classesv2. I have been involved in several conferences including participation as presenter, and in lightning talks and poster sessions at Open Repositories, code4lib, hydraConnect and the DigitalLibraryFoundation.
Osman Din, Senior Programmer Analyst
I got into software engineering, and programming in particular, due to my background in Computer Science. My current career focus is on developing back-end large-scale services for digital content management and publishing, as well as writing web applications and tools that aid in this enterprise. Besides other assignments or projects that I participate in, the bulk of my time is dedicated currently to two major projects, Ladybird 2 and Fedora 4. Ladybird 2 is a Java application for managing the publishing and discovery of digital content to repositories (such as Fedora) and user-facing web applications (such as the Hydra interfaces). I’m the lead developer for this project. The code is written via IntelliJ, lives in GitHub (eventually, it will become an open source project), and is managed via Jenkins for continuous integration. For Fedora 4, which is an open source repository system with about 30 developers, I keep the code that I write in a fork on GitHub, and submit it to the Fedora 4 project team lead in the form of GitHub pull requests. The code is tested automatically for integration and fitness via Travis and Jenkins; the documentation for new functionality is kept up-to-date in Confluence; the status updates for features and bugs are recorded in Pivotal Tracker. My favorite tools for software design and programming are IntelliJ, bash, Eclipse (for proprietary frameworks), Virtual Box, Git, Jenkins and LucidChart.
Lakeisha Robinson, Programmer Analyst
How I began a career in programming:
Immediately following college is when I started my career at IBM. It was there at IBM, where I designed and coded programs in C/Assembly, when I realized how much I loved programming. I had a hardware background in Electrical Engineering and didn’t have a strong programming background. When I left that job I decided to pursue a degree in Computer Science to strengthen my programming skills. I started my career here at Yale University 2 years ago. I’ve participated on many exciting projects including Quicksearch, Kissinger, Ladybird and Findit.
Projects I’ve worked on and am currently working on:
I am the technical lead on the Blacklight based Quicksearch project. Quicksearch is where we are unifying our Orbis and Morris records to be searched in the same interface. I am responsible for many of the code changes for the setup, ingest and interface functionality. I am also working on the Kissinger project where I am responsible for the discovery of Kissinger material. I’m also one of the original contributors to our Blacklight based digital ‘Findit’ interface where I was responsible for the creation of the MODS formatted XML document retrieving data from the Ladybird database. I also was responsible for the object discovery in the interface and I continue to do ongoing work on enhancement.
Anju Meenattoor, Programmer Analyst
I have been working in IT for 8 years focusing mainly on web development and C#.Net applications. I started my career at Yale in January 2014 and am current working on Dr. Henry Kissinger project. Since January, I have been involved in developing applications for importing Kissinger MODS files, automate Kissinger digital file import into ladybird, Manual QC tool for checking Kissinger digital files and Ladybird maintenance.
Tracy MacMath, User Interface Programmer
I’m the newest member of the Digital Library Programming team. As a User Interface Programmer, I’ll be working primarily on our implementation of Hydra, Blacklight, Ladybird and other Hydra-related applications we adopt in the future (such as Avalon). Before coming to Yale, I was a User Experience Producer for the Marketing group at Gartner in Stamford. I received a Master of Science in Interactive Communications from Quinnipiac University, and an undergraduate degree in music (drums and percussion).
Yale University Library FindIt and QuickSearch services have completed a Security Design Review (SDR) by the Information Security Office of Yale ITS. These systems use the Hydra repository solution as the underlying technology stack. The SDR process is used to provide recommendations for building, improving, or reengineering services to meet University policies, industry best practices, laws, and regulation requirements. Thanks to Bob Rice for evaluating and implementing the recommendations and Tom Castiello and Marcus Aden from the Information Security Office for their insight and participation.
LibraryIT recently purchased a license for the performance management and monitoring service New Relic. We will be using the New Relic APM-Application Performance Management application to monitor and improve performance of the new Hydra/Blacklight complex (aka Findit and Quicksearch beta). This is a SaaS, cloud-based service for monitoring applications and their underlying infrastructure as well as the programs themselves.
New Relic does do some usage monitoring, much in the vein of Google Analytics, but the particulars of installation and setup of this service will allow the Information Architecture Group in LibraryIT and others to specifically target performance issues like page loads and search result returns. New Relic will be a great help in assessing the health and responsiveness of the critical servers, applications and which run the Library’s key services.
The Digital Library & Programming group is pleased to announce that we’ve hit a major milestone the development of the Hydra digital repository system at YUL. Communication and syncing between repository system components became fully automated at the end of September 2014. This automation applies not just to work on the infrastructure built for the Kissinger papers, but for all Ladybird/Hydra interaction.
Automation like this allows metadata and objects to travel within the Hydra system without intervention, which in turn allows Library IT to focus more intensely on structural and workflow development. As a Project Hydra partner, Yale is now in the position to share this work with the Project Hydra community, and empower those members to scale up their own repository ingest services.
Join us for the inaugural Yale Technology Summit, a day-long program of conversations with Yale faculty, students, and staff working with innovative and cutting-edge technologies. The event, coordinated by Yale Information Technology Services, is free and open to all members of the Yale community.
Library and Library IT presentations at this event include:
- Library Development for Digital Repositories: What is this Hydra Fedora stuff?
In response to a fragmented digital collections environment developed over many years using many systems, the Yale Library has launched a project to unify digital collections within a single open source software framework using Hydra/Fedora. Michael Dula, the Library CTO, will talk about the decision to go open source with Hydra and Fedora as the underlying technologies. Topics will include Yale’s contributions to the open source Hydra community, a demonstration of initial projects, and future development plans and possibilities.
- Quicksearch: Universal Search at the University Library
The Library offers several search interfaces: Orbis and MORRIS search the Library and Law Library catalogs, Articles+ for articles, journals and newspapers, and several digitized collection searches. The many search interfaces present a challenge to our patrons, who have to select the correct search depending on the material they need. The Library will combine several of these search interfaces into one unified ‘Quicksearch’, which over time will become a comprehensive search interface for the majority of Library resources. The Quicksearch poster session will highlight progress on the project so far. We will also provide laptops so Summit Participants can try the new search for themselves.
- Humanities Data Mining in the Library
In response to increased scholarly demand, Yale University Library is helping humanists make sense of large amounts of digital data. In this presentation, we will highlight recent projects based on Yale-digitized data, data from large commercial vendors, and data from the Library of Congress. We’ll address 1) working with digitized collections that are subject to license & copyright, 2) thinking about both explicit metadata and latent structure in large digital collections, and 3) moving beyond text to consider machine vision and computational image analysis.
- Preservation and Access Challenges of Born-Digital Materials
We will provide an introduction to the scope of born-digital materials at Sterling Memorial Library and the Beinecke Rare Book & Manuscript Library, and in particular will discuss the innovative ways staff at the Yale libraries are collaborating with colleagues on different initiatives, including a digital forensics lab devoted to the capture of born-digital materials, an emulation service that can provide online access to vintage computing environments via a web browser, and a vision for digital preservation to ensure that collection materials we capture today will remain usable in the future.
Watch the conversation on #YaleTechSummit2014 on Twitter!
Just a brief update on the work of our group for the past week.
We continue our efforts in contributing to the Fedora 4 project. We use Fedora as one of the core products in our Hydra implementation. Currently we have several installations of version 3. Version 4 has been in development for a little over a year with an expected release date of June 2015. While Yale has been a financial contributor to the Fedora Commons project for many years now, we only started contributing code to the project in 2013.
The Quicksearch project is also moving along swiftly. This week the major milestone of handling CAS login was completed. This is used for some features in the Blacklight software like the bookbag and search history. CAS is generally simple to integrate with most software products as long as the link between a NetID and the local user database can be made. In the case of Blacklight, making this link became complicated because of the use of several different code libraries in the specific version of Blacklight being used for Quicksearch, which is different from the version we use for our Hydra interface, FindIt.
Almost all efforts this week were related to ingest operations for the Kissinger project. There was also some vacation time taken so the output this week was limited.
Yesterday we met to discuss the development for full text search for objects ingested into Hydra. The work is broken up into the following steps:
1- alter the SOLR index to accept 2 new fields that will store full text
2- alter the Hydra ingest application to store the content of the TXT files into the new SOLR fields
3- setup the Blacklight controllers for handling if/when each of the FT fields are used in user searches
4- develop the Blacklight user interface to allow the FT search option
At this point we are only focused on the first two steps. 3&4 require us to have data in place. We will be moving steps 1 & 2 to the test environment the week of Oct 20 and then roll these changes into production the week of Oct 27. We will be doing all our FT testing with the Day Missions collection which uses a Hydra ingest package very similar to Kissinger.
This is a repost from another location that has more information on our full text search plan. So I will give a brief overview of what that plan looks like with the use cases used to draft this approach.
There are two types of full text search for objects we ingest into Hydra.
The first is the simplest, OCR text from a scanned image like a page in a book or a manuscript. This type of text is treated as an extension of the metadata making it simple to combine into search results since the text is considered open access.
The second is significantly more complex, it is where the contents of the full text require special permission to search so instead of the text being treated as an extension to the metadata, it is treated the same as we treat files that carry special access conditions. This permission would have been granted ahead of time so at the time you execute your full text search it will include results from the restricted items. This use case is currently specific to the Kissinger papers project but is being programmed to scale out as needed.
So the approach we are taking is kind of simple, we place the open access full text into one SOLR field and then the restricted access full text into a field specifically designed for restricted content. At the point when the search is executed, the open access text is searched and the restricted is filtered so that your search is only applied to the restricted contents which you have been granted access to view.