The current ingest procedure is somewhat long-winded and technical. This is an example given a single EAD XML file containing a large number (48,000) of individual documentary unit items in a single fonds. The repository is the Internation Tracing Service (ITS), which has EHRI repository ID de-002409.
This ingest covers importing the EAD file into the staging server, at which time it should be ready for verification and if necessary, changes, before the production ingest.
First, log into the EHRI staging server via SSH and open a bunch of shells. In one of them, tail the following file, which will give us some information about what goes wrong, when something inevitably goes wrong the first few times we try:
tail -f /opt/webapps/neo4j-version/logs/log/neo4j.log
The Neo4j DB lives in /opt/webapps/data/neo4j/databases/graph.db. You can back it up without shutting down the server by running:
/opt/webapps/neo4j-backup.sh graph.db.BAK
To restore the DB the procedure is: - shut down Neo4j - replace /opt/webapps/data/neo4j/databases/graph.db with backup directory you specified previously - ensure all files in the graph.db directory are owned and writable by the webadm group: - chgrp -R webadm graph.db - chmod -R g+rw graph.db - restart Neo4j
Onwards with the ingest…
Next, in another shell, copy the file(s) to be ingested to the server and place them in /opt/webapps/data/import-data/de/de-002409, the working directory for ITS data.
Import properties handle certain mappings between tags (with particular attributes) and EHRI fields. The ITS data has a particular mapping indicating that when the <unitid> has a type="refcode" that is the main doc unit identifier, and that the rest are the alternates. This file is, in this case:
/opt/webapps/data/import-data/de/de-002409/its-pertinence.properties
The actual import is done via the /ehri/import/ead endpoint on the Neo4j extension. It is documented here: http://ehri.github.io/docs/api/ehri-rest/ehri-extension/wsdocs/resource_ImportResource.html
The basic procedure is:
To make the curl command less cumbersome, lets export the path to the properties file as an environment variable:
export PROPERTIES=/opt/webapps/data/import-data/de/de-002409/its-pertinence.properties
Also, lets write a log file and export it’s path as an environment variable:
echo "Importing ITS data with properties: $PROPERTIES" > LOG.txt export LOG=`pwd`/LOG.txt
Now we can POST the data to the ingest endpoint:
curl -XPOST \ -H "X-User:mike" \ -H "Content-type: text/xml" \ --data-binary @KHSK_GER.xml \ "http://localhost:7474/ehri/import/ead?scope=de-002409&log=$LOG&properties=$PROPERTIES&commit=true"
These parameters are:
Note: when importing a single EAD containing ~50,000 items in a single transaction the staging server might run out of memory. If it does the only option is to increase the Neo4j heap size by uncommenting and setting the dbms.memory.heap.max_size=MORE_MB (say, 3500) in $NEO4J_HOME/conf/neo4j-wrapper.conf and restarting Neo4j by running:
sudo service neo4j-service restart
Additional note: Certain date patterns are fuzzy parsed by the importer and invalid dates such as 31st April will currently throw a runtime exception resulting in a BadRequest from the web service. So fix all these first ;)
If all goes well you should get something like this:
{"created":48430,"unchanged":0,"message":"Import ITS 0.4 data using its-pertinence.properties.\n","updated":0,"errors":{}}
In theory, that ingest should be idemotent, so you can run the same command again and not change anything. Instead you’d get a reply like:
{"created":0,"unchanged":48430,"message":"Import ITS 0.4 data using its-pertinence.properties.\n","updated":0,"errors":{}}
The final step is the re-index the ITS repository, making the items searchable. This can be done from the Portal Admin UI, or via the following command:
java -jar /opt/webapps/docview/bin/indexer.jar \ --clear-key-value holderId=de-002409 \ --index -H "X-User=admin" \ --stats \ --solr http://localhost:8080/ehri/portal \ --rest http://localhost:7474/ehri \ "Repository|de-002409"
(This tool is a library/CLI utility the is used by the portal UI and available on the server: see the https://github.com/EHRI/ehri-search-tools project for more details.)
To update existing collections, when, for example, adding descriptions in another language, the procedure is exactly the same with one exception: the import Curl command needs an additional parameter:
&allow-update=true
Without this parameter the importer will throw a mode violation error when it ends up updating an existing collection.
If you want to overwrite existing item description with data from a new EAD the EAD must have the same sourceFileId value as exists on the current description. The sourceFileId is a property computed from two aspects of the EAD file: the eadheader/eadid value and the eadheader/profiledesc/langusage/language/@langcode value combined thus: [EADID]#[UPPER-CASE-LANGCODE].
For example, if the eadid is 100 and the language code is eng, the sourceFileId will be 100#ENG.
Only documentary unit descriptions created via the EAD ingest process will have a sourceFileId; those created using the portal interface will not. For descriptions that have the property it is visible (but not editable) on the portal admin pages.
Note: the consequence of the above is that the eadid value should not contain the language code, since this is redundant and will result in a sourceFileId like eng#ENG.
It is possible to ingest multiple EAD files in a single transaction by providing the importer with an archive file (containing multiple XML files) instead of a single XML file. Currently the following formats are supported:
The importer will assume the data it is given is an archive if the content type of the request is given as application/octet-stream (aka, miscellaneous binary) instead of either text/xml (XML) or text/plain (local file paths.)
Note: if several EAD files provide different translations of the same items it is necessary to enable update ingests via &allow-updates=true.