Skip to the content.

E-ARK IP validation and manipulation tool and library

This project provides a command-line interface and Java library to validate and manipulate OAIS Information Packages of different formats: E-ARK (version 1, 2.0.4, 2.1.0), BagIt, Hungarian type 4 SIP.

The E-ARK Information Packages are maintained by the Digital Information LifeCycle Interoperability Standards Board ( DILCIS Board). DILCIS Board is an international group of experts committed to maintain and sustain maintain a set of interoperability specifications which allow for the transfer, long-term preservation, and reuse of digital information regardless of the origin or type of the information.

More specifically, the DILCIS Board maintains specifications initially developed within the E-ARK Project (02.2014 - 01.2017):

The DILCIS Board collaborates closely with the Swiss Federal Archives in regard to the maintenance of the SIARD ( Software Independent Archiving of Relational Databases) specification.

For more information about the E-ARK Information Packages specifications, please visit http://www.dilcis.eu/

Installation

Requirements

Download the latest release to use as a tool or check below how to use it as a Java Library.

Usage

You can use the commons-ip as a command-line tool or a Java library.

Use as a command-line tool

To use commons-ip command-line tool, need to download the latest release. This tool can validate a SIP or create a valid EARK2 SIP.

To validate a SIP have to use the following options:

To create a EARK-2 SIP have to use the following options:

This is the descriptive metadata section:

NOTE: if does not give the metadata version, the tool try to obtain this values from the file name in the following formats (file: ead_2002.xml -> result: metadata version: 2002)

This is the representation section:

Examples:

Full create SIP command with long options:

java -jar commons-ip-cli-2.X.Y.jar create --metadata-file metadata.xml --metadata-type ead --metadata-schema ead2002.xsd \
--representation-data "dataFile1.pdf,dataFolder1,dataFile2.png,Mixed,representation1" \
--sip-id sip1 --ancestors sip2,sip3 --documentation documentation1,documentationFolder --path folder2 --submitter-name agent1 --submitter-id 123
java -jar commons-ip-cli-2.X.Y.jar validate -i sip1.zip
java -jar commons-ip-cli-2.X.Y.jar validate -i sip1.zip sip2.zip -o output/

Output Example

The report generated by the validator is in JSON format and has the following structure:

{
  "header": {
    "title": "Validation Report CSIP",
    "specifications": [
      {
        "id": "CSIP-2.0.4",
        "url": "https://github.com/DILCISBoard/E-ARK-CSIP/releases/tag/v2.0.4"
      },
      {
        "id": "SIP-2.0.4",
        "url": "https://github.com/DILCISBoard/E-ARK-SIP/releases/tag/v2.0.4"
      }
    ],
    "version_commons_ip": "2.0.0-alpha3-SNAPSHOT",
    "date": "2021-09-24T13:30:26.203+01:00",
    "path": "${HOME}/Full-EARK-SIP.zip"
  },
  "validation": [
    {
      "specification": "CSIP-2.0.4",
      "id": "CSIPSTR1",
      "name": "CSIP Information Package folder structure",
      "location": "",
      "description": "Any Information Package MUST be included within a  single physical root folder (known as the “Information Package root folder”). For packages presented in an archive format, see CSIPSTR3, the archive MUST unpack to a single root folder.",
      "cardinality": "",
      "level": "MUST",
      "testing": {
        "outcome": "PASSED",
        "issues": [],
        "warnings": [],
        "notes": []
      }
    },
    {
      "specification": "CSIP-2.0.4",
      "id": "CSIP32",
      "name": "Digital provenance metadata",
      "location": "mets/amdSec/digiprovMD",
      "description": "For recording information about preservation the standard PREMIS is used. It is mandatory to include one <digiprovMD> element for each piece of PREMIS metadata. The use if PREMIS in METS is following the recommendations in the 2017 version of PREMIS in METS Guidelines.",
      "cardinality": "0..n",
      "level": "SHOULD",
      "testing": {
        "outcome": "FAILED",
        "issues": [],
        "warnings": [
          "It is mandatory to include one <digiprovMD> element in Root METS.xml for each piece of PREMIS metadata"
        ],
        "notes": []
      }
    }
  ],
  "summary": {
    "success": 120,
    "warnings": 3,
    "errors": 5,
    "skipped": 33,
    "notes": 8,
    "result": "INVALID"
  }
}

Use as a Java Library

  1. Add the following repository
  1. Add the following dependency
<dependency>
  <groupId>org.roda-project</groupId>
  <artifactId>commons-ip2</artifactId>
  <version>2.6.0</version>
</dependency>

Write some code

// 1) instantiate E-ARK SIP object
SIP sip=new EARKSIP("SIP_1",IPContentType.getMIXED(),IPContentInformationType.getMIXED());
        sip.addCreatorSoftwareAgent("RODA Commons IP","2.0.0");

// 1.1) set optional human-readable description
        sip.setDescription("A full E-ARK SIP");

// 1.2) add descriptive metadata (SIP level)
        IPDescriptiveMetadata metadataDescriptiveDC=new IPDescriptiveMetadata(
        new IPFile(Paths.get("src/test/resources/eark/metadata_descriptive_dc.xml")),
        new MetadataType(MetadataTypeEnum.DC),null);
        sip.addDescriptiveMetadata(metadataDescriptiveDC);

// 1.3) add preservation metadata (SIP level)
        IPMetadata metadataPreservation=new IPMetadata(
        new IPFile(Paths.get("src/test/resources/eark/metadata_preservation_premis.xml")));
        sip.addPreservationMetadata(metadataPreservation);

// 1.4) add other metadata (SIP level)
        IPFile metadataOtherFile=new IPFile(Paths.get("src/test/resources/eark/metadata_other.txt"));
// 1.4.1) optionally one may rename file final name
        metadataOtherFile.setRenameTo("metadata_other_renamed.txt");
        IPMetadata metadataOther=new IPMetadata(metadataOtherFile);
        sip.addOtherMetadata(metadataOther);

// 1.5) add xml schema (SIP level)
        sip.addSchema(new IPFile(Paths.get("src/test/resources/eark/schema.xsd")));

// 1.6) add documentation (SIP level)
        sip.addDocumentation(new IPFile(Paths.get("src/test/resources/eark/documentation.pdf")));

// 1.7) set optional RODA related information about ancestors
        sip.setAncestors(Arrays.asList("b6f24059-8973-4582-932d-eb0b2cb48f28"));

// 1.8) add an agent (SIP level)
        IPAgent agent=new IPAgent("Agent Name","OTHER","OTHER ROLE",CreatorType.INDIVIDUAL,"OTHER TYPE","",
        IPAgentNoteTypeEnum.SOFTWARE_VERSION);
        sip.addAgent(agent);

// 1.9) add a representation (status will be set to the default value, i.e.,
// ORIGINAL)
        IPRepresentation representation1=new IPRepresentation("representation 1");
        sip.addRepresentation(representation1);

// 1.9.1) add a file to the representation
        IPFile representationFile=new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
        representationFile.setRenameTo("data.pdf");
        representation1.addFile(representationFile);

// 1.9.2) add a file to the representation and put it inside a folder
// called 'def' which is inside a folder called 'abc'
        IPFile representationFile2=new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
        representationFile2.setRelativeFolders(Arrays.asList("abc","def"));
        representation1.addFile(representationFile2);

// 1.10) add a representation & define its status
        IPRepresentation representation2=new IPRepresentation("representation 2");
        representation2.setStatus(new RepresentationStatus(REPRESENTATION_STATUS_NORMALIZED));
        sip.addRepresentation(representation2);

// 1.10.1) add a file to the representation
        IPFile representationFile3=new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
        representationFile3.setRenameTo("data3.pdf");
        representation2.addFile(representationFile3);

// 2) build SIP, providing an output directory
        Path zipSIP=sip.build(tempFolder);

Note: SIP implements the Observer Pattern. This way, if one wants to be notified of SIP build progress, one just needs to implement SIPObserver interface and register itself in the SIP. Something like (just presenting some of the events):

public class WhoWantsToBuildSIPAndBeNotified implements SIPObserver {

  public void buildSIP() {
    ...
    SIP sip = new EARKSIP("SIP_1", IPContentType.getMIXED());
    sip.addObserver(this);
    ...
  }

  @Override
  public void sipBuildPackagingStarted(int totalNumberOfFiles) {
    ...
  }

  @Override
  public void sipBuildPackagingCurrentStatus(int numberOfFilesAlreadyProcessed) {
    ...
  }
}
// 1) invoke static method parse and that's it
SIP earkSIP=EARKSIP.parse(zipSIP);

Development

In this section are some relevant notes about Commons IP development.

XML Beans

XML Beans are used by Commons IP to manipulate METS files using Java code.

Details

Some changes were made to XML Schemas to be able to compile XML Schemas into Java classes using XJC as well as to be able to validate an XML file against its XML Schema without Internet connections.

The changes are:

After Java classes were created, some changes were made to produce METS XML files well-defined in terms of namespaces. Namely:

@jakarta.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = jakarta.xml.bind.annotation.XmlNsForm.QUALIFIED)

and the after

@jakarta.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = jakarta.xml.bind.annotation.XmlNsForm.QUALIFIED, xmlns = {
  @jakarta.xml.bind.annotation.XmlNs(prefix = "", namespaceURI = "http://www.loc.gov/METS/"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "xsi", namespaceURI = "http://www.w3.org/2001/XMLSchema-instance"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "csip", namespaceURI = "https://DILCIS.eu/XML/METS/CSIPExtensionMETS"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "sip", namespaceURI = "https://DILCIS.eu/XML/METS/SIPExtensionMETS"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "xlink", namespaceURI = "http://www.w3.org/1999/xlink")})

IANA Media Types

The IANA Media Types list is required to perform SIP Validation. The list is located in the folder and named as follows:

/src/main/resources/controlledVocabularies/IANA_MEDIA_TYPES.txt

How to generate/update

To update IANA media types list, in commons-ip root directory run the following command:

./scripts/ianaMediaTypes_parser.sh

The command executes a script that downloads all IANA Media Types (Application, Audio, Font, Image, Message, Model, Multipart, Text, Video) from https://www.iana.org/assignments/media-types/${iana_file}.csv . Note that this downloads different .csv files then creates a .txt file with all IANA media types appended to the new file.

Extend List of IANA Media Types

The IANA media types list from https://www.iana.org/assignments/media-types/${iana_file}.csv was extended with the following mimetypes:

Image
Text
Video

Update Extended list of IANA Media Types

Commercial support

For more information or commercial support, contact KEEP SOLUTIONS.

Further reading

Contributing

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D

License and Intellectual Property

All contributions to this project are licensed on LGPL v3, which includes an explicit grant of patent rights, meaning that the developers who created or contributed to the code relinquish their patent rights with regard to any subsequent reuse of the software.

Credits

License

LGPLv3