UK-based digital archiving provider Arkivum is one of three providers selected for a €4.8m second prototyping phase of the European multinational Archiver project, which aims to provide petabyte-scale data archiving and preservation for its scientific partners.
The project is led by CERN – home of the Large Hadron Collider near Geneva – and also comprises DESY (Deutsches Elektronen-Synchrotron), EMBL-EBI (European Bioinformatics Institute) and PIC (Port d’Informació Científic).
The Archiver project aims to provide petabyte-scale storage for a wide variety of research and analytics use cases for the scientific partners involved, said João Fernandes, Archiver project leader from CERN’s IT department.
The scalability of the technology is a high priority because the expected eventual capacity will be in the tens of petabytes. In the prototype phase of the project, the system will ingest data at rates of up to 100TB a day.
This second phase is worth €4.8m and will last eight months. Archiver is co-funded by the European Union’s Horizon 2020 research and innovation programme.
“It’s not just about storing bits, but also about control of the data, so preserving what has been done with the data previously, who by, and keeping the documentation and the software,” said Fernandes.
Those requirements are summed up in the FAIR principles – Findable, Accessible, Interoperable and Reusable – so that experiments can be reproduced and continued long after they were last worked on, if needed.
The time between experiments can be lengthy, so data has significant long-term value and needs to remain active and accessible in the archive for possibly decades after a research project has ended. At present, custom-built databases for handling complex and sometimes sensitive datasets present barriers to researchers uploading and downloading data.
“We have about 10 major use cases,” said Fernandes. “Some are purely about data preservation. Some are about keeping a second copy, but also about being able to reproduce the analysis. Others, such as those connected to genome work, involve data that is increasing by 50% a year and decisions will need to be made about where data is cached to allow access to it.”
Fernandes also described protocols that would need to be supported as ranging from those developed by the institutions themselves – such as CERN/Stanford’s xrootd – to wider industry standard methods, such as S3 object storage.
The three designs chosen for the second prototyping stage are: Arkivum with a managed storage-as-a-service layer that uses Google Cloud Platform; Libnova, which Fernandes described as “similar to Arkivum” but using AWS and is “more proprietary” in terms of the software used; and Onedata, which is delivered by Deutsche Telekom/T-Systems and is based on open source software.
Each of the three designs will be evaluated in limited prototype projects before the project moves to its next phase and to multi-petabyte scale in 2022.
Arkivum is a spinout from the University of Southampton IT Innovation Centre. It successfully completed the initial design phase of the three-year project in October 2020.