The European Molecular Biology Laboratory (EMBL) is running a data lake across two datacentre sites, based on NetApp storage, to support scientific research.
Modern scientific research is heavily dependent on big data analysis and increasingly relies on artificial intelligence (AI)-based methods. EMBL’s research groups develop their own AI models, which are trained and operated with Jupyter notebooks in container-based environments. EMBL manages the underlying data with NetApp Astra Trident, providing persistent data storage for container environments.
EMBL uses NetApp cloud and data services to deliver up to 400PB of scientific data to its more than 80 research groups and the global research community.
Rupert Lueck, head of IT at EMBL, said the lab runs huge experiments at its imaging and sequencing centres using high-end electron microscopes. Using cryo-electron microscopy, the spatial structure and function of individual molecules can be studied very precisely. This technique generates “tonnes of data”, resulting in EMBL needing to store 10-15PB a year of research data across all its sites, he said.
The analysis of the experimental data is often performed on EMBL’s high-performance compute clusters and cloud systems. Both are accessed by many scientists simultaneously and so have extremely high data throughput requirements. NetApp systems at EMBL support these high-performance requirements, both in terms of research groups’ applications running on the compute clusters, and in terms of efficient interaction of the systems and services involved.
To support the data requirements of its researchers, EMBL’s data lake comprises several clusters distributed across the institute’s sites. The datacentres at EMBL in Heidelberg and Cambridge provide a total of more than 400PB of storage on NetApp systems.
The setup is designed to offer efficient access to the extensive data volumes via Network File System and Common Internet File System. It supports uninterrupted movement of demanding datasets, such as those used for data analysis based on machine learning, or the training of AI models, and enables hardware and data to be migrated without downtime.
In the current setup, said Lueck, all the data is committed to disk and then goes through a data-processing pipeline, where it is analysed for quality. The data ends up in the data lake, but some is also passed on to EMBL’s high-performance computing facility. Other datasets are processed using GPUs.
Lueck said EMBL’s storage infrastructure has evolved with the advent of cloud-based storage and containerisation. “We are moving some workloads into the cloud and we are exploring cloud-based data provisioning,” he said. “NetApp Trident allows us to provision storage flexibly into Kubernetes or Openstack.”
Persistent, object-based storage, available on NetApp Trident for containerisation, is an important part of the organisation’s data storage strategy. Lueck added: “We need to ensure data is stored redundantly and can be provisioned very fast using a storage grid.”