Cloud

NetApp provides data infrastructure for EMBL

Door

11 februari 2022

571

The European Molecular Biology Laboratory (EMBL) is running a data lake across two datacentre sites, based on NetApp storage, to support scientific research.

Modern scientific research is heavily dependent on big data analysis and increasingly relies on artificial intelligence (AI)-based methods. EMBL’s research groups develop their own AI models, which are trained and operated with Jupyter notebooks in container-based environments. EMBL manages the underlying data with NetApp Astra Trident, providing persistent data storage for container environments.

EMBL uses NetApp cloud and data services to deliver up to 400PB of scientific data to its more than 80 research groups and the global research community.

Rupert Lueck, head of IT at EMBL, said the lab runs huge experiments at its imaging and sequencing centres using high-end electron microscopes. Using cryo-electron microscopy, the spatial structure and function of individual molecules can be studied very precisely. This technique generates “tonnes of data”, resulting in EMBL needing to store 10-15PB a year of research data across all its sites, he said.

The analysis of the experimental data is often performed on EMBL’s high-performance compute clusters and cloud systems. Both are accessed by many scientists simultaneously and so have extremely high data throughput requirements. NetApp systems at EMBL support these high-performance requirements, both in terms of research groups’ applications running on the compute clusters, and in terms of efficient interaction of the systems and services involved.

To support the data requirements of its researchers, EMBL’s data lake comprises several clusters distributed across the institute’s sites. The datacentres at EMBL in Heidelberg and Cambridge provide a total of more than 400PB of storage on NetApp systems.

The setup is designed to offer efficient access to the extensive data volumes via Network File System and Common Internet File System. It supports uninterrupted movement of demanding datasets, such as those used for data analysis based on machine learning, or the training of AI models, and enables hardware and data to be migrated without downtime.

In the current setup, said Lueck, all the data is committed to disk and then goes through a data-processing pipeline, where it is analysed for quality. The data ends up in the data lake, but some is also passed on to EMBL’s high-performance computing facility. Other datasets are processed using GPUs.

Lueck said EMBL’s storage infrastructure has evolved with the advent of cloud-based storage and containerisation. “We are moving some workloads into the cloud and we are exploring cloud-based data provisioning,” he said. “NetApp Trident allows us to provision storage flexibly into Kubernetes or Openstack.”

Persistent, object-based storage, available on NetApp Trident for containerisation, is an important part of the organisation’s data storage strategy. Lueck added: “We need to ensure data is stored redundantly and can be provisioned very fast using a storage grid.”

Source is ComputerWeekly.com

NetApp provides data infrastructure for EMBL

Recente posts

Edge storage: What it is and the technologies it uses

Amazon plans to shut down more than 50 brick-and-mortar stores.

UK promises tougher line on cyber crime

University benefited from early virtual desktop investments when pandemic struck

second-level address translation (SLAT)

Meest bekeken posts

Gartner: AI and datacentre spending ramps up

VMware vSphere 8 end-of-support challenges

Hoe erg is het nou?

Canva uses 1Password to secure ID during growth phase

Microsoft has already contracted GPUs to balance costs

POPULAIRE BERICHTEN

BIT-blogs – Security monitoring bij BIT

Booming Segments of AI Conversational Platform Market 2020-2028 with AWS, Google,...

Okta picks up Auth0 for $6.5bn

POPULAIRE CATEGORIE