The endless capacity of cloud object storage with Posix-compatible file access – that’s the promise of CunoFS, from Cambridge-based PetaGene, which aims to resolve the increasingly widespread challenge of how to combine high-performance compute and huge amounts of stored data.
It’s a challenge for workloads such as artificial intelligence (AI), video production, medical research and security anomaly detection, which to work best often need to frequently and rapidly refresh data from deep or widely spread sources. Ordinarily, to read and write quickly while accessing very large volumes of storage is a costly proposition.
“To save costs on capacity, the solution is to use object storage, locally or in the cloud,” said Dan Greenfield, co-founder and CEO of PetaGene, at a recent IT Press Tour event in Berlin attended by LeMagIT. “The problem is that often applications are not built for object storage. They open, save, etc, as files, typically in NAS [network-attached storage]. They don’t send HTTP requests as needed in [Amazon] S3.”
Put file access in S3 buckets
Greenfield provided some examples: “On AWS [Amazon Web Services], 1TB [terabyte] of frequently accessed S3 storage costs $276 a month. With AWS’s NAS service, EFS, the bill rises to $3,600 a month. And if you want the same kind of parallel access offered in S3, you’d need AWS FSx Lustre, and for 1TB that costs $7,200 a month.”
Even so, it’s not likely that many enterprises would convert applications to work in S3 as they do in file mode. Leaving aside the communication protocol differences – REST APIs versus access through the host operating system – it would be necessary also to change existing algorithms and user habits.
In object storage, there’s no concept of directories, no Posix management of users or groups of users, no modifications, only the creation of new versions. In short, the process of migration from file to object is long and costly.
The traditional approach to dealing with elevated costs of NAS access has been to put a gateway in front of object storage and convert NFS or SMB on the fly. An example of a gateway is the Python-based open source s3fs.
“The problem with this type of architecture is that the gateway creates a bottleneck, which puts into a single file all the access that servers can give in parallel,” said Greenfield, which they would be in S3. “Our solution is to deploy a file/object gateway in every server that runs an application.”
CunoFS is mounted on servers with the Posix path “/cuno/s3” and pointed at the bucket specified in preferences. That makes it possible to navigate thought directories using the “cd” command, to extract files via “tar”, change access rights with “chmod”, filter contents with “grep”, and so on.
Much faster than traditional NAS
CunoFS doesn’t just avoid the bottleneck of a single gateway, it also accelerates access beyond what is possible with traditional NAS.
According to performance figures from PetaGene, CunoFS installed on an AWS virtual server will write the source code for the Linux kernel in 128 seconds to S3 storage and read it in 21 seconds.
Using an application server to write to AWS’s EFS NAS takes, respectively, six and 10.5 minutes. Here, the write is quicker than the read because EFS uses cache.
Passing through an external NAS/object gateway like s3fs to write the same code from the same server to the same S3 storage takes a little over two hours, while the read takes around 15 minutes.
Also likely to be an appropriate use case to increasing numbers of companies are AI frameworks. So, a PyTorch server hosted on Google Cloud Platform (GCP) will write at 260Mbps to an object storage service via an s3fs gateway, and 350Mbps to NAS without conversion. With CunoFS on the PyTorch server, that leaps to 20Gbps.
How does CunoFS achieve reads/writes on files faster than a NAS that doesn’t need to deal with HTTP requests? Simply because CunoFS is not just a local gateway, but is also an efficient tool for on-the-fly compression. It’s quicker because it transfers a lot less data.
CunoFS: A variant of PetaSuite
PetaGene started out as a supplier of compression tools for genomic labs, PetaSuite, which could achieve ratios of 60% to 90% reduction.
PetaSuite was followed by the PetaLink library, which enabled compression and rehydration of files on the fly on application servers. It was this that allowed for accelerated reads and writes on NAS.
In 2018, the platform gained the ability to store files on S3 buckets with conversion on the fly to object mode, but it was four more years before the module was used for anything other than genomic data.
“Initially, PetaSuite Cloud Edition was very efficient at saving very large files to the cloud, but performance was very disappointing on files of more usual size,” said Greenfield. “We understood that resolving this problem would allow us to expand our customer base to all those want to process large numbers of files.”
As it happens, PetaGene realised it had made the mistake of storing Posix metadata (directories, access rights) with the other metadata stored in S3 (author’s name, type of content).
“Posix metadata is much simpler than S3 metadata,” said Greenfield. “It is much more compressible, and we can federate it between several files. It is therefore possible to treat it separately and this is how PetaSuite Cloud Edition became CunoFS.”
Since 2022, CunoFS has worked its spell on some big storage players, most notably Dell and NetApp, which have seen in it a way to accelerate their solutions.
Also, PeteGene expects to extend CunoFS compatibility to Linux servers, with a client version for Windows and MacOS coming, as well as a CSI driver for Kubernetes clusters. A version compatible with ARM servers is expected by the end of this year.