ARM and Meta: Plotting a path to dilute GPU capacity

0
10
The next decade in enterprise backup

Source is ComputerWeekly.com

News that ARM is embarking on developing its own datacentre processors for Meta, as reported in the Financial Times, is indicative of the chip designer’s move to capitalise on the tech industry’s appetite for affordable, energy-efficient artificial intelligence (AI).

Hyperscalers and social media giants such as Meta use vast arrays of expensive graphics processing units (GPUs) to run workloads that require AI acceleration. But along with the cost, GPUs tend to use a lot of energy and require investment in liquid cooling infrastructure.

Meta sees AI as a strategic technology initiative that spans its platforms, including Facebook, Instagram and WhatApp. CEO Mark Zuckerberg is positioning Meta AI as the artificial intelligence everyone will use. In the company’s latest earnings call, he said: “In AI, I expect this is going to be the year when a highly intelligent and personalised AI assistant reaches more than one billion people, and I expect Meta AI to be that leading AI assistant.”

To reach this volume of people, the company has been working to scale its AI infrastructure and plans to migrate from GPU-based AI acceleration to custom silicon chips, optimised for its workloads and datacentres.

During the earnings call, Meta chief financial officer Susan Li said the company was “very invested in developing our own custom silicon for unique workloads, where off-the-shelf silicon isn’t necessarily optimal”.

In 2023, the company began a long-term venture called Meta Training and Inference Accelerator (MTIA) to provide the most efficient architecture for its unique workloads.

Li said Meta began adopting MTIA in the first half of 2024 for core ranking and recommendations inference. “We’ll continue ramping adoption for those workloads over the course of 2025 as we use it for both incremental capacity and to replace some GPU-based servers when they reach the end of their useful lives,” she added. “Next year, we’re hoping to expand MTIA to support some of our core AI training workloads, and over time some of our GenAI [generative AI] use cases.”

Driving efficiency and total cost of ownership

Meta has previously said efficiency is one of the most important factors for deploying MTIA in its datacentres. This is measured in performance-per-watt metric (TFLOPS/W), which it said is a key component of the total cost of ownership. The MTIA chip is fitted to an Open Compute Platform (OCP) plug-in module, which consumes about 35W. But the MTIA architecture requires a central processing unit (CPU) together with memory and chips for connectivity.

The reported work it is doing with ARM could help the company move from the highly customised application-specific integrated circuits (ASICs) it developed for its first generation chip, MTIA 1, to a next-generation architecture based on general-purpose ARM processor cores.

Looking at ARM’s latest earnings, the company is positioning itself to offer AI that can scale power efficiently. ARM has previously partnered with Nvidia to deliver power-efficient AI in the Nvidia Blackwell Grace architecture

At the Consumer Electronics Show in January, Nvidia unveiled the ARM-based GB10 Grace Blackwell Superchip, which it claimed offers a petaflop of AI computing performance for prototyping, fine-tuning and running large AI models. The chip uses an ARM processor with Nvidia’s Blackwell accelerator to improve the performance of AI workloads.

The semiconductor industry offers system on a chip (SoC) devices, where various computer building blocks are integrated into a single chip. Grace Blackwell is an example of an SoC. Given the work Meta has been doing to develop its MTIA chip, the company may well be exploring how it can work with ARM to integrate its own technology with the ARM CPU on a single device.

Although an SoC is more complex from a chip fabrication perspective, the economies of scale when production is ramped up, and the fact that the device can integrate several external components into one package, make it considerably more cost-effective for system builders.

Li’s remarks on replacing GPU servers and the goal of MTIA to reduce Meta’s total cost of ownership for AI correlate with the reported deal with ARM, which would potentially enable it to scale up AI cost effectively and reduce its reliance on GPU-based AI acceleration.

Boosting ARM’s AI credentials

ARM, which is a SoftBank company, recently found itself at the core of the Trump administration’s Stargate Project, a SoftBank-backed initiative to deploy sovereign AI capabilities in the US.

During the earnings call for ARM’s latest quarterly results, CEO Rene Haas described Stargate as “an extremely significant infrastructure project”, adding: “We are extremely excited to be the CPU of choice for such a platform combined with the Blackwell CPU with [ARM-based] Grace. Going forward, there’ll be huge potential for technology innovation around that space.”

Haas also spoke about the Cristal intelligence collaboration with OpenAI, which he said enables AI agents to move across every node of the hardware ecosystem. “If you think about the smallest devices, such as earbuds, all the way to the datacentre, this is really about agents increasingly being the interface and/or the driver of everything that drives AI inside the device,” he added.

Source is ComputerWeekly.com

Vorig artikelBIT introduceert server-side e-mailfiltering met Sieve
Volgend artikelCloud Engineer