The Impact of AI on Data Center Bandwidth and Latency

The break-neck pace at which AI models have evolved since the first program was developed in 1951 is impressive. Over the past few years, remarkable progress has been witnessed in machine learning, natural language processing, computer vision, and robotics. Indeed, certain experts forecast that the pace of AI advancement is doubling every few years, and this exponential trajectory is expected to persist.

What is not keeping pace with the development of AI? Traditional data center system design. As researchers consistently expand the horizons of AI, the proliferation of AI models reaching hundreds of trillions of parameters could become routine soon.

How can organizations address these data center system constraints within the growing demands of AI and machine learning workloads?

Contact OSI Today!

The “Paradigm Shift”

The phrase may be overused, but aptly describes what we’re seeing in data centers affected by AI data traffic. Traditionally, data centers were primarily focused on north-south traffic, where data moved from servers to the internet and vice versa. However, with the rise of AI and scale-out technologies, the focus has shifted to east-west traffic, characterized by extensive communication between servers, nodes, and data centers.

This shift means more inter-server communication, necessitating a significant increase in bandwidth and the ability to scale it effectively. In the past, north-south traffic had lower bandwidth requirements, but this is no longer the case with east-west traffic, such as AI workloads.

Scalability and GPU Focus

It’s no secret that data centers require more bandwidth between servers, racks, and data centers to handle AI and machine learning workloads efficiently. To meet these demands, there has been a significant focus on scalability, both within servers themselves and in the communication between servers.

The most noteworthy development in AI-oriented high-performance computing (HPC) has been the use of GPUs (and now, even custom ASICs) to accelerate training workloads. One key characteristic of an AI training workload is that it generates enormous amounts of east/west traffic, more so than even the most demanding of traditional HPC tasks. Add to this the use of GPUs/ASICs, which allow enormous amounts of computing power to be contained in a single host. This creates a situation where each host is capable of generating over 100Gbps of traffic towards other hosts on an ongoing basis. This is why AI-related HPC has now overtaken general-purpose cloud computing for pushing the need for faster and more capable networking equipment.

The Impact of AI on Data Center Infrastructure

AI-related data center design is presented with significant challenges with power density, thermal management, electrical infrastructure, and physical space. A modern AI-focused server can consume up to 6000W of power, 6 of those go into a standard 42U rack (~37kW/rack), and there may be 100+ racks of this hardware. Note that the big names in AI like OpenAI don’t publish their footprint but will be substantially higher than these numbers. This is a world-leading scale, which demands new ways of thinking about how to build and operate data centers. Power and thermal management systems could be articles of their own, but as far as the servers themselves go, the hardware is built to custom specifications (though in a standard form factor) and the lifecycle of the hardware is much shorter than any other environment. The traditional 5-year refresh cycle simply doesn’t exist with the current state of AI development.

View Our Optical Solutions

How to Manage a Herd of Servers

Management of those innumerable servers is a whole field unto itself now. Automation is key – every component’s management capability must be networked themselves; via this, the majority of datacenter operations tasks can be handled without human intervention. This capability makes dealing with hardware failures a much more automated process. At this point, individual server hardware failures cease to be an issue warranting immediate attention and become a component of regular maintenance operations. New hardware can bootstrap its OS over the network, load a core OS, and join its cluster without human intervention, so the only labor needed is in the hardware swap. With sufficient capacity planning, failed hardware can then be replaced on a weekly, monthly, or beyond intervals. With this change in network maintenance strategy, some companies have moved away from regular maintenance contracts and instead, may replace it entirely rather than investing in maintenance.

For companies playing in the AI space, every detail matters when building a scalable and efficient infrastructure. A savvy data center architect looks at all the core requirements and can then make the best decisions on how to balance all the variables that go into a large-scale data center.

How Can OSI Help?

With port speeds above 10GbE, passive and active transceivers are the only viable solution (UTP does not scale past 10GbE). Such deployments may include a wide variety of different transceiver types (active/passive) and speeds (10GbE/25GbE/40GbE/50GbE/100GbE/400GbE, for example).

In an AI-focused Data Center, each host may have up to (8) 100GbE NICs in it and there may be 6+hosts per rack (and dozens, hundreds, or thousands of racks). At this scale, the amount of 100GbE/400GbE+ optics required per rack adds up quickly, as do the total costs of the optical solutions. When a large number of network optics are needed, pricing, availability and the ability to deploy and scale Data Centers quickly are paramount. OSI specializes in third-party optics, and we help customers across the globe save money and time.

The scale-out and east/west-heavy nature of these Data Centers means intra and inter-site connectivity must scale as well. Open Line Systems can help scale bandwidth per fiber pair when communicating between buildings or sites. OSI can help design, build, and implement these solutions to scale.

OSI can help you unlock the full potential of AI-oriented high-performance computing with our cutting-edge GPUs. With the ability to accelerate training workloads and handle enormous amounts of east/west traffic, our GPUs ensure seamless processing and efficient networking. Experience over 100Gbps of computing power in a single host, making AI-related HPC the driving force behind the demand for faster and more capable networking equipment.