As AI inference adoption grows exponentially, variables such as context lengths exceeding one million tokens, concurrent users numbering in the hundreds and thousands, multiturn conversations becoming commonplace, and model complexity becoming unwieldy drive-up memory requirements significantly. This reveals new challenges. While GPUs and accelerators often dominate the conversation, a critical bottleneck lies behind the scenes: memory. Bandwidth, latency, and scalability challenges often determine the success or limits of AI systems.

Compute Express Link (CXL) is a high-speed interconnect protocol designed to improve memory and resource sharing between processors and accelerators and has emerged as the transformative solution to these challenges. By providing a high-speed, cache-coherent interconnect, CXL enables efficient resource utilization and flexible scaling of memory, which modern enterprises require.

CXL 3.0 represents a significant step forward from its predecessor by doubling bandwidth, introducing advanced fabric capabilities, and enabling direct peer-to-peer communication between devices. For technology leaders, that leap creates a new set of strategic decisions. Understanding the differences between these generations is essential for organizations looking to future-proof their infrastructure and maximize return on investment (ROI).

CXL 2.0 vs CXL 3.0

CXL marked a significant turning point in data center architecture. Before its introduction, traditional architectures forced CPUs, GPUs, and accelerators to operate in silos. Each processor was tied to its own isolated memory, often leaving capacity underused while other resources struggled for access. CXL 2.0 addressed this by introducing two concepts: memory expansion for local in-server memory and memory pooling that allows several servers to access the same physical memory over a special memory network or fabric.

Memory Expansion

In-server memory expansion is implemented using one or more CXL add-in-cards (AICs) installed in an available PCIe slot in the server, adding capacity directly to the server CPU's system memory. In the case of memory pooling, this feature allows memory to be dynamically allocated via a CXL switch and CXL I/O adapter to different servers based on workload needs rather than being statically bound to a single CPU or GPU.

Memory Pooling

CXL 2.0 brought basic memory sharing capabilities, allowing multiple hosts to connect to a single memory device or pool. Together, these features eliminated 'stranded capacity’—memory sitting idle on one server while another struggled for access—enabling significantly higher utilization across the environment. For enterprise leaders, this translated directly to lower TCO: organizations could achieve the same performance with less total memory, avoiding the cost of overprovisioning hardware to compensate for inefficient allocation.

Memory Sharing

To understand the strategic value of moving to CXL 3.0, it helps to first examine how it differs from CXL 2.0 at a fundamental level, starting with the distinction between memory pooling and memory sharing:

  • Memory Pooling (CXL 2.0): In a pooled environment, a segment of memory is assigned to a specific host. Once assigned, that host "owns" that memory region. If Host A needs more memory, the orchestrator can reallocate idle memory from the pool to Host A. However, Host B cannot access that same memory region simultaneously.
  • Memory Sharing (CXL 3.0): CXL 3.0 enables coherent memory sharing. This allows multiple hosts (e.g., a cluster of GPUs) to access and modify the same data in the same physical memory address simultaneously. Hardware-enforced cache coherency ensures that all devices see the most up-to-date data.

For AI workloads, this distinction is profound. In training large models, traditional architectures require copying massive datasets to the local memory of each GPU. With coherent memory sharing, a single copy of the dataset can reside in a shared CXL memory pool, accessible by all GPUs in the cluster.

Likewise, in AI inferencing with different classes of GPUs that are assigned to different tasks (e.g. prefill and decode), using a shared memory in a KV cache environment can dramatically speed up time to first token (TTFT) production. Sharing memory drastically reduces the need for redundant data storage and eliminates the latency associated with copying data, streamlining the training process.

Key Improvements and New Features in CXL 3.0

While CXL 2.0 focused on resource pooling and initial memory disaggregation, CXL 3.0 represents a fundamental shift toward larger-scale fabric-centric architectures and memory sharing capabilities. It builds upon the successes of its predecessor while introducing substantial performance and functional upgrades designed for the next generation of AI and HPC workloads. The most significant of these upgrades include: doubled bandwidth, advanced fabric capabilities, and peer-to-peer communication between devices.

Double the Bandwidth

One of the most immediate performance gains in CXL 3.0 is the doubling of link bandwidth to 64 GT/s (Gigatransfers per second). This increase aligns with PCIe Gen6 speeds, ensuring that the interconnect does not become a bottleneck as processors and accelerators continue to get faster. For data-intensive applications like real-time analytics and LLM training, this throughput is essential for maintaining high performance.

Fabric Capabilities and Multilevel Switching

CXL 3.0 introduces native fabric capabilities. Unlike the simple switching topologies of CXL 2.0, CXL 3.0 supports multilevel switching, meaning traffic can traverse multiple switching tiers without being routed back through a central host. This allows for the creation of complex, non-blocking fabric topologies (such as spine-leaf architectures) that connect hundreds or thousands of devices. This capability transforms the data center from a collection of servers into a unified, composable computer, where resources across an entire rack or row can be addressed as if they were local.

Enhanced Coherency and Peer-to-Peer Models

A critical advancement in CXL 3.0 is the support for enhanced coherency and peer-to-peer (P2P) data transfer. In CXL 2.0, communication largely flowed between the host and the device. CXL 3.0 allows multiple devices (such as two GPUs or a GPU and a memory module) to communicate directly with one another without the need to route data through one host CPU. This direct memory access reduces latency and frees up CPU cycles for other tasks.

Embracing Incremental Adoption

The leap from CXL 2.0 to CXL 3.0 is more than a speed upgrade; it is a shift toward a composable and increasingly resilient fabric-centric data center. For enterprises, the goal is not just to wait for the next standard but to build a strategy that accommodates this evolution.

CXL 3.0 offers the bandwidth and coherency required for the next generation of AI, but the operational principles can be adopted today. By leveraging CXL 2.0 solutions that enable memory expansion, pooling, and persistence, technology leaders can validate their software ecosystems and infrastructure designs in advance.

Prepare for CXL 3.0 Today

The transition to CXL 3.0 will not happen overnight. However, enterprise architects do not need to wait for full CXL 3.0 hardware availability to begin preparing their infrastructure. Solutions available today can help organizations prototype CXL 3.0 architectures using CXL 2.0 technology.

Penguin Solutions, through its SMART Modular CXL product line, offers solutions that bridge this gap. SMART’s CXL Type 3 (CXL.mem) products decouple memory capacity from the host CPU’s native channels and may be used in both in-server memory expansion as well as a key modular component of a disaggregated CXL pooling appliance. These products, while predominantly CXL 2.0 based, provide a number of CXL 3 features specifically around reliability and serviceability (RAS) management. This enables architectures that provide access to some of the CXL 3 features even when deployed on CXL 2.0 links.

Refining Persistent Memory Stacks

As memory architectures are increasingly adopted, it becomes critical to provide memory-class devices that can keep up with the evolution in memory speeds, while providing options to store data persistently. SMART’s NV-CMM-E3S EDSFF module integrates high-performance DRAM with persistent flash memory and a built-in backup power source. During normal operation, it functions as standard CXL memory. On power loss, it automatically moves data to flash, preserving the state.

This technology allows OEMs and hyperscalers to implement persistent-memory use cases—such as checkpointing, write-back caching, and fast recovery—today. These are the exact features that will define reliable, memory in both CXL 2.0 and CXL 3.0 infrastructures. By refining software stacks and failure-handling logic on NV-CMM-E3S now, organizations can accelerate their eventual deployment of fully composable, resilient CXL memory capabilities.

The Role of SMART CXL Products

The SMART CXA-4F1W and CXA-8F2W are CXL 2.0 Add-In Cards (AICs) that provide cache-coherent memory expansion and use a familiar PCIe add-in-card form factor used by many server peripheral vendors, making them easy to adopt. By deploying these modules behind a PCIe/CXL switch today, system architects can validate kernel drivers, provisioning flows, and monitoring policies that will extend naturally to multi-hop, pooled CXL 3.0 fabrics.

This allows organizations to:

  • Qualify CXL.mem in production environments: Test telemetry and capacity orchestration APIs now.
  • Prototype pooling topologies: Gain operational expertise in managing disaggregated memory before deploying complex CXL 3.0 switches.
  • De-risk future migrations: By utilizing hardware included on the CXL Consortium Integrator List, CTOs can ensure interoperability and a smoother transition to future standards.

In addition, the SMART CMM-E3S CXL EDSFF modules provide a similar capability designed for emerging modular servers that require front panel plug and play options for CXL.

For organizations already deploying AI inference at scale, Penguin Solutions' MemoryAI™ KV Cache Server demonstrates what this architecture looks like in production. The server offloads the KV cache from GPU memory to a dedicated high-capacity CXL-based appliance, reducing time-to-first-token, eliminating memory bottlenecks, and enabling up to 11TB of pooled memory accessible across the cluster. It is the industry's first production-ready KV cache server of its kind.

Penguin Solutions provides the scalable, composable infrastructure and specialized expertise needed to navigate this CXL transition. By aligning today’s investments with tomorrow’s architectural capabilities, enterprises can ensure their infrastructure remains a competitive advantage in an AI-driven future. Contact us today to learn more.

Author Image

Related Articles

Server aisle

Talk to the Experts at
Penguin Solutions

At Penguin, our team designs, builds, deploys, and manages high-performance, high-availability HPC & AI enterprise solutions, empowering customers to achieve their breakthrough innovations.

Reach out today and let's discuss your infrastructure solution project needs.

Let's Talk