d-Matrix April Newsletter

Maximizing the potential of disaggregated pipelines

The inference crunch is here.

GPU prices are rising and agentic tools like Codex, Hermes, Claude Code, and OpenClaw are exploding in popularity. What was originally individual models has transformed into long multi-step operations—with each step carrying its own cost of inference.

Efficiency in agentic chains has never been more important. Rather than monolithic multi-modal models handling each individual step, smaller models can handle a subset of those tasks for a fraction of the price at significantly lower latency. Disaggregated pipelines are improving that efficiency even further.

Disaggregated pipelines that maximize the performance and value of GPUs and optimized accelerators offer an immediate solution to the dearth of inference compute available.

And it isn’t just about splitting the pre-fill and decode steps into different types of hardware. Optimized inference accelerators can rapidly speed up the decode process by taking over tasks like speculative decoding.

These agentic tools are only going to become more popular—and empowering them to scale gracefully with efficient pipelines while maintaining performance will be critical to the future of AI.

How speculative decoding in disaggregated pipelines supercharges AI inference

Speculative decoding—much like disaggregated pipelines in general—is not a new concept.

It's also emerged as an immediate way to both tap the value of inference-optimized accelerators and maximize the potential of already-implemented GPUs. Accelerators running smaller models can draft tokens at extremely low latency and propose them to GPU verifiers running larger models.

Speculative decoding still has massive potential to grow. Smaller models are constantly improving. Better accuracy of those models means even faster results with significantly lower compute overhead.

Disaggregated pipelines now offer the opportunity to take advantage of all of the advantages smaller models offer while still generating the kind of quality responses enterprises need.

Get the latest d-Matrix updates to your inbox. Sign up below:

GigaIO + d-Matrix: accelerating AI inference even further

We announced this month that we've acquired GigaIO's data center business to bring in even more expertise in rack-scale infrastructure and high-performance interconnects.

Learn More

Figuring out how to measure agentic success

Agentic networks already proved out their value at a high level. But the next challenge is measuring their success at a much more granular level.

That’s critical because each enterprise’s needs are different and may align with different goals. Agentic networks need to inevitably prove out their ROI, and evaluating how they contribute is increasingly important as they gain widespread adoption.

Each step in an agentic network may also have its own definition of success inside the context for a whole pipeline. A voice pipeline only matters if the response on the other end of the line arrives as quickly as possible. But you could also measure the text generation models in terms of how quickly sentences are generated and passed to a speech-to-text model, rather than completing the response.

A one-size-fits-all evaluation harness likely won’t cut it for larger enterprises as AI applications explode in popularity. The next step is meeting somewhere in the middle between holistic product goals and individual agentic step performance.

Maximizing the potential of disaggregated pipelines

How speculative decoding in disaggregated pipelines supercharges AI inference

Interested in more from d-Matrix?

Get the latest d-Matrix updates to your inbox. Sign up below:

GigaIO + d-Matrix: accelerating AI inference even further

Figuring out how to measure agentic success