Concurrency Requirements: Hyperscale deployments must handle hundreds or even thousands of users running LLM inference at once.

Latency Sensitivity: Real-time, agentic workflows demand ultra-low latency to maintain seamless interactions.

Throughput Demands: High token throughput is critical to efficiently serve large volumes of concurrent users.

In Transformer-based LLMs, the Key-Value (KV) cache grows linearly with token length, as each new token adds to the context. This leads to escalating memory use—especially in reasoning models that store KV data across multiple steps.

For large models like DeepSeek-70B , the KV cache can reach about 1 MB per token per batch, scaling to ~2 GB per batch for a 128-token sequence. In hyperscaler setups with thousands of users, this quickly overwhelms memory bandwidth, becoming a major bottleneck for real-time inference.

The memory wall causes GPUs to operate at reduced efficiency, with compute cores idling while waiting for data transfer between memory and processing units. For example, Nvidia GPUs often operate at only 60% capacity due to memory bottlenecks.



Memory-bound processes consume significant energy due to frequent data movement between processor and off-chip memory. In large-scale deployments, cooling systems are designed based on peak power requirements rather than average power consumption, leading to substantial waste.

Compute Throughput Underutilization

Poor Energy Efficiency

The Consequences

Memory-centric processing is transformative for LLM inference because it addresses two critical challenges with precision

Scaling Workflows

KV Cache Bottleneck

Understanding the Highways

Understanding the Highways

In the hyperconnected metropolis of AI, data flows like traffic, with every piece of information acting as a vehicle navigating the intricate network of computing systems. Traditional chip architectures resemble a sprawling city where densely packed populations reside in towering skyscrapers—analogous to weights stored in 3D memory—and must travel long distances to distant factories, representing compute cores, for processing. This constant movement along the "highways" between memory and compute cores creates significant congestion, wasting energy and causing delays.

As the demands of our digital world surge exponentially, these traditional "highways" become increasingly clogged, unable to keep up with the sheer volume of data that needs to be transferred from memory to compute cores for processing. This bottleneck, famously termed the "Memory Wall," leaves the compute cores starved for data, leading to an underutilized processing system.

In the hyperconnected metropolis of AI, data flows like traffic, with every piece of information acting as a vehicle navigating the intricate network of computing systems. Traditional chip architectures resemble a sprawling city where densely packed populations reside in towering skyscrapers—analogous to weights stored in 3D memory—and must travel long distances to distant factories, representing compute cores, for processing. This constant movement along the "highways" between memory and compute cores creates significant congestion, wasting energy and causing delays.

As the demands of our digital world surge exponentially, these traditional "highways" become increasingly clogged, unable to keep up with the sheer volume of data that needs to be transferred from memory to compute cores for processing. This bottleneck, famously termed the "Memory Wall," leaves the compute cores starved for data, leading to an underutilized processing system.

What if, instead of forcing data to travel these congested "highways", we could bring the computation directly to the data itself?
This is the vision of Qernel AI and its revolutionary approach which is a trifecta of 3D stacking, charge domain processing (CIM) architecture.

What if, instead of forcing data to travel these congested "highways", we could bring the computation directly to the data itself?
This is the vision of Qernel AI and its revolutionary approach which is a trifecta of 3D stacking, charge domain processing (CIM) architecture.

The Memory Wall

The Memory Wall

In the world of digital cities—where massive computational workloads are routine—GPUs have long been the standard for LLM inference. Industry leaders like Nvidia’s H100 and B200 dominate this space with their immense processing capabilities. Yet, a fundamental limitation persists: memory bandwidth.

Despite their exceptional compute power, GPUs typically exhibit memory bandwidths that are over 500 times lower than their computational throughput. This imbalance creates a severe bottleneck, particularly for long context windows in hyperscalar operations that demand extreme concurrency. As a result, GPUs often face the so-called “memory wall,” where insufficient memory bandwidth prevents them from fully utilizing their computational potential.

In the world of digital cities—where massive computational workloads are routine—GPUs have long been the standard for LLM inference. Industry leaders like Nvidia’s H100 and B200 dominate this space with their immense processing capabilities. Yet, a fundamental limitation persists: memory bandwidth.

Despite their exceptional compute power, GPUs typically exhibit memory bandwidths that are over 500 times lower than their computational throughput. This imbalance creates a severe bottleneck, particularly for long context windows in hyperscalar operations that demand extreme concurrency. As a result, GPUs often face the so-called “memory wall,” where insufficient memory bandwidth prevents them from fully utilizing their computational potential.

Why is Memory-Centric Processing Essential?

Memory-centric processing is crucial for overcoming these challenges. Such architectures provide massively parallel memory access and brings a compute throughput to memory bandwidth ratio of 1:1. It addresses the core issues of LLM execution by:

  • Eliminating the KV Cache Bottleneck: Performs computations directly within the memory array

  • Enabling Scalable Agentic Workflows: High memory bandwidth (1 PB/s) and capacity (1 GB per chip)

  • Efficient Throughput Utilisation: Arithmetic intensity of one and horizontally scalable architecture

  • Energy Efficiency of a human brain: Operating at 100 TOPs/W, 1/1000th of power as a GPU

Memory-centric processing is crucial for overcoming these challenges. Such architectures provide massively parallel memory access and brings a compute throughput to memory bandwidth ratio of 1:1. It addresses the core issues of LLM execution by:

  • Eliminating the KV Cache Bottleneck: Performs computations directly within the memory array

  • Enabling Scalable Agentic Workflows: High memory bandwidth (1 PB/s) and capacity (1 GB per chip)

  • Efficient Throughput Utilisation: Arithmetic intensity of one and horizontally scalable architecture

  • Energy Efficiency of a human brain: Operating at 100 TOPs/W, 1/1000th of power as a GPU

Why is Memory-Centric Processing Essential?

Qernel's Q1 Performance: 100x Higher Throughput

While some competitors have explored near-memory or in-memory computing architectures (e.g., Groq and Cerebras), these solutions often involve a trade-off. To increase bandwidth, they frequently sacrifice memory capacity due to memory density limitations in SRAM based logic chips. This compromise limits their ability to support large-scale deployments with multiple concurrent users.

While some competitors have explored near-memory or in-memory computing architectures (e.g., Groq and Cerebras), these solutions often involve a trade-off. To increase bandwidth, they frequently sacrifice memory capacity due to memory density limitations in SRAM based logic chips. This compromise limits their ability to support large-scale deployments with multiple concurrent users.

Qernel's Q1 Performance: 100x Higher Throughput

Qernel's 3D Charge-domain CIM Architecture
A Paradigm Shift




Qernel is pushing the boundaries of computing with its novel approach to integrating computation directly into memory. Their detailed product differentiation highlights three key innovations designed to deliver significant performance and efficiency gains:

Redefining Memory Processing

In Transformer-based LLMs, the Key-Value (KV) cache grows linearly with token length, as each new token adds to the context. This leads to escalating memory use—especially in reasoning models that store KV data across multiple steps.

For large models like DeepSeek-70B , the KV cache can reach about 1 MB per token per batch, scaling to ~2 GB per batch for a 128-token sequence. In hyperscaler setups with thousands of users, this quickly overwhelms memory bandwidth, becoming a major bottleneck for real-time inference.

Heterogeneous Integration for Performance

The Q1 Chip uses a 3D computational memory stack that combines heterogeneous logic and memory. Its analog CIM dies are co-packaged with a logic controller, enabling programmability and efficient vector math.

This 3D architecture delivers 1PB/s bandwidth and 2 POPs at only 20W, achieving 100 TOPS/W for INT8 workloads.

Additionally, the 3D packaging method cuts costs significantly by avoiding the 3–4× expense of traditional interposer-based GPU–HBM setups.

The Q16 PCIe Card takes Qernel's technology to the next level by connecting multiple Q1 chip cubes via a high-speed router SerDes integrated into an FPGA die. This modular design enables a substantial 16GB of capacity and 32 POPs of compute per card. This high level of integration and compute power makes the Q16 ideal for fine-grained multi-agent AI applications, such as retrievers, planners, and reasoners, opening new possibilities in complex AI systems.

Modular Multichip System with FPGA Router Die : Scaling Compute and Capacity

The Future of AI Inference

Qernel's groundbreaking work in computational memory promises to revolutionize various computing domains by offering significant improvements in performance, energy efficiency, and cost-effectiveness. Their three-tiered product strategy, from innovative cDRAM cells to modular multichip systems, demonstrates a comprehensive vision for the future of high-performance computing.

For the Technical White Paper with more details on the Qernel's 3d Charge-domain CIM computing, please reach us at founders@qernel.ai

Qernel's groundbreaking work in computational memory promises to revolutionize various computing domains by offering significant improvements in performance, energy efficiency, and cost-effectiveness. Their three-tiered product strategy, from innovative cDRAM cells to modular multichip systems, demonstrates a comprehensive vision for the future of high-performance computing.

For the Technical White Paper with more details on the Qernel's 3d Charge-domain CIM computing, please reach us at founders@qernel.ai

The Future of AI Inference

The Consequences

Memory-centric processing is transformative for LLM inference because it addresses
two critical challenges with precision

Poor Energy Efficiency

Memory-bound processes consume significant energy due to frequent data movement between processor and off-chip memory. In large-scale deployments, cooling systems are designed based on peak power requirements rather than average power consumption, leading to substantial waste.

Compute Underutilization

The memory wall causes GPUs to operate at reduced efficiency, with compute cores idling while waiting for data transfer between memory and processing units. For example, Nvidia GPUs often operate at only 60% capacity due to memory bottlenecks.



KV Cache Bottleneck

In Transformer-based LLMs, the Key-Value (KV) cache grows linearly with token length, as each new token adds to the context. This leads to escalating memory use—especially in reasoning models that store KV data across multiple steps.

For large models like DeepSeek-70B , the KV cache can reach about 1 MB per token per batch, scaling to ~2 GB per batch for a 128-token sequence. In hyperscaler setups with thousands of users, this quickly overwhelms memory bandwidth, becoming a major bottleneck for real-time inference.

Scaling Workflows

Concurrency Requirements: Hyperscale deployments must handle hundreds or even thousands of users running LLM inference at once.

Latency Sensitivity: Real-time, agentic workflows demand ultra-low latency to maintain seamless interactions.

Throughput Demands: High token throughput is critical to efficiently serve large volumes of concurrent users.

Redefining
Memory Processing

In Transformer-based LLMs, the Key-Value (KV) cache grows linearly with token length, as each new token adds to the context. This leads to escalating memory use—especially in reasoning models that store KV data across multiple steps.

For large models like DeepSeek-70B , the KV cache can reach about 1 MB per token per batch, scaling to ~2 GB per batch for a 128-token sequence. In hyperscaler setups with thousands of users, this quickly overwhelms memory bandwidth, becoming a major bottleneck for real-time inference.

Heterogeneous Integration for Performance

The Q1 Chip uses a 3D computational memory stack that combines heterogeneous logic and memory. Its analog CIM dies are co-packaged with a logic controller, enabling programmability and efficient vector math.

This 3D architecture delivers 1PB/s bandwidth and 2 POPs at only 20W, achieving 100 TOPS/W for INT8 workloads.

Additionally, the 3D packaging method cuts costs significantly by avoiding the 3–4× expense of traditional interposer-based GPU–HBM setups.

The Q16 PCIe Card takes Qernel's technology to the next level by connecting multiple Q1 chip cubes via a high-speed router SerDes integrated into an FPGA die. This modular design enables a substantial 16GB of capacity and 32 POPs of compute per card. This high level of integration and compute power makes the Q16 ideal for fine-grained multi-agent AI applications, such as retrievers, planners, and reasoners, opening new possibilities in complex AI systems.

Modular Multichip System with FPGA Router Die
Scaling Compute and Capacity

Qernel is pushing the boundaries of computing with its novel approach to integrating computation directly into memory. Their detailed product differentiation highlights three key innovations designed to deliver significant performance and efficiency gains:

Qernel's 3D Charge-domain CIM Architecture
A Paradigm Shift