Understanding the Highways
Understanding the Highways
In the hyperconnected metropolis of AI, data flows like traffic, with every piece of information acting as a vehicle navigating the intricate network of computing systems. Traditional chip architectures resemble a sprawling city where densely packed populations reside in towering skyscrapers—analogous to weights stored in 3D memory—and must travel long distances to distant factories, representing compute cores, for processing. This constant movement along the "highways" between memory and compute cores creates significant congestion, wasting energy and causing delays.
As the demands of our digital world surge exponentially, these traditional "highways" become increasingly clogged, unable to keep up with the sheer volume of data that needs to be transferred from memory to compute cores for processing. This bottleneck, famously termed the "Memory Wall," leaves the compute cores starved for data, leading to an underutilized processing system.
In the hyperconnected metropolis of AI, data flows like traffic, with every piece of information acting as a vehicle navigating the intricate network of computing systems. Traditional chip architectures resemble a sprawling city where densely packed populations reside in towering skyscrapers—analogous to weights stored in 3D memory—and must travel long distances to distant factories, representing compute cores, for processing. This constant movement along the "highways" between memory and compute cores creates significant congestion, wasting energy and causing delays.
As the demands of our digital world surge exponentially, these traditional "highways" become increasingly clogged, unable to keep up with the sheer volume of data that needs to be transferred from memory to compute cores for processing. This bottleneck, famously termed the "Memory Wall," leaves the compute cores starved for data, leading to an underutilized processing system.
What if, instead of forcing data to travel these congested "highways", we could bring the computation directly to the data itself?
This is the vision of Qernel AI and its revolutionary approach which is a trifecta of 3D stacking, charge domain processing (CIM) architecture.
What if, instead of forcing data to travel these congested "highways", we could bring the computation directly to the data itself?
This is the vision of Qernel AI and its revolutionary approach which is a trifecta of 3D stacking, charge domain processing (CIM) architecture.
In the world of digital cities—where massive computational workloads are routine—GPUs have long been the standard for LLM inference. Industry leaders like Nvidia’s H100 and B200 dominate this space with their immense processing capabilities. Yet, a fundamental limitation persists: memory bandwidth.
Despite their exceptional compute power, GPUs typically exhibit memory bandwidths that are over 500 times lower than their computational throughput. This imbalance creates a severe bottleneck, particularly for long context windows in hyperscalar operations that demand extreme concurrency. As a result, GPUs often face the so-called “memory wall,” where insufficient memory bandwidth prevents them from fully utilizing their computational potential.
In the world of digital cities—where massive computational workloads are routine—GPUs have long been the standard for LLM inference. Industry leaders like Nvidia’s H100 and B200 dominate this space with their immense processing capabilities. Yet, a fundamental limitation persists: memory bandwidth.
Despite their exceptional compute power, GPUs typically exhibit memory bandwidths that are over 500 times lower than their computational throughput. This imbalance creates a severe bottleneck, particularly for long context windows in hyperscalar operations that demand extreme concurrency. As a result, GPUs often face the so-called “memory wall,” where insufficient memory bandwidth prevents them from fully utilizing their computational potential.
Qernel's Q1 Performance: 100x Higher Throughput
While some competitors have explored near-memory or in-memory computing architectures (e.g., Groq and Cerebras), these solutions often involve a trade-off. To increase bandwidth, they frequently sacrifice memory capacity due to memory density limitations in SRAM based logic chips. This compromise limits their ability to support large-scale deployments with multiple concurrent users.
While some competitors have explored near-memory or in-memory computing architectures (e.g., Groq and Cerebras), these solutions often involve a trade-off. To increase bandwidth, they frequently sacrifice memory capacity due to memory density limitations in SRAM based logic chips. This compromise limits their ability to support large-scale deployments with multiple concurrent users.
Qernel's Q1 Performance: 100x Higher Throughput
Qernel's 3D Charge-domain CIM Architecture
A Paradigm Shift
Qernel is pushing the boundaries of computing with its novel approach to integrating computation directly into memory. Their detailed product differentiation highlights three key innovations designed to deliver significant performance and efficiency gains:
Redefining Memory Processing
In Transformer-based LLMs, the Key-Value (KV) cache grows linearly with token length, as each new token adds to the context. This leads to escalating memory use—especially in reasoning models that store KV data across multiple steps.
For large models like DeepSeek-70B , the KV cache can reach about 1 MB per token per batch, scaling to ~2 GB per batch for a 128-token sequence. In hyperscaler setups with thousands of users, this quickly overwhelms memory bandwidth, becoming a major bottleneck for real-time inference.
Heterogeneous Integration for Performance
The Q1 Chip uses a 3D computational memory stack that combines heterogeneous logic and memory. Its analog CIM dies are co-packaged with a logic controller, enabling programmability and efficient vector math.
This 3D architecture delivers 1PB/s bandwidth and 2 POPs at only 20W, achieving 100 TOPS/W for INT8 workloads.
Additionally, the 3D packaging method cuts costs significantly by avoiding the 3–4× expense of traditional interposer-based GPU–HBM setups.
The Q16 PCIe Card takes Qernel's technology to the next level by connecting multiple Q1 chip cubes via a high-speed router SerDes integrated into an FPGA die. This modular design enables a substantial 16GB of capacity and 32 POPs of compute per card. This high level of integration and compute power makes the Q16 ideal for fine-grained multi-agent AI applications, such as retrievers, planners, and reasoners, opening new possibilities in complex AI systems.
Modular Multichip System with FPGA Router Die : Scaling Compute and Capacity
The Future of AI Inference
Qernel's groundbreaking work in computational memory promises to revolutionize various computing domains by offering significant improvements in performance, energy efficiency, and cost-effectiveness. Their three-tiered product strategy, from innovative cDRAM cells to modular multichip systems, demonstrates a comprehensive vision for the future of high-performance computing.
For the Technical White Paper with more details on the Qernel's 3d Charge-domain CIM computing, please reach us at founders@qernel.ai
Qernel's groundbreaking work in computational memory promises to revolutionize various computing domains by offering significant improvements in performance, energy efficiency, and cost-effectiveness. Their three-tiered product strategy, from innovative cDRAM cells to modular multichip systems, demonstrates a comprehensive vision for the future of high-performance computing.
For the Technical White Paper with more details on the Qernel's 3d Charge-domain CIM computing, please reach us at founders@qernel.ai
The Future of AI Inference