This is the question at the heart of Qernel's work. Rather than representing data as digital voltage levels and performing arithmetic through transistor switching, our charge domain processing (cDP) architecture encodes weights directly as electrical charge and activations as pulse durations in the time domain. When a pulse is applied to a charge domain cell, the resulting charge transfer is physically proportional to the product of weight and activation. Multiplication happens through physics not logic gates.

When Physics Computes

What If Physics Did the Math?

Charge Adds Itself

When thousands of these cells contribute to a shared accumulation node, their outputs naturally sum through charge conservation. A full multiply accumulate operation emerges from charge transfer alone no digital multipliers, no adder trees, no intermediate registers shuttling data back and forth.

One Cell. Full Weight.

This isn't compute in memory in the conventional sense. Existing CIM approaches still rely on binary SRAM cells, requiring multiple cells and significant peripheral circuitry to represent multi bit values. Qernel's gain cell architecture stores multi level weights natively in a single compact structure, achieving 5x better density and 10x lower power than conventional SRAM before you even get to the compute advantages.

25X

20X

100X

FLOPs per Watt
vs. leading GPUs

FLOPs per Dollar
compute cost reduction

Cache Bandwidth/W
vs. SRAM baselines

V.1.0.0

Here's a number that should get your attention: Cloudflare reported a 4,000% year-over-year increase in Workers AI inference requests in early 2025. Akamai launched an entirely new Inference Cloud product line. Gcore expanded serverless AI across 180+ points of presence. Meanwhile, centralized GPU datacenters are spending $3–7 trillion on AI infrastructure buildout through 2030, with power demand projected to hit 219 gigawatts by the end of the decade. Inference has officially surpassed training as the dominant AI workload by revenue and everyone from hyperscale datacenter operators to distributed edge providers is scrambling to serve it profitably.

But this surge is colliding with a hard physical wall at every scale. Nvidia GPU power consumption has escalated from 400W per accelerator (A100) to 700W (H100) to over 1,000W (B200), with the Rubin R200 expected to approach 1,800W. For edge operators, that's more power than an entire server rack. For central datacenters, it means hundreds of megawatts consumed just generating tokens turning electricity bills into the single largest line item and making the economics of token delivery razor-thin or outright unprofitable at current GPU efficiency levels.

Inference Is Outpacing Training and Outpacing
Power Grids

The core tension: AI inference needs to be everywhere inside hyperscale token factories, across CDN edge nodes, embedded in telco infrastructure, deployed at sovereign locations. But the hardware powering it burns through electricity faster than the economics can justify. At central datacenters, GPU-powered inference at scale can produce net losses for years. At the edge, the hardware simply doesn't fit the power envelope. Something has to give.

The Real Bottleneck Isn't Compute. It's Energy.

The problem manifests differently at different scales, but the root cause is the same. At CDN edge sites operating within 50–500 kW power envelopes, deploying even a modest GPU cluster can push power costs up 5–10x overnight, compressing margins that justify the investment. Cloudflare's gross margins have declined roughly 350 basis points year over year as it scales GPU infrastructure. Akamai has deployed GPUs in just 17 of its 4,400+ locations, not by choice, but by physical constraint.

At central datacenters, the numbers are even more sobering. Modeling a 0.5 MW datacenter allocation running a GPT class 120B parameter model on GPU infrastructure, the math breaks down: 267 GPU nodes costing $134M in hardware, consuming $1.5–3M per year in electricity, generating 650 tokens/sec per node. At current selling prices of roughly $0.40 per million output tokens, that facility runs at a net loss of $91M over three years. The tokens are flowing, but every joule of energy spent producing them is destroying value.

This isn't a temporary supply chain problem. It's an architectural one. Today's digital AI accelerators, even the most advanced, remain fundamentally constrained by the separation of memory and compute. Weights sit in SRAM or HBM. Arithmetic happens in separate logic blocks. Every multiply-accumulate operation requires fetching data across interconnects, toggling thousands of transistors, and charging large capacitances. In modern GPUs, more than half of total system energy is consumed outside the arithmetic units themselves, just moving data around.

Shrinking transistors doesn't fix this. Moore's Law improvements are plateauing, and the memory wall is widening faster than process nodes can close it. The industry needs a fundamentally different approach.

Silicon-Validated. Not Slideware.

What this means practically : Qernel's Q-Core delivers 400 TOPS in approximately 16 mm² at 2W on a 5nm process. That's an AI tensor core you can embed almost anywhere inside an edge server, a CDN appliance, an autonomous vehicle, or a defense system without redesigning your power and cooling infrastructure.

Claims in AI hardware are cheap. Silicon results are not. Qernel has fabricated and validated test chips demonstrating charge domain compute with measured performance at 10 picojoules per classification 100x more energy efficient than GPU baselines and 100 nanosecond latency per classification. The measured FP4 MAC linearity shows sub 0.1 LSB equivalent noise across wide operating conditions, with stable behavior across process corners, temperature, and die to die variation.

This robustness comes from a self calibrated gain cell design where each cell inherently compensates for local threshold voltage variations. The result is an architecture that doesn't require complex external calibration loops and is on track for automotive grade (AEC Q100) reliability qualification unusual for a novel compute paradigm.

For centralized inference providers the "token factories" serving billions of API calls daily the economics of inference reduce to a single question: how many tokens can you generate per joule of energy consumed? Every other metric (throughput, latency, cost per token) ultimately traces back to this ratio, because electricity is the dominant and fastest growing operational cost.

Consider the same 0.5 MW datacenter allocation modeled with Qernel accelerators instead of GPUs. Where a GPU-based deployment requires 267 nodes at $134M to serve demand, a Qernel-based deployment achieves the same token throughput with 27 nodes at $2.7M delivering 10x higher tokens/sec per node while consuming a fraction of the power. The impact cascades across the entire cost structure: 80% lower compute capex, 60% lower cooling costs, and 90% lower electricity operating expenses.

10X

Token Throughput
per node vs. 8x H100

90%

Lower Electricity OpEx
same throughput

Token Factories: Where Tokens per Joule Becomes the Only Metric That Matters

The business impact is stark. Over a three year horizon with identical token demand (11 billion tokens/day on a 120B parameter model), the GPU based token factory projects a net loss of $91M. The Qernel based facility projects $40M in profit. Same tokens delivered, same revenue collected radically different economics, driven entirely by tokens per joule.

This is not an edge only story. Central datacenter operators, inference as a service providers, and any organization building a token factory are facing the same fundamental math: GPU powered inference at scale is an energy economics problem masquerading as a compute problem. Solving the energy equation doesn't just improve margins it determines whether the token business is viable at all.

From Core to Cloud : A Scalable Architecture

A single efficient tensor core is a breakthrough. A scalable system architecture is a product. Qernel's design is built on a modular hierarchy: individual Q-Cores compose into Q-Clusters, which aggregate into full SoC configurations. This allows the architecture to scale along two independent axes latency per kernel (by clustering cores) and throughput (by adding clusters) while maintaining a consistent programming model.

Q1 2026

Q4 2026

Q1 2027

Q3 2027

Q-Core | Tensor Core IP & SoC Samples

The foundational charge domain tensor core

Flux AI | xPU Adapter Card

A drop-in PCIe accelerator for existing datacenter and edge infrastructure.

Plug in efficiency without rearchitecting your stack.

Qbe | Q-Core Stacked with 3D DRAM

Vertically integrated compute and memory, eliminating the off-chip bandwidth bottleneck for LLM inference.

Orbit | AI xPU Server Box

A complete inference server platform.

built for operators who need maximum throughput per watt at rack scale.

Why This Matters Now ?

The edge AI inference market is projected to grow from $54 billion in 2024 to $157 billion by 2030. Central datacenter AI infrastructure spend is measured in trillions through the decade. Data sovereignty regulations across 65+ countries are mandating local inference processing. Whether you're operating a hyperscale token factory trying to make the unit economics work, or a CDN operator trying to fit inference into a 200 kW edge site, you face the same binding constraint: the energy cost of generating each token with today's hardware.

Qernel doesn't try to incrementally improve the digital paradigm. We've replaced it with one where the laws of physics do the heavy lifting where charge and time perform the arithmetic that transistors used to struggle with. The result is an inference platform that delivers dramatically more tokens per joule, whether deployed at rack scale in a central facility or embedded in a power constrained edge node without the energy bill that currently makes large scale inference a losing proposition.

The beginning of sustainable inference starts here. Reach out to info@qernel.ai for the detailed white paper.

QernelAI —— v1.0.0

About

Blog

Careers

Contact

V.1.0.0

B82E6A

Here's a number that should get your attention: Cloudflare reported a 4,000% year-over-year increase in Workers AI inference requests in early 2025. Akamai launched an entirely new Inference Cloud product line. Gcore expanded serverless AI across 180+ points of presence. Meanwhile, centralized GPU datacenters are spending $3–7 trillion on AI infrastructure buildout through 2030, with power demand projected to hit 219 gigawatts by the end of the decade. Inference has officially surpassed training as the dominant AI workload by revenue and everyone from hyperscale datacenter operators to distributed edge providers is scrambling to serve it profitably.

But this surge is colliding with a hard physical wall at every scale. Nvidia GPU power consumption has escalated from 400W per accelerator (A100) to 700W (H100) to over 1,000W (B200), with the Rubin R200 expected to approach 1,800W. For edge operators, that's more power than an entire server rack. For central datacenters, it means hundreds of megawatts consumed just generating tokens turning electricity bills into the single largest line item and making the economics of token delivery razor-thin or outright unprofitable at current GPU efficiency levels.

Inference Is Outpacing Training and Outpacing
Power Grids

The core tension: AI inference needs to be everywhere inside hyperscale token factories, across CDN edge nodes, embedded in telco infrastructure, deployed at sovereign locations. But the hardware powering it burns through electricity faster than the economics can justify. At central datacenters, GPU-powered inference at scale can produce net losses for years. At the edge, the hardware simply doesn't fit the power envelope. Something has to give.

The problem manifests differently at different scales, but the root cause is the same. At CDN edge sites operating within 50–500 kW power envelopes, deploying even a modest GPU cluster can push power costs up 5–10x overnight, compressing margins that justify the investment. Cloudflare's gross margins have declined roughly 350 basis points year over year as it scales GPU infrastructure. Akamai has deployed GPUs in just 17 of its 4,400+ locations, not by choice, but by physical constraint.

The Real Bottleneck Isn't Compute. It's Energy.

This isn't a temporary supply chain problem. It's an architectural one. Today's digital AI accelerators, even the most advanced, remain fundamentally constrained by the separation of memory and compute. Weights sit in SRAM or HBM. Arithmetic happens in separate logic blocks. Every multiply-accumulate operation requires fetching data across interconnects, toggling thousands of transistors, and charging large capacitances. In modern GPUs, more than half of total system energy is consumed outside the arithmetic units themselves, just moving data around.

At central datacenters, the numbers are even more sobering. Modeling a 0.5 MW datacenter allocation running a GPT class 120B parameter model on GPU infrastructure, the math breaks down: 267 GPU nodes costing $134M in hardware, consuming $1.5–3M per year in electricity, generating 650 tokens/sec per node. At current selling prices of roughly $0.40 per million output tokens, that facility runs at a net loss of $91M over three years. The tokens are flowing, but every joule of energy spent producing them is destroying value.

Shrinking transistors doesn't fix this. Moore's Law improvements are plateauing, and the memory wall is widening faster than process nodes can close it. The industry needs a fundamentally different approach.

What If Physics Did the Math?

This is the question at the heart of Qernel's work. Rather than representing data as digital voltage levels and performing arithmetic through transistor switching, our charge domain processing (cDP) architecture encodes weights directly as electrical charge and activations as pulse durations in the time domain. When a pulse is applied to a charge domain cell, the resulting charge transfer is physically proportional to the product of weight and activation. Multiplication happens through physics not logic gates.

When Physics Computes

When thousands of these cells contribute to a shared accumulation node, their outputs naturally sum through charge conservation. A full multiply accumulate operation emerges from charge transfer alone no digital multipliers, no adder trees, no intermediate registers shuttling data back and forth.

Charge Adds Itself

This isn't compute in memory in the conventional sense. Existing CIM approaches still rely on binary SRAM cells, requiring multiple cells and significant peripheral circuitry to represent multi bit values. Qernel's gain cell architecture stores multi level weights natively in a single compact structure, achieving 5x better density and 10x lower power than conventional SRAM before you even get to the compute advantages.

One Cell. Full Weight.

Claims in AI hardware are cheap. Silicon results are not. Qernel has fabricated and validated test chips demonstrating charge domain compute with measured performance at 10 picojoules per classification 100x more energy efficient than GPU baselines and 100 nanosecond latency per classification. The measured FP4 MAC linearity shows sub 0.1 LSB equivalent noise across wide operating conditions, with stable behavior across process corners, temperature, and die to die variation.

This robustness comes from a self calibrated gain cell design where each cell inherently compensates for local threshold voltage variations. The result is an architecture that doesn't require complex external calibration loops and is on track for automotive grade (AEC Q100) reliability qualification unusual for a novel compute paradigm.

Silicon-Validated. Not Slideware.

What this means practically : Qernel's Q-Core delivers 400 TOPS in approximately 16 mm² at 2W on a 5nm process. That's an AI tensor core you can embed almost anywhere inside an edge server, a CDN appliance, an autonomous vehicle, or a defense system without redesigning your power and cooling infrastructure.

For centralized inference providers the "token factories" serving billions of API calls daily the economics of inference reduce to a single question: how many tokens can you generate per joule of energy consumed? Every other metric (throughput, latency, cost per token) ultimately traces back to this ratio, because electricity is the dominant and fastest growing operational cost.

Consider the same 0.5 MW datacenter allocation modeled with Qernel accelerators instead of GPUs. Where a GPU-based deployment requires 267 nodes at $134M to serve demand, a Qernel-based deployment achieves the same token throughput with 27 nodes at $2.7M delivering 10x higher tokens/sec per node while consuming a fraction of the power. The impact cascades across the entire cost structure: 80% lower compute capex, 60% lower cooling costs, and 90% lower electricity operating expenses.

Token Factories: Where Tokens per
Joule Becomes the Only Metric That Matters

The business impact is stark. Over a three year horizon with identical token demand (11 billion tokens/day on a 120B parameter model), the GPU based token factory projects a net loss of $91M. The Qernel based facility projects $40M in profit. Same tokens delivered, same revenue collected radically different economics, driven entirely by tokens per joule.

This is not an edge only story. Central datacenter operators, inference as a service providers, and any organization building a token factory are facing the same fundamental math: GPU powered inference at scale is an energy economics problem masquerading as a compute problem. Solving the energy equation doesn't just improve margins it determines whether the token business is viable at all.

A single efficient tensor core is a breakthrough. A scalable system architecture is a product. Qernel's design is built on a modular hierarchy: individual Q-Cores compose into Q-Clusters, which aggregate into full SoC configurations. This allows the architecture to scale along two independent axes latency per kernel (by clustering cores) and throughput (by adding clusters) while maintaining a consistent programming model.

From Core to Cloud :
A Scalable Architecture

Q1 2026

Qbe | Q-Core Stacked with 3D DRAM

Vertically integrated compute and memory, eliminating the off-chip bandwidth bottleneck for LLM inference.

Q-Core | Tensor Core IP & SoC Samples

The foundational charge domain tensor core

Q4 2026

Q1 2027

Flux AI | xPU Adapter Card

A drop-in PCIe accelerator for existing datacenter and edge infrastructure.

Plug in efficiency without rearchitecting your stack.

Orbit | AI xPU Server Box

A complete inference server platform.

built for operators who need maximum throughput per watt at rack scale.

Q3 2027

Why This Matters Now ?

The edge AI inference market is projected to grow from $54 billion in 2024 to $157 billion by 2030. Central datacenter AI infrastructure spend is measured in trillions through the decade. Data sovereignty regulations across 65+ countries are mandating local inference processing. Whether you're operating a hyperscale token factory trying to make the unit economics work, or a CDN operator trying to fit inference into a 200 kW edge site, you face the same binding constraint: the energy cost of generating each token with today's hardware.

Qernel doesn't try to incrementally improve the digital paradigm. We've replaced it with one where the laws of physics do the heavy lifting where charge and time perform the arithmetic that transistors used to struggle with. The result is an inference platform that delivers dramatically more tokens per joule, whether deployed at rack scale in a central facility or embedded in a power constrained edge node without the energy bill that currently makes large scale inference a losing proposition.

The beginning of sustainable inference starts here. Reach out to info@qernel.ai for the detailed white paper.

When Physics Computes

What If Physics Did the Math?

Charge Adds Itself

When thousands of these cells contribute to a shared accumulation node, their outputs naturally sum through charge conservation. A full multiply accumulate operation emerges from charge transfer alone no digital multipliers, no adder trees, no intermediate registers shuttling data back and forth.

One Cell. Full Weight.

25X

20X

100X

FLOPs per Wattvs. leading GPUs

FLOPs per Dollarcompute cost reduction

Cache Bandwidth/Wvs. SRAM baselines

V.1.0.0

Inference Is Outpacing Training and OutpacingPower Grids

The Real Bottleneck Isn't Compute. It's Energy.

Shrinking transistors doesn't fix this. Moore's Law improvements are plateauing, and the memory wall is widening faster than process nodes can close it. The industry needs a fundamentally different approach.

Silicon-Validated. Not Slideware.

10X

Token Throughputper node vs. 8x H100

90%

Lower Electricity OpExsame throughput

Token Factories: Where Tokens per Joule Becomes the Only Metric That Matters

From Core to Cloud : A Scalable Architecture

Q1 2026

Q4 2026

Q1 2027

Q3 2027

Q-Core | Tensor Core IP & SoC Samples

The foundational charge domain tensor core

Flux AI | xPU Adapter Card

A drop-in PCIe accelerator for existing datacenter and edge infrastructure. Plug in efficiency without rearchitecting your stack.

Qbe | Q-Core Stacked with 3D DRAM

Vertically integrated compute and memory, eliminating the off-chip bandwidth bottleneck for LLM inference.

Orbit | AI xPU Server Box

A complete inference server platform. built for operators who need maximum throughput per watt at rack scale.

Why This Matters Now ?

The beginning of sustainable inference starts here. Reach out to info@qernel.ai for the detailed white paper.

V.1.0.0

Inference Is Outpacing Training and OutpacingPower Grids

The Real Bottleneck Isn't Compute. It's Energy.

Shrinking transistors doesn't fix this. Moore's Law improvements are plateauing, and the memory wall is widening faster than process nodes can close it. The industry needs a fundamentally different approach.

What If Physics Did the Math?

When Physics Computes

When thousands of these cells contribute to a shared accumulation node, their outputs naturally sum through charge conservation. A full multiply accumulate operation emerges from charge transfer alone no digital multipliers, no adder trees, no intermediate registers shuttling data back and forth.

Charge Adds Itself

One Cell. Full Weight.

Silicon-Validated. Not Slideware.

Token Factories: Where Tokens per Joule Becomes the Only Metric That Matters

From Core to Cloud : A Scalable Architecture

Q1 2026

Qbe | Q-Core Stacked with 3D DRAM

Vertically integrated compute and memory, eliminating the off-chip bandwidth bottleneck for LLM inference.

Q-Core | Tensor Core IP & SoC Samples

The foundational charge domain tensor core

Q4 2026

Q1 2027

Flux AI | xPU Adapter Card

A drop-in PCIe accelerator for existing datacenter and edge infrastructure. Plug in efficiency without rearchitecting your stack.

Orbit | AI xPU Server Box

A complete inference server platform. built for operators who need maximum throughput per watt at rack scale.

Q3 2027

Why This Matters Now ?

The beginning of sustainable inference starts here. Reach out to info@qernel.ai for the detailed white paper.

FLOPs per Watt
vs. leading GPUs

FLOPs per Dollar
compute cost reduction

Cache Bandwidth/W
vs. SRAM baselines

Inference Is Outpacing Training and Outpacing
Power Grids

Token Throughput
per node vs. 8x H100

Lower Electricity OpEx
same throughput

A drop-in PCIe accelerator for existing datacenter and edge infrastructure.

Plug in efficiency without rearchitecting your stack.

A complete inference server platform.

built for operators who need maximum throughput per watt at rack scale.

Inference Is Outpacing Training and Outpacing
Power Grids

Token Factories: Where Tokens per
Joule Becomes the Only Metric That Matters

From Core to Cloud :
A Scalable Architecture

A drop-in PCIe accelerator for existing datacenter and edge infrastructure.

Plug in efficiency without rearchitecting your stack.

A complete inference server platform.

built for operators who need maximum throughput per watt at rack scale.