Inherent Robustness and Precision by Design
Analog compute is often criticized for its susceptibility to noise and process variations. QernelAI overcomes these challenges through two key innovations: a self-calibrated design for noise immunity and an adaptive pulse modulation scheme for superior dynamic range.
Figure 2: The self-calibration mechanism provides superior noise margin and robustness against temperature and process variations compared to non-calibrated designs. Measurements are obtained across 6-sigma variations including all PVT corners and die-to-die variations.
QernelAI’s Qcore IP is immune to the cell-to-cell and die-to-die threshold voltage (Vt) mismatches that plague other analog designs, as shown in Figure 2. This is achieved through a self-calibration process inherent to the read operation:
1 An input current is fed into the gain cell.
2 This current builds a self-calibrated gate-source voltage (Vgs) on the bitcell’s transistor.
3 This Vgs automatically compensates for any local variations in the transistor’s Vt.
4 During the MAC operation, this calibrated Vgs generates an output current that is a precise mirror of the input, regardless of underlying process mismatches.
This “calibration-free” design ensures high linearity and consistent performance even at elevated temperatures (up to 100°C), a critical requirement for enterprise and automotive applications. It eliminates the need for the complex and costly external calibration circuits required by competing solutions.
Adaptive pulse control of RWL ensures larger dynamic range per cell (higher SNR), which is fully controlled digitally from external peripheral circuits.

Figure 3: Competitor architectures with fixed dynamic range per cell suffer from a loss of precision (valuable bits) in the high-probability central range of the activation distribution. QernelAI’s adaptive pulse modulation enables a “zoom-in” on the central range, dedicating higher effective precision where the neural network activations concentrate.
In neural networks, input and weight distributions are typically normal (Gaussian), meaning most values cluster around the center of the distribution. An ideal CIM architecture should offer the highest precision in this high-probability central range.
Voltage-mode CIMs like Encharge-AI fail here. Their capacitive-coupled nature means each cell only receives a small, fixed fraction of the total dynamic range, providing low precision across the board with no ability to focus. As shown in Figure 4, this fixed, uniform quantization results in a significant loss of valuable bits in the high-probability central range of the activation distribution.
QernelAI’s current-based design employs an adaptive pulse modulation scheme. By adjusting the integration time (i.e., the pulse width) of the input current, the system can perform a native zoomin operation. This adaptive control allows QernelAI to dedicate a higher number of effective quantization levels to the narrow central range where most activation values lie, while using fewer levels for outlier values in the tails of the distribution. This results in significantly higher effective precision for real-world AI workloads, as illustrated in Figure 3.
Dynamic Range Enhancement with
Adaptive Precision Control

Analog compute is often criticized for its susceptibility to noise and process variations. QernelAI overcomes these challenges through two key innovations: a self-calibrated design for noise immunity and an adaptive pulse modulation scheme for superior dynamic range.


Figure 2: The self-calibration mechanism provides superior noise margin and robustness against temperature and process variations compared to non-calibrated designs. Measurements are obtained across 6-sigma variations including all PVT corners and die-to-die variations.
QernelAI’s Qcore IP is immune to the cell-to-cell and die-to-die threshold voltage (Vt) mismatches that plague other analog designs, as shown in Figure 2. This is achieved through a self-calibration process inherent to the read operation:
1 An input current is fed into the gain cell.
2 This current builds a self-calibrated gate-source voltage (Vgs) on the bitcell’s transistor.
3 This Vgs automatically compensates for any local variations in the transistor’s Vt.
4 During the MAC operation, this calibrated Vgs generates an output current that is a precise mirror of the input, regardless of underlying process mismatches.
This “calibration-free” design ensures high linearity and consistent performance even at elevated temperatures (up to 100°C), a critical requirement for enterprise and automotive applications. It eliminates the need for the complex and costly external calibration circuits required by competing solutions.
Adaptive pulse control of RWL ensures larger dynamic range per cell (higher SNR), which is fully controlled digitally from external peripheral circuits.
The explosive growth of artificial intelligence has created an insatiable demand for more efficient and powerful hardware processing. Compute-In-Memory (CIM) has emerged as a promising alternative, performing matrix-vector multiplication directly within memory arrays to minimize data movement. However, not all CIM architectures are created equal. QernelAI’s Qcore/Qcluster IP represents a revolutionary leap forward, leveraging a multi-bit, charge-domain, gain-cell technology to deliver industry-leading performance and efficiency. This whitepaper details the core architectural innovations that provide QernelAI with qualitative and quantitative merit over competitors like D-Matrix and Encharge-AI.
Introduction
The fundamental advantage of the QernelAI architecture lies in its ability to perform multi-bit input and multi-bit output operations within a single, compact gain cell, which also natively supports advanced low-precision formats like INT4, FP4, MXFP4, and NVFP4. This stands in stark contrast to competing SRAM-based CIMs, which are inherently single-bit and require significant overhead to approximate multi-bit functionality.
As shown in Fig. 2, The QernelAI gain cell is programmed to store multiple, distinct levels of analog current (e.g., seven levels for E2M1 FP4: {0, 0.5, 1, 1.5, 2, 4, 6}). This allows for the direct, native execution of FP4 arithmetic without any conversion overhead at the cell level. SRAM-based competitors cannot achieve this; their single-bit nature makes it impossible to store these multilevel values directly, forcing them to use multiple cells and complex digital encoding/decoding logic to approximate FP4, incurring significant area, power, and latency penalties. Apart from FP4, Qcore also supports signed [-8…0,…7] and unsigned INT4 [0…15] precision
As shown in Fig. 2, The QernelAI gain cell is programmed to store multiple, distinct levels of analog current (e.g., seven levels for E2M1 FP4: {0, 0.5, 1, 1.5, 2, 4, 6}). This allows for the direct, native execution of FP4 arithmetic without any conversion overhead at the cell level. SRAM-based competitors cannot achieve this; their single-bit nature makes it impossible to store these multilevel values directly, forcing them to use multiple cells and complex digital encoding/decoding logic to approximate FP4, incurring significant area, power, and latency penalties. Apart from FP4, Qcore also supports signed [-8…0,…7] and unsigned INT4 [0…15] precision
Unmatched Efficiency and Native Multi-bit Support
The fundamental advantage of the QernelAI architecture lies in its ability to perform multi-bit input and multi-bit output operations within a single, compact gain cell, which also natively supports advanced low-precision formats like INT4, FP4, MXFP4, and NVFP4. This stands in stark contrast to competing SRAM-based CIMs, which are inherently single-bit and require significant overhead to approximate multi-bit functionality.
As illustrated in Figure 1, a 4-bit Multiply-Accumulate (MAC) operation in QernelAI’s architecture is executed within one self-contained gain cell. Competitors, however, must resort to complex and inefficient workarounds:
D-Matrix
To achieve a 4-bit operation, D-Matrix requires four separate 1-bit SRAM cells and a power-hungry 8-bit digital adder tree to combine the partial products. Large combinational path in the digital adders also limits the number of row activations fed per cycle, which limits the throughput.
Encharge-AI
Like Dmatrix, Encharge utilizes four 1-bit SRAM cells, but its voltage-mode design necessitates four power-intensive Analog-to-Digital Converters (ADCs) to process the outputs before aggregation, with each cell contributing only a small, fixed fraction of the total dynamic range. This architectural elegance translates directly into dramatic improvements in power efficiency (TOPS/W) and area efficiency (TOPS/mm²). A simplified quantitative analysis highlights this disparity.
This architectural elegance translates directly into dramatic improvements in power efficiency (TOPS/W) and area efficiency (TOPS/mm²). A simplified quantitative analysis highlights this disparity.
As illustrated in Figure 1, a 4-bit Multiply-Accumulate (MAC) operation in QernelAI’s architecture is executed within one self-contained gain cell. Competitors, however, must resort to complex and inefficient workarounds:
D-Matrix
To achieve a 4-bit operation, D-Matrix requires four separate 1-bit SRAM cells and a power-hungry 8-bit digital adder tree to combine the partial products. Large combinational path in the digital adders also limits the number of row activations fed per cycle, which limits the throughput.
Encharge-AI
Like Dmatrix, Encharge utilizes four 1-bit SRAM cells, but its voltage-mode design necessitates four power-intensive Analog-to-Digital Converters (ADCs) to process the outputs before aggregation, with each cell contributing only a small, fixed fraction of the total dynamic range. This architectural elegance translates directly into dramatic improvements in power efficiency (TOPS/W) and area efficiency (TOPS/mm²). A simplified quantitative analysis highlights this disparity.
This architectural elegance translates directly into dramatic improvements in power efficiency (TOPS/W) and area efficiency (TOPS/mm²). A simplified quantitative analysis highlights this disparity.
Table 1: Relative comparison of PPA metrics between Qernel, D-matrix and EnchargeAI. This detailed component-level analysis, supported by the transistor-level view in Figure 1 and Table 1, demonstrates a 10x to 20x fundamental advantage in TOPS/W and TOPS/mm²
Table 1: Relative comparison of PPA metrics between Qernel, D-matrix and EnchargeAI. This detailed component-level analysis, supported by the transistor-level view in Figure 1 and Table 1, demonstrates a 10x to 20x fundamental advantage in TOPS/W and TOPS/mm²
Metric
QernelAI
(Gain-Cell+ 1x ADC)
D-Matrix
(SRAM+ Adder)
Encharge-AI
(SRAM+ 4xADC)
QernelAI Advantage
(Ratio)
Transistor/Capacitor
Count per 4-bit MAC
~5T+1C
(Bitcell) + 5T (ADC)
~6T(Bitcell)+
70T(Adder)
40T+4C(Bitcell)
+ 5T (ADC)
4x - 7x
Relative Compute
Density (TOPs/ mm²)
20x (Row-parallel
activations)
~6T(Bitcell)+
70T(Adder)
~4x
(Row-parallel
activations)
~5x - 20x
Relative Energy
Efficiency (TOPs/W)
10x
~1x
~2x
~5x - 10x
Combined Efficiency
Metric (TOPs/mm2/W)
100x
~1x
~8x
~25x - 200x


Inherent Robustness and Precision by Design


QernelAI —— v1.0.0
Unmatched Efficiency and Native Multi-bit Support

