The cjsonx Algorithm

The architecture of cjsonx is designed around extreme performance and memory locality. It achieves parse speeds in excess of 1.0 GB/s by decoupling the parsing process into two distinct phases: SIMD Structural Scanning (Stage 1) and Recursive Descent Parsing (Stage 2).

This approach was pioneered by simdjson and popularized in C by yyjson. cjsonx builds upon these foundations by introducing the Eisel-Lemire float parsing algorithm into the lexical layer for incredibly fast numerical decoding.

Stage 1: SIMD Structural Scanning

In standard parsers, the CPU spends the vast majority of its time checking every single character one by one (if (c == ' ') continue;). This causes massive pipeline stalls and branch mispredictions.

cjsonx completely avoids this. Instead, it reads the JSON payload in 32-byte (AVX2) or 16-byte (NEON) chunks.

Bitmask Generation: SIMD instructions are used to compare the chunk against structural characters ({, }, [, ], :, ,, ").
String Escaping: A secondary SIMD pass identifies the start and end of strings, ensuring that characters inside strings (like a colon : inside "time: 12:00") are masked out and ignored.
Tape Creation: The indices of all valid structural characters are compressed into a "Tape" (an array of integers).

By the end of Stage 1, we have a complete map of the JSON structure. Whitespace is mathematically ignored at a speed of several gigabytes per second.

Stage 2: Recursive Descent, Computed Gotos & Arena Allocation

With the Tape constructed, the parser no longer reads the JSON character by character. It jumps directly from one structural character to the next.

Computed Gotos: On supported compilers (GCC/Clang), the central switch statement is replaced with a dispatch table (goto *dispatch_table[c]). This allows the CPU's branch predictor to track the flow of JSON states independently, significantly reducing branch mispredictions and boosting parse speed by 10-15%.
Flat Arena Allocation: Instead of calling malloc() for every object, array, or string, cjsonx allocates a single block of memory (an Arena) upfront.
16-byte Nodes: Every JSON element (whether it's a number, a string reference, or an object) is represented by a highly compressed 16-byte structure (cjsonx_node_t), verified with _Static_assert at compile time:
4 bytes type_and_length: packs an 8-bit type tag and a 24-bit length/count (max 16,777,215 elements).
4 bytes next_sibling: the flat-array index of the next sibling, enabling O(1) subtree skipping without recursion.
8 bytes val: a union overlapping a double, a zero-copy const char* string pointer, a bool, or a uint32_t first-child index — whichever applies to the node type.
Cache Locality: Because nodes are stored sequentially in the Arena array, 4 nodes fit perfectly inside a single 64-byte CPU cache line. Traversing the DOM tree is extremely cache-friendly, leading to massive speedups during querying (cjsonx_get).

Single-Pass String Parsing & SWAR

When parsing strings, cjsonx achieves zero-copy speeds wherever possible. If a string contains escape characters or control characters, it must be validated and potentially re-allocated.

To maximize throughput, cjsonx uses Single-Pass SIMD & SWAR techniques: 1. SIMD (AVX2/NEON/WASM): It scans 16 to 32 bytes at a time, checking for escapes (\), non-ASCII characters, and raw control characters (< 0x20) simultaneously in a single CPU instruction. 2. SWAR (SIMD Within A Register): On older platforms without vector instructions, it loads 8 bytes into a 64-bit register and uses bitwise arithmetic to validate the entire block concurrently without looping character-by-character.

The Eisel-Lemire Float Engine

Most fast parsers use standard 64-bit IEEE 754 conversions (strtod or custom integer math) which can be extremely slow and sometimes inaccurate on complex edge cases.

cjsonx leverages the Eisel-Lemire Algorithm directly in the parsing pipeline:

As the parser reads a number, it accumulates the mantissa and exponent using fast 64-bit integer arithmetic.
It then performs a high-precision table lookup (using a precomputed table of powers of 10) and a 128-bit multiplication (__uint128_t or emulated on 32-bit platforms).
This guarantees that 99.9% of all floating-point numbers are resolved exactly and correctly in a single fast-path operation without any floating-point math overhead.
If the number is an extreme edge case that the fast path cannot perfectly resolve, it falls back first to cjsonx_fastfloat (a secondary fast-path using double arithmetic), and finally to the standard C library (strtod) to guarantee 100% correctness.