How It Works

Architecture & Internals of SyndrDB

1. System Overview

SyndrDB processes every request through a layered pipeline. A client connection arrives over TCP, is authenticated and bound to a session, and its command is routed through the parser, planner, and execution engine. Results flow back through the wire protocol. Cross-cutting concerns — MVCC, WAL, locking, and indexes — integrate at each layer.

Each box links to its detailed section below

2. Query Pipeline — From SQL to Results

Every SyndrQL statement follows a deterministic pipeline from raw text to executed results. The pipeline is designed for zero-allocation token scanning, modular parsing, and cost-based plan selection.

Query Pipeline
SQL String
Tokenizer
Parser
Expression AST
Query Router
Cost-Based Planner
Execution Plan
Execute
Results

Tokenizer

Character-by-character scanning produces a flat list of tokens. The tokenizer distinguishes between == (equality comparison, TOKEN_EQ) and = (assignment, TOKEN_ASSIGN), and treats * as TOKEN_MULTIPLY — context determines whether it means "all fields" or multiplication.

Parsers

SyndrDB uses dedicated parsers per statement type rather than a single monolithic grammar. Each parser consumes tokens and produces a typed query struct:

  • SELECT parser — handles projections, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET, JOINs, subqueries
  • INSERT / UPDATE / DELETE parsers — DML operations with expression-based WHERE clauses
  • Transaction parser — BEGIN, COMMIT, ROLLBACK, SAVEPOINT
  • Cursor parser — DECLARE, FETCH, CLOSE
  • Trigger parser — CREATE/DROP/ENABLE/DISABLE TRIGGER
  • DDL parsers — CREATE/DROP BUNDLE, CREATE INDEX, CREATE VIEW

Expression AST

WHERE clauses, HAVING filters, and computed expressions are represented as a tree of expression nodes:

Node TypePurposeExample
BinaryExpressionTwo-operand comparison or logicage >= 25 AND active == true
UnaryExpressionNOT, IS NULL, EXISTSNOT EXISTS (...)
IdentifierExpressionField referencename
QualifiedIdentifierExpressionTable-qualified field"users"."name"
LiteralExpressionConstant value42, "hello"
SubqueryExpressionNested SELECTIN (SELECT ...)

Cost Model

The query planner evaluates candidate plans using a cost model that considers:

  • CPU cost — per-row evaluation overhead (expression complexity, function calls)
  • I/O cost — pages to read from storage (full scan vs. index lookup)
  • Memory cost — intermediate result buffering (hash tables, sort buffers)
  • Selectivity estimation — HyperLogLog cardinality and histogram statistics guide row-count predictions

3. Execution Engine — Nodes & Interfaces

SyndrDB's execution engine uses a node-based architecture where each operation (scan, filter, sort, aggregate) is represented as a composable node. Nodes can be stacked into execution trees where data flows from leaf nodes (scans) up through processing nodes to the root.

Three Execution Interfaces

Different query patterns need different data-flow models. SyndrDB provides three interfaces:

ExecutionNode
Materialized (map)
SliceExecutionNode
Scan-optimized (slice)
IteratorNode
Volcano pull-based
InterfaceMethodReturnsUse Case
ExecutionNode Execute(ctx) map[string]*Document General queries with random access by doc ID
SliceExecutionNode ExecuteSlice(ctx) []*Document, []string Full scans where map overhead is unnecessary
IteratorNode Init(ctx), Next(), Close() *Document, error Streaming, cursors, memory-bounded execution

The IteratorNode uses Volcano-model semantics: Next() returns one document per call, and (nil, nil) signals end-of-data. Nodes implementing IterableNode can produce iterators via AsIterator().

Execution Node Types

Example Execution Tree
LimitNode
SortNode
AggregationNode
FilterNode
FullScanNode
SELECT status, COUNT(*) FROM "orders" WHERE total > 100 GROUP BY status ORDER BY status LIMIT 10
NodePurposeKey Optimization
FullScanNodeSequential scan of all pagesPredicate pushdown into scanner, projection pushdown
IndexScanNodeHash/BTree/BRIN index lookupFalls back to full scan if index miss
BRINScanNodeBlock-range skip scanSkips entire page ranges based on min/max summaries
IndexOnlyScanNodeAnswers query from index aloneZero page reads when index covers all projected fields
BTreeOrderedScanNodePre-sorted range traversalAvoids in-memory sort for ORDER BY on indexed column
FilterNodeWHERE expression evaluationSIMD batch evaluation for simple predicates
AggregationNodeGROUP BY with hash/sort strategyStreaming aggregation, SIMD vectorized SUM/MIN/MAX
SortNodeORDER BYRadix sort, parallel merge sort, SIMD-accelerated comparisons
LimitNodeLIMIT / OFFSETShort-circuits upstream execution
DistinctNodeDISTINCT deduplicationHash-based with pre-sorted optimization
JoinExecutionNodeHash joinBuild-side selection, predicate pushdown
CorrelatedSubqueryNodeIN/EXISTS subqueriesHash semi/anti-join rewriting (O(N+M) vs O(N*M))

SIMD Acceleration

Performance-critical paths use SIMD (Single Instruction, Multiple Data) via the syndrdb-simd library:

  • Batch predicate evaluation — evaluates WHERE conditions on entire document batches
  • Vectorized aggregation — SUM, MIN, MAX on int64 columns processed in SIMD lanes
  • Compound predicate bitmaps — AND/OR of multiple conditions via bitmap operations
  • SIMD string operations — UPPER/LOWER with ASCII fast path
  • Accelerated sorting — SIMD-assisted comparisons for radix and parallel merge sort

4. Plan Cache — Adaptive Query Optimization

Query planning is expensive (cost estimation, index selection, join ordering). SyndrDB caches execution plans to amortize this cost across repeated queries.

Plan Cache Lookup
SQL Query
xxhash
Shard [N]
Version Check
Hit: Serve Cached Plan
Miss: Build New Plan

8-Shard LRU

The cache is divided into 8 independent shards, each with its own LRU eviction and mutex. The shard is selected by xxhash(queryText) % 8. This reduces lock contention under high concurrency — 8 concurrent planners can each hit a different shard without blocking.

Adaptive Generic/Custom Planning

Inspired by PostgreSQL, SyndrDB uses an adaptive strategy:

  • First 5 executions — always use a custom plan (parameter-specific)
  • After 5 executions — compare generic plan cost vs. average custom plan cost
  • If generic is cheaper — switch to generic plan (parameter-independent, reusable)
  • Periodic re-evaluation — if statistics change, reconsider the choice

Lazy Invalidation

Each bundle tracks a version number that increments on schema changes, index creation/deletion, or statistics refresh. Cached plans store the version at creation time. On cache hit, if the plan's version is stale, a fresh plan is built. During rebuild, the stale plan continues serving read queries to avoid latency spikes.

SettingDefaultDescription
planCacheCapacity1000 per shardMax entries before LRU eviction
planCacheEnabledtrueEnable/disable plan caching

5. Storage Engine — Segments, Pages & Write Path

SyndrDB stores documents in append-only binary segment files, organized by bundle. The storage engine is designed for sequential write throughput and efficient page-level reads.

On-Disk Layout

Bundle Directory Structure
database/bundleName/
bundle.manifest
000001.bnd
000002.bnd
sorted_index.idx
JSON metadata
Binary segments (BSON)
Page lookup
ComponentFormatPurpose
bundle.manifestJSONTracks all segment files, document counts, bloom filter state
*.bndBinary (BSON)Append-only segment files containing document data
sorted_index.idxBinarySharded sorted index for O(log n) pageID calculation

Segment Files

Documents are serialized as BSON and appended to the current active segment file. When a segment reaches the maximum size (default 32MB), a new segment is created. Old segments are immutable — compaction merges them to reclaim space from deleted/superseded versions.

Document Pages

In memory, documents are organized into pages of approximately 4,096 documents each. Each page provides two access patterns:

  • Documents map[string]Document — keyed by document ID for random access
  • DocumentSlice []Document — flat array for scan-optimized sequential access

Pages form a linked list via NextPageID / PreviousPageID pointers.

Write Buffer

Writes use a double-buffered design for zero-contention I/O:

Double-Buffered Write Path
Writer 1
Writer 2
Writer N
↓ atomic offset reservation
Active Buffer (pwrite, no mutex)
↓ swap on flush
Back Buffer → Disk (background)

Writers atomically reserve an offset in the active buffer and write via pwrite — no mutex needed. When the buffer is flushed, the active and back buffers swap atomically. The background flusher writes the back buffer to disk without blocking new writes.

SettingDefaultDescription
bundleFileMaxSizeMB32Segment file rotation threshold
maxLoadedDocumentPages500Max pages in memory before eviction

6. Page Cache — 64-Shard Lock-Free Design

The page cache is the most contended data structure in SyndrDB — every query touches it. Its design prioritizes lock-free reads under high concurrency.

64-Shard Page Cache
Page Request
xxhash(pageKey) % 64
Shard 0
Shard 1
Shard 2
...
Shard 63
sync.Map (fast path)
Lock-free atomic loads
Authoritative Map
RWMutex-protected
LRU Chain
Eviction ordering

Read Path (Zero Contention)

Reads first attempt sync.Map.Load() which is a lock-free atomic load. Under read-heavy workloads (the common case), no locks are ever acquired. On miss, the shard's RWMutex is taken for a read lock to consult the authoritative map.

Write Path (Copy-Outside-Lock)

Writes follow a copy-outside-lock pattern: the new page state is prepared without holding any lock, then a brief write lock on the target shard updates both the authoritative map and sync.Map atomically. This minimizes the critical section to a pointer swap.

COW Snapshots for GROUP BY

GROUP BY queries need a consistent view of page data while concurrent writes may be modifying pages. The cache provides copy-on-write snapshots: an immutable []Document array is created from the page and cached with a staleness timestamp. Multiple concurrent GROUP BY queries share the same snapshot if the page hasn't changed.


7. Index System — Hash (LSM), B-Tree, BRIN

SyndrDB supports three index types, each optimized for different access patterns. All indexes support partial indexes (WHERE clause), functional expressions (LOWER, YEAR, arithmetic), and INCLUDE columns for covering queries.

Hash Index V3 (LSM Architecture)

Hash Index V3 — LSM Tiers
MemTable
In-memory, 100K max entries
↓ overflow / flush
Entry Storage
256 buckets, append-only
↓ compaction
Compacted Files
Merged, deduplicated

O(1) average lookup for equality queries (field == value). Uses an LSM-tree approach:

  • Write path: append entry to disk bucket → update MemTable → check compaction threshold
  • Read path: check MemTable → scan bucket files backward (newest first) → cache result
  • MVCC-aware: reads filter by CommitSequence to return only visible versions

B-Tree V2

B+ Tree Structure
Root Node
Internal
Internal
Leaf
Leaf
Leaf
Leaf
Linked leaf nodes enable efficient range traversal

O(log n) lookup for range queries, ORDER BY, and unique constraints. B+ tree with linked leaf nodes for efficient range traversal:

  • Page-based storage: 8KB pages, metadata page 0, LRU page cache (1000 pages default)
  • WAL for crash recovery: separate B-tree WAL with CRC32 checksums
  • Range queries: O(log n) search to first matching leaf + O(k) sequential traversal

BRIN (Block Range INdex)

BRIN Range Skip Visualization
Pages 1-128
min:1 max:500
Pages 129-256
min:480 max:1200
Pages 257-384
min:900 max:1500
Pages 385-512
min:1400 max:2000
Query: WHERE value BETWEEN 600 AND 1300 — skips pages 1-128 and 385-512 entirely

One entry per ~128 pages storing min/max values, NULL tracking, and document count. Ideal for naturally ordered data (timestamps, auto-incrementing IDs). Tiny footprint: ~250 entries per 1M documents.

Index Comparison

TypeBest ForComplexityImplementation
Hash V3Equality (field = value)O(1) avgLSM: MemTable + append-only buckets
B-Tree V2Range, ORDER BY, uniqueO(log n)B+ tree with linked leaves, page cache
BRINRange on ordered dataO(ranges)Block-range min/max summaries

8. MVCC — Multi-Version Concurrency Control

Every write creates a new version of a document rather than overwriting in place. Readers see a consistent snapshot without blocking writers, and writers don't block readers.

Document Version Fields

FieldTypePurpose
CommitSequenceuint64Global monotonic sequence assigned at commit
VersionSequenceuint64Per-document version counter (1, 2, 3...)
CreatedByTxIDuint64Transaction that created this version
DeletedByTxIDuint64Transaction that deleted this version
SupersededAttime.TimeTimestamp when replaced by a newer version (zero = current)

Visibility Rules

A document version is visible to a transaction's snapshot if all five conditions are met:

MVCC Visibility Check
1. Read-your-own-writes: if CreatedByTxID == myTxID, always visible
2. Snapshot boundary: CommitSequence <= snapshotSeq
3. Active tx exclusion: CreatedByTxID not in active transaction set
4. Not deleted: DeletedByTxID == 0 or deleted after snapshot
5. RCU grace period: superseded versions visible for 100ms window

Version Chain

Document Version History
v1
CommitSeq: 100
Superseded
v2
CommitSeq: 250
Superseded
v3 (current)
CommitSeq: 500
Active
Transaction with snapshot at seq 300 sees v2; transaction at seq 600 sees v3

Dead Version Reclamation (Vacuum)

Old versions that are no longer visible to any active transaction are cleaned up by the vacuum process:

  • isDeadVersion() checks: superseded + grace period elapsed + commitSequence < oldest active snapshot
  • RemoveDeadVersionsFromPage() performs in-memory cleanup at the page level
  • Configurable via vacuumDeadRatioThreshold (default 0.3) and vacuumMaxPagesPerCycle (default 100)

HOT Updates

When an UPDATE modifies only non-indexed fields, SyndrDB skips the index update entirely (Heap-Only Tuple optimization). This avoids index maintenance overhead for common "update a status field" patterns.


9. WAL — Write-Ahead Log & Crash Recovery

The WAL guarantees durability: every state-changing operation is recorded to the log before the in-memory state is modified. On crash, the WAL is replayed to recover to the last consistent state.

WAL Entry Format

+----------+--------+------------+-------+--------+-----------+------------+--------+
| TxID     | OpType | BundleName | DocID | Before | After     | Timestamp  | CRC32  |
| (uint64) | (byte) | (string)   | (str) | (data) | (data)    | (int64)    | (4B)   |
+----------+--------+------------+-------+--------+-----------+------------+--------+

Each entry is self-describing with a CRC32 checksum for corruption detection. The Before field stores the pre-modification state for undo-based rollback.

Three Durability Modes

Strict
fsync after every op
Safest, slowest
Balanced
Group commit
10x fewer fsyncs
Performance
Async flush
Fastest, risk of loss

Group Commit (Balanced Mode)

Multiple concurrent transactions share a single fsync by batching their WAL entries into a double-buffered write pipeline:

Group Commit Flow
Tx 1
Tx 2
Tx 3
↓ append entries
Main Buffer (accumulating)
↓ swap (atomic)
Back Buffer → fsync to disk
One fsync serves all three transactions

Crash Recovery

On startup, the recovery process:

  1. Finds the last checkpoint marker in the WAL
  2. Replays all WAL entries after that checkpoint
  3. Reloads affected bundles from their segment files
  4. Rolls back any incomplete transactions

Write Coordinator

Three background goroutines manage the WAL lifecycle:

GoroutinePurpose
WAL WriterDrains the entry queue, writes to log file, triggers group commit
Background WriterPeriodically flushes dirty pages from cache to segment files
CheckpointerWrites checkpoint markers, enables WAL file rotation
SettingDefaultDescription
walEnabledtrueEnable/disable WAL
durabilityModebalancedstrict / balanced / performance
walMaxFileSizeMB100WAL file rotation threshold

10. Transaction System — ACID Guarantees

SyndrDB provides full ACID transactions with three isolation levels, undo-based rollback, and document-level write locks.

Transaction Lifecycle

Transaction Flow
BEGIN
Capture Snapshot
DML Operations
Conflict Check
COMMIT
ROLLBACK (undo via WAL before-images)

Isolation Levels

LevelBehaviorUse Case
READ COMMITTED Each statement sees the latest committed data Simple read workloads, low contention
REPEATABLE READ (default) Snapshot captured at BEGIN, all reads see the same point-in-time Consistent reporting, analytics
SERIALIZABLE SSI (Serializable Snapshot Isolation) detects read/write conflicts Financial transactions, strict consistency

Serializable Snapshot Isolation (SSI)

SERIALIZABLE uses a technique called SSI to detect anomalies without blocking reads:

  • SIREAD locks — recorded after SELECT execution, tracking which documents were read
  • rw-antidependency tracking — when a write conflicts with another transaction's SIREAD, an edge is recorded
  • Dangerous structure detection — at COMMIT, checks for cycles in the dependency graph
  • Abort policy — the transaction that creates a dangerous structure is aborted with a serialization error

Deadlock Detection

Document-level write locks can create deadlock situations. SyndrDB detects these in real-time:

  • Wait-for graph — when a transaction blocks on a lock, an edge is added to the dependency graph
  • DFS cycle detection — runs on every new wait edge, not periodically
  • Victim selection — the youngest transaction in the cycle is aborted (least work lost)
  • Channel-based waiting — blocked transactions wait on a channel rather than polling

Savepoints

Single-level savepoints allow partial rollback within a transaction:

BEGIN TRANSACTION;
  ADD DOCUMENT TO BUNDLE "orders" WITH ({...});
  SAVEPOINT "before_update";
  UPDATE DOCUMENTS IN BUNDLE "orders" (...) CONFIRMED WHERE status == "pending";
  -- Oops, wrong update
  ROLLBACK TO SAVEPOINT "before_update";
  -- orders table is restored to the savepoint state
COMMIT;

11. Concurrency Architecture — Shards, Atomics & Lock-Free Patterns

SyndrDB is designed for high-concurrency workloads. The master pattern is sharded access with lock-free reads: split data structures into independent shards, and use atomic operations for the read path so that readers never block.

Sharding Overview

Sharded Subsystems
Page Cache
64 shards
RWMutex + sync.Map
Session Manager
64 shards
RWMutex + sync.Map
Plan Cache
8 shards
LRU per shard
Rate Limiter
32 shards
Immutable whitelist

Lock-Free Patterns

PatternWhere UsedMechanism
atomic.Pointer ServiceManager, BucketFileManager Lock-free singleton access via atomic load/store
sync.Map Page cache fast path, scanner registry, session indexes Lock-free reads, amortized-lock writes
Copy-outside-lock Page cache writes Prepare new state outside critical section, brief lock for pointer swap
Double-checked locking Manifest creation RLock fast-path check, then Lock + re-check for initialization
Atomic offset reservation Write buffer Writers atomically claim a region in the buffer without any mutex
RCU (Read-Copy-Update) Write path, reader views Immutable snapshots published atomically, old versions reclaimed after grace period

Why Sharding Works

With 64 shards, even at 60 concurrent connections, the expected number of concurrent accesses per shard is less than 1. This virtually eliminates lock contention. The hash function (xxhash) provides uniform distribution, ensuring no hot shards under random access patterns.


12. Server & Wire Protocol

SyndrDB uses a custom TCP wire protocol designed for low-latency command execution, pipelining, and streaming of large result sets.

Connection Lifecycle

Connection Flow
TCP Accept
Parse Connection String
Authenticate
Create Session
Command Loop: Read → Parse → Execute → Send Result
Disconnect → Cleanup Session → Release Locks

Wire Protocol Format

FeatureDetail
Command Terminator\x04 (EOT). Literal \x04 escaped as \x04\x04
Parameter Delimiter\x05 (ENQ) separates prepared statement parameters
Pipeline ModeClient sends multiple commands; server responds to each with READY\n sentinel after completion
CompressionOptional zstd compression via compress=zstd in connection string

Connection String

syndrdb://host:port:database:user:password[:options]

Options (colon-separated key=value):
  compress=zstd        Enable zstd compression
  pipeline=true        Enable pipeline mode
  streaming=chunked    Enable streaming protocol

Streaming Protocol

For large result sets, streaming avoids materializing the entire result in memory:

Streaming Protocol (STREAM:v1)
STREAM:v1\n — header (negotiated)
CHUNK:<len>\n<data> — uncompressed chunk
ZCHUNK:<comp>:<uncomp>\n<data> — zstd compressed chunk
END:<count>,<timeMS>\n — terminator with stats

The streaming chunk size defaults to 256 documents. The execution engine pulls documents from an IteratorNode, batches them into chunks, and sends each chunk over the wire as it's produced.

Session Manager

Sessions are managed in a 64-shard storage with per-shard RWMutex. Lock-free secondary indexes (via sync.Map) allow fast lookup by username or connection ID. Each session is cryptographically bound to the client's IP address and user-agent fingerprint.

Rate Limiting & Throttling

  • Per-IP rate limiting — 32-shard design with immutable whitelist set and atomic global connection counter
  • Large query throttling — semaphore limits concurrent full scans to 15, preventing any single query pattern from starving others
SettingDefaultDescription
maxConnections1000Maximum concurrent connections
streamingChunkSize256Documents per streaming chunk
maxOpenCursorsPerSession64Cursor limit per session
queryTimeoutSeconds300Maximum query execution time