1. System Overview
SyndrDB processes every request through a layered pipeline. A client connection arrives over TCP, is authenticated and bound to a session, and its command is routed through the parser, planner, and execution engine. Results flow back through the wire protocol. Cross-cutting concerns — MVCC, WAL, locking, and indexes — integrate at each layer.
2. Query Pipeline — From SQL to Results
Every SyndrQL statement follows a deterministic pipeline from raw text to executed results. The pipeline is designed for zero-allocation token scanning, modular parsing, and cost-based plan selection.
Tokenizer
Character-by-character scanning produces a flat list of tokens. The tokenizer distinguishes between == (equality comparison, TOKEN_EQ) and = (assignment, TOKEN_ASSIGN), and treats * as TOKEN_MULTIPLY — context determines whether it means "all fields" or multiplication.
Parsers
SyndrDB uses dedicated parsers per statement type rather than a single monolithic grammar. Each parser consumes tokens and produces a typed query struct:
- SELECT parser — handles projections, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET, JOINs, subqueries
- INSERT / UPDATE / DELETE parsers — DML operations with expression-based WHERE clauses
- Transaction parser — BEGIN, COMMIT, ROLLBACK, SAVEPOINT
- Cursor parser — DECLARE, FETCH, CLOSE
- Trigger parser — CREATE/DROP/ENABLE/DISABLE TRIGGER
- DDL parsers — CREATE/DROP BUNDLE, CREATE INDEX, CREATE VIEW
Expression AST
WHERE clauses, HAVING filters, and computed expressions are represented as a tree of expression nodes:
| Node Type | Purpose | Example |
|---|---|---|
BinaryExpression | Two-operand comparison or logic | age >= 25 AND active == true |
UnaryExpression | NOT, IS NULL, EXISTS | NOT EXISTS (...) |
IdentifierExpression | Field reference | name |
QualifiedIdentifierExpression | Table-qualified field | "users"."name" |
LiteralExpression | Constant value | 42, "hello" |
SubqueryExpression | Nested SELECT | IN (SELECT ...) |
Cost Model
The query planner evaluates candidate plans using a cost model that considers:
- CPU cost — per-row evaluation overhead (expression complexity, function calls)
- I/O cost — pages to read from storage (full scan vs. index lookup)
- Memory cost — intermediate result buffering (hash tables, sort buffers)
- Selectivity estimation — HyperLogLog cardinality and histogram statistics guide row-count predictions
3. Execution Engine — Nodes & Interfaces
SyndrDB's execution engine uses a node-based architecture where each operation (scan, filter, sort, aggregate) is represented as a composable node. Nodes can be stacked into execution trees where data flows from leaf nodes (scans) up through processing nodes to the root.
Three Execution Interfaces
Different query patterns need different data-flow models. SyndrDB provides three interfaces:
Materialized (map)
Scan-optimized (slice)
Volcano pull-based
| Interface | Method | Returns | Use Case |
|---|---|---|---|
ExecutionNode |
Execute(ctx) |
map[string]*Document |
General queries with random access by doc ID |
SliceExecutionNode |
ExecuteSlice(ctx) |
[]*Document, []string |
Full scans where map overhead is unnecessary |
IteratorNode |
Init(ctx), Next(), Close() |
*Document, error |
Streaming, cursors, memory-bounded execution |
The IteratorNode uses Volcano-model semantics: Next() returns one document per call, and (nil, nil) signals end-of-data. Nodes implementing IterableNode can produce iterators via AsIterator().
Execution Node Types
| Node | Purpose | Key Optimization |
|---|---|---|
FullScanNode | Sequential scan of all pages | Predicate pushdown into scanner, projection pushdown |
IndexScanNode | Hash/BTree/BRIN index lookup | Falls back to full scan if index miss |
BRINScanNode | Block-range skip scan | Skips entire page ranges based on min/max summaries |
IndexOnlyScanNode | Answers query from index alone | Zero page reads when index covers all projected fields |
BTreeOrderedScanNode | Pre-sorted range traversal | Avoids in-memory sort for ORDER BY on indexed column |
FilterNode | WHERE expression evaluation | SIMD batch evaluation for simple predicates |
AggregationNode | GROUP BY with hash/sort strategy | Streaming aggregation, SIMD vectorized SUM/MIN/MAX |
SortNode | ORDER BY | Radix sort, parallel merge sort, SIMD-accelerated comparisons |
LimitNode | LIMIT / OFFSET | Short-circuits upstream execution |
DistinctNode | DISTINCT deduplication | Hash-based with pre-sorted optimization |
JoinExecutionNode | Hash join | Build-side selection, predicate pushdown |
CorrelatedSubqueryNode | IN/EXISTS subqueries | Hash semi/anti-join rewriting (O(N+M) vs O(N*M)) |
SIMD Acceleration
Performance-critical paths use SIMD (Single Instruction, Multiple Data) via the syndrdb-simd library:
- Batch predicate evaluation — evaluates WHERE conditions on entire document batches
- Vectorized aggregation — SUM, MIN, MAX on int64 columns processed in SIMD lanes
- Compound predicate bitmaps — AND/OR of multiple conditions via bitmap operations
- SIMD string operations — UPPER/LOWER with ASCII fast path
- Accelerated sorting — SIMD-assisted comparisons for radix and parallel merge sort
4. Plan Cache — Adaptive Query Optimization
Query planning is expensive (cost estimation, index selection, join ordering). SyndrDB caches execution plans to amortize this cost across repeated queries.
8-Shard LRU
The cache is divided into 8 independent shards, each with its own LRU eviction and mutex. The shard is selected by xxhash(queryText) % 8. This reduces lock contention under high concurrency — 8 concurrent planners can each hit a different shard without blocking.
Adaptive Generic/Custom Planning
Inspired by PostgreSQL, SyndrDB uses an adaptive strategy:
- First 5 executions — always use a custom plan (parameter-specific)
- After 5 executions — compare generic plan cost vs. average custom plan cost
- If generic is cheaper — switch to generic plan (parameter-independent, reusable)
- Periodic re-evaluation — if statistics change, reconsider the choice
Lazy Invalidation
Each bundle tracks a version number that increments on schema changes, index creation/deletion, or statistics refresh. Cached plans store the version at creation time. On cache hit, if the plan's version is stale, a fresh plan is built. During rebuild, the stale plan continues serving read queries to avoid latency spikes.
| Setting | Default | Description |
|---|---|---|
planCacheCapacity | 1000 per shard | Max entries before LRU eviction |
planCacheEnabled | true | Enable/disable plan caching |
5. Storage Engine — Segments, Pages & Write Path
SyndrDB stores documents in append-only binary segment files, organized by bundle. The storage engine is designed for sequential write throughput and efficient page-level reads.
On-Disk Layout
| Component | Format | Purpose |
|---|---|---|
bundle.manifest | JSON | Tracks all segment files, document counts, bloom filter state |
*.bnd | Binary (BSON) | Append-only segment files containing document data |
sorted_index.idx | Binary | Sharded sorted index for O(log n) pageID calculation |
Segment Files
Documents are serialized as BSON and appended to the current active segment file. When a segment reaches the maximum size (default 32MB), a new segment is created. Old segments are immutable — compaction merges them to reclaim space from deleted/superseded versions.
Document Pages
In memory, documents are organized into pages of approximately 4,096 documents each. Each page provides two access patterns:
Documents map[string]Document— keyed by document ID for random accessDocumentSlice []Document— flat array for scan-optimized sequential access
Pages form a linked list via NextPageID / PreviousPageID pointers.
Write Buffer
Writes use a double-buffered design for zero-contention I/O:
Writers atomically reserve an offset in the active buffer and write via pwrite — no mutex needed. When the buffer is flushed, the active and back buffers swap atomically. The background flusher writes the back buffer to disk without blocking new writes.
| Setting | Default | Description |
|---|---|---|
bundleFileMaxSizeMB | 32 | Segment file rotation threshold |
maxLoadedDocumentPages | 500 | Max pages in memory before eviction |
6. Page Cache — 64-Shard Lock-Free Design
The page cache is the most contended data structure in SyndrDB — every query touches it. Its design prioritizes lock-free reads under high concurrency.
Read Path (Zero Contention)
Reads first attempt sync.Map.Load() which is a lock-free atomic load. Under read-heavy workloads (the common case), no locks are ever acquired. On miss, the shard's RWMutex is taken for a read lock to consult the authoritative map.
Write Path (Copy-Outside-Lock)
Writes follow a copy-outside-lock pattern: the new page state is prepared without holding any lock, then a brief write lock on the target shard updates both the authoritative map and sync.Map atomically. This minimizes the critical section to a pointer swap.
COW Snapshots for GROUP BY
GROUP BY queries need a consistent view of page data while concurrent writes may be modifying pages. The cache provides copy-on-write snapshots: an immutable []Document array is created from the page and cached with a staleness timestamp. Multiple concurrent GROUP BY queries share the same snapshot if the page hasn't changed.
7. Index System — Hash (LSM), B-Tree, BRIN
SyndrDB supports three index types, each optimized for different access patterns. All indexes support partial indexes (WHERE clause), functional expressions (LOWER, YEAR, arithmetic), and INCLUDE columns for covering queries.
Hash Index V3 (LSM Architecture)
In-memory, 100K max entries
256 buckets, append-only
Merged, deduplicated
O(1) average lookup for equality queries (field == value). Uses an LSM-tree approach:
- Write path: append entry to disk bucket → update MemTable → check compaction threshold
- Read path: check MemTable → scan bucket files backward (newest first) → cache result
- MVCC-aware: reads filter by CommitSequence to return only visible versions
B-Tree V2
O(log n) lookup for range queries, ORDER BY, and unique constraints. B+ tree with linked leaf nodes for efficient range traversal:
- Page-based storage: 8KB pages, metadata page 0, LRU page cache (1000 pages default)
- WAL for crash recovery: separate B-tree WAL with CRC32 checksums
- Range queries: O(log n) search to first matching leaf + O(k) sequential traversal
BRIN (Block Range INdex)
min:1 max:500
min:480 max:1200
min:900 max:1500
min:1400 max:2000
One entry per ~128 pages storing min/max values, NULL tracking, and document count. Ideal for naturally ordered data (timestamps, auto-incrementing IDs). Tiny footprint: ~250 entries per 1M documents.
Index Comparison
| Type | Best For | Complexity | Implementation |
|---|---|---|---|
| Hash V3 | Equality (field = value) | O(1) avg | LSM: MemTable + append-only buckets |
| B-Tree V2 | Range, ORDER BY, unique | O(log n) | B+ tree with linked leaves, page cache |
| BRIN | Range on ordered data | O(ranges) | Block-range min/max summaries |
8. MVCC — Multi-Version Concurrency Control
Every write creates a new version of a document rather than overwriting in place. Readers see a consistent snapshot without blocking writers, and writers don't block readers.
Document Version Fields
| Field | Type | Purpose |
|---|---|---|
CommitSequence | uint64 | Global monotonic sequence assigned at commit |
VersionSequence | uint64 | Per-document version counter (1, 2, 3...) |
CreatedByTxID | uint64 | Transaction that created this version |
DeletedByTxID | uint64 | Transaction that deleted this version |
SupersededAt | time.Time | Timestamp when replaced by a newer version (zero = current) |
Visibility Rules
A document version is visible to a transaction's snapshot if all five conditions are met:
CreatedByTxID == myTxID, always visible
CommitSequence <= snapshotSeq
CreatedByTxID not in active transaction set
DeletedByTxID == 0 or deleted after snapshot
Version Chain
CommitSeq: 100
Superseded
CommitSeq: 250
Superseded
CommitSeq: 500
Active
Dead Version Reclamation (Vacuum)
Old versions that are no longer visible to any active transaction are cleaned up by the vacuum process:
isDeadVersion()checks: superseded + grace period elapsed + commitSequence < oldest active snapshotRemoveDeadVersionsFromPage()performs in-memory cleanup at the page level- Configurable via
vacuumDeadRatioThreshold(default 0.3) andvacuumMaxPagesPerCycle(default 100)
HOT Updates
When an UPDATE modifies only non-indexed fields, SyndrDB skips the index update entirely (Heap-Only Tuple optimization). This avoids index maintenance overhead for common "update a status field" patterns.
9. WAL — Write-Ahead Log & Crash Recovery
The WAL guarantees durability: every state-changing operation is recorded to the log before the in-memory state is modified. On crash, the WAL is replayed to recover to the last consistent state.
WAL Entry Format
+----------+--------+------------+-------+--------+-----------+------------+--------+ | TxID | OpType | BundleName | DocID | Before | After | Timestamp | CRC32 | | (uint64) | (byte) | (string) | (str) | (data) | (data) | (int64) | (4B) | +----------+--------+------------+-------+--------+-----------+------------+--------+
Each entry is self-describing with a CRC32 checksum for corruption detection. The Before field stores the pre-modification state for undo-based rollback.
Three Durability Modes
fsync after every op
Safest, slowest
Group commit
10x fewer fsyncs
Async flush
Fastest, risk of loss
Group Commit (Balanced Mode)
Multiple concurrent transactions share a single fsync by batching their WAL entries into a double-buffered write pipeline:
Crash Recovery
On startup, the recovery process:
- Finds the last checkpoint marker in the WAL
- Replays all WAL entries after that checkpoint
- Reloads affected bundles from their segment files
- Rolls back any incomplete transactions
Write Coordinator
Three background goroutines manage the WAL lifecycle:
| Goroutine | Purpose |
|---|---|
| WAL Writer | Drains the entry queue, writes to log file, triggers group commit |
| Background Writer | Periodically flushes dirty pages from cache to segment files |
| Checkpointer | Writes checkpoint markers, enables WAL file rotation |
| Setting | Default | Description |
|---|---|---|
walEnabled | true | Enable/disable WAL |
durabilityMode | balanced | strict / balanced / performance |
walMaxFileSizeMB | 100 | WAL file rotation threshold |
10. Transaction System — ACID Guarantees
SyndrDB provides full ACID transactions with three isolation levels, undo-based rollback, and document-level write locks.
Transaction Lifecycle
Isolation Levels
| Level | Behavior | Use Case |
|---|---|---|
| READ COMMITTED | Each statement sees the latest committed data | Simple read workloads, low contention |
| REPEATABLE READ (default) | Snapshot captured at BEGIN, all reads see the same point-in-time | Consistent reporting, analytics |
| SERIALIZABLE | SSI (Serializable Snapshot Isolation) detects read/write conflicts | Financial transactions, strict consistency |
Serializable Snapshot Isolation (SSI)
SERIALIZABLE uses a technique called SSI to detect anomalies without blocking reads:
- SIREAD locks — recorded after SELECT execution, tracking which documents were read
- rw-antidependency tracking — when a write conflicts with another transaction's SIREAD, an edge is recorded
- Dangerous structure detection — at COMMIT, checks for cycles in the dependency graph
- Abort policy — the transaction that creates a dangerous structure is aborted with a serialization error
Deadlock Detection
Document-level write locks can create deadlock situations. SyndrDB detects these in real-time:
- Wait-for graph — when a transaction blocks on a lock, an edge is added to the dependency graph
- DFS cycle detection — runs on every new wait edge, not periodically
- Victim selection — the youngest transaction in the cycle is aborted (least work lost)
- Channel-based waiting — blocked transactions wait on a channel rather than polling
Savepoints
Single-level savepoints allow partial rollback within a transaction:
BEGIN TRANSACTION;
ADD DOCUMENT TO BUNDLE "orders" WITH ({...});
SAVEPOINT "before_update";
UPDATE DOCUMENTS IN BUNDLE "orders" (...) CONFIRMED WHERE status == "pending";
-- Oops, wrong update
ROLLBACK TO SAVEPOINT "before_update";
-- orders table is restored to the savepoint state
COMMIT;
11. Concurrency Architecture — Shards, Atomics & Lock-Free Patterns
SyndrDB is designed for high-concurrency workloads. The master pattern is sharded access with lock-free reads: split data structures into independent shards, and use atomic operations for the read path so that readers never block.
Sharding Overview
64 shards
RWMutex + sync.Map
64 shards
RWMutex + sync.Map
8 shards
LRU per shard
32 shards
Immutable whitelist
Lock-Free Patterns
| Pattern | Where Used | Mechanism |
|---|---|---|
| atomic.Pointer | ServiceManager, BucketFileManager | Lock-free singleton access via atomic load/store |
| sync.Map | Page cache fast path, scanner registry, session indexes | Lock-free reads, amortized-lock writes |
| Copy-outside-lock | Page cache writes | Prepare new state outside critical section, brief lock for pointer swap |
| Double-checked locking | Manifest creation | RLock fast-path check, then Lock + re-check for initialization |
| Atomic offset reservation | Write buffer | Writers atomically claim a region in the buffer without any mutex |
| RCU (Read-Copy-Update) | Write path, reader views | Immutable snapshots published atomically, old versions reclaimed after grace period |
Why Sharding Works
With 64 shards, even at 60 concurrent connections, the expected number of concurrent accesses per shard is less than 1. This virtually eliminates lock contention. The hash function (xxhash) provides uniform distribution, ensuring no hot shards under random access patterns.
12. Server & Wire Protocol
SyndrDB uses a custom TCP wire protocol designed for low-latency command execution, pipelining, and streaming of large result sets.
Connection Lifecycle
Wire Protocol Format
| Feature | Detail |
|---|---|
| Command Terminator | \x04 (EOT). Literal \x04 escaped as \x04\x04 |
| Parameter Delimiter | \x05 (ENQ) separates prepared statement parameters |
| Pipeline Mode | Client sends multiple commands; server responds to each with READY\n sentinel after completion |
| Compression | Optional zstd compression via compress=zstd in connection string |
Connection String
syndrdb://host:port:database:user:password[:options] Options (colon-separated key=value): compress=zstd Enable zstd compression pipeline=true Enable pipeline mode streaming=chunked Enable streaming protocol
Streaming Protocol
For large result sets, streaming avoids materializing the entire result in memory:
STREAM:v1\n — header (negotiated)
CHUNK:<len>\n<data> — uncompressed chunk
ZCHUNK:<comp>:<uncomp>\n<data> — zstd compressed chunk
END:<count>,<timeMS>\n — terminator with stats
The streaming chunk size defaults to 256 documents. The execution engine pulls documents from an IteratorNode, batches them into chunks, and sends each chunk over the wire as it's produced.
Session Manager
Sessions are managed in a 64-shard storage with per-shard RWMutex. Lock-free secondary indexes (via sync.Map) allow fast lookup by username or connection ID. Each session is cryptographically bound to the client's IP address and user-agent fingerprint.
Rate Limiting & Throttling
- Per-IP rate limiting — 32-shard design with immutable whitelist set and atomic global connection counter
- Large query throttling — semaphore limits concurrent full scans to 15, preventing any single query pattern from starving others
| Setting | Default | Description |
|---|---|---|
maxConnections | 1000 | Maximum concurrent connections |
streamingChunkSize | 256 | Documents per streaming chunk |
maxOpenCursorsPerSession | 64 | Cursor limit per session |
queryTimeoutSeconds | 300 | Maximum query execution time |