How to Build a Version-Controlled Database with Prolly Trees

Introduction

Modern databases and filesystems rely heavily on B-trees for efficient storage and retrieval of sorted key-value pairs. However, traditional B-trees lack built-in version control, making it difficult to track changes over time, branch, or merge datasets. Dolt, an open-source project under the Apache 2.0 license, cleverly uses a variant called a Prolly tree (probabilistic B-tree) to enable full version control for an entire database. This guide walks you through the conceptual steps to implement a similar system, from understanding the fundamentals to handling branching and merging. Whether you are building a new database or adding versioning to an existing one, these steps will help you leverage Prolly trees for efficient, immutable data versioning.

How to Build a Version-Controlled Database with Prolly Trees

What You Need

Before diving into implementation, ensure you have the following:

Solid grasp of data structures – Familiarity with B-trees, hash functions, and tree traversal.
Programming language proficiency – Choose a language with good support for binary data (e.g., C++, Rust, Python, Go).
Understanding of version control concepts – Basic ideas of commits, branches, diffs, and merges (like in Git).
Storage layer – A block-based storage system (e.g., local filesystem, S3, or a key-value store) to persist nodes.
Hashing library – SHA-256 or similar for generating content hashes.
Time and patience – Building a versioned database is non-trivial; expect iterations.

Step-by-Step Guide

Step 1: Understand B-tree Limitations for Versioning

Standard B-trees update nodes in-place, meaning that when you insert or delete a key, the changed node is overwritten. This destroys previous states, making it impossible to retrieve historical versions. To support version control, you need persistent or copy-on-write data structures that preserve all previous versions. The key insight is to make every node immutable: instead of updating a node, create a new version of it that shares unchanged subtrees. This is where Prolly trees come in.

Step 2: Introduce Prolly Trees – Probabilistic B-trees

A Prolly tree is a B-tree variant where each node's content is hashed to produce a unique identifier. When a node’s content changes, its hash changes, effectively creating a new version. The tree structure leverages content-based addressing: nodes are stored by their hash, and a root hash (similar to a Git commit ID) uniquely identifies the entire database state. This allows you to keep every version of the tree without any data duplication beyond changed nodes.

Step 3: Design Node Structure with Content Hashing and Reference Counting

Each node in a Prolly tree should contain:

Keys and values – The sorted key-value pairs that form the leaf or internal entries.
Child pointers – Not physical addresses, but content hashes of child nodes.
Metadata – Node type (internal/leaf), fanout parameters, and a reference count for garbage collection.

The total size of a node should align with your storage block size (e.g., 4KB or 8KB). To decide when to split or merge nodes, Prolly trees use a probabilistic technique: on insertion, compute a hash of a key and, if the hash falls below a threshold (e.g., leading zero bits), the node splits. This creates a balanced structure statistically, without explicit balancing operations.

Step 4: Implement Copy-on-Write Semantics for Updates

When inserting or deleting a key:

Start from the root node, read it (by its hash).
If the node is a leaf, create a new leaf node with the updated key-value set. Compute its new hash.
If the node is internal, recursively update the appropriate child. Once the child’s new hash is known, create a new internal node with the updated child pointer and keys, then compute its hash.
Propagate upward. The final root hash becomes the new version identifier for the database.

All unchanged nodes are shared between versions. This makes checkouts, branches, and merges extremely storage-efficient.

Step 5: Handle Branching and Merging Using Tree Diffs

To support multiple branches, each branch stores its own root hash. Branching is as simple as recording a new root hash. Merging two branches requires computing the diff between two Prolly trees. Because nodes are content-addressed, you can compare two root hashes and recursively find divergent subtrees. For leaf nodes, you perform a three-way merge (common ancestor, branch A, branch B) to resolve conflicts. For internal nodes, you merge the child node sets, potentially splitting or combining them if necessary. The result is a new tree representing the merged state.

Step 6: Optimize Storage and Garbage Collection

Over time, many node versions accumulate. To reclaim space from unreachable versions (nodes not referenced by any branch or tag), implement a garbage collector. Use a mark-and-sweep approach: start from all branch root hashes and traverse reachable nodes, marking them. Then sweep the storage for unmarked nodes and delete them. Alternatively, use reference counting inside nodes, but beware of cycles (nodes can form cycles only if you allow recursive data structures; Prolly trees are acyclic). Periodic garbage collection keeps storage efficient.

Tips for Success

Start simple – Implement a single-threaded, in-memory Prolly tree first before adding persistence and concurrency.
Use test-driven development – Create a suite of tests for insertion, deletion, branching, and merging with known results.
Monitor node balance – Adjust the split probability threshold to control fanout. Smaller thresholds yield larger node sizes (fewer levels), larger thresholds yield smaller nodes (more levels).
Consider serialization format – Use a compact binary format (e.g., Protocol Buffers, FlatBuffers) to minimize node size.
Optimize hash computation – Cache hashes of immutable nodes since they never change. This can speed up diff operations.
Study Dolt’s implementation – The Dolt source code (on GitHub) provides a real-world example of Prolly trees in action. Examine their node layout, chunking, and merge logic for inspiration.
Think about scalability – If your database grows beyond memory, ensure your storage backend (filesystem or object store) supports random reads with low latency.
Document your design – The interplay of hashing, copy-on-write, and merging can be complex; clear documentation helps maintainability.

By following these steps, you can build a database that inherently supports version control, branching, and merging—just like Dolt. Prolly trees offer a theoretical and practical foundation for immutable, deterministic storage that is ideal for collaborative data systems.

Tags: