rqlite is a lightweight, open-source, distributed relational database written in Go. It is built on the Raft consensus protocol and uses SQLite as its storage engine.
Development of 9.0 has begun and aims to reduce disk usage by approximately 50%. This goal will be achieved through a high-level design overhaul targeting the primary causes of disk consumption in rqlite.
What drives current disk usage?
Today, rqlite’s disk space usage is driven by three main factors:
- Raft Log: The log of changes made to the system. This log is at the core of the Raft consensus system.
- Working SQLite Database: The live database that rqlite uses to serve reads and writes. Once a SQLite statement is successfully committed to the Raft log, that statement is applied to the working SQLite database.
- A snapshot of the working SQLite Database: To prevent the Raft log growing without bound, the Raft subsystem within rqlite periodically generates and stores a point-in-time copy of the working SQLite database – this copy is known as a snapshot. Once the snapshot is taken rqlite can then truncate the Raft log. This snapshotted copy is then used by rqlite to restore a node when it restarts, or transmitted to a another node when that node needs to “catch up” with the state of an existing rqlite cluster. Snapshotting and Log Truncation are core concepts in Raft based systems.
High-Level design for rqlite 9.0
The key strategy for reducing disk usage involves removing the need to store the snapshotted copy of the working SQLite database in the Raft system. While the Raft log is periodically truncated and stops growing after a certain point due to snapshotting, the working SQLite database continues to grow as more data is written to it. And since the snapshot copy of the SQLite database is roughly the same size as the working SQLite database it too grows in size. So if we could eliminate the Snapshot copy, rqlite would use 50% less disk.
However, an rqlite node needs a snapshotted copy at certain points — this cannot be avoided. So how do we skip the copy, but still meet the needs of Snapshot and Restore?
To understand how we can avoid storing an extra copy during the snapshotting process, it’s important to know that rqlite runs its underlying SQLite database in Write-Ahead Log (WAL) mode. This is crucial for our new approach. In the proposed 9.0 design, the working SQLite database file (excluding its associated WAL file) and the snapshotted copy in the Raft system are logically the same. By using this fact, we can eliminate the need to store a separate snapshotted copy in the Raft system.
New Snapshotting approach
Let’s delve into how snapshotting will work in 9.0. This can help us see why the working SQLite file can also serve as the copy needed by the Raft Snapshot store.
- Snapshot and WAL Checkpointing: At snapshot time, rqlite will checkpoint the Write-Ahead Log (WAL) of the working SQLite database. All subsequent writes will then be directed to a new WAL file, leaving the main SQLite file unchanged from the point when the snapshot was generated. Consequently, until the next snapshot occurs, the main SQLite file represents the point-in-time state required by the Raft Snapshot store. This approach allows us to use the combined SQLite file and WAL file for regular read and write operations, while the unchanged main SQLite file serves as the dataset for the Raft Snapshot store. No more need for an extra copy!
- Write a Reference to Snapshot Store: Instead of copying the entire SQLite file, rqlite will write a Reference, such as a checksum, to the Snapshot store. This Reference can be used to validate that the main SQLite file matches what the Snapshot store references, whenever snapshot data is needed. (This check protects against bugs, operational mistakes, or disk corruption but isn’t strictly needed.)
- Restoration from Snapshot: As mentioned earlier, since all writes after the snapshot process are directed to the WAL file, the main SQLite file remains ready for the Restore-from-Snapshot process when needed (e.g., during a node restart or when transferring the snapshot to another node). In other words, the main SQLite file (ignoring its associated WAL file) remains logically identical to what would have been written to the Raft snapshot store if rqlite had actually created a duplicate copy.
I call this new design Referential Snapshotting.
Bonus Enhancements
Referential Snapshotting will also bring a few other significant improvements.
- Faster snapshotting: By writing minimal data to the Raft Snapshot store, the snapshotting process will be much faster. It should consist of the SQLite WAL checkpointing time (which is usually very short) and the checksum computation time. There will be no need to copy large amounts of SQLite data to the Snapshot store on every snapshot. When one realises that writes to rqlite are blocked during the Snapshot process, the advantages of faster snapshotting are clear.
- Faster restarts: Nodes, even those with multiple gigabytes of SQLite data, will restart much, much faster. Currently, at restart, rqlite has to restore the working SQLite database file from the copy in Raft Snapshot Store. But with this new design, the working SQLite database file will already be in the correct place at start-up. At most, rqlite will only need to compare the checksum in the Snapshot store to the checksum of the working SQLite database. Multi-GB systems should restart within a few seconds.
Next Steps
The move to rqlite 9.0 should make a significant step forward in optimizing the efficiency of rqlite. By implementing Referential Snapshotting, I expect to achieve significant reductions in disk usage, faster snapshotting, and improved node restart times.
There are many details to get right, including SQLite WAL management, seamless upgrades from earlier releases, and checksum choice. So stay tuned for further updates as we progress towards this major release.
If you’re interested in learning more about Raft and how it can be used to build distributed systems such as rqlite, check out my recent talk at GopherCon.