Replicating SQLite using Raft Consensus

raft-logoSQLite is a “self-contained, serverless, zero-configuration, transactional SQL database engine”.  However, it doesn’t come with replication built in, so if you want to store mission-critical data in it, you better back it up. The usual approach is to continually copy the SQLite file on every change.

I wanted SQLite, I wanted it distributed, and I really wanted a more elegant solution for replication. So rqlite was born.

Why replicate SQLite?

SQLite is very convenient to work with — the entire database is contained within a single file on disk, making working with it very straightforward. Many people have experience with it, and it’s been a natural choice for adding relational-database functionality to many systems. It’s also rock-solid. However, since it isn’t replicated it can become a single point of failure in a system design. While it is possible to continually copy the SQLite file to a backup server every time it is changed, this file-copy must not take place while the database is being accessed.

I decided to build a distributed replication layer using the Raft consensus protocol, which gives me effective replication without the hassle of running a much heavier solution like MySQL. It provides all the advantages of replication, with the data modelling functionality of a relational database, but with the convenience of a single-file database. The entire system is written in Go, and the source is available on github.

An rqlite cluster

The diagram below shows an example rqlite cluster of 3 nodes, which continually work together to ensure that the SQLite file under each node is identical. With 3 nodes running, 1 node can fail, the cluster will remain up, and the data is still safe. In this example a leader has been elected and is coloured red. The Raft protocol dictates that all reads and writes should go through this node.

A 3-node rqlite cluster
A 3-node rqlite cluster, replicating SQLite using Raft Consensus

For a write operation, only when a majority of nodes (including the leader) acknowledge that write, is that change actually committed to the Raft log and then to the actual SQLite databases underneath each node — it is the leader’s job to ensure this consensus is reached. If the leader fails, or there is a network partition such that the leader is cut off from the other two nodes, one of the other nodes will be elected leader shortly afterwards.

rqlite is a CP system. When faced with a network partition it chooses consistency over availability — reads and writes in the partition with a quorum of servers will remain available. But the servers on the other side of the partition will refuse to accept any changes. When the partition is healed however, these nodes will receive any changes made to the nodes on the other side of partition, and all copies of SQLite database will be in consensus again.

Choosing a Distributed Consensus algorithm

Raft is used as the consensus protocol for multiple projects including InfluxDB and etcd. They both use the goraft implementation, and since I want to write more Go, it was a natural choice to use for rqlite.

Deploying rqlite

You can find the source code for rqlite, and instructions on how to build and deploy it, here on github. I hope to continue developing this software, as distributed consensus systems are immensely interesting.

23 thoughts on “Replicating SQLite using Raft Consensus”

  1. Hi:

    So, only operations by HTTP API are replicated , right ?

    What happens if all nodes reply with an ACK ( raft consensus ) and , before committing to raft log , one node fails ? When is synchronized again ?

    Regards

  2. Yes, only operations committed through the HTTP API are replicated.

    As for the failure you outlined, when the node is brought back online it will contact the leader and find that the leader has entries in its log that it is missing. It will then write those entries to its log.

    Check the Raft paper for full details.

    1. Hi Jens — it started as a real need. A system I was working on professionally needed a small amount of (ideally relational) database storage. The system was clustered, so the question was which node should host this storage, which I didn’t like. So this was about showing we could run SQLite in a replicated manner, such that the loss of any single node would mean not mean the data would be lost. We ended up going a different route however, and rqlite was not used in production.

        1. Hi Stanislav — we also had access to a clustered Java-based key-value store (open source, but can’t provide the name), so just decided to use that. It did mean we had to build the relational layer ourselves though.

  3. Nice work! Very useful piece of software, but i have question. Philip is this possible to use https to secure database access? I want to synchronize many devices but i want to secure database against outside access? Is this possible or do you plan this functionality in near future? Regards.

    1. Mike — thanks. There is no support for HTTPS access to the system, though you might be able to do it yourself by putting something like nginx in front of each node, and have it do the HTTPS to HTTP conversion.

  4. This is interesting!

    Would it be feasible to modify this to tolerate byzantine failures? I.e, could I configure it to commit transactions only when 2/3+1 peers have voted for the same transaction?

    Background: @ ball.askemos.org we have a programming platform sitting atop of replicated state machines. Now I’m looking for alternative implementations of replicated state machines, which may fit the bill. Our platform is a bit Erlang-inspired having persistent agents which communicate by message passing. We replicate those agents in consensus with byzantine failure resistance. Often we store the state of our agents in a sqlite3 database. Pretty much like you’re doing here. That’s how I found rsqlite in the first place.

    1. Hi Jörg — if this system was changed to support byzantine failures, the code base would be very different. Since it’s built on the Raft consensus protocol, it only supports the failure modes that Raft tolerates. So if the code was modified as you would like, it would be a very different system — it wouldn’t be a modification so much as a new system. I hope this helps.

      1. I see. When I posted, I assumed that Raft was actually a *consensus* protocol. Which implies that it would normally tolerate byzantine faults. But Raft is in fact a *coherence* protocol, where you must trust the leader (making the latter a SPOF).

        Sure it helps: we better stick with the code we’ve been using so long as this protects us against byzantine faults. The slowdown is not as bad as a backdoor would be.

        1. I am not sure the guys at Stanford would agree with you. They clearly consider Raft a consensus protocol. I am unfamiliar with the term “coherence protocol”.

          Also the leader is clearly not a single point of failure when you consider the system as a whole (which is the point). The system remains fully functional if the leader fails — another node simply becomes leader.

    1. Alex — I don’t really understand the question. I know what you might mean by a database lock, but I don’t see exactly how that is specific to rqlite.

    1. it works like a charm.
      i’m really impressed Philip.

      one question
      if i want to change location of the db
      can i do localhost:4001/dev/shm/db ?

  5. how do you determine which one is the leader ?
    if i start the node 1, i’m seeing the following:

    is there a way to do a clean start?

    2016/02/21 10:40:28 [ERROR] attempted leader redirection, but no leader available
    2016/02/21 10:40:29 [ERROR] attempted leader redirection, but no leader available
    2016/02/21 10:40:42 [ERROR] attempted leader redirection, but no leader available

    1. Alex — rqlite has a new consensus module, and those errors will no longer occur. You should run the latest code instead.

  6. Buenas noches. Amigos me gustaría saber en que sistema operativo fue montado el cluster o como hizo?

  7. Does it support multiple SQLite DBs? We plan to use multiple SQLite DBs and want to replicate all of them. Also the writes and reads can be done on any of the DBs.

    1. No, you must run a rqlite cluster per SQLite database. And all writes must go through the leader, but reads can go through any node, depending on what read-consistency guarantees you can accept.

Leave a Reply

Your email address will not be published. Required fields are marked *