rqlite is a lightweight, open-source, distributed relational database written in Go, which uses SQLite as its storage engine. Recently I introduced a high-performance write-path into rqlite and, to my great surprise, it exposed a bug in SQLite.
The SQLite team quickly addressed the issue but let’s take a look at the behavior I saw, and how I finally created a simple test which reproduced the issue.
It always starts with a new feature
rqlite 7.5.0 introduced Queued Writes — allowing users to write large amounts of data to rqlite in a high-performance manner. Of course, this meant that the SQLite INSERT rate would also increase, which, of course, was the point.
127.0.0.1:4001> SELECT COUNT(*) FROM logs
Imagine my surprise when I would occasionally get back:
ERR! database disk image is malformed
Well, that’s strange. rqlite runs an in-memory database by default. How could the disk image be corrupted? I put it down to re-use of an error code within SQLite, which is a fairly common practice in programming generally. But then I made the real mistake. I didn’t take the error seriously enough and decided it was probably something transient, never to be seen again.
The real world intrudes
On November 1st a rqlite user, João Arruda, filed the following issue on GitHub:
There was no denying it now — this was a real issue, and potentially a serious one. Who wants to see errors about “malformed disk image” when using a database?
The first thing to do was to think about what was happening — and could I reproduce what João was seeing? I fired up my load test again, increased the INSERT rate until the disk IO was maxed out, and started querying the database via the CLI — and hit the issue again.
ERR! database disk image is malformed
What my testing did show was that this was a query-time issue only. Once the INSERT traffic ceased the correct number of records were always in the database — and those records always looked fine. So that was the good news — whatever was going on was not actually corrupting the database.
Let’s unit test it
The next was to simplify the setup. Could I write a test that would bring out the issue? Thinking about what was going on I realised it wasn’t much more complicated than queries being performed on one database connection, while rows were being concurrently inserted on a different connection. Shouldn’t be too difficult to reproduce, right?
And it wasn’t. I wrote a simple unit test in Go, which involved no rqlite code, and ran it.
$ go test -run Test_TableCreationInMemoryLoadRaw --- FAIL: Test_TableCreationInMemoryLoadRaw (2.30s) malformed_test.go:59: rows had error after Next() (1203 loops): database disk image is malformed FAIL exit status 1 FAIL github.com/mattn/go-sqlite3 2.303s
Boom! Now I knew I was onto something. Next step was to check with the maintainers of the Go SQLite driver used by rqlite. They confirmed there was something going on deep in SQLite, and wasn’t a problem with the Go code.
If you didn’t code it in C, it didn’t happen
Now it was time to post on the SQLite forum. I made my first post in early November, looking for help. I asked why should my queries see a corrupt disk image, with an in-memory database? While I got some response to my question, there was no real progress. I recognized that a Go-only example wouldn’t be too convincing to the SQLite community. There were just too many layers between my code and the SQLite implementation — too many reasons why it could be my usage of SQLite, and not SQLite itself.
So time to write a simple C program. Would I bring out the issue? Would it disappear? Either way I hoped to learn something.
I put together a C program pretty easily — the SQLite C API is surprisingly easy to code against. And then I compiled and executed it:
$ gcc in-memory.c -pthread -l sqlite3 $ ./a.out Running SQLite version 3.39.4 THREADSAFE=1 Failed to step data: database disk image is malformed
And there you go. A simple use of SQLite and my perfectly-fine queries encounter a corrupt database. It was time to post on the SQLite forum again. And very soon afterwards the SQLite team confirmed it was a bug in SQLite, and patched their source code.
Looking at the patch it appears that my query connection was getting access to the database when it shouldn’t have — and then seeing a database that was in the middle of changing. The query code interpreted this as a corrupt database, and returned an error to my code.
This was all very gratifying. I realised finding this bug was probably the most significant thing I’d done in my entire software career. I quickly patched my copy of SQLite, not waiting for a new SQLite release. I rebuilt rqlite and could no longer reproduce the issue. As a result rqlite 7.12.1 was released, including the fix for this issue.
And finally João confirmed it no longer occurred in his production setup either:
The lesson learned here is the same as it always is. The computers are doing exactly what you tell them to do. They don’t randomly insert errors in your system, and the computers are not making stuff up as they go along. Ignoring the issue when I saw it first was a big mistake, and was not intellectually honest.
Many thanks to João for filing the issue on GitHub, rittneje@ who maintains the Go SQLite driver at the center of rqlite and helped show it was a SQLite issue, and the SQLite team for fixing the bug so quickly.