rqlite is a lightweight, open-source, distributed relational database written in Go, which uses SQLite as its storage engine. Recently I introduced a high-performance write-path into rqlite and, to my great surprise, it exposed a bug in SQLite.
The SQLite team quickly addressed the issue but let’s take a look at the behavior I saw, and how I finally created a simple test which reproduced the issue.
It always starts with a new feature
rqlite 7.5.0 introduced Queued Writes — allowing users to write large amounts of data to rqlite in a high-performance manner. Of course, this meant that the SQLite INSERT rate would also increase, which, of course, was the point.
During my load testing of Queued Writes, I would sometimes run the rqlite CLI, and keep an eye on test progress by issuing the following command periodically:
127.0.0.1:4001> SELECT COUNT(*) FROM logs
Imagine my surprise when I would occasionally get back:
ERR! database disk image is malformed
Well, that’s strange. rqlite runs an in-memory database by default. How could the disk image be corrupted? I put it down to re-use of an error code within SQLite, which is a fairly common practice in programming generally. But then I made the real mistake. I didn’t take the error seriously enough and decided it was probably something transient, never to be seen again.
The real world intrudes
On November 1st a rqlite user, João Arruda, filed the following issue on GitHub:
There was no denying it now — this was a real issue, and potentially a serious one. Who wants to see errors about “malformed disk image” when using a database?
The first thing to do was to think about what was happening — and could I reproduce what João was seeing? I fired up my load test again, increased the INSERT rate until the disk IO was maxed out, and started querying the database via the CLI — and hit the issue again.
ERR! database disk image is malformed
What my testing did show was that this was a query-time issue only. Once the INSERT traffic ceased the correct number of records were always in the database — and those records always looked fine. So that was the good news — whatever was going on was not actually corrupting the database.
Let’s unit test it
The next was to simplify the setup. Could I write a test that would bring out the issue? Thinking about what was going on I realised it wasn’t much more complicated than queries being performed on one database connection, while rows were being concurrently inserted on a different connection. Shouldn’t be too difficult to reproduce, right?
And it wasn’t. I wrote a simple unit test in Go, which involved no rqlite code, and ran it.
$ go test -run Test_TableCreationInMemoryLoadRaw --- FAIL: Test_TableCreationInMemoryLoadRaw (2.30s) malformed_test.go:59: rows had error after Next() (1203 loops): database disk image is malformed FAIL exit status 1 FAIL github.com/mattn/go-sqlite3 2.303s
Boom! Now I knew I was onto something. Next step was to check with the maintainers of the Go SQLite driver used by rqlite. They confirmed there was something going on deep in SQLite, and wasn’t a problem with the Go code.
If you didn’t code it in C, it didn’t happen
Now it was time to post on the SQLite forum. I made my first post in early November, looking for help. I asked why should my queries see a corrupt disk image, with an in-memory database? While I got some response to my question, there was no real progress. I recognized that a Go-only example wouldn’t be too convincing to the SQLite community. There were just too many layers between my code and the SQLite implementation — too many reasons why it could be my usage of SQLite, and not SQLite itself.
So time to write a simple C program. Would I bring out the issue? Would it disappear? Either way I hoped to learn something.
I put together a C program pretty easily — the SQLite C API is surprisingly easy to code against. And then I compiled and executed it:
$ gcc in-memory.c -pthread -l sqlite3 $ ./a.out Running SQLite version 3.39.4 THREADSAFE=1 Failed to step data: database disk image is malformed
And there you go. A simple use of SQLite and my perfectly-fine queries encounter a corrupt database. It was time to post on the SQLite forum again. And very soon afterwards the SQLite team confirmed it was a bug in SQLite, and patched their source code.
Looking at the patch it appears that my query connection was getting access to the database when it shouldn’t have — and then seeing a database that was in the middle of changing. The query code interpreted this as a corrupt database, and returned an error to my code.
This was all very gratifying. I realised finding this bug was probably the most significant thing I’d done in my entire software career. I quickly patched my copy of SQLite, not waiting for a new SQLite release. I rebuilt rqlite and could no longer reproduce the issue. As a result rqlite 7.12.1 was released, including the fix for this issue.
And finally João confirmed it no longer occurred in his production setup either:
The lesson learned here is the same as it always is. The computers are doing exactly what you tell them to do. They don’t randomly insert errors in your system, and the computers are not making stuff up as they go along. Ignoring the issue when I saw it first was a big mistake, and was not intellectually honest.
Many thanks to João for filing the issue on GitHub, rittneje@ who maintains the Go SQLite driver at the center of rqlite and helped show it was a SQLite issue, and the SQLite team for fixing the bug so quickly.
11 thoughts on “How I found a bug in SQLite”
I think I hit the same bug in django, and it took them 5 years to fix it. Django tagline is “webframework for perfectionists with deadlines”. I was fired because of this bug.
Damn, what a ride!
The fix is for memdb. I doubt that you used the inmemory db for django….
I did use in memorydb for a testing pipeline, andteh bug was preventing transition from django 1 to django 2.
As a maintainer and software dev myself, I can 100% relate w/ how you decided to ignore it at first, as a “maybe it’s my env or smth”.
As a lesson to learn from this, it could be an interesting approach for the next time u hit something similar, bc let’s face it: it is very likely to happen, to add a ” known issues” disclaimer stating what you saw, in case someone else also manages to reproduce it, and if so, then focus on it, rather than ignoring it and then having an issue filed.
In any case, well done, and what a nice achievement 😀
That’s actually a good idea, I should do that going forward for any other issue like this.
> add a ” known issues” disclaimer stating what you saw, in case someone else also manages to reproduce it, and if so, then focus on it
If you go that route, you need to explicitly ask for reports of bug reproductions, though – if I ran into a bug that was listed as a known issue, I wouldn’t file a bug report (after all, they already know!)
Computers really can insert random errors in your code! They happen when radiation or cosmic rays randomly flip a bit… and yes, that really does happen. But only really at large scale, and we have error correction in hardware (both RAM and disks) to minimise the (uncorrected) rate of these random bit flips below the level where you will never see it in a whole career.
I had that happen to me once compiling a large C++ code base. I got an error talking about “klass Something” or something similar. The source didn’t contain such an egregious error. Looking deeper it turned out it was a bit flipped in the class keyword. Recompiling obviously fixed it. Still a bit scary.
I would say any software issue can be truly random until you can reproduce the issue on another system that doesn’t share any piece of hardware.
That’s the only true way to be absolutely sure your current system is not simply broken. Brokeness can happen in various ways: storage, power, broken memory controller, broken L1/L2/L3 cache, broken CPU core, broken micro-op execution unit in the CPU core etc etc.
Only when you can reproduce it with two fully independent systems (in best case having everything different so Intel vs AMD CPU, Samsung vs ADATA storage etc. you can be sure. (This would be required to rule out design problem in a given component.)
When you’re sure the issue is caused by software, then it’s time to put resources to figure out the exact cause.
It was a bigger issue, congratulations!