Programming a database is fascinating work. I’ve been deeply involved with developing open source databases for the past two years and programming a database is possibly the most instructive project one can ever complete as a software developer.
What’s really striking however, is how much my attitude towards databases has changed over the past 6 years. From a state of disinterest, I’ve come to think of these systems as a pinnacle of software engineering.
Not knowing any better
For most of my career, my only experience of databases was reading about them, usually in a rather dry context — open any undergraduate textbook on databases to see what I mean. Often you’re presented with something like the table below, as the canonical use case for a relational database:
ID | First | Last | Title | Department |
---|---|---|---|---|
1 | Robert | Kelly | Director | Marketing |
2 | Tom | Burke | Representative | Sales |
3 | John | Smith | Vice President | Sales |
Could you read about anything more, well, boring? If this was all databases were about I wanted nothing to do with them. What was the point? Software was much cooler than this, right? And so I completely avoided anything to do with databases for a long, long time.
You never forget your first CRUD application
In 2009, after many years writing embedded software, Linux device drivers, and networking software, I found myself leading a team that needed to build a web-based system. You see, the AWS cloud had arrived and licensing techniques based on MAC addresses would no longer work. My team had to build a licensing portal for our new EC2-based software appliances. Since we had plenty of experience with Python, we chose Django, running on top of MySQL. And something new happened. I actually started working with a database.
As development of our plain ol’ CRUD app proceeded, I began to realise just how important the database was — how fundamental it had become to our system. If we lost the database, our software development was for nought. If the database corrupted data, our customers’ appliances could go unlicensed, and their networks would cease to operate. If the database didn’t perform — and didn’t perform well — thousands of people would be affected simultaneously. But none of these things ever happened. The database always worked. It never let us down. I was impressed.
Later on I discovered foreign key constraints, unique constraints, referential integrity, indexes, (remember, at this time I knew nothing about these things) — all the myriad ways the database could help me build a more robust system. I finally realised that modern databases were amazing — that databases are the most boring thing in the world until you actually have to build a system with them.
You never forget your first search system either
By 2012 I was leading a team building a large indexing and search system, with elasticsearch at its core. Seeing what a system like elasticsearch could do — a big key-value database built on world-class indexing technology — even with terabytes of log data underneath it, was a real eye-opener.
By now I had even seen databases and search systems fail too, but I was hooked on database technologies. By 2014 I was part of a small and dedicated team, developing the core of an open-source time-series database.
What I learned
Algorithms really do matter
Only in database development does Big O analysis really come alive. A database is one of the few applications where a programmer still has to loop over, sort, and filter, millions of objects. It’s one of the few places that much of the dry material one learns in CS class actually matters.
The same cannot be said of a lot other software development. Writing Boot ROM firmware? No, algorithms never mattered to me then. Tuner device drivers? Nope, didn’t matter. Network appliance management software? CRUD apps? Hardly. All those disciplines required different skills, different knowledge. Most of the time it was only in job interviews that I discussed runtime complexity.
But that all changed with database development. To actually see a system return the correct results, but in a fraction of the previous time, due to an algorithm change — to actually see it happen in your code, in the system you built — is a wonderful thing.
Performance really matters too
There is an old story in software that goes like this: a programmer writes some code that runs ten times faster than the previous version. He shows it around, but someone points out that the data it produces are slightly different than the correct data. “But it’s ten times faster.” the programmer points out. “Well, if it doesn’t need to be correct, I can make a version that takes up no space at all and runs infinitely fast”, replies the other.
This morality tale always had a big impact on me. Correctness was always more important than anything else. And that is true. But it also made me believe that programs are worth something simply because they produce the correct results.
This is not true for databases.
Performance is more than just a feature. It’s a requirement. The people who will pay real money for databases often do so because they have very large amounts of data. If the database doesn’t perform well in this situation — if it doesn’t return results quickly and efficiently — it might as well not work at all.
You think the write system is complex?
I think what stunned me most about developing databases was how complex the query engine becomes. I had had lots of experience building systems that write and store data to disk. Making those systems perform well can be a significant challenge.
But that complexity is often significantly less than that of the query engine. A flexible query system — effectively building a system to answer questions, when you don’t know what the questions will be ahead of time — requires serious design thought. The query planner must be effective. The query system must support many orthogonal requirements — filtering by some dimensions, and grouping by others, joining data from disparate tables — and sometimes from external sources. And finally the query system must be efficient and perform very well. This leads to a very real tension in design and implementation, between abstraction and optimization, which takes real skill to manage well.
In the real world, it must be operated
Any serious database must support basic operations such as backup, restore, shard management, and monitoring.
If I, as a serious operator, can’t backup your database, I can’t use it, It’s as simple as that. It doesn’t matter how fast your database can accept writes. It doesn’t matter how small its memory footprint is during queries. If I can’t protect the the data in my database from the failures you, the database creator, can’t control for, I’ll never be comfortable running it.
Of course, there are many ways to backup a database that don’t require co-operation from the database. But built-in methods are often best. It’s one of the reasons I added hot-backup support to rqlite v2.0. If I wanted anybody to try using rqlite seriously, I had to address real-world concerns that the system could completely fail, and take the data long with it.
So when designing and implementing a database, build in operational support from the start. Make it a fundamental part of your design. Your users will thank you for it.
The answer is usually “it depends”
When you first start working with databases, especially as an operator, you often ask questions such as: At what rate can the system index? How fast will it respond to queries? How much disk space will I need? How large can a shard be and still be OK? How can I speed stuff up? All asked without qualification. I used to do it myself.
And perhaps you get to talk to the database programmers, and you ask them these questions. And the response you often — perhaps always — get is: It depends. You must benchmark, you must measure. It can be infuriating to hear this, and can seem like a cop-out.
But it’s not.
Now I smile when I hear such questions. It’s naive.
Indexing rates might depend on the size of your data, not just the number of documents or data points. It might depend on batching, the cardinality of your data, whether your database is clustered, what columns and fields in your data are indexed, whether it’s new data, or updates to existing data, the machine you run your database on, the RAM, the IO performance, the replication you use.
The variables controlling performance never end.
For queries, perhaps it depends on the time range for time-series data. It depends on the number of records hit, the number of fields queried, whether range scans are involved, whether the data is indexed, what types of indexes are in use, the number of shards that might be accessed, whether the data is local. And the characteristics of the machine. Is it under load? Is it performing maintenance? Is the network busy?
So the answer is always, it depends. The database designers are being honest. They can know everything about the system they built, and still not know the answer to your question.
Things to program before you die
If there was one piece of advice I would give to fellow developers looking to improve their programming ability, it would be to become part of a database development team. My programming skills increased enormously as a result of database development — it is a wonderful coding experience.
Thanks for the post. Can you tell me the path to becoming a member of a database development team? I love to read database stuff (e.g.: how they work underneath) but it is hard to find an open-source project and contribute. Mostly because those projects have matured and having a big team behind.
Hi Thao — I feel your pain. It is difficult due to a) code maturity (as you say), and the complexity of database software.
One (possibly) easier way is to contribute to the client library code base for the database first — contribute to the libraries that talk to the database’s API. That code is usually less complex, and often not as mature. Or, if a client library doesn’t exist for your favourite language, create it and offer it to the maintainers. That can get you noticed, and studying the API can start to get you up to speed on the database.