Sometime ago I was asked where to begin to learn data engineering. It was a broad question, and it took some to understand what exactly I was being asked.
Soon I realised what the question really is — where should I start to learn about modern data infrastructure systems? As a generalist programmer without much knowledge in the domain?
Many people often think distributed systems knowledge is key in this area, but it’s not true. Distributed systems knowledge, while important, is secondary. What’s really important is developing a better understanding of databases and, I would argue, search technology.
Start by learning what relational databases can really do for you. Many developers gain only a superficial understanding of what a relational database does, and how it does it. But a relational database is truly an amazing piece of software engineering. Learn about how to model data with it, how it performs at scale, its limitations, and how it can fail.
From then, spend some time working with search engines. I have built a couple of significant systems using search technology, and the more I worked with them, the more I began to think search is the answer to a lot of data problems. Gaining a decent understanding of both relational and search systems, comparing and contrasting what each does well (and does not), is very informative. And it is important to understand how the performance of each type of system varies when data sets get very large. But don’t ask the question too soon.
From there you should learn about the need for replication and sharding. They are not the same thing, but can be complementary. They are usually done for quite different reasons — performance, reliability, or fault-tolerance.
Only after you understand these areas, should you dive into true distributed systems concepts. Learn why distributed systems can help you address many of the issues mentioned above, but also learn why it introduces its own problems. Study distributed consensus, as a formal process to design for, and solve, the distributed systems problem.
The final key element to data engineering are pipelines — moving the data from place to another. I’m not just talking about networking — though it’s clearly critical. I’m talking control flow, feedback, buffering, and recovery challenges involved when you need to move large amounts of data from one place to another, and often when there are low-latency requirements.
Apache Kafka has now become the leading pipeline technology, and is a great system to study.
A place to start
One place to start is with the search engine and a data set. Take the contents of Wikipedia and index it. Study what’s involved, and how performance varies depending on your hardware and reliability requirements. Next build a prototype pipeline that you can send data into on demand, and plug your indexing system into the other end. Decoupling different parts of your system is an important design pattern, and pipelines let you do it.
A place to finish
When you’ve done all this, study the data structures underlying it all. Possibly the two most important are the B+ Tree and the Log-structured Merge Tree. The former is most prevalent in relational databases, the latter in systems like Cassandra and Elasticsearch. Again, compare, and contrast, and understand why one is used versus the other
And now you know what I wrote in response to the question. Hopefully this give you an idea or where to start, and what do to next.