Stop asking "how much data do you have?"

databaseIn every field there is a question that, while it sounds interesting, betrays a naiveté and lack of sophistication.

In my field — SaaS and data platforms — it’s how much data do you have?

I’ve been asked this question a lot, often without any indication of real understanding, and without any apparent appreciation of what it really means.

Let’s understand why

Let me explain why this question is often meaningless. Imagine I am part of a team that has built a SaaS-based Analytics system (I am). Let’s say you ask me how much data do you have, and I answer 20 terabytes. Have you learned anything? You might think you have, but you probably learned very little.

Perhaps you think 20 terabytes is really impressive. Perhaps you think it’s nothing. But you don’t really know, in any way that matters.

Perhaps I am cheating

Most modern data storage systems support replication. Replication is done for fault-tolerance and query performance. So perhaps I am replicating 2, 3, or even 4 times. Now my 20TB might really only be 5TB. But I didn’t tell you that, did I? And perhaps I’ll turn up the replication in case you’d like a bigger number. You want 60TB? Just let me know.

Perhaps I am selling myself short

Perhaps I used a Lucene-based system like Solr, or perhaps InfluxDB. Both systems store the data in compressed form. So how much data do I have? On disk I might have 10TB, but it might represent 100TB of information. So I ask you again, how much data do I have?

What does my system do?

Storing 100MB of data in hard consensus can be much more challenging than storing 100GB in an eventually consistent system. Sometimes it’s the other way around. But without knowing what kind of system I am running, a number doesn’t tell you much.

And running 10 5TB data systems is much different than running a single 50TB system. The operational challenges can be much greater with the latter.

Perhaps I am throwing my data away

So I tell you I have 100GB of data in my system. What you don’t know is that I have a system that processes 2TB of data per day.

You see, my team spent months building our streaming analytics pipeline to handle that ingest volume — and this was where we excelled. The team designed it, debugged it, deployed the monitoring infrastructure, made it rock-solid. It can actually be operated and can withstand all manner of failures. That’s where the engineering is, and the amount of data you see in my system completely misses the point.

It’s not the amount, it’s what inside it

Is my data indexed? By how many dimensions? Do I have 20TB sitting underneath an Elasticsearch cluster? Is it data that has many high-cardinality attributes? Is it heterogeneous in nature? In all these cases that 20TB is much richer, much more sophisticated, than 20TB sitting in AWS S3. It takes a lot more skill to manage the former than the latter.

Which brings me to my next point.

Perhaps my data is dead

I tell you I have 700TB of data. You are impressed. But my data sits in AWS Glacier, and never sees the light of day. Have I ever even attempted to restore a fraction of it? Do I really know what I am doing? So I ask you again, how much data do I have?

A better question

Now, of course the amount of data I have does matter. Dealing with 1MB of data is clearly different than dealing with 1TB, but just knowing the number is not enough.

There is a better question to ask.

If you had a major system failure and lost your data system, how long would it take to bring it back online from backups?

I like this question because it will tell you so much more about what you’re dealing with. Does the team snapshot the database? How long do snapshots take? How do snapshots work? If they must restore a snapshot it, how do they recover changes since the last snapshot? Does AWS actually do everything for them?

If the data is stored raw somewhere, how long will it take to re-index it into a fresh database? Is it all replayed from a system like Apache Kafka? How long will that take? How will you know when it’s done?

Can certain data — perhaps more valuable than other data — be re-indexed first? Are we talking minutes, hours or days before you’re back on-line? If it’s a distributed system, is it a particular challenge bringing it back to life? Where does the schema come from?

The real challenges

When it comes to building systems, what really matters is what has always mattered. Is my system well engineered? Can it tolerate failure? Am I humble enough to know I will make mistakes, and can my system recover? Do I actually know what it means to manage a modern multi-node clustered database?

Am I part of a great team, that writes clear code, takes pride in what we do, works hard, and is always trying to learn?

So by all means ask me how much data I have. But please ask other questions too.

Leave a Reply

Your email address will not be published. Required fields are marked *