Designing a search system for log data — part 1

This is the first part of a 3-part series “Designing and building a search system for log data”. Part 2 is here, and part 3 is here.

For the past few years, I’ve been building indexing and search systems, for various types of data, and often at scale. It’s fascinating work — only at scale does O(n) really come alive. Developing embedded systems teaches you how computers really work, but working on search systems and databases teaches you that algorithms really do matter.

Common to all these systems are a few key requirements and design principles. This series of blog posts aims to outline some of those principles, in terms of the design and implementation of a real program for indexing log data. The program also provides near-real-time search for that data.

Data sources

A log search system either pulls the data in for analysis, or has it pushed to it. Perhaps it uses a mixture of both. Simply reading files from disk falls into the first category, but receiving data over the network falls into the second category.

Once you decide to build a network service you must decide which protocols it will support, what performance trade-offs it will make, and what guarantees the system will provide. Does reception at the network layer guarantee the data will be successfully indexed? This guarantee is particularly difficult to meet — you can get very close, but providing 100% reliability is very hard. Supporting request-response protocols like HTTP make it easier, but performance may suffer.

Log parsing

Why parse log lines? Why make the effort to identify timestamps, IP addresses, hostnames, program names, and the like? If all you want to offer is free-text search, then parsing brings little benefit. But once the data is parsed, much more sophisticated indexing and search is possible and — this is of particular importance — searching by time is possible.

And once a log line is parsed, it results in many other pieces of distinct data, in addition to the source line itself. The system design must support associating these new attributes with log data during indexing, so that searching by these attributes can be efficient and effective.

But what time is it?

There are multiple answers to this question, when talking about log data. There is the time the log data is received, and the time parsed from the log message. Let’s call the first simply the reception time, and the latter the event time. In most cases event time is more important than reception time.

Most people don’t give much thought to these differences but they are very important. So important in fact, that systems that perform very well for the (common) case of reception time being closely correlated to event time often perform very poorly when event time varies significantly relative to reception time. This can happen, say, when a sender is delayed for a significant period of time.

This is also a very important usability issue. Log data is generated, obviously, in event time. To make sense to a user, search results must also be ordered by event time.

Data retention

Indexing and search systems are usually built to handle large amounts of data, and the value of older data may rapidly decrease with time — this is particularly true in the case of Technical Operations. So the question of data management is a very real one. How will your system delete data that is no longer needed? So that storage requirements are controllable? And how will your system respond if data is written into the past, older than retention policies allow? Or — in a related question — into the future?

Disk is cheap

Yes it is, but that is beside the point.

It’s very important that these types of system make efficient use of storage. Why, if storage is cheap? There are two main reasons, one emotional, one practical.

The first reason is that the storage a system requires for a given set of source data is easily measured. Because of that it’s often the first thing that is measured when first evaluating a system. And so even though it may not matter in practice, storage footprint is a very important consideration. This is the emotional reason.

The practical reason applies to deployments dealing with large amounts of data. If you don’t pay attention to storage efficiency, a very large system can easily generate terabytes of data. Operating a system with, say, 30TB of data on disk can be hard and slow work. Need to back up the system in an emergency? It’ll take time. Need to restart your entire cluster? It may take hours to come back online.

Always think about storage efficiency because every bit (really every lack of a bit!) helps.

You want fast indexing, or you want fast search?

Pick one, but only one.

When building these systems, there are many ways to improve indexing performance. And there are many ways to improve search. Using B-Trees to store the data means great read performance, but write performance can suffer. Instead if one uses an LSM approach, you can get great write performance, but search can take longer.
Or perhaps the system batches up the incoming data so it can be flushed to disk in a single fsync? You may improve write performance, but this may delay when the data is available for search.

At the system level sharding — effectively writing to many storage units in parallel — is often used to improve write performance, and can be very effective. But when it comes to search your system may need to read multiple shards, and then merge and sort many sets of results from those shards. Search slows down and becomes more complex.

Depending on the system design sharding may help with search. If you know that the data for which you are searching can only exist in a certain subset of shards, you need only search those shards. But this can add complexity to your system at index and configuration time, and may result in hot-spots — some shards may be more heavily accessed than others. But unbalanced access patterns — known as hotspots — is one of the main reasons for sharding!

This is not to say you can’t build systems that index (write) and search (read) with high-performance. You can. It’s just that there is a fundamental tension between these two goals, and improving one usually means degrading the other.

Querying the data

Finally you’ve got to decide on how the log data should be queried. Should it be simple free-text search? Should it support regular-expression like functionality? How does the user specify orthogonal query parameters like time range? E.g. only search log data from the last day?

And this doesn’t even begin to discuss graphical interfaces versus simple text-based command-line interfaces. Most operators prefer the command-line, but casual users usually prefer a GUI.

Enough talk!

In the next post I describe a real system I built to index log data received over the network, and make that data available for search. It’ll address many of the requirements outlined above and demonstrates the trade-offs that are made in real systems.