Why are we still tailing log files?

Sit down with any developer today, even those at pretty new software companies, and you’ll see them doing something that programmers have been doing for years — you’ll see them tailing log files.

But why? Why are we still tailing logs?

Didn’t we solve log analysis?

I’m part of the Engineering team for Google Cloud Logging, one of the largest logging systems in the world. We’ve built an advanced log-analysis at scale. So has AWS, and so has Microsoft.

Just a few of the Logging companies in the market.

Logging companies are everywhere, because engineers love to solve the problems they understand best. It’s a super crowded market — a market that has provided a plethora of log ingestion, search, and insights tools. We have had advanced log management and analysis systems for years.

Yet developers are still tailing log files.

A long time ago, when I was at Loggly, our mission was to build a product so effective that tailing would seem unnecessary and anachronistic — and yet our customers still tailed their log files. We even tailed our own log files while developing and monitoring Loggly itself!

Now, to be clear, we used our own product plenty, but tailing never went away. In fact, one of competitors at the time — Papertrail — made its name building a really slick, yet simple, product that tailed logs, and didn’t do much else.

Tailing is dead, long live tailing!

It turns out tailing isn’t going anywhere. In fact, my team at Google recently released its own tailing feature. So it’s worthwhile thinking about why developers still like to tail logs, and where it can still go wrong.

Tailing remains popular because it’s simple, gives a real sense of connection with the computer, and the software it’s running. As a developer one really enjoys the low-latency experience, almost a sense of mind-meld with the software you’re developing.

And, of course, tailing is easy. No agents to install, no web browsers, just, perhaps an ssh session, and a shell prompt. And Control-C when you’re done.

It’s these very characteristics that makes it difficult to emulate a tailing experience in advanced log analysis products. Any emulation of tailing must seem very simple — and low-latency is key. This can be difficult to do well in a (relatively) complex framework such an webpage — not to mention transmitting data over HTTP.

Traps for young players

But we were right at Loggly, to a certain extent. If one is not careful, tailing encourages bad habits, especially when monitoring production systems.

The temptation is always to ssh to a machine, start tailing the log files, and watch for the critical log message go by, the one that will tell you what you need to know there and then. But it’s too easy to miss the very message you need to see when text is streaming by. It’s then when Search is really needed — when you need to be sure — and when Alerting, and Insights are needed too.

In fact too much tailing is an anti-pattern, a sign your Operations is not what it should be — that centralized and secure logging is not where it should be.

So enjoy tailing logs, it has its place, just don’t let it make you lazy.

2 thoughts on “Why are we still tailing log files?”

kubectl logs also sits at the same interesting place between “SSH to a host and tail logs” vs “use a full-fledged analytics-focused UI.”

Tailing logs is crucial for event-based-debugging. Developers often need to answer how services react when processing certain events.

At the heart troubleshooting is the question what happens to service X when triggered by event Y. Half the battle is finding the problematic service X, and finding the specific event Y that triggers the bug. You needs to iterate over triggering many event types and watching the reactions of multiple services until you get discover the root cause of the issue.

Although printf debugging may be scoffed at in the modern era of debugging tools, it is still the most simplest and effective way for those darn bugs that only happen in production, and when you are running short on time. In such scenarios, low latency tailing is critical to shorten the iterative debugging period and effectively the MTTR.

What if there was a way to centrally tail the logs so that you don’t have to SSH into the host? What if the central tailing solution was as fast as `tail -f` on the host? I think developers around the world would love a highly scalable distributed tailing solution because, as you guessed it, most developers are lazy for anything more sophisticated.

Vallified

Why are we still tailing log files?

Didn’t we solve log analysis?

Tailing is dead, long live tailing!

Traps for young players

2 thoughts on “Why are we still tailing log files?”

Leave a Reply to Punya Biswal Cancel reply

Philip O'Toole