This is a post following up on my Monitorama Baltimore 2019 talk.
Logging and Monitoring systems — Observability Systems, if you prefer — often seem to struggle to meet the needs of their users.
Following up on my earlier post, it has been pretty straightforward to so far to migrate this blog from Rackspace to GCP. It’s going pretty much as expected, but the architecture is going to be slightly different than I initially thought.
Another interesting paper came my way, thanks to the Morning Paper mailing list. Nines are Not Enough:Meaningful Metrics for Clouds discusses a topic that I deal with regularly in my role at Google.
SLIs, SLOs, and SLA are easy to discuss in a general sense, but surprisingly subtle to put into practise. This paper, authored by Google engineers, explores why this is so, and offers a new framework for thinking about them.
Since I recently joined Google Cloud Platform (GCP), I thought it’s time to get some practical experience with the platform. As a result I’m going to migrate this blog from Rackspace to GCP — specifically I’ll use GCE for WordPress, and Cloud SQL for the persistent database storage.
Monitoring — the measurement of your system, the gathering of telemetry, and alerting when it behaves anomalously — is key to running large-scale, modern computer systems. But what many developers today don’t realise is that monitoring can be a key part of your design cycle too.
In every field there is a question that, while it sounds interesting, betrays a naiveté and lack of sophistication.
In my field — SaaS and data platforms — it’s how much data do you have?
rqlite is an open-source distributed relational database, which uses SQLite as its storage engine. rqlite is written in Go and uses Raft to achieve consensus across a set of SQLite databases. It gracefully handles leader election, and can tolerate machine failure.
I recently presented at the InfluxDB San Francisco Meetup, on InfluxDB and the Raft consensus protocol. My talk was about the fundamental problems of distributed systems, and how InfluxDB uses Raft to solve these issues.
In the last post we examined the design and implementation of Ekanite, a system for indexing log data, and making that data available for search in near-real-time. Is this final post let’s see Ekanite in action.
In the previous post I outlined some of the high-level requirements for a system that indexed log data, and makes that data available for search, all in near-real-time. Satisfying these requirements involves making trade-offs, and sometimes there are no easy answers.
For the past few years, I’ve been building indexing and search systems, for various types of data, and often at scale. It’s fascinating work — only at scale does O(n) really come alive. Developing embedded systems teaches you how computers really work, but working on search systems and databases teaches you that algorithms really do matter.
I’ve recently been thinking about why running Services is particularly hard. By Services I mean Software-as-a-Service platforms. During the years, I’ve written software for many different systems — embedded software, web services, databases, and distributed systems, but being involved with designing and running a SaaS platform was difficult in a whole new way: running Services is hard work.
Real-time — or near real-time — data pipelines are all the rage these days. I’ve built one myself, and they are becoming key components of many SaaS platforms. SaaS Analytics, Operations, and Business Intelligence systems often involve moving large amounts of data, received over the public Internet, into complex backend systems. And managing the incoming flow of data to these pipelines is key.