Another interesting paper came my way, thanks to the Morning Paper mailing list. Nines are Not Enough:Meaningful Metrics for Clouds discusses a topic that I deal with regularly in my role at Google.
SLIs, SLOs, and SLA are easy to discuss in a general sense, but surprisingly subtle to put into practise. This paper, authored by Google engineers, explores why this is so, and offers a new framework for thinking about them.
Since I recently joined Google Cloud Platform (GCP), I thought it’s time to get some practical experience with the platform. As a result I’m going to migrate this blog from Rackspace to GCP — specifically I’ll use GCE for WordPress, and Cloud SQL for the persistent database storage.
Continue reading Deploying Vallified on GCP
Monitoring — the measurement of your system, the gathering of telemetry, and alerting when it behaves anomalously — is key to running large-scale, modern computer systems. But what many developers today don’t realise is that monitoring can be a key part of your design cycle too.
Continue reading Monitoring: it’s not just for production
In every field there is a question that, while it sounds interesting, betrays a naiveté and lack of sophistication.
In my field — SaaS and data platforms — it’s how much data do you have?
Continue reading Stop asking “how much data do you have?”
rqlite is an open-source distributed relational database, which uses SQLite as its storage engine. rqlite is written in Go and uses Raft to achieve consensus across a set of SQLite databases. It gracefully handles leader election, and can tolerate machine failure.
With the v3 release series, rqlite can now replicate SQLite databases on a global scale, with very little effort. Let’s see it in action using the AWS EC2 cloud.
Continue reading rqlite v3: Globally replicating SQLite
I recently presented at the InfluxDB San Francisco Meetup, on InfluxDB and the Raft consensus protocol. My talk was about the fundamental problems of distributed systems, and how InfluxDB uses Raft to solve these issues.
Continue reading InfluxDB and the Raft consensus protocol
This is the last part of a 3-part series “Designing and building a search system for log data”. Be sure to check out part 1 and part 2.
In the last post we examined the design and implementation of Ekanite, a system for indexing log data, and making that data available for search in near-real-time. Is this final post let’s see Ekanite in action.
Continue reading Designing a search system for log data — part 3
This is the second part of a 3-part series “Designing and building a search system for log data”. Be sure to check out part 1. Part 3 follows this post.
In the previous post I outlined some of the high-level requirements for a system that indexed log data, and makes that data available for search, all in near-real-time. Satisfying these requirements involves making trade-offs, and sometimes there are no easy answers.
Continue reading Designing a search system for log data — part 2
This is the first part of a 3-part series “Designing and building a search system for log data”. Part 2 is here, and part 3 is here.
For the past few years, I’ve been building indexing and search systems, for various types of data, and often at scale. It’s fascinating work — only at scale does O(n) really come alive. Developing embedded systems teaches you how computers really work, but working on search systems and databases teaches you that algorithms really do matter.
Continue reading Designing a search system for log data — part 1
I’ve written my first post for the InfluxDB blog. In it I discuss the new statistics and monitoring system built into InfluxDB, starting with the 0.9.4 release. Functionality like this is critical when it comes to running a distributed database like InfluxDB.
You can check it out here.
It’s been 418 days since my first Github commit of Go code. In that time I’ve written a Syslog-to-Kafka producer, a Raft-based distributed SQLite database, a near real-time log search system, and become a core developer of InfluxDB.
Continue reading 400 days of Go
I’ve recently been thinking about why running Services is particularly hard. By Services I mean Software-as-a-Service platforms. During the years, I’ve written software for many different systems — embedded software, web services, databases, and distributed systems, but being involved with designing and running a SaaS platform was difficult in a whole new way: running Services is hard work.
Continue reading Running services is hard
Real-time — or near real-time — data pipelines are all the rage these days. I’ve built one myself, and they are becoming key components of many SaaS platforms. SaaS Analytics, Operations, and Business Intelligence systems often involve moving large amounts of data, received over the public Internet, into complex backend systems. And managing the incoming flow of data to these pipelines is key.
Continue reading Drop, Throttle, or Buffer
I’ve been thinking a lot recently about what makes computer services and products sticky — what makes users and customers come back again and again to what you’ve built. There are lots of ways to summarize it, but when it comes to systems that help technical people run their own systems, they come for the features, but they stay for the uptime.
Continue reading Come for the Features, Stay for the Uptime
This blog describes working with InfluxDB 0.8. InfluxDB 0.8 is no longer supported, and has been superseded by the 1.0 release.
I recently came across InfluxDB — it’s a time-series database built on LevelDB. It’s designed to support horizontal as well as vertical scaling and, best of all, it’s not written in Java — it’s written in Go. I was intrigued to say the least.
Continue reading InfluxDB and Grafana HOWTO
Java is the predominant language of Big Data technologies. HBase, Lucene, elasticsearch, Cassandra – all are written in Java and, of course, run inside a Java Virtual Machine (JVM). There are some other important Big Data technologies, while not written in Java, also run inside a JVM. Examples include Apache Storm, which is written in Clojure, and Apache Kafka, which is written in Scala. This makes basic knowledge of the JVM quite important when it comes to deploying and operating Big Data technologies.
Continue reading What I wish I’d been told about the JVM
Loggly recently held an elasticsearch meetup, which was a great success. One question that was repeatedly asked was how to ensure elasticsearch does not suffer a partition — known as a split-brain. This can be a particular problem in AWS EC2, where the network is subject to interruptions. It can also happen if the elasticsearch master node performs long garbage collection cycles.
One configuration that is very effective at preventing this problem is described in this post.
Continue reading Avoiding elasticsearch split-brain
I recently wrote my first post for the Loggly blog. It illustrates why host machines are often the worst place to store the logs those machines are generating.
You can check it out here.
When running a large real-time processing system, monitoring is critical. But it does more than allow you to keep an eye on your system. During development it allows you test hypotheses about how it works, how it performs when certain parameters are changed, and takes the guessing out of working with dynamic systems.
Storm, a real-time computational framework open-sourced by Twitter, is such a system and comes with a Spout, allowing messages to be streamed from a Kafka Broker.
Continue reading Monitoring Storm Kafka Spouts using Python