Real-time — or near real-time — data pipelines are all the rage these days. I’ve built one myself, and they are becoming key components of many SaaS platforms. SaaS Analytics, Operations, and Business Intelligence systems often involve moving large amounts of data, received over the public Internet, into complex backend systems. And managing the incoming flow of data to these pipelines is key.
Once you connect your systems to the Internet, it’s chaos. You can’t predict ahead of time what data you’ll receive, from where, and at what rate. And at some point your system will receive more data than it can handle. In a multi-tenant system a single individual customer may overwhelm your pipeline. This is a particular problem in systems like AWS, where that customer might be a single network hop away over a 1Gb network link.
Unless you’ve thought about how your data pipeline should behave when this happens, it’ll most likely fail spectacularly when it does, bringing your reputation down with it. In the end your design choice comes down to one or more of the following options: drop, throttle, or buffer.
This is the cheapest option for you, the system designer. The cheapest traffic to handle is that which you don’t handle at all. However, dropping data can create a seriously negative impression. Your customers will consider their data lost. If you decide to drop data you’ll need to build a notification system, which lets them know when data is being thrown away.
At a network level, a simple way to communicate to clients is to always close the network connection after dropping data, assuming the data arrives over TCP. However, network-level operations like this can cause many problems for sender — for example, a Syslog client — so sometimes it makes better sense to leave the connection up, but silently drop the data.
An alternative to dropping is throttling — deliberately slowing ingestion of the data, for the client in question.
It’s hard to implement throttling and rate-limiting well in software — you often need buffers in memory, queuing policies, and timers to drain those queues. The Linux kernel has extensive support for traffic control. Go also has some very nice rate-limiting patterns based on channels. All this can be quite CPU-intensive too.
Buffering provides the best experience for your customers, but is the most expensive for you. Apache Kafka is the best system I have worked with, when it comes to building these buffers. Its stellar performance is more than enough for most designs, and its storage is only limited by the amount of disk you are prepared to provide.
A system of buffering is only a partial answer however. A better solution combines buffering with an overflow system — route the offending traffic into a buffer other than that for the main incoming data. You can then decide how to handle the overflow — process it later, or discard it. They key characteristic is that the data is not lost.
But Measure First
When the times comes to implement one of these policies, to be able to choose, you must first measure. Your system must have measurement of all flows built-in, and if your system is multi-tenant, you must measure everything on a customer-by-customer basis. Only then you can decide if you should Drop, Throttle, or Buffer the offending traffic.