On Monday this week, the Prometheus authors have released version 1.0.0 of the central component of the Prometheus monitoring and alerting system, the Prometheus server. (Other components will follow suit over the next months.) This is a major milestone for the project. Read more about it on the Prometheus blog, and check out the announcement of the CNCF, which has recently accepted Prometheus as a hosted project.
Backstage Blog RSS
You're browsing posts of the category Go
July 19th, 2016 Announcements Open Source Monitoring Go Prometheus has come of age – a reflection on the development of an open-source project By Björn "Beorn" Rabenstein
January 26th, 2015 Announcements Open Source Monitoring Go Prometheus: Monitoring at SoundCloud By Julius Volz, Björn Rabenstein
In previous blog posts, we discussed how SoundCloud has been moving towards a microservice architecture. Soon we had hundreds of services, with many thousand instances running and changing at the same time. With our existing monitoring set-up, mostly based on StatsD and Graphite, we ran into a number of serious limitations. What we really needed was a system with the following features:
A multi-dimensional data model, so that data can be sliced and diced at will, along dimensions like instance, service, endpoint, and method.
Operational simplicity, so that you can spin up a monitoring server where and when you want, even on your local workstation, without setting up a distributed storage backend or reconfiguring the world.
Scalable data collection and decentralized architecture, so that you can reliably monitor the many instances of your services, and independent teams can set up independent monitoring servers.
Finally, a powerful query language that leverages the data model for meaningful alerting (including easy silencing) and graphing (for dashboards and for ad-hoc exploration).
All of these features existed in various systems. However, we could not identify a system that combined them all until a colleague started an ambitious pet project in 2012 that aimed to do so. Shortly thereafter, we decided to develop it into SoundCloud's monitoring system: Prometheus was born.
May 9th, 2014 Announcements Go Open Source Roshi: a CRDT system for timestamped events By Peter Bourgon
Let's talk about the stream.
The SoundCloud stream represents stuff that's relevant to you primarily via your social graph, arranged in time order, newest-first. The atom of that data model, an event, is a simple enough thing.
- User who did the thing
- Identifier of the thing that was done
If you followed A-Trak, you'd want to see that repost event in your stream. Easy. The difficult thing about time-ordered events is scale, and there are basically two strategies for building a large-scale time-ordered event system.
Fan out on write means everybody gets an inbox.
That's how it works today: we use Cassandra, and give each user a row in a column family. When A-Trak reposts Skrillex, we fan-out that event to all of A-Trak's followers, and make a bunch of inserts. Reads are fast, which is great. But writes carry perverse incentives: the more followers you have, the longer it takes to persist all of your updates. Storage requirements are also quadratic against user growth and follower count (i.e. affiliation density). And mutations, e.g. changes in the social graph, become costly or unfeasible to implement at the data layer. It works, but it's unwieldy in a lot of dimensions.
At some point, those caveats and restrictions started affecting our ability to iterate on the stream. To keep up with product ideas, we needed to address the infrastructure. And rather than tackling each problem in isolation, we thought about changing the model.
The alternative is fan in on read.
When A-Trak reposts Skrillex, it's a single append to A-Trak's outbox. When users view their streams, the system will read the most recent events from the outboxes of everyone they follow, and perform a merge. Writes are fast, storage is minimal, and since streams are generated at read time, they naturally represent the present reality. (It also opens up a lot of possibilities for elegant implementations of product features and experiments.)
Of course, reads are difficult. If you follow thousands of users, making thousands of simultaneous reads, time-sorting, merging, and cutting within a typical request-response deadline isn't trivial. As far as we know, nobody operating at our scale builds timelines via fan-in-on-read. And we presume that's due at least in part to the challenges of reads.
Yet we saw potential here. Storage reduction was actually huge: we projected a complete fan-in-on-read data size for all users on the order of a hundred gigabytes. At that size, it's feasible to keep the data set in memory, distributed among commodity servers. The problem then becomes coördination: how do you reliably and correctly populate that data system (writes), and materialize views from up to thousands of sources by hard deadlines (reads)?
Enter the CRDT
If you're into so-called AP data systems, you've probably run into the term CRDT recently. CRDTs are conflict-free replicated data types: data structures for distributed systems. The tl;dr on CRDTs is that by constraining your operations to only those which are associative, commutative, and idempotent, you sidestep a lot of the complexity in distributed programming. (See: ACID 2.0 and/or CALM theorem.) That, in turn, makes it straightforward to guarantee eventual consistency in the face of failure.
With a bit of thinking, we were able to map a fan-in-on-read stream product to a data model that could be implemented with a specific type of CRDT. We were then able to focus on performance, optimizing our reads without becoming overwhelmed by incidental complexity imposed by the consistency model.
The result of our work is Roshi, a distributed storage system for time-series events. It implements what we believe is a novel CRDT set type, closely resembling a LWW-element-set with inline garbage collection. At its core, it uses the Redis ZSET sorted set to store state, and orchestrates self-repairing reads and writes on top, in a stateless operational layer. We spent a long while optimizing the read path to support our latency and QPS requirements, and we're confident that Roshi will accommodate our exponential growth for years. It took about six developer months to build, and we're in the process of rolling it out now.
Roshi is fully open-source, and all the gory technical details are in the repository, so please do check it out. I hope it's easy to grok: at the time of writing, it's 5000 lines of Go, of which 2300 are tests. And we intend to keep the codebase lean, explicitly not adding features that are outside of the tightly defined problem domain.
Open-sourcing our work naturally serves the immediate goal of providing usable software to the community. We hope that Roshi may be a good fit for problems in your organizations, and we look forward to collaborating with anyone who's interested in contributing. Open-sourcing also serves another, perhaps more interesting goal, which is advancing a broader discussion about software development. The obvious reaction to Roshi is to ask why we didn't implement it with an existing, proven data system like Cassandra. But we too often underestimate the costs of doing that: costs like mapping your domain to the generic language of the system, learning the subtleties of the implementation, operating it at scale, and dealing with bugs that your likely novel use cases may reveal. There are even second-degree costs: when software engineering is reduced to plumbing together generic systems, software engineers lose their sense of ownership, which is the foundation of craftsmanship and software quality.
Given a well-defined problem, a specific solution may be far less costly than a generic version: there's a smaller domain translation, a much smaller surface area, and less operational friction. We hope that Roshi stands in evidence for the case that the practice of software engineering can be a more thoughtful and crafted process. Software that is "invented here" can, in the right circumstances, deliver outstanding business value.
July 24th, 2012 Go Open Source Go at SoundCloud By Peter Bourgon
SoundCloud is a polyglot company, and while we’ve always operated with Ruby on Rails at the top of our stack, we’ve got quite a wide variety of languages represented in our backend. I’d like to describe a bit about how—and why—we use Go, an open-source language that recently hit version 1.
It’s in our company DNA that our engineers are generalists, rather than specialists. We hope that everyone will be at least conversant about every part of our infrastructure. Even more, we encourage engineers to change teams, and even form new ones, with as little friction as possible. An environment of shared code ownership is a perfect match for expressive, productive languages with low barriers to entry, and Go has proven to be exactly that.
Go has been described by several engineers here as a WYSIWYG language. That is, the code does exactly what it says on the page. It’s difficult to overemphasize how helpful this property is toward the unambiguous understanding and maintenance of software. Go explicitly rejects “helper” idioms and features like the Uniform Access Principle, operator overloading, default parameters, and even exceptions, on the basis that they create more problems through ambiguity than they solve in expressivity. There’s no question that these decisions carry a cost of keystrokes—especially, as most new engineers on Go projects lament, during error handling—but the payoff is that those same new engineers can easily and immediately build a complete mental model of the application. I feel confident in saying that time from zero to productive commits is faster in Go than any other language we use; sometimes, dramatically so.
Go’s strict formatting rules and its “only one way to do things” philosophy mean we don’t waste much time bikeshedding about style. Code reviews on a Go codebase tend to be more about the problem domain than the intricacies of the language, which everyone appreciates.
Further, once an engineer has a working knowledge of Effective Go, there seems to be very little friction in moving from “how the application behaves today” to “how the application should behave in the ideal case.” Should a slow response from this backend abort the entire request? Should we retry exactly once, and then serve partial results? This agent has been acting strangely: can we install a 250ms timeout? Every high-level scenario in the behavior of a system can be expressed in a straightforward and idiomatic implementation, without the need for libraries or frameworks. Removing layers of abstraction reduces complexity; plainly stated, simpler code is better code.
Go has some other nice properties that we’ve taken advantage of. Static typing and fast compilation enable us to do near-realtime static analysis and unit testing during development. It also means that building, testing and rolling out Go applications through our deployment system is as fast as it gets.
In fact, fast builds, fast tests, fast peer-reviews and fast deployment means that some ideas can go from the whiteboard to running in production in less than an hour. For example, the search infrastructure on Next is driven by Elastic Search, but managed and interfaced with the rest of SoundCloud almost exclusively through Go services. During validation testing, we realized that we needed the ability to mark indexes as read-only in certain circumstances, and needed the indexing applications to detect and respect this new dimension of index-state. Adding the abstraction in the code, polling a new endpoint to reliably detect the state, changing the relevant indexing behaviors, and writing tests for them, all took half an afternoon. By the evening, the changes had been deployed and running under load for hours. That kind of velocity, especially in a statically-typed, natively-compiled language, is exhilarating.
I mentioned our build and deployment system. It’s called Bazooka, and it’s designed to be a platform for managing the deployment of internal services. (We’ll be open-sourcing it pretty soon; stay tuned!) Scaling 12-Factor apps over a heterogeneous network can be thought of as one large, complex state machine, full of opportunities for inconsistency and race conditions. Go was a natural choice for this kind of job. Idiomatic Go is safely concurrent by default; Bazooka developers can reason about the complexity of their problem without being distracted by the complexity of their tools. And Bazooka makes use of Doozer to coordinate its shared state, which—in addition to being the only open-source implementation of Paxos in the wild (that we’re aware of)—is also written in Go.
All together, SoundCloud maintains about half a dozen services and over a dozen repositories written entirely in Go. And we’re increasingly turning to Go when spinning up new backend projects.
Interested in writing Go to solve real problems and build real products? We’d love to hear from you!