Understanding Observability (and Monitoring) with Christine Yen

Monitoring and observability is something near-and-dear to my own heart, so this week’s episode is exciting: Christine Yen, Cofounder & CEO of Honeycomb, joins me to talk about observability, why dashboards aren’t as helpful as you think, and the value of being able to ask questions of your own application and infrastructure when you’re troubleshooting.

About Christine Yen

Christine delights in being a developer in a room full of ops folks. As a cofounder of Honeycomb.io, a tool for engineering teams to understand their production systems, she cares deeply about bridging the gap between devs and ops with technological and cultural improvements. Before Honeycomb, she built out an analytics product at Parse (bought by Facebook) and wrote software at a few now-defunct startups.

Links Referenced: 

Transcript

Mike Julian: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, to the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.


Mike Julian: This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast you're probably also interested in better monitoring tools and that's where Influx comes in. Personally I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with their other tools, Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to Influx Data for helping to make this podcast possible.


Mike Julian: Hi folks. Welcome to another episode of Real World DevOps podcast. I'm your host Mike Julian. My guest this week is a conversation I've been wanting to have for quite some time. I'm chatting with Christine Yen CEO and co-founder of Honeycomb, and previously an engineer at Parse. Welcome to the show.


Christine Yen: Hello. Thanks for having me.


Mike Julian: I want to start this conversation off in kind of what might sound like a really foundational question. What are we talking about when we're all talking about observability? What do we mean?


Christine Yen: When I think about observability, and I talk about observability I like to frame it in my head as the ability to ask questions of our systems. And the reason we've got that word rather than just say, "Okay well monitoring is asking questions about our system," is that we really feel like observability is about being a little bit more flexible and ad-hoc about asking those questions. Monitoring sort of brings to mind defining very constrict parameters within which to watch your systems, or thresholds, or putting your systems in a jail cell and monitoring that behavior, whereas, we're like, "Okay, our systems are going to do things, but they're not necessarily bad." But let's be able to understand what's happening and why. And let's observe and look at the data that your systems are putting out as well as thinking about how, asking more free form questions might impact how you even think about your systems, and how you even think about what to do with that data.


Mike Julian: When you say asking questions what do you mean?


Christine Yen: When I say asking questions of my system, I mean being able to proactively be able to investigate and dig deeper into data, rather than sort of passively sitting back and looking at the answers I've curated in the past. In order to illustrate this, to compare observability monitoring a little more directly with monitoring, especially traditional monitoring when we're curating these dashboards, what we're essentially doing is we are looking at sets of answers from questions that we posed when we pulled those dashboards together. All right, so if a dashboard has existed for six months the graphs that I'm looking at to answer a question like, what's going on in my system, are answers to the questions that I had in mind six months ago when I tried to figure out what information I would need to figure out whether my system was healthy or not. In contrast, an observability tool should let you say, "Oh is my system healthy?" What does healthy mean today? What do I care about today? And if is see some sort of anomaly in a graph, or I see something odd, I should be able to continue investigating that threat without losing track of where I am, or again relying on answers from past questions.


Mike Julian: So does that mean that curating these dashboards to begin with is just the wrong way to go? Like is it just a bad idea?


Christine Yen: I think dashboards can be useful, but I think that over use of them has led to a lot of really bad habits in our industry.


Mike Julian: Yeah, tell me more about the bad habits there.


Christine Yen: An analogy I like to use is, when you go to the doctor and you're not feeling well. A doctor looks at you and asks you, "What doesn't feel well. Oh, it's your head. What kind of pain are you feeling in your head? Is it acute? Is it just kind of a dull ache? Oh, it's acute. Where in your head?" They're asking progressively more detailed questions based on what they learned at each step along the way. Honestly this is kind of parallels a natural human problem solving concept. In contrast, I think the bad habits that dashboard lead us to build are things like it would be the equivalent of a doctor saying, "Oh well based on the charts from the last three times you visited, you broke your ankle and you skinned your knee." Pretend you go to the doctor to skin your knee. You know, "Oh okay, you broke your ankle last time, did you break your ankle again? No. Okay, did you ... How's your knee doing?"


With dashboards, we have built up this belief that these answers to past questions that we've asked are going to continue to be relevant today. And there's no guarantee that they are. Especially for our engineering teams that are staying on top of incidents and responding, and fixing things that they've found along the way. You're going to continue to run into new problems, new kind of strange interactions with routine components. And you're going to be able to ask new questions of what your systems are doing today.


Mike Julian: It seems like with that dashboard problem we have that same issue with alerting. I've started calling this kind of reflexive alerting strategy where it's like, "Oh God. We just had this problem. Well we better add a new alert so we catch it next time it happens." It's like well how many times is that new alert going to fire? Probably never. Like you're probably never going to see that issue again. With dashboards, dashboards are the same way. What you're describing, God I've seen this 100 billion times where someone curates a dashboard, is like, "Okay, well first thing now that we have this alert is let's go look at the dashboards and see what went wrong." I'm like, "Well no, graphs look fine." So no problem, but clearly the sites down.


Christine Yen: Yeah, there's a term that we've been playing with, dashboard blindness. Where if it doesn't exist in the dashboard it clearly hasn't happened, or you know it just can't exist, because people start to feel like, "Okay, we have so many dashboards. One of them must be showing something wrong if there's something going wrong in our system." But you can't. You can't, that's not always going be the case. To expect that means that you have this unholy ability to predict the future of your system and man if people could really predict the future, I would do a lot more things than just build dashboard with that.


Mike Julian: Right. Rather than just shit on dashboards forever what is a good use of a dashboard? Like presumably you have dashboards in your office somewhere?


Christine Yen: Yes. I think dashboards are great jumping off points. And I actually very much feel like dashboards are a tool, they've just been over used. So I absolutely don't want to shit on dashboards because they serve a purpose of providing kind of a unified entry point. Right. What are our KPI's? What are the things that matter the most. Great. Let's agree on those. Let's go through the exercise of agreeing on those, because as mush as we would like to think that this is a technology problem that can be solved with tools, a lot of the time these sorts of things require humans and process to determine. So let's decide on a KPIs, and let's put them up on a wall, but expect and spread the understanding that wall is only going to tell us when to start paying attention. Dashboards themselves can't be our way of debugging, or our way of interacting with our systems.


Mike Julian: Right. So in other words that dashboard it's going to tell you that something has gone wrong, but it won't tell you what?


Christine Yen: Right.


Mike Julian: I think that's a fantastic thing. And that actually mirrors a lot of the current advice around alerting strategy too of you find you SLIs, alert only on an SLI, not on these low level system metrics.


Christine Yen: Yeah, I love watching this conversation evolve. I think Monitorama 2018, something like three talks in a row were all about alert fatigue. And it's so true to see these people, to see these engineering teams fall into this purely reactive mode of, "Okay, well if this happened, this is how we will prevent it from happening again." And each postmortem just spins out more alerts, and more dashboards. Inevitably your people are going to end up in a state of unsustainable hundreds or thousands of dashboards to comb through. And then their problem isn't how do I find out what's going on? It's, how do I figure out which dashboard to look at? Which again is looking at things from the wrong perspective. Dashboards tell you that something has happened and you need a tool that's flexible enough to follow your human problem solving brain patterns to figure out what's actually wrong.


Mike Julian: Funny you mentioned the Monitorama, there was a talk, I want to say 2016 maybe, I think it was Twitter where they had this problem of alert overload, just constant alerts. So they decided, "You know what we're going to do? We're just going to delete them all." Done. Like, "We'll just start over." I'm like, "That's such a fantastic idea." People think that I'm insane when I recommend it, but hey Twitter did it so I'm sure it's fine.


Christine Yen: Yeah, I mean drastic times call for drastic measures. It's funny talking, being especially in the vendor seat talking to a lot of different engineering teams about their tools and how they solve problems with their production systems. There is definitely an element of kind of this safety blanket feeling. Right? "Okay, but we need all of our alerts. How will we know when anything is going wrong?" Or, "We need all of our alarms for all time at full resolution." And I get it. I feel that there are patterns that folks kind of get into, and it's how you know how to solve their problems, and especially when things are on fire. It feels like you don't have time to step back and change your process when you're like, "No, this is what I'm doing to keep most of the fires under control." And I think this is why communities like yours and Monitoramas, and it's whether it is so good that we have ways that we can share different techniques for addressing this so that folks who are in the piles and piles of alerts hole can dig themselves out of it, and start to find ways to address that.


Mike Julian: Yep, yep, completely agreed. So I want to take a few steps back and talk about monitoring. There's been a lot of discussion about how observability is not monitoring. Monitoring is kind of I guess looking at things that we can predict. We think through, and feel free to correct me at any time here. We think through failure modes that could possibly happen, and design, or dashboards design alerts for those failure modes that we can predict. Whereas, what you were describing earlier, observability is not that, it's for the things that we can't predict. Therefore, we have to make the data able to be explored. Is that about right?


Christine Yen: That's about right. For anyone in the audience knee jerking about that, I want to clarify. I really think of observability as a super set monitoring. And the exercise of thinking through what might go wrong is still a necessary exercise. It's the equivalent of the software developers should still write tests. You should still be doing this due diligence of what might go wrong. What will be the signals for when it goes wrong? What information will I need in order to address it once it does go wrong? All these are still important parts of any release process. But, instead of framing it as, here's the signal I'm going to preserve it as this one metric, and immortalize this as the only way to know if something is going wrong. What we'd say, Honeycomb would encourage you to do is take those signals, whatever metric, or whatever piece of metadata that you'd want in order to identify that something is going wrong, and instead of immortalizing them, flattening them as pre-aggregated metrics, instead capture those as events, and you know, maybe it does make sense to define and alert, or define a graph somewhere so that you can keep an eye on it. But instead of freezing the sort of question that you might ask make sure you have the information available later if you want to ask a slightly different take on that question, or have a little bit of flexibility down the road.


Mike Julian: So thinking through all the times that I've instrumented code, hasn't this always been possible?


Christine Yen: It has. I would say not-


Mike Julian: I feel a very large but coming on.


Christine Yen: I think that as engineers we are taught to think about, or understand the constraints of the essentially data store we're writing into when we write into it. We're taught to think about the type of data we are writing, and the trade offs, and traditionally the kind of two data stores, either a log store, or a tensors metrics store that we've used has limitations. Either that limit the expressiveness of the metadata that we can send, and talking specifically about things like high-cardinality data, and tensors, metrics, we've just been conditioned that we can't send that sort of information over there. Or, okay logs are just going to be read by human eyeballs at grep, so I'm not going to challenge myself to structure them or put analytical information potentially useful for analytic queries into my logs. I think that the known trade offs of the end result have impacted habits in instrumentation. When instead, like you say, all this should have been possible all along. We just haven't done it because the end tools haven't supported this sort of very high level flexible analytical queries that we can and should be asking today.


Mike Julian: Yeah, you used a word there that I want to call attention to because it's kind of the crux of all of this, which is high-cardinality. I have had the question come up many, many times of what in the world is it? And it's always couched in terms of like, "I think of myself as quite a smart person, but what the shit is high-cardinality?" It's one of those things of, I'm afraid to ask the question, because I should know this like everyone thinks I should. I know it because I had to go figure out what in the world everyone was talking about. So what is it? What are we talking about here?


Christine Yen: I'm glad you asked. This is also why, for the record, our marketing folks have tried to shy away from us using this term publicly because lot of people don't know what it means, and they're afraid to ask. So thank you for asking.


Mike Julian: But it's so core to everything we're talking about.


Christine Yen: So very clinical level, high-cardinality describes a quality of the data in which there are many, many unique values. So types of data that the high-cardinality are things like IP addresses or social security members, not that you would ever store those in your, in any data-


Mike Julian: And if you were, please don't.


Christine Yen: Things that are lower cardinality are things like species, or species of person issuing the request, or things like AWS instance type. Yes, there's a lot of them, but there's far fewer of them then there are IP addresses. And-


Mike Julian: There's a known bound of that measured in maybe hundreds.


Christine Yen: Yeah. Yeah, and I think the reason that we're talking about this term more, and it's coming up more, is that we are moving towards a more high-cardinality world in our infrastructure. In our systems. And when I say things like that I'm like, well 10, 15 years ago it was much more common to have a monolithic application on five micro servers, where when you needed to find out what was going wrong that you really only had five different places to look. Or five different places to start looking. Now even at that kind of basic level, we have maybe instead of one monolith we have 10 micro-services spread across 50 containers, and then 500 Kubernetes pods all shuffling in and out over the course of a day. And even just that basic, which process is struggling, is much harder to answer now because we have many more of these combinations of attributes which then produce a high-cardinality data problem. And I think that's something that people are starting to experience more of, in their own lives, that a lot of vendors or open source metrics projects are starting to recognize that they also have to deal with as an effect of the industry moving in this technical direction.


Mike Julian: One of my favorite examples of this came from days back when I ran graphite clusters, the common advice was don't include request IDs or user IDs in a metric name. And ten to one running graphite that's still a pretty common thing, because if you do, well it explodes your graphite server. The number of whisper files that get created is astronomical. So the end result is that we just don't do it. And like you just don't record that data, but what you're saying is, no, you actually do need that data. Like not having that is hampering your exploration and trying to answer the questions.


Christine Yen: Absolutely and I mean in this case with press IDs or ID, again there might be some folks in the audience being like, "Well I'm Pinterest and I have the luxury of not having to worry about individual user IDs, and maybe, but I guarantee that there are some high-cardinality attributes that you do care about that are important for debugging. For us at Parse it was app ID. We were a platform so we had like 10s, 100s, eventually millions of unique apps all sending us taped data, and we needed to be able to distinguish, "Okay, well this one app is doing something terrible. Let's black list him and go on with our day." And if it's not user ID for some folks it might be shopping cart ID, or Mongo instance that it's talking to. Our infrastructure has gotten so much more complicated. There's so many more intersections of things. In graphite world you would need to define so many individual metrics to figure out that a particular combination of SDK on a particular node type, hitting a particular end point for this particular class of user, you'd have to track so many different combination metrics to find out that one intersection of those was misbehaving. But more and more that's our reality. And more and more out tools need to support this very flexible combining of attributes in this way.


Mike Julian: Right. Yeah, the more and more that we start to build customer facing applications, especially like the applications where the customer can kind of have free rein over what they're doing, what they're sending, like I don't know a public API means that one customer using one version of the API, using one particular SDK, could cause everyone to have a very bad day. And if you're aggregating all that, how are you going to find it's them? All you see is just that the service is sucking.


Christine Yen: 100%. Yeah, the more like, ultimately we're all moving towards overall where we are, multi-tenant platforms, and if not user facing platforms then often assured services inside larger companies. Your co-workers are your customers and you still need to be able to distinguish between that one team using 70% of your resources, versus other folks.


Mike Julian: Right. Yep. So it seems to me that there's kind of a certain level of scale and engineering maturity required before you can really begin to leverage these techniques. Is that actually true?


Christine Yen: I don't think that there is. There's no, you must be this tall to ride bar on the observability journey. There are number of steps. There are steps along the way that allow you to use more and more of these techniques, but when I think about teams that are farther along their journey than others. It's often more of a mindset then anything technical or anything like that. Right. When I think of steps along the observability maturity model and we're Liz Fong-Jones our new developer advocate, formerly with Google, is actually working on something along these lines for release, I think sometime in June, when we think of that, it's part tools, but it's also process and people. And it is, I think that there are some changes afoot in the industry about how people think about their systems. How people instrument. How people set up their systems in order to be observable, that really all factor into how effectively they're able to pick up some of these techniques and start running.


And choice of tooling is a catalyst for this. Ideally you have a tool that, sorry Graphite, lets you capture the high-cardinality actuate that you want to, but that's only a piece. And I think that we are in for a lot of really fun kind of cultural conversations about what it means to have a digi-driven culture. What it means to be grounded by what's actually happening in production when trying to figure out why the things that you're actually observing don't line up with what you expect.


Mike Julian: All right. So you've given a lot of talks lately and over the past year or two about observability driven development, which sounds really cool. Can you tell us what it is?


Christine Yen: Yeah. Observability driven development, or kind of as I like to say, to kind of zoom out and just talk about observability, and the development process is a way of trying to bring the conversation about observability away from pure ops land, or Pure SRA land and into a part of the room where developers and engineers hang out. So my background is much more of an engineer, my co-founder Charity, comes much more from the ops side of the room, and we've really started to see observability as basically just a bridge that allows and empowers software engineers to think more about production and really own their services.


And one of the things that I've pressed on in these talks about how observability can benefit the development process is what a positive feedback loop it is to be looking at production even way before I'm at a point of shipping code. There are so many spots along the development process when you're figuring out what to build, or how to go about building it, what algorithm to chose. Or, "Hey I've written this. My test passed, but I'm not totally sure whether it works." There's so many spots where if developers gained this muscle of, "Hey let me send you, check my theory with what's actually happening with production." People can ship fresh, better code, and be a lot more confident in the code that they're pushing out there in the first place.


My favorite example's from one of our customers, Geckoboard, they're obviously very a data driven culture, their primary business is providing dashboards for KPI metrics. They were telling me the other day about a project that their PMs were running actually, and the PMs were the primary users here not the engineers, where they ultimately had incomplete problem to try and solve. And their PMs were like, "Well we could probably have the engineers go off and try to come up with a perfect solution, or we can come up with like three possible approaches to this solving this problem. We could run these experiments in production. Capture the results in Honeycomb. And then actually look at what the data is saying about how these algorithms are performing." And the key here is that they're actually running it on their data. Right?


There's a realism that looking at real production data gets you that is so much better than sitting around debating theoreticals, because they're able to say, "Okay, well we've had these three implementations running in parallel, and looking at the data this one seems like the clear winner. Great. Let's move forward with this implementation." And they can feel confident that it's going to continue behaving well at least for the foreseeable future whether traffic remains the same.


Again these are bad habits that people have fallen into, right? Where dad's are like, "Okay, monitoring something that I need to add right before I ship it just so that up spokes will stay off my back when I tell them that everything is fine." Or, "That the up spokes isn't going to look at in order to come yell at me." I don't know, but it's like that shouldn't be the only time we're thinking about implementation. That shouldn't be the time we, I'm speaking for software developers here, should be thinking about what will happen in production. Because at every stage you know more and more people are using feature flags to release their code. Cool. You should be capturing those feature flags in your instrumentation, and alongside, "Hey, cool, what is X user think about this thing that we've featured flagged them into?" You should be looking at, okay, what is the performance of your system look like for folks who have that feature flag turned on or turned off? Are your monitoring metrics, observability tools, whatever flexible enough to capture that. Isn't that just as interesting as the qualitative, does user X like this new feature? It's got to be.


And there's so many things that more and more are starting to be part of the development process that observability tools should be tapping into, and should be encouraging in order to break down this wall between developers, and operators. Because ultimately you know, you said more and more we're building user facing systems at the end of the day their goal has to be delivering a great experience for those users.


Mike Julian: Right. Yeah, we're all on the same team here.


Christine Yen: We're all on the same team.


Mike Julian: So let's say that I'm a listener to this show, but I don't use Honeycomb, I can't use Honeycomb for whatever reason, but I really like all of these ideas. I want more of this for me. How can I get started with it? Like are there ways I can implement this stuff with open source technologies?


Christine Yen: There are probably some. First you want a data store that is flexible enough to support these operations. Right? So you should be looking for something that lets you capture all the bits of metadata that you know are important to your business. For Parse, to use as an example, that was things like app ID, operating system version of the client operator. In Parse it was a mobile back end of service. So we had a bunch of SDKs that you could use to talk to our API. So for we're evaluating the quote, unquote health of that service it was which app is sending this traffic? What SDKs are they using? Which end points are they hitting? Those mattered to our business, and those are also incidentally much easier for developers to map to code when talking about health or anomalies than traditional monitoring system metrics.


So identify those useful pieces of metadata. Make sure your tool can support any kind of interesting slices along those piece of metadata that you'll want. And make sure honestly, lots of folks again there might be some folks in the audience thinking, "Well I can do this with my data tools." I don't know how many data scientist you have in your listenership, and it's true, lots data science tools can do that. I know that for our intents and purposes, as an engineering team at Honeycomb we care about real time, so that tends to be something that disqualifies many of the data science tools.


But I think that more that tool choice, folks who are excited about observability, folks who are looking for the next step beyond monitoring should really start looking at places in their development process, or release process where they're relying on intuition rather than data. Right? Where else can we be validating our assumptions? Where else can we be checking what our expectations are versus what is actually happening out there in the wild? This culture and process is really what that observability driven concept is trying to get at, is where can you be more regularly, efficiently, naturally be looking to production to inform development in order to deliver great experience for our customers.


Mike Julian: Yeah, that's fantastic advice. This has been absolutely wonderful. Thank you so much for joining me. Where can people find out more about you and your work?


Christine Yen: The Honeycomb blog is a great place to find kind of a mix of stories, and more conceptual posts. Honeycomb.io/blog. I know that we actually also have our own podcast. It's called the ollycast. I think it's o11y.fn. And of course have Honeycomb Twitter feed and we have community Slack as well for folks who just want to talk about observability, and want to get a chance to play around.


Mike Julian: Yeah, awesome. As a parting story, I was on one of the first trials of Honeycomb, what back when it was still closed, and I can't remember where I read it. It might have been part of the in app documentation, it might have been something that Charity said on Twitter, but it was like, "Don't use Honeycomb for WordPress, that's not what we're built for." At the time I had about 100 node WordPress clusters. So I'm like, "You know what, I'm going to use this for Word Press.


Christine Yen: Awesome.


Mike Julian: I did actually find the interesting things out of it, which I found pretty hilarious.


Christine Yen: Cool.


Mike Julian: So there you go. I believe you do actually have a free trial as well now?


Christine Yen: We do. We have a free trial. We also have the community edition. It's a little bit smaller, but should be enough for folks to get a feel for what Honeycomb can offer. A note about the WordPress disclaimer, I'm glad you got value out of it. I think that's awesome. I would also say that a 100 node WordPress cluster is a whole lot more complicated than we thought when we said that early on. And I think that the distinction that we wanted to make there was, you know, if you have a simple system maybe you don't need this much flexibility. Maybe whatever you have set up is working fine. Because ultimately over the course of this podcast observability it involves changes not just to your tooling, but how you work and how you think about your systems. And that was really a kind of a disclaimer to make sure folks who were interested in investing in a little bit in all of that.


Mike Julian: Yeah. Yeah, absolutely.


Christine Yen: I'm glad you overcame and tried it out.


Mike Julian: All right. Well thank you so much for joining. It's been wonderful.


Christine Yen: Thank you. This has been a lot of fun. I'm a big fan.


Mike Julian: Wonderful.


Christine Yen: Thanks.


Mike Julian: And to everyone listening, thanks for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes you can find us a realworlddevops.com. On iTunes, Google Play, or wherever it is you get your podcasts. I will see you in the next episode.


VO: This has been a HumblePod production. Stay humble.
Want to sponsor the podcast? Send me an email.

2019 Duckbill Group, LLC