Observability & Robots with Ian Sherman

When I heard about a company doing observability on robots in the physical world, I was hooked--I had to know more! Thankfully, my guest this week, was all-too-happy to talk about what he and his team at Formant.io are doing, how it works, and the challenges they run into. Some stuff you can look forward to in this episode: the robot just got stuck in a puddle--how does it know? What considerations do you make for the safety of humans around the robot, so they don’t get up getting whacked by it? And, of course, a whole lot more.

About Ian Sherman

Ian Sherman is Head of Software at Formant, a company building cloud infrastructure for robotics. Prior to Formant, Ian led engineering teams at Google X and Bot & Dolly. The through line of his career has been tool building, for engineers and artists alike. He’s inspired by interdisciplinary collaboration of all types; currently this takes the form of applying patterns and practices from distributed systems operations to the relatively nascent field of robotics.



Mike: This is the Real World DevOps Podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing the awesome work in the world of DevOps, from the creators of your favorite tools to the organizers of amazing conferences, from the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find. This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in.

Personally I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database, Influx DB, but you may not be as familiar with their other tools. Telegraf for metrics collection from systems, Chronograf for visualization and Kapacitor for real time streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of that out at Influxdata.com. My thanks to Influx Data for helping make this podcast possible.

Mike: Robots. You apparently are working at some company that does observability for robots and I'm a little confused because like what in the world is this all about? Do robots actually need observability?

Ian: Yeah. I work in a company called Formant. We are about a year and a half old and we're focused on a lot of problems in supported robots, but specifically observability for robotics is very important to us. I think it's representative of the type of concern that hasn't historically been important in robotics, but is increasing as we are shipping robots more and more to customers, deploying fleets of robot, deploying them in semi-structured environments, and generally seeing their numbers increase in the wild.

Mike: These robots, are these like Johnny 5 style robots or they're more like C3PO or The Terminator or Wall-E? Are these more Wall-E or maybe even the really terrifying stuff that General Dynamics is putting out?

Ian: Right. We like to maintain a flexible definition of a robot. I think that's maybe just a way of avoiding the definition question.

Mike: I'm sure the robots in the singularity will be very happy about your loose definition.

Ian: Yeah. The vast number of deployed robots in the world has sort of traditionally defines probably in the space of automotive manufacturing. That's where we see bolted down work cells of high payload position controls, heavy metal robots performing assembly and welding and applications like that. But the fastest growing part of the robotics market is actually in service robotics and in the deployment of robotics into less structured environments. That's environments like logistics and warehousing, retail, agriculture. This is where we have started focusing is in robots in semi-structured environments.

We do think that we have a lot to offer in industrial robotics as well, but it has some better focus to date.

Mike: I saw on your website there's this really interesting photo of a robot kind of strolling down the aisle at the grocery store.

Ian: Mm-hmm (affirmative).

Mike: Is that indicative of the kind of robots we're talking about primarily?

Ian: It is. We may have a little bit of an insight into the way things are going just from the customers we're talking to every day, but we are seeing more and more robots deployed into retail for example. It's just what that image shows. The applications at the moment are typically in things like floor cleaning, inventory scanning. Those are the front of house applications that we see the most often. Of course, in order fulfillment and logistics and warehousing, we see a lot of addition applications of robotics.

Mike: Got you. I want to take a little tangent here and ask how in the world did you get into this? I don't think anyone comes out of school and says, “You know what I'm going to do? I'm going to build robots and observability.”

Ian: I came to robotics through work at a company called Bot & Dolly about seven or eight years ago. It was focused on applying robotics to challenges in film and visual effects. I had an opportunity to get involved in novel applications of industrial robotics at that company. We were acquired into Google and that was around the time that a number of robotics companies were acquired, including Boston Dynamics we mentioned. Inside Google, I had the chance to see how all of our peers were thinking about these problems. We ultimately left Google about a year and a half ago because we were excited to ship products. The timeline for that is...

Mike: There's a very subtle danger there.

Ian: The timeline is long at Google for shipping products, but the experience was really invaluable. Personally, I was already interested in the tools and infrastructure side of robotics. Through building tools to support these teams inside Google and through seeing how people thought about problems like observability, software deployment, configuration management in the context of robotics; it became clear that there's actually a huge opportunity to bring some of the best practices that have been developed for decades in the backend distributed systems world to the robotics world.

That's where I find a lot of inspiration. The problem is similar enough that we have a lot to learn, but different enough that it does require some new thinking and some new technology.

Mike: That's a really great segue into a really good question of what is it look like to do observability in robots? You mentioned all these tools and all these techniques that infrastructure people rely on every day. I can think management and that sort of thing. How is that being applied in your work?

Ian: The fundamental requirement of observability and robotics is really no different than it is in monitoring backend systems. We want to maintain visibility into the state of the system. Use that information to allow humans to respond to changes in internal systems state and also automated systems to respond to those changes. But there's a few key differences. One is that the data types that are relevant to us in robotics are often different than they are in backend distributed systems. We have sensors generating a lot of data about the physical world. Those data types are often geometric or three-dimensional or media-based.

The infrastructure and tooling to ingest and index and visualize that type of data is different. The workflows that we used to debug issues are different. They often require making sense of a lot of that geometric and visual data. Another difference is that centralizing data is often challenging from a field deployed robot relative to a server in a data center. The availability of network resources is often unpredictable and we need to have contingency plans in place for when that work is unavailable.

Relative to an IoT application, there's sort of a different set of resources available to us at the edge as opposed to extremely constrained IoT devices that might be running on bare metal. We typically have access to an operating system. We might even have access to a GPU. That allows us to make different trade-offs in the system design to maintain observability into these remote machines.

Mike: It sounds like you're ... Due to the perhaps limit of availability of network or the unknown availability of network and especially with robots out doing their thing in the fields, you're probably pushing a lot of decisions and logic to the edge, to the robots themselves. Is that right?

Ian: That's right. One thing we've learned over the course of building our product is that one of those decisions that's really important to our customers is actually decisions about what data is being centralized and when.

Mike: Oh, that's interesting.

Ian: Typically, in a backend monitoring setup we define a set of metrics that are continuously pushed or pulled at a common rate. In the robotics world, we may care about different types of data around different events of interests, different resolutions of data at different times of day or around say a particularly sensitive manipulation behavior, and giving our customers those levers to dynamically turn on and off what telemetry is being sent and what resolution is something that I think is kind of an interesting problem to work on and specific to the robotics domain.

Mike: Right, yeah. Absolutely. I'm imagining that perhaps some of the problems that you're running into are things like in the example of the grocery store, a robot going down an aisle and hits a spill in the middle of the aisle. What do you do about that? How does the robot even know that's a thing and how you ... that would be a really prime candidate for we need to record this information because the robot needs to know next time on how we need to handle this. How are you recording such things like this is not just "Oh, the CPU is X now." It's much more visual.

Ian: Yeah, so that fundamental limit of how does the robot know that something bad happened is a limit that we'll always have to confront. I think similarly to backend systems we often have to rely on second order, or sort of, best guess indications that something has gone wrong.

In the case of the spill, it could be that we are seeing wheel slippage, which is something we can detect in the robot control set, and that type of event for us might mean for us that the logs from the last 30 seconds are dumped and prioritized for upload to a centralized server.

Mike: Mm-hmm (affirmative). It occurs to me that you would have some granularity challenges, too. In that, let's say I have a web app and it's serving whatever customers. It's having problems four or five minutes is probably fine. It's going to be like, yeah people are going to be upset, but it's not the end of the world.

If I have a robot spinning in circles for five minutes. Someone is going to be really upset about that, which means that you'll have to be able to know about these problems within seconds, whereas in the standard web operations, for us, it's more like minutes. Is that right?

Ian: I think that's right and I think it gets to some of the safety challenges that come with deploying these systems in the physical world alongside humans. That's really a system design problem that we cannot solve entirely here. That's really the responsibility of the application developer to make sure that there is sufficient layers of safety and local autonomy, and sort of that system design that keeps people safe and hopefully keeps our customers happy.

So the stakes of mistakes are high, but the challenges of observability into those mistakes is also high. That is what I think makes it a really interesting space to work in.

Mike: I'm just imagining being in a grocery store, and being run over by one of these things. Like, that would be a very unpleasant experience.

Ian: I agree. Hopefully, we like to make sure that our customers know that it is not our responsibility as an observability platform to prevent that from happening, but that would be the worst case scenario, I agree.

Mike: I'm going to assume that you're not the first solution to ever come to market to solve this problem.

Ian: There is a long legacy of SCADA systems that have been deployed in industrial control settings.

Mike: Ooh, yes.

Ian: And-

Mike: Big fan of ICS.

Ian: Okay, and anybody who has worked with them knows that they are a proven technology that need a specific need for a specific set of users. Unfortunately, they don't really apply to this world of semi-structured robots wandering around retail stores.

Well we are not the first, I would say that we are part of the first wave of products that have emerged really just in the last year to address these concerns and I think it's because to date robotics companies have built everything in-house and we're seeing a trend similar to what we saw in 15 years ago, in the web world, which is that there is a growing realization that not every part of the stack is central to company's value proposition, and we're hoping to take some problems off people's plates.

Mike: Yeah, I'm glad you went that direction. That was going to be my question. How have people been solving this to begin with before you came along? Sounds like they're just writing a bunch of stuff themselves and hoping for the best.

Ian: Yes, that's what we see. It's extremely fragmented. I think the standardization that has happened in the robotics ecosystem that we're targeting has been really around solving problems of single agent autonomy, and for that, there are great open source tools out there like the robot operating system that have really gone along way towards standardizing approaches towards those problems.

But, when it comes to thinking about logs and monitoring and fleet management, really it has been extremely fragmented, and one challenge is that the people that constitute robotics companies often come from a very deep robotics research background, and don't have experience building and maintaining cloud infrastructure. As a result, we see a lot of avoidable mistakes being made. It's another place we see some opportunity.

Mike: I want to talk about your tech stack a bit. What's going on under the hood with all this? Are you using the same tools that operations engineers are going to recognize? Have you completely built stuff from scratch? What's going on there?

Ian: That's a great question so we're trying to strike a balance between building what needs to be built, but not what doesn't. A great example is our approach to exposing business intelligence capability on top of the telemetry we have collected. It's not to try to build that in-house in any sense. We are kind of leveraging the workflows that already exist for pushing data into a data lake and running business intelligence on top of that.

On the other hand, building the monitoring that's required to monitor not just scalar metric data, but also streams of images and geometric data is something that would be hard to ask of existing server monitoring tools so that's an area where we have made investments. We have made investments in the functionality of the edge, and some of that sort of dynamic instrumentation that we're talking about.

And, we made investments in some of the visualization because obviously looking at 3D data is very different than looking at textbooks.

Mike: Yeah, I'm just trying to think about how I would solve that problem, and coming up with just a whole lot of blanks. Like the time series database, oh that's easy, it's not an impossible problem, but visualizing 3D data in a way that I could go back and look at it. That sounds tricky.

Ian: Yeah, but it makes it fun though.

Mike: For a bit, I was thinking that maybe you were just collecting a whole bunch of different data points and assembling them into an image, but it sounds like you're actually taking image snapshots from the edge and storing those?

Ian: Yeah, so we can consume full or reduced resolution images, point clouds, maps, data types like that. The biggest challenge is really in making those trade offs between when are the resources available to do that compression at the edge. When is the network available to centralize that data, and that's why we've been really focused on the capabilities of the software running at the edge.

Mike: Mm-hmm (affirmative). I imagine that you probably have somewhat limited retention on the robots themselves. Are you talking minutes, days, hours, months?

Ian: Well, it definitely depends on the customer. We often see a tiered approach as you would expect where LIDAR data that might be publishing at a Kilohertz generating gigabytes per minute has a very low retention period. Text data is obviously easier to keep around for a long time. But, we do have the luxury of typically full SSD resources locally, and that does give us retention better than what you would get on a Raspberry Pi IoT to best-

Mike: Right, yeah. I want to talk about failures in these robots a little bit. It would seem to me, in my naïve understanding of robots, that you know everything that's on a robot. You know what's there, you know what isn't. It seems to me that you could predict all the different failures that could happen. But, with our example of the spill on the floor, we clearly get into, well maybe not so what kinds of failures are common in these robots? I think you mentioned earlier that there is a whole lot of the unknown unknowns that you're getting into as well. Can you talk more about those?

Ian: Sure, I think we can reason pretty well about the internal state of the robot software, but where it gets challenging is that these robotic systems often. They obviously include hardware components, and they are interacting with an external world that can be very hard to reason about.

So the failure modes are really diverse, and to the question of what types of failure modes do we see often? A good example is mobile robots often encounter mislocalization, a low confidence about a position in a map. This can be solved in a few ways. One approach that we see, some companies are taking is a shared autonomy approach, where there is actually support from a human operator in the case that a robot identifies itself as mislocalized. They can sort of help the robot get back on track.

That is something I think is unique to robotics and a trend that we're seeing.

Mike: Is this mislocalization failure like the equivalent of you thinking there is another stair on the stairway?

Ian: Well, that might be hard to detect. I think it's more like moving around a retail environment for which a static map exists, but finding that the inventory manager actually moved a shelf overnight.

Mike: Oh.

Ian: And all of the sudden, my slam algorithm which usually has a very high degree of confidence about my position on a map, is returning very low confidences about whether I'm on aisle 9 or aisle 10.

Mike: It's like your significant other rearranging the living room when you're gone.

Ian: Exactly. Yep.

Mike: Or in the middle of the night.

Ian: Yep.

Mike: Okay, got it.

Ian: That's one failure mode that we see. Another failure mode that is common is at the hardware layer. Again, this is typically pretty hard to predict except though sort of second order measurements in software. What we try to do to support that use case is to just give people good visibility into what hardware is deployed where. As anybody who's worked on fleet management problems before knows that's a tricky problem in itself especially when you are swapping out components, doing repairs. But, that's an area where we definitely, see pain and opportunity for robotics.

Mike: Right. How do you decide where the fault detection should take place? Sometimes it should happen on the robot, other times it should happen centrally. How do you decide which is which?

Ian: It's tricky. That is pretty application specific, and as an infrastructure company, we don't know the answer to that as well as our customers do. I think that's where the domain expertise of people solving inventory scanning problems, and retail, really comes into play.

You know, what we're trying to do is give people hooks in the right places to do that monitoring wherever in the stack makes sense.

Mike: Do you ever advise your customers on what's possible? I imagine that when I worked in retail a million years ago. If I had a robot there sitting in front of me, I wouldn't even be sure what I could do with it. Like, I could imagine a few things but I'm sure you being experts in this particular area, can imagine all sorts of other things.

Ian: Yeah, we definitely talk to customers, and are happy to consult on some of that system design. But, they're often very good at what they do, and when it comes to knowing their own hardware, knowing their own local software stack, there is only so much value we can provide.

Mike, I'm curious for you. How do you see observability monitoring practices extending beyond backend distributing systems and I'm sure we're just one of the number of domains that is starting to sort of borrow and steal. Where else are you seeing this happen?

Mike: Man, that is a fascinating topic. Listeners can go back to a previous episode. I believe it will be the episode with my friend Andrew Rodgers. Andrew and I discussed the observability in manufacturing, and the observability in industrial control systems, SCADA which we were talking about earlier, and his work now with like building scale and city scale monitoring and observability, which is absolutely fascinating stuff.

We made a comment that the manufacturing world was really great at coming out with the principles of process engineering, but they didn't have the technical ability to write the software to execute on those principles. Then at some point, software engineering and operations, the technical operations found these principles but because we are experts at writing software. We started to do the execution, and now manufacturing is taking all that and applying it to them.

Like it was this really cool, two way street that's happened over the past 10 years or so-

Ian: I would guess that includes some sensing of not just software systems, but a physical system as well.

Mike: Yeah, so it's actually ... not just of software, it's actually almost entirely physical stuff. It's where the things like I have a boiler or I have a furnace or hell, I have a road. My road has sensors embedded in it. Well I'm gathering all this data about environmental conditions and traffic and things like this, and shoving that into some software system that is going to do stuff and send that information back out into the physical world to change the physical world.

The technology is being used here. The same technologies we've been talking about like they are making heavy use of Grafana and time series database like Cassandra and Kafka for steaming, like all of this is standard web operations tooling that would expect in any monitoring platform, but it's being used for this real-world physical interactions.

I think that this is super cool because it's very similar to what's going on here where you are collecting this physical world data, shoving it into software that we would all know and recognize if we saw it, making decisions about it, and using that to change what's going on in the real world.

Ian: Yeah, I think the control loop element of this is really interesting. In software, we think, typically the control might involve horizontal auto scaling or traffic shifting or something like that. In the physical world, we get to actually move metal or plastic or turn cameras or reach out and touch things, which is pretty exciting.

Mike: Yeah, in a lot of ways, the impact is bigger and the risk is also bigger.

Ian: Yeah, true. Yep

Mike: So, we're using software to control traffic lights. It doesn't take a genius to see that could potentially turn out very badly.

Ian: Absolutely.

Mike: Same with manufacturing furnaces. These are things that get things to a bajillion degrees, and we're using software to monitor those to see what's the temperature on this. Well, one of the things that a lot of people don't realize is if a furnace is getting too hot, yes that sucks, but it's not a huge issue. It's actually a furnace getting too cold that's a problem.

So if the furnace gets too cold, then everything in it solidifies. Well these furnaces haven't been shut off for 20 plus years, so when it solidifies, the furnace is done. It's time to replace it, and that's a million dollar investment. Simply, because software that was paying attention to it didn't catch a failure in time.

Ian: Yeah, I think for me that kind of points at the deep domain knowledge that people who've been working in these industries for their entire career, really bring to bear on the problem. You know, deep knowledge of furnace behavior. It's not something I can speak to, but I think that it's really exciting to me if people with this expertise are empowered with software that lets them do things that they've always wanted to do.

I think when we talk in the robotics world about these different waves of robotics start ups and the earlier robotics subjects were founded by robotics Ph.D's who know they could build robots and were looking for a problem. I think we're staring to see robotics-

Mike: Kind of like the AI at MIT. All the early AI companies were just people from the AI Lab.

Ian: Yep, maybe looking for problems with a hammer looking for a nail.

Mike: Right, yep.

Ian: The most recent crop of robotics start-ups that I think has a much higher chance of success are people coming in with some deep domain expertise and looking at robotics as just one tool among many that they might apply to solving a real problem.

Mike: Right, incidentally for anyone that just heard that particular comment. That's the foundation of building a business.

Ian: Yeah, you'd hope. Yeah.

Mike: Yeah, having domain expertise and a certain area and then looking for ways to solve that problem, is a much better way to go about life than coming with a hammer looking for nail.

Ian: But, we know a lot of the latter for sure.

Mike: Right, a whole bunch of us, man, we really know Python, let's go find problems we can use Python for. As it turns out, that doesn't usually go so well.

Ian: I'm curious when you see people applying monitoring and observability to these new domains. Do you see people making mistakes that have already been made in that world? Or do you think that people are benefiting from the recent developments in monitoring?

Mike: I honestly think it's a big of both. One of the challenges that I'm seeing is that you have all these people from the manufacturing or building management. This is two domains that are top of line of me. You see all these people who have those deep domain expertise and they're taking tooling and processes that we've come up with in awesome software and applying them.

Well, they're able to apply them fairly well to their domain expertise. What they are actually missing is all the domain expertise from the software world. Things like how do you do effective alerting. It's not like every time a value passes a number, page someone. We know that doesn't work, and like software engineering, we're still ... I think we're pretty well at the forefront of how to do that. The different anti-patterns there.

But, then when you look at places like nursing and medicine, they're not there yet. Like, they know it's a problem, but they don't have good solutions. We kind of have solutions. Their environment is much higher stakes than ours is. You can see why they are kind of hesitate to adopt some of the solutions we've come up with, which is very understandable.

Ian: Yeah, very high and I'm very excited to see how that plays out.

Mike: Yeah, me too. I think it's going both ways. Manufacturing, building management like they're clearly doing really cool stuff with the stuff we've built and the stuff we've designed. I think for the most part, all the monitoring vendors outside of those domains are not even seeing what's going on. Like, I happen to know that Influx DB and Grafana and Cassandra and Kafka are being used in places that these companies don't even know they're being used in.

That's super cool, but on the other hand, that's pretty shitty for everyone. It would be really helpful if we had more of a dialogue between these two groups, but-

Ian: Yeah, I think people who care about tooling and infrastructure for operations. There's just a ton of opportunity in some of the domains you just mentioned.

Mike: Right, like, most of the opportunity in the world is not within the software domain. It's in non-software domains like places were you wouldn't expect software to be.

Ian: Yeah, I agree.

Mike: Yeah, yeah. Robots. Well, now that we've come full circle. This has been an absolutely fascinating discussion. Thank you so much for joining us.

Ian: I really enjoyed it. Thanks, Mike. This was fun.

Mike: And to everyone else listening in, thank you for listening to the Real World DevOps podcast. If you want to stay up-to-date on the latest episodes, you can find us at realworldDevOps.com and on iTunes, Google Play or wherever it is that you get your podcast. I'll see you the next episode.
Want to sponsor the podcast? Send me an email.

2019 Duckbill Group, LLC