Cherre's Stefan Thorpe - Heading Straight For The Cloud

0:00 0:00

May 8, 2022

Liran Haimovitch

Rookout CTO Liran Haimovitch sits down with Stefan Thorpe, CEO at Cherre. They discuss how he knew that the cloud was where he was going to go, the whole point of DevOps, why they monitor everything, what they’re doing with Kubernetes and why they’re excited about it, and why they don’t use terraform.

Try Rookout for free

Cherre’s Stefan Thorpe – Heading Straight For The Cloud

May 08, 2022

Intro: Welcome to The Production First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I’m your host, Liran Haimovitch, CTO and co-founder of Rookout.

Liran Haimovitch: Today, we’re going to be discussing the complexities of data integration. With us is Stefan Thorpe, Chief Engineering Officer at Cherre. Thank you for joining us, and welcome to the show.

Stefan Thorpe: Thank you for having me. It’s a pleasure to be here.

Liran Haimovitch: So, Stefan, tell me a little bit about yourself, who you are.

Stefan Thorpe: So, obviously, I head up an engineering department at Cherre. I’ve worked in the technology space for, well, 24-25 years now. I come from an ops background. I used to build networks, build servers, everything from laying cables to — actually, I worked on the physical building of a data center back in the UK. That was a long time ago. I was also really lucky to get involved in AWS and actually do a POC with AWS back in 2008-2009. And I knew it instantly, the cloud was exactly where I was going to go. That has brought me at the forefront of DevOps for just over a decade and a bit, nearly a decade and a half. Scaling systems is what I’ve done and what I’ve focused on. I’ve traveled the world doing it, and that’s exactly what I love doing.

Liran Haimovitch: Awesome. Now, you mentioned you started with operations going back, as you know, data center operations, networking operations. And I have to wonder, when people think about the head of engineering, the first image that comes to mind is usually some X software engineer, somebody who writes tons of code. And I kind of have to wonder, how do you see it? How does this different background story, so to speak, make a difference or doesn’t make a difference in how you see your role?

Stefan Thorpe: It makes a difference in a number of ways. I think really, partly my view of DevOps here becomes important. Whilst I come from an ops background, I realized — and always had an interest in some of that development background as well. And especially as I started to adopt those cloud environments, the first thing I did for most companies, circa 2009-2010, was migrate them onto the cloud. Everyone was excited about the massive scale and everything else. And I was moving these Ruby applications on, maybe PHP, and then they would go, “Great, we’ve got this system that scales, but it’s now not scaling.” And what I quickly learned was, it was because we weren’t architecturing the actual software.

The whole point of DevOps is its dev and operations. And so, from that point on, as soon as I started to realize that, I just re-geared my training and skilling. I then spent four or five years heavily in the development space. I still did ops, but I learned everything about design patterns, True OO, how and when to apply functional, which ones work, which ones don’t work, and really bring — So I just spent that entire time — I still do that. It’s interesting, I now code in more languages than I care to speak. But every time I sit down and look at a system, I bring the software architecture and what I’m going to do and my code and then the application, and I can’t help but view the entire system in this kind of 3D integrated model in my head.

Liran Haimovitch: Makes perfect sense.

Stefan Thorpe: So I had to work on the dev side of it, but yeah, it’s made me a better engineer. It’s actually one of the things I push all my engineers to do. So, whilst I have a team of some 30-40 data engineers, many of those guys are more skilled in, say, Kubernetes and infrastructure than some DevOps guys I’ve met. No offense to the DevOps community. But it’s all part of our continuous learning, our continuous improvement process. We’ll probably talk about this more later but we run a blameless post-mortem process. The very root of that is getting down to those five Whys: What’s going on? What’s happened in the system? Now, that system as a whole is the people, the actual production, development, operations. So we really have to run through that entire stack. And through that process, our team learns about all of those technologies.

Liran Haimovitch: So you’ve mentioned the blameless post-mortem, let’s dive into that. How do you monitor your SaaS platform and where does the blameless post-mortem come into play?

Stefan Thorpe: We monitor everything. And the mechanism we use predominantly to assess how well we’re doing is SLOs, service level objectives. To put it in the simplest term for those that might not be aware, it’s a ratio that we do something X amount of time over a certain percentage. And that gives us an error budget and a response time. Through those SLOs, we work out how well the system is doing. And as I said, the system includes people, not just the technical system. So, are we able to deliver and set our client expectations? When I say we monitor everything, it’s phenomenal how much data we produce from just monitoring the metrics. We literally have hundreds of metrics across hundreds of variations. We still miss stuff.

And again, that comes into the blameless post-mortem process. If something happens, and as we go through that five Whys and like, “Oh, well, we could have had a test here that might have alerted that,” the first thing we go and do is then go and add that test into the system. So, maybe we have anomaly checks, and again, you have to continuously improve monitoring and those benchmarks. And maybe the anomaly or the standard deviation was too wide or too narrow and alerts too much or doesn’t alert enough. And all of that plays in. So we’re constantly tweaking and improving those. We’re also constantly looking for additional systems to monitor in very different ways. Honestly, I think the post-mortem process and the fact that it’s the system, always the system, it’s never a single person or a single thing.

Frankly, if a person has been able to do something, the system has been configured incorrectly. So the first thing we do is, in those scenarios, we sit everybody down. It’s like, “Okay, we get it,” I’m sure you get that kind of ice-cold bath when somebody has done something, or maybe they felt they could have done something, or stopped it, or, “Maybe I could have raised my hand up and said X, Y, and Z.” But the point is, the system should do it. And from there, we just look at how we can improve it, add in the fix, move on to the next thing, and continue to move forward.

Liran Haimovitch: I have to wonder, you’ve repeatedly mentioned metrics and SLOs, so I’m guessing that when it comes to the three pillars of observability, you’re very metrics-driven compared to other types of observability. Can you share a bit more about that? What key metrics are you looking at and, especially, how do they differ between different parts of the system?

Stefan Thorpe: For Cherre as a data integration platform, it’s all about delivering. In fact, any startup, it’s all about delivering for our clients’ needs. And in our case, our clients — everyone’s interested in uptime because you want to make sure that you get access to the platform that you’re using. That’s a simple one, everyone should do that out of the bag. It’s around the data and getting the data to our clients in a consistent manner. And so we want to make sure that we deliver data on time. If the source has an update Monday, 5 am, we might want to get it out to our clients by 8 am so they’ve got it there for their working day moving forward. We deal with all types of data, by the way. So we actually pull public data sources, we pull our data partner data sources, so we’re working with our data partners because they’re able to deliver quicker, and then we work with our clients’ integration services. And that’s how our clients are able to form their insights quicker, make their decision better, and go through that entire process.

Liran Haimovitch: Let’s take a step back for a second and say a bit more about what you are doing as a company and what are you doing with all that data you seem to be moving around.

Stefan Thorpe: So, Cherre is the real estate’s leading data integration and insights platform. What that means is we enable our customers to connect both their internal and external disparate data for insights and better decision-making. Because the world knows that data has value. But for it to have value, it needs to connect and join seamlessly. A very simple way is that we build data warehouses for our clients. What I was coming to is, that can be public data, it can be data partner data. Especially in the real estate world, there are a lot of people that focus on the right comparison for, I don’t know, a single-family home or the trends over a particular sub-market and those kinds of things.

So there are companies that niche down on getting the right quality in a very particular question area. We don’t. We want our clients to be able to bring all of that data together and then start to look at the entire picture as a whole. What that allows them to do is make better decisions — make better and quicker business decisions. And if we do that in a consistent way, it’s much better. So, we focus on reducing risk, we automate the entire process, and we accelerate those times to business insights and giving clients value to their data.

Liran Haimovitch: So if we go back to the metrics you’ve mentioned, I understand your key metrics around the data. How quickly do you move the data? How accurate is the data? How well do you integrate it or process it?

Stefan Thorpe: Exactly. That’s exactly what we’re looking at. So we look at, end-to-end, how long it takes us to process and the data delivery windows. That in itself as contractual SLAs, and then we have our own objectives to delivery that consistently. We then have our performance SLAs, and I’ll get onto that once we talk about how we deliver data to our clients. And then the other thing we do is check for quality and anomalies and changes within the data. What you typically see with almost any data set is trend patterns, maybe the fill rate of a column increases by half a percent every time you do a weekly update.

So you’ve got this trend that goes up by half a percent each week, if all of a sudden it drops off by 10%, our alert system will come up and say, “Okay, something was different here. Not sure whether it was right or wrong, but it was different.” And then we’ll trigger an incident or check what’s going on, work out — it could be that the source was down. It could have been something as simple as the file that we downloaded was corrupt and we weren’t able to process it, so we come back and we restart that process. So that’s all around our data delivery what our clients want. Then we have our performance of our systems underlying in every system we touch. Is our ingest system working as expected? Are packages that build up all of those still functioning in the same way they’re doing?

We then have machine learning models. Are they performing in the same way that they were previously doing? What you do with metrics is you build this picture of what’s going on in a live environment, and frankly, trigger on anything and everything to start off with, and then you start to work out which metrics are valuable and which ones aren’t. I’ve set up monitors that I thought were going to be hugely important and then I never heard from them ever again. But that’s part of the joy of going through the system and continually learning from what’s needed and what’s not.

Liran Haimovitch: Yeah, it’s all about trial and error. Something can seem completely reasonable and make a jump as an alert every time after time after time. Or something else that might seem obvious can never ever make an appearance.

Stefan Thorpe: Yeah, exactly. I mean, honestly, those alerts that go off all the time are probably even more dangerous than the ones that don’t.

Liran Haimovitch: Exactly. Stefan, I’m wondering, you’ve mentioned a lot about the data you’re ingesting, what data is it? What data sources do you integrate and what are the challenges of working with them?

Stefan Thorpe: It’s any real estate data that our clients need, as tangential as that needs to be. We’ve connected everything from the court systems so that people can understand when an asset is in a distressed state and there’s something going on within the court systems and illegals. We’ve connected floodplain data so that people can understand the risks, and maybe an insurer is looking at it and saying, “Okay, is this building within a high-risk area?” It’s literally any data. The challenges that come with that are varying and massive. This is probably one of my favorites, we had to deal with an FTP server that was nearly 25 years old.

Liran Haimovitch: That sounds like a recipe for great success.

Stefan Thorpe: Oh, honestly, it was so much fun. It hadn’t been updated so, obviously, nowadays, it’s running these obscure security protocols which took some time to go and work out and try and get all of that through all of our layers of security and back out, and that was fun within itself. And then the files that had been uploaded have been manually edited and added since 95 by Windows. The encoding system, whilst this was templated, the encoding system for this has been every variation of Microsoft simple coded fonts since Windows 95. And Microsoft had some really interesting operating systems during that period. We would process it, we’d get a 25% through and we’d be like, “Okay, we found an error.” The first time around we were like, “Okay, what’s going on here?” Then we’re like, “Oh, it’s an encoding thing.” So we then built this really long list of like, “Does it match this encoding? Does it match this encoding?” — those kinds of things.

We then have to deal with the other end of it — high-speed, data that’s coming through on a much quicker scale. One of my favorite analogies around this, there’s a video that’s going around the internet of a philosopher talking about how heavy it is to hold a glass. And what he goes on to say is, “The glass isn’t heavy, but you stand there and hold the glass for 30 minutes, it gets heavy. You stand there for two hours, your arm is going to be aching and throbbing, and now that glass is really heavy.” When you’re managing hundreds of pipelines, you’ve got hundreds of those glasses that you’re holding and having to keep in a working state constantly trying to deliver.

And, like I said, we do that coming back to our metrics and just making sure that they’re running smoothly. Do that consistently over weeks, months, and years — as I said, some of our ingests are 15 minutes, pretty high speed. And then some of them, it goes off once a year. So you don’t look at it for a year and then it fires up and it comes out, and then if there’s a potential issue, you’re like, “Okay, we worked on this a year ago, what were we doing? So, yeah. There’s a whole heap of them, but it’s a lot of fun, honestly.

Liran Haimovitch: So how do you go about building hundreds of different data pipelines? Do you have a platform for that? Do you automate some of the processes? What does it look like?

Stefan Thorpe: Everything in our production system is automated, even manual processes, and I’ll come to that in a minute. As I said earlier, when I think about systems, I’m really good at picturing high-level systems. Architecture is kind of the space I’ve lived in. And for me, layering is the most important thing. And so our system is layered. Each area has a very particular set of responsibilities, and it must just deliver the same thing at its seam or it’s joined to the next layer. And this was a concept I took from the OSI model, the seven-layer networking model that you learn when you go and do Cisco or you go and do operations. And that seven-layer model has stood the test of time. That’s older than most people have been working in this industry. And it works because as long as the seam is the same, what you do in that layer doesn’t matter.

Now, what that gives you the ability to do is version one section. So we’ve gone through four, five, maybe six versions of our ingest to work out the most efficient process. And as long as it delivers into the warehouse in the same format, we can do what we like. Same with our data warehouse. We did things in SQL on Postgres, and then we did things in SQL on BigQuery. And now we’re doing things in DBT. And even just the standard we’re using within DBT is layered. So we iterate that. So again, we layer everything. That gives us huge amounts of flexibility. That gives us our automation across the entire stack.

Coming back to what I said about the manual, even on manual stuff, sometimes we have to do some data crunching that’s very unique for a very particular use case that we know we’re not going to repeat. It doesn’t make sense to build a system that’s fully automated to do that. That’s gold plating. It doesn’t fall into that 80-20 rule. So what we simply did was, we built a system that allows us to drop a file into a particular safe bucket with the right approvals and then that system gets loaded into our database and into our data warehouse and then checked. It means it goes through the automation checks that most things go through. Obviously, it doesn’t have the same qualities, but it goes through some safety metrics to make sure it gets into production. So that way, our entire production is hands-off. We don’t have anybody doing anything. Even our code deployment is all through CI or through that entire process.

Liran Haimovitch: This sounds like some pretty impressive stuff, and a pretty complex and robust platform for ingesting the data, integrating it and making it available for your customers. So I have to wonder, what are the future developments you have in mind?

Stefan Thorpe: Scale. We’ve still got hundreds of thousands of data sources to add. Our schema’s already massive. We recently spoke to an API gateway company, a well-known one, I won’t name them because they’re a great company. But the first thing they said when they looked at our schema was, “Your schema’s bigger than any schema we’ve ever seen,” like, literally.

Liran Haimovitch: But size doesn’t matter.

Stefan Thorpe: It doesn’t matter until it breaks everything. But we’ve got more, there’s much more data in the real estate industry out there. It’s growing at an exponential rate. And there are many different ways to look at it. So, we’ve got more to that. We’ve got a long list of data partners that we’re working with, and we’re adding them into the system on a very quick cadence, so we’re doing more of that. As we add each of those data sources in, it builds out our graph, and our machine-learning data science teams just continue to push the boundaries on what we can do and what they can find out. In a theoretical world, if you can see all the data, you’re going to get the answers right. And so, we’re just adding into that, and we discover new spaces and new scopes of work that we can go into so we’ll continue to push the boundaries on that.

For myself, as I said, I’m a DevOps engineer. So personally, one of the things that I’m interested in and looking forward to is what we’re currently doing with Kubernetes and custom resource definitions. Kubernetes is so extensible. In theory, you could build a domain object model or an API that allows us to deploy a pipeline with, like, 10 lines of configuration. Tell us what the source is, tell us where — they should be simple. And that’s where we’re driving right now. So we’re using the extensibility of Kubernetes to drive some of that. I’ve always found that deeply interesting. Yeah. We don’t use Terraform. I know it’s the de facto within the industry. We did use Terraform, but everything is within Kubernetes manifest. Again, part of the reason I said earlier, my team understands Kubernetes is because they only need a couple of languages. How does a Kubernetes manifest work? What does it do? And, can I build my own? The answer is yes. If you learn that, and you learn DBT, you can do a hell of a lot within our system. So that gives us growth and area, and again, just continued scalability.

Liran Haimovitch: Very inspiring.

Stefan Thorpe: Thank you.

Liran Haimovitch: I have just one more question to you before we wrap this up, it’s a question to ask all of my guests. What’s the single bug that you remember the most from your career from all of those — everything from data centers to the cloud and Kubernetes?

Stefan Thorpe: I knew this question was coming, and I couldn’t think of a single bug. But the one thing that just kept playing on my mind, I’m dyslexic, pretty heavily so. It’s interesting, it doesn’t come up as much in coding. But the one consistent thing that has always come up in coding for me is the missing of a single letter. Again, with dyslexia, you read something, the waveforms — the letters are either there or not there. And I was joking with one of our employees, I had to type in a Wi-Fi password, something really simple. The password was QUICKOWL and rather than put a U in, I put KW. I’m not joking, I had 15-20 minutes of me, Shannon, my wife, like, “I know this is the password, why is it not working?” And she walks over and was like, “You have a W rather than a U,” and I’m like, “Ooh, okay.” That has happened in coding for me so often. I banged my head, looked at something, I put it down, calm down, come back 4 hours, look at it, and you go, “Yep, I was missing a letter.” I had too many letters in that word, and something as simple as that is probably the biggest one.

Liran Haimovitch: Dyslexia can be challenging at times.

Stefan Thorpe: It’s interesting. I think it’s what forms my systems thinking. So whilst it’s not great for writing, and my Slack communication is interesting, I think it’s part of how I visualize everything and go through that. I recently saw that LinkedIn and Richard Branson are now doing a dyslexic mindset as a tag — you know those things that you say, “I’m good at this, I’m good at that,” they’ve started one within there.

Liran Haimovitch: Any final thoughts for our listeners?

Stefan Thorpe: I mean, for me, Cherre’s always growing. We’re a culture-first company, meaning that we’re literally higher on culture, less on experience. So, if someone’s got one year experience or 15 years of experience, come and say, “Hi,” sit down with us. We’re growing rapidly. We’re doing some very exciting stuff in lots of locations across all of the cloud providers. Yeah, some great opportunities. And if not, I’m always just happy to network with people.

Liran Haimovitch: So, Stefan, how should people reach out to you to network or to learn more about Cherre?

Stefan Thorpe: They can go to careers@cherre.com. That’s the first place if they want to apply. If you just want to reach out to me, my LinkedIn is Stefan Thorpe, and you’ll find me under Cherre. They’re probably the two best locations.

Liran Haimovitch: Awesome. Thank you very much for joining us.

Stefan Thorpe: Again, thank you for having me. This is a lot of fun.

Outro: So that’s a wrap on another episode of The Production First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.

Cherre's Stefan Thorpe - Heading Straight For The Cloud

Cherre’s Stefan Thorpe – Heading Straight For The Cloud

Have questions?