Transmit Security's Orit Yaron - Making New Mistakes

0:00 0:00

May 15, 2022

Liran Haimovitch

Rookout CTO Liran Haimovitch sits down with Orit Yaron , VP Cloud Platform at Transmit Security. They discuss technical vs. cultural priorities, how they handle on-call rotations and keep it from being a burden, human error and how it can be used to explain…everything. designing systems to cope with mistakes, and being proactive in learning.
Get the latest news

Transmit Security’s Orit Yaron – Making New Mistakes

Intro: Welcome to The Production First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I’m your host, Liran Haimovitch, CTO and co-founder of Rookout.

Liran haimovitch: What does the Head of the Cloud Platform mean? Today, I’m hosting Orit Yaron, who has been leading the cloud platform Outbrain, Kaltura, and now Transmit Security. Thank you for joining us, and welcome to the show.

Orit Yaron: Hi, Liran, thank you for hosting me today.

Liran haimovitch: So, Orit, who are you? What can you tell us about yourself?

Orit Yaron: So on the professional side, I’ve been around the industry for quite some time now. My first real job was in Amdocs, and since then, I had the pleasure and the opportunity to accompany several companies throughout their growth. Whether it’s from Cyota, in which I joined with about 30 people and then four years later it was acquired by RSA, and the year after, by EMC. I ended up joining a 30-people company and leaving a 30,000-people company. And later on, at Outbrain, which also I joined as a relatively small company and grew with it throughout to the scale that it is today. And actually, that’s what I like, I like the combination of the technical challenges of scale and also the cultural challenges of growth. Other than that, in between, I’m the mother of two teenagers and trying to find some time to sail and windsurf whenever I get the chance.

Liran haimovitch: You sail?

Orit Yaron: Sail, yeah. We try to sail, as a family, at least once a year. It’s a very different dynamic between the people in the family.

Liran haimovitch: Definitely. Well, I get seasick, even at harbor. The only thing that gets me on a boat is scuba diving. I love that. Other than that, I’m sticking to the shore.

Orit Yaron: You’d be surprised but I also get seasick with the combination of having the four of us on a boat without any distraction, sail things and stuff. So I take a pill and then get over it. It’s worth it.

Liran haimovitch: So you’ve been Head of Cloud Platform, it’s three companies now. Can you tell us what it means? What are the roles and responsibilities that come with this title?

Orit Yaron: So I think that the first thing is to make sure that we enable the business. We are here at the end of the day to support the business, make sure that we provide the customers with the best-in-class infrastructure on top of which the service can run reliably, and to provide value to the customers. Regardless of that, in order to achieve that, we also need to make sure that we provide the right infrastructure and tools to internal functions within the company. I am a very big believer in the empowerment of people, so we want to strive and build as much self-serve infrastructure and tools to enable whether it’s R&D engineers, customer support functions, sales. We get to work with all of the entities within the company, we get to have also external customers and be involved with external customers, and also internal customers.

Liran haimovitch: Before we dive into the details of what it looks like in your current role, can you share a bit more about Transmit Security? What is the company doing and what are the unique challenges?

Orit Yaron: So Transmit Security, I joined Transmit about eight months ago. Our main goal is to solve the identity problems, the security identity problems that’s — at the end of the day, when you see the security problems that we have and the rise of breaches and so forth, it’s all around the fact that security is broken, whether it’s an identity was compromised, the password was stolen. And we provide the SaaS services to make sure that we secure that, whether it’s by providing a user identity platform, whether it’s by providing a solution for password list authentication. And it’s been very interesting to be part of that journey because the company has been growing very rapidly. We about doubled our size in the past year. And also, going through a very interesting transformation of focusing on the SaaS services and modernizing our technology stack.

Liran haimovitch: You’ve mentioned you’re on a big journey, the company’s growing rapidly, and you touched upon SaaS. And I happen to know you haven’t started out as a SaaS company. What are the challenges as you’re shifting, even as a fairly young company, from on-prem devices to a SaaS platform?

Orit Yaron: So I think that is very much related to the topic of this podcast, The Production First Mindset. When you are an on-prem solution, you are a bit detached from the production, the day-to-day production stuff, at least the R&D. You release a version, especially in the financial industry, you release a version, it’s not very often, definitely not once a day or once a week. And the whole burden, let’s put it like that, and the responsibility of running it in production is actually being done by the customer. And when you transform to SaaS, you need to make sure that you also connect the people in the company and the developers to the production itself, because now we are the owner of production and we need to make sure that we provide the best service. So The Production First Mindset becomes something that is critical for success.

Liran haimovitch: So, as the Head of the Cloud Platform in a company that’s on the rise to move to the cloud and offer a SaaS service, what are your key priorities?

Orit Yaron: Wow, we have many right now. But I think that we can split it into technical priorities and cultural priorities. I strongly believe that, at the end of the day, technology and culture are going together hand in hand and you cannot really separate them. But for the sake of the discussion right now, on the tech side, we have a very big priority on modernizing our delivery stack, our CICD stack. We’ve now introduced Argo CD into our platform and we’re currently evaluating our CI technology. We’re currently working with Jenkins, and we have some big thoughts whether we should continue with Jenkins or go into other platforms that are out there. On the observability side, we have a very big project of making our observability more accessible to the developers. We have now completed a project of migrating all of our log-based monitoring into the CoreLogic solution. And now we’re working on implementing an event and metric-based monitoring solution that will be based on Prometheus and Grafana stat.

Liran haimovitch: Yeah, I think I’ve seen many companies go through those processes. And as you grow, as your environment becomes more critical, the SRA is the people who are working on production need those advanced observability tools such as metrics, events, and down-the-line event tracing. But those aren’t always easily accessible to developers because they require a lot of specialized knowledge on how to use them.

Orit Yaron: So I think that’s the key thing. You can collect a lot of data, collecting all the metrics and all the logs, but if you don’t organize it well, you’ll end up with a bunch of data that means nothing. The key here is to be able to organize it in a way that it’s easy and accessible for everyone that needs to use it. I personally don’t believe in the approach of having just an SRA team that will handle production. I think that in order to have a very successful SaaS offering, you need to make sure that the developer, at the end of the day, looks at production, understands what happens to the code that develops in production, and also relates to the notion of ownership.

The way I see it, the developer owns their code, not just until they release it in the mode of working on my computer. They should own the code while it’s developed, while it’s being delivered into production, while it’s running in production, and until the point that it will be replaced by a different piece of code to do — because of scale or whatnot. So, connecting the engineering to production is critical for our success. It’s a joint effort that is being done, not by my group, but also, by the engineering team. And having proper visibility technologies is critical for succeeding in that.

Liran haimovitch: So how do you measure the observability from the developer’s perspective? I mean, I know for the most part that SRA is focused on time to detect — you need time to detect, you need time to resolve, the ability to understand the health of the system, but as you mentioned, developers are more focused on the code rather than the overall system as a whole. So how do you go about measuring the success of the visibility you have? How good is it for the SRAs as well as for the developers?

Orit Yaron: So we are still in the very early stages of trying to figure out that part. I have a theory of what I want but if we meet in six months, I will be able to tell you if we succeeded. But in general, I think that the same metrics apply. So you want to optimize to short solution time, short detection time, of course, short resolution time, but you also want to optimize to have your alerts actionable and to have your alerts routed to the right people that can actually solve them, in order to avoid burnout of people. And I think that the observability ecosystem needs to support that.

Liran haimovitch: Makes sense. What other tools and techniques are you using to connect your engineers to production, to have them know what’s happening there?

Orit Yaron: So as I said, the first thing is that they will be able to see and understand what’s happening in production. The second thing is more around processes. For example, we added a very thorough RCA process. So whenever there is a problem in production, it’s not just the SRA or the Cloud Platform team that needs to understand what happened and why they should be part of this process. And in some cases, when the issue is replicative, they own the process of understanding the root cause. We’ve added the engineering into an on-call rotation to make sure that when there is an applicative issue they get the alert. Again, it’s still in process. I don’t think that we will ever be able to say that we are done. Something like that needs to be nourished, and we need to invest in it overall, constantly, all the time. But I think that we are going in the right direction.

Liran haimovitch: Early on, you mentioned culture. Besides ownership, what are the key elements you see as important to the culture of your SaaS company?

Orit Yaron: Good question. I think that it also correlates to another key responsibility or key challenge that we have with recruiting the right people. You know, every company today struggles with recruiting. Recruiting is difficult, but we need to make sure that we don’t just recruit but we recruit the right people with the right set of mind. If you tell me that someone is a great engineer but he doesn’t share his knowledge, probably Transmit Security is not the right place for this person. It’s super important for us to be a learning organization. And as such, we need to make sure that we invest time in it, we do the lessons learned part. From every incident, good or bad, you can learn. And we also need to make sure that we share the knowledge. We like to share it internally, we like to share it with the community, we like to learn from the community. So it’s super important to make sure that we recruit the right people for that. Also, people with production in mind. If you want to write code but you don’t care what happens to it after that and you don’t find it interesting to deal with the scaling issue or the problems that occur once it’s running in production, again, probably you’re not the right person to join a SaaS company.

Liran haimovitch: You’ve touched a bit on the on-call rotation. Now I know this is one of the most hotly debated topics out there because, on the one hand, it’s very important to have somebody on call who can respond and fix whatever is going on, which obviously means having a lot of knowledge about the system. On the other hand, being on-call is also a learning opportunity, and there are very different people within the company, different roles, different experiences. How do you go about training, preparing, and allocating people for on-call?

Orit Yaron: So I know that on-call is a difficult topic and many people look at it as a burden. So to start with, I think that we shouldn’t look at it as a burden but we should look at it as you mentioned, as a learning opportunity, as a tool that will enable you to be a better engineer as you move forward. I think that the place where on-call becomes a burden and a pain is when you have: (a) a lot of noise that is not actionable, and (b) when you feel like a router when the alerts are not being routed to the right people that can solve the issue and you get woken up at night in the most inconvenient time that can ever be, and the only thing that you find yourself that you need to do is to call someone else. Then you feel a bit stupid.

So instead of that, I suggest that we will take the approach of looking at on-call as a learning opportunity and make sure that we: (a) have very good quality alerts, only alerts that are actionable. If the alerts fire and look at it and you say, “I don’t have anything to do with it,” that’s probably not the right alert that you should have. And (b) make sure that you write it to the right person. If an engineer will get an alert in the middle of the night about the bug that he introduced into the system, I’m pretty sure that the next day the bug will be fixed because he doesn’t want to be woken up at night again the next night. On the other hand, when you move the pain to someone else, then you find yourself with hacks like, “Okay, a restart will do here,” and you find yourself doing bandaids all over the system.

I think that if you make sure that you put those two principles in place, you will: (a) improve the quality of your system; b) make sure that you don’t burn out the people on — waking up and interrupting them without any reason. And lastly, you will also be able to make them be better engineers. You know, one of the data points that we do measure is meantime to sleep. So we are not only optimizing to make sure that we have resolution time short, resolution time, etc., we also want to make sure that we don’t burn out people. So, if we see that a certain team has a very high level of alerts in the middle of the night, this is something that we do monitor and we will act on it.

Liran haimovitch: So what would be your go-to action if you’re seeing that some team is being alerted too often? What are your options for fixing that?

Orit Yaron: First of all, we need to understand why they are being alerted so often. If it’s because they are like the collector for other teams and they are being routed, the alerts are being routed not properly, that can be one reason. Another reason is that there is a quality issue with the code that this team is writing, so maybe we should take a sprint or two and invest in quality. Definitely not seeing that, not measuring that, and being blind to it will not help you.

Liran haimovitch: Definitely. Speaking of bugs, especially bugs in production, there’s one question that I’m asking all of my guests. You’ve been around many companies, you’ve done multiple roles, but what are the bugs that you remember the most?

Orit Yaron: The bugs that we remember most I think are the bugs that we learn most out of them. There is one that I love especially, it was at Outbrain. We built the system to allow all developers to deploy their code into production very easily. It kind of masqueraded all the gory details behind things and it had a nice GUI into it. And we had one developer deploying into production and accidentally, he switched the fields between the version number and the number of pods that he wants to generate. Can you imagine what that costs? So we had a very interesting taking on it and lessons learned. And after picking into it, we learned that this mistake didn’t happen just because of lack of attention. This specific developer was used to working with a different view of the system. And in that other view, the order of the fields was vice versa. So it was out of habit that he did it. And actually, it’s a bug in our system, because we switched the order of the fields.

Liran haimovitch: Yeah. I would also mention some sanity tests, either to the version number or to the number of pods, would have helped.

Orit Yaron: Exactly. So we learned from it a lot. We learned: (a) that every system needs to protect itself, you cannot count on people’s common sense. And people will always make mistakes, that will always happen. So the system needs to protect itself. The second thing that we learned, which is super important, I believe, almost any production issue you can dismiss as a human error. And if we would have dismissed that as a human error, we wouldn’t have found out about the actual bug. Every time that you think that something was caused out of human error, try to dig in a bit more and you may find that there is a reason for that human error.

Liran haimovitch: Yeah, I think human error can be used to explain everything. If production was down, at some point someone made a mistake, otherwise, it wouldn’t have been down. On the other hand, that’s the least useful reason of them all because there is nothing to be done about human errors. We have to assume people are going to make mistakes and we have to design our systems to be able to cope with those mistakes without catastrophic consequences.

Orit Yaron: I agree. But I also think that in many cases, we categorize things as human error when behind the things there are reasons why this error occurred. So we should try to understand why it happened and not just use the human error excuse.

Liran haimovitch: A reason why it happened, reasons why we didn’t detect it, reasons why it caused so much damage.

Orit Yaron: And I think that at the end of the day, I always tell my teams, it’s okay to make mistakes. You will make mistakes, you’re human. But let’s make sure that we: (a) make a new mistake every time, not repeat ourselves, and (b) that we take those opportunities as learning opportunities and learn from those mistakes.

Liran haimovitch: Makes perfect sense. Because I would be worried of giving engineers the challenge of making new mistakes because they can come back every day with a new mistake to make on purpose.

Orit Yaron: Yeah, we need to be careful with that.

Liran haimovitch: Orit, it’s been a pleasure having you and discussing cloud platforms and human errors and learning and all of that. Any parting words?

Orit Yaron: So first of all, Liran, thank you very much for having me. It has been a real pleasure. I think that if I need to sum it up, I would just tell people to make sure that they are proactive, proactive in their learning, proactive in what they do. Even if you are with the same company for a long time, there are always interesting opportunities to take ownership and be proactive, and I’m sure that every organization will love to have those kinds of people. And if you don’t do that and you keep in your comfort zone for too long, you’ll simply grow old. So, be careful of that.

Liran haimovitch: Definitely. Be careful of that.

Orit Yaron: Thank you.

Outro: So that’s a wrap on another episode of The Production First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.