Box's Tamar Bercovici - The North Star Of Your System
Box’s Tamar Bercovici – The North Star Of Your System
Liran Haimovitch: Welcome to The Production-First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I’m your host, Liran Haimovitch, CTO and Co-Founder of Rookout. Today, we’re going to be discussing the cloud journey of one of the most interesting SAS companies out there. With us is Tamar Bercovici, VP of Engineering at Box. Tamar, thank you for joining us, and welcome to the show.
Tamar Bercovici: Thank you for having me.
Liran Haimovitch: So Tamar, what can you tell us about yourself?
Tamar Bercovici: So my name is Tamar, I am based in Redwood City in California, a mom of two kids, been working at Box for 11 years now. So I joined sort of at a different phase of the company, we were a lot smaller. I joined as an engineer on the backend team, I initially worked on building out the scalability layer for our database infrastructure, and shifted into a managerial role at some point. And now I am one of our VPs of engineering, and I lead the core platform team.
Liran Haimovitch: That’s awesome. What was it like growing from a senior engineer to the VP of engineering of one of the groups, I mean, one of the most interesting core groups, at Box.
Tamar Bercovici: I mean, it’s been a really fascinating journey. I think being at Box for this past, I guess, a little over a decade. I mean, first off, just the company itself, you know, Box was around 130 people when I joined. And, you know, it wasn’t a brand-new company, the company was founded in 2005. So they’ve been around for a while, but they were sort of– before the big kind of hockey stick growth moment. And so it was really interesting to see on sort of all facets, like how we had to evolve our product and sort of go to market for that product as we grew. And then of course, how we had to evolve the infrastructure to be able to scale up to meet these new challenges. And then at the same time, how you grow the team. And you know, especially in the rapid growth phases, I think you find that everything that you put in place that works well is sort of immediately not working well again, when the team doubles again. And so, it was just sort of this very rapid iteration, I think, through a lot of product infrastructure and team phases and challenges and trying things out and seeing if it works. And so you just get that very rapid learning. I think for me, it’s been a fun journey, because it’s felt like every two to three years max, I’ve actually been doing a different role in some way. And so, I’m the kind of person that likes that it’s a good learning and growth opportunity, for sure.
Liran Haimovitch: Cool. So to give us some context, what’s Box is all about?
Tamar Bercovici: So, at Box, we power how the world works together. We provide a platform for enterprises to support all of their content needs. So if you think about it across a lot of different industries, a lot of different company sizes, like we all tend to deal with content, with documents, in some ways, whether we’re collaborating on putting together a presentation or reviewing data in a spreadsheet, or, you know, sharing our HR policies, like whatever it is, in all of our different roles and functions and across all these different industries, content tends to be a very central way of doing work. And at Box, we really focus on providing the best platform for your content that is easy to use, and facilitates, you know, all the different types of collaboration that we require. But then also meet sort of the security and compliance requirements of real content and real enterprise use cases, and that integrates with all the best in class, other tools that we all love to use to do our job. So that’s what we focus on. That’s what gets us up in the morning and that’s what we’ve been working on and continue to work on going forward.
Liran Haimovitch: Impressive, as we’re talking today, you’re actually in a very interesting place on your cloud journey. I mean, the company’s been around for almost 20 years, as you mentioned, but today, you’re doing a pretty big shift on your end.
Tamar Bercovici: Yeah. So you know, it’s funny, I think, companies, when you look at their tech stack, their infrastructure setup, it’s a little bit like tree rings, you can kind of see when, what the technological landscape was like when they were created. And then also, you know, what maybe went in– what became like the fad at a particular point in time when it was adopted, and you kind of accumulate these bits and pieces of things along the way. And so, when I joined Box, it was very much that play classic lamp stacks, a single MySQL DB that was running everything on the database side and then basically a PHP monolith webapp and we were running in a data center. Basically, we had a space and colos, and that was sort of our infrastructure. And obviously, we’ve grown a lot since then, we actually did introduce some workloads in the cloud relatively early on. So we were sort of running hybrid on-prem and cloud for many years. But I would say that, predominantly the bulk of our infrastructure was running in data centers. And it’s actually fascinating because even there, we went through a lot of iterations and evolutions and probably one of the biggest projects that I led was a migration that we did between two data centers. So, we had to basically move to a new facility. And when you think of, you know, production changes, like that’s one of the most complex one because we were already … we had just like 1000s and 1000s of servers. And so, it’s like setting up this new environment, in the physical sense, right? Because you’re not– you don’t have the benefit of the cloud. So you’re actually racking and stacking and provisioning everything, and validating and you know, how do you navigate that scope of change without impacting customers was definitely a big challenge. But now we’re on sort of the next phase of our journey, where we’re actually leaning way more heavily into cloud. And so, we are in the process of migrating a lot of our major workloads to run fully cloud. And so that’s again, yet another form of a pretty large migration that we need to navigate. But one that I think will provide us a lot of advantages going forward in terms of how we run our stack and how we can evolve Box going forward.
Liran Haimovitch: I’m wondering what are the key advantages you’re seeing to moving to the cloud? I mean, after you’ve run pretty big data centers for pretty long time, you’ve been running 1000s of servers, you’re not new at this. Why are you moving to the cloud?
Tamar Bercovici: Yeah. It was very much a choice. Because as you said, you know, we’ve obviously we’ve built up an expertise, and there’s definitely certain level of optimization and customization that you can do when you’re really tuning your own physical space and you’re exactly configuring the skews that you’re using it. You know, there’s a certain level of optimization that you can do that you almost have to forego when you go cloud. I think it’s a mix of a lot of things. Definitely running in the cloud just gives more flexibility in the sense that when you’re running your own data center, you have to sort of accommodate for a lot of eventualities, right? Because there’s sort of a delay between saying, like, oh, I need more capacity in this pool for this workload, okay, do we have enough spare capacity already in our data center? Do we need to go procure a new piece of hardware to support that? So, you end up needing to sort of be a little more conservative in a way, in terms of keeping around extra capacity, being very structured and how you roll things out. Obviously, you know, as our customer base has been expanding around the globe, being able to bring some of our infrastructure closer to our customers, there’s a lot more flexibility that we can gain by just leveraging the strong international presence that the large cloud providers have already built up. So, there’s definitely these opportunities for more flexibility and optimization and that we don’t have to– You know, when you’re running multiple different workloads, which these cloud providers are, it just lets them manage those buffers and those pipelines of hardware in a different way and we get to capitalize from their economies of scale, which are even larger than ours. I think beyond that, there’s also just the opportunity for more easily leveraging a lot of the infrastructure innovation that’s coming out in these environments. When you look at the different technologies, in terms of data stores, eventing pipelines, machine learning technologies, like all these different things, that we can leverage, sort of the best-in-class versions of, that just turbocharges our ability to leverage those tools, and then build the things that make Box special and are delivering the value proposition to our customers. So, it shifts some of where we need to focus away from aspects that we can get as good as if not better from leveraging companies that are focused on that as their product.
Liran Haimovitch: Makes perfect sense. Now, this is yet another step in your Box journey. And I’ve heard you say that software engineering, especially when engineering SaaS product, it’s always a journey.
Tamar Bercovici: Yeah, that’s right. I think the previous company that I worked at, we shipped software, like we were– you know, we ship shrink-wrapping software. And I think there’s a very different model that you have with SAS, because, you know, you sell to a customer, yes, you have that sale that you just made. But really, it’s a lot more about the beginning of that long-term relationship with the customer and their lifetime contract value that you’re going to have from them. So, it is a lot more about thinking for that long term. And a lot of the value proposition to the customers themselves is that what they’re using is continually improving. And they’re sort of continuously going to be on the best-in-class version of these tools that they’re using. It’s been such a fundamental shift because now we’re all at sort of table stakes. We’re all expecting that. Like if your software doesn’t do that, then it feels like there’s something wrong. And so, when you really think about that, though, it means that we’re all on this continuous journey of evolution, right? The product, we’re continually pushing forward the boundaries of what we’re building and what we’re providing and new capabilities or new features, better performance, fixing bugs, whatever it is, right? The product is constantly moving forward, our customers are constantly moving forward and how they’re using us, right Like Box is one of those companies where we need to deal with the aggregate scale of everyone leveraging our platform, but also because we sell to businesses, some of those businesses are large businesses that are running complex, heavy workloads, where in and of themselves, that’s sort of a different type of skill challenge. Scale boundaries are continually pushing forward, the product of continually introducing sort of new angles and new challenges that we need to rise to. And so, when you think about that, in terms of how you design your architecture and your systems, there is no perfect NorthStar destination that you’re trying to get to, and the more you try to figure that out, the worse off actually you’re going to be. So, it’s this interesting tension of looking at the short-term needs of the system, and then trying to project out a little further. Okay, where are we headed? What’s the trajectory? Where do we need to invest? So that we can meet the business where it’s headed towards, and then continuously be on that iterative process of moving the system forward, you know, there’s a lot of information out there around agile product development …and getting something into the hands of the customers to try it out. But in a way, there’s an element of that to system and infrastructure design as well, you have to sort of put a system in to production to then see how it evolved, to then see where the next bottleneck is coming in, so that you can then focus on that. And sometimes trying to pre-optimize too many steps into the future, you end up building a complex system, that’s not actually meeting the challenges that end up emerging. And now, it’s more difficult to evolve the system and to fix it. So it’s this interesting balance of again, looking across multiple time horizons, but acknowledging that it’s always going to be an iterative process and there is no one step that you’re aiming for.
Liran Haimovitch: Speaking of long term versus short term, if there was ever a conflict in SAS engineering, it’s about delivering features for the next quarter, or for the next year, versus engineering having doomsday scenarios about all the technologies that have to be replaced, the databases, the load balancers. What has worked for you to kind of figure out what are the shelter needs? Especially, how much can you invest in them? Versus what are the true long-term needs? What’s truly important to carry out rather than just engineering fencing new technologies?
Tamar Bercovici: Yeah, I mean, this is– Definitely I think, one of all of our canonical challenges. There isn’t an easy answer. I see it less as a dichotomy between features and sort of tech debt and more on that actually the short term versus the long term. Because even on the product side, right? Should I add this next incremental feature in my current product? Or should I launch a completely new product line, right? Like we have these tensions between long-term investments that require more upfront, and it’s going to take a longer time until we realize the benefit versus short-term quick wins. But if you, you know, but maybe you’re missing out, you’re sort of optimizing for a local maxima. And so again, I don’t think there’s a one size fits all answer to this. I do think realizing that we’re constantly iterating actually reduces some of the stress, because it means that you can make a choice one or the other way, you can live with it for a little bit. And then you can see how it’s panning out and you can reevaluate, okay, now that I’ve learned something in this past quarter, what do I need to do with the next quarter? So, it’s all about informing that ongoing process, but to not evade your question altogether, I think we need to look at engineering and architecture investments through a business value lens. And sometimes it’s a little harder to make that case. But at the end of the day, everything that we do in a business should either let us acquire new customers, enable us to better retain our current customers, or improve sort of the efficiency, like the cost of providing the service, right, like at the end of the day. That’s what we all do. And by the way, investing in the team, making the team more productive. That’s an efficiency metric as well, right? You want to have a productive team that stays and doesn’t churn out. So, sort of all of the, like, developer enablement elements are obviously critical as well. And so, any project that we take on, we should look out through those lenses. So, if we say we need to scale a system, well, why? What are we seeing that’s causing us to come to this conclusion and getting really concrete about what is the specific problem, also then lets you create a much more effective project or program around it, because I think what we tend to do as engineers sometimes is we say, Oh, this system is so broken, we just need to rebuild it. And then we kind of throw everything out that we’re going to change our database and change our development language and move to microservices and fix this thing that’s always been– you know, it’s like we have this desire to make it all perfect. But again, that’s actually a bad way to build great systems. And so, if you step back for that, you say, what we’re trying to address is this aspect. How do we make sure that we’re focusing on, sort of what’s the MVP architecture of that? And it could still be a very long, complex project, like we’ve definitely run, multi quarter and even multiyear investments in architecture and infrastructure, but they have to be focused in terms of what is it that they’re trying to deliver? Why do we care about that? So that we can make the right trade off and prioritization decisions as we go through that project over time.
Liran Haimovitch: Now, as you’re going through those projects over time, you have the long-term commitments, you have the short term stuff, you have requirements that are always changing, and you’re always trying to learn. So how do you make sure that you stay agile? How do you keep both the teams and the architecture agile so that you can react those two learnings?
Tamar Bercovici: I think it’s again, making sure that we’re always focused on the impact of the work that we’re doing. And I think that helps to be less attached to the specific tactics that we chose, and more, you know, is it working or not? Or what are the new requirements that are coming in, which can then let us assess the system in a different way? Because if you get– If what I’m doing is, you know, building this component, and now it turns out that component is no longer the right thing to do, then, that emotionally that’s a lot more difficult to deal with, as opposed to what I’m doing is improving performance. Hypothesis one is that this component is going to help we’re going to start building it, it doesn’t seem to be working, what else can we explore? So I think keeping the team always focused on those outcomes, and ideally, measurable outcomes that you can track helps that process of having that feedback loop, and then letting the team be a lot more engaged in that basically self-assessment of, hey, we want to move this metric, and it’s not working, what do we need to do? And so, it’s a lot more about pushing that perspective, and that understanding of what we’re trying to accomplish, and the exposure to the data down to the team and the developers that are working on it, so that everyone can be sort of synergetically aligned and try to row in the same direction.
Liran Haimovitch: Yeah, keeping everyone aligned on the impact, rather than on the technology stuff. As you’re doing those processes, what kind of techniques did you find effective for managing the risks of new technologies, of all technologies of big shifts?
Tamar Bercovici: Definitely. When I reflect on a lot of what we’ve done at Box, over the last decade, it’s been a lot of risky projects. And the truth is, is that there is no meaningful re-architecting work that is without risk, especially when you’re operating at a high scale. These systems, they’re sort of finicky, right? Like you change one thing, and it’s like, oh, it turns out that we were tuned for this certain workload, now, it’s a different workload. And, you know, no matter what you do, these are often large changes they require– They have a major impact to the system, and they can have, you know, consequences maybe that you’re not expecting or side effects that you’re not expecting. And so, what I found that’s been most useful is to sort of take a step back from that and acknowledge that risk is a part of the equation, there’s no way to remove the risk from the project, like no amount of pre-production testing, or design reviews. And by the way, you should absolutely, do design reviews and testing and like all of those things are important. But no matter what you do, there is no such thing as a large-scale production change on a live high-scale system that is without risk. And so now once you sort of acknowledged that you can actually manage it in a different way, right? It sort of shifts your perspective to saying, okay, risk is a part of the equation, what can we do to manage that risk. And again, it depends on what the type of the risk is. But two techniques that are generally useful is: A) to try to pull in the de-risking as early as you can into the process. So, when I think of mapping out a large-scale, long-running sort of high-risk program, I think of the milestones not as value-add milestones, because often the system as a whole is not fully going into production until the end, like maybe you’re migrating or something. So, you’re not going to see the benefits when you’re midway through. So, sort of incremental value doesn’t make as much sense. But incremental risk reduction does, right? So if I think of the whole project, and I’m really worried about, did we pick the right database technology, what can I do to actually validate that choice early as possible in the program so that if I indeed find that, you know, it’s not scaling in the way that we want, or that we are hitting some other unforeseen challenge that we have the most amount of time to reassess, and change our plans and adapt. So, one is like working through the project, according to risk reduction as your milestones, and then trying to pull those up as early as possible.
Liran Haimovitch: Rather than just kind of looking at the risks as you go live on your goal.
Tamar Bercovici: Yeah. Exactly.
Liran Haimovitch: Approval discussion. So did we think about the risk? Yes, no, we’re going to go live.
Tamar Bercovici: Exactly. So it’s like, can you find out as much of that as early as possible? And then actually, this a great segue to the second half of that, like, at some point, you need to push something out. You needed to make a change in production, right? How can you de-risk that, and I think that when you think about how you design your system, a lot of times we think of the end state. “Oh, I want to be running, you know, fully sharded, or I want to be running on this new technology, or in this new environment, or microservices, or whatever it is.” We sort of think of the end state, but we need to be spending probably as much time on as we design and architecture on the migration path to that architecture. Because you can often build things in such a way that every production change that you’re making, reduces the scope of impact from the risks that you’re worried about. And again, this is very context-specific on what change you’re making. But there are a lot of techniques that you can leverage to, to migrate a system incrementally over time, so that you’re not making these big-bang changes. And then the actual risk is, it’s still there. But the impact to customers is greatly mitigated. And so, if you sort of combine those two, you can often actually pull off very high-risk changes, where you make a lot of mistakes, and you make a lot of like bad design choices, or bugs in the code or you know, things you forgot, you actually have a lot of issues because we’re all human. And there’s no way that you can sort of navigate a change like that with no problems. But those problems don’t manifest as customer-impacting outages or degradation. They don’t manifest as something that derails the whole program. And so, it’s sort of a mechanism of still managing that process and managing the risk, as opposed to trying to remove it completely, which is kind of a futile effort.
Liran Haimovitch: I feel much of what you’re describing here, in this answer, in previous answers. It’s kind of changing the mindset to be focused on the operation of the software, operation of the system. I own it today. I’m going to own it tomorrow. And I’m not so much focused on what’s going to happen someday in the future on some imaginary state that may or may not occur.
Tamar Bercovici: Yes, that’s right. I think it’s all about that path. And look, I think it is important to have a notion of where you think you’re headed in the long term. And its sort of the further out you look, the fuzzier it should be because you’re missing a lot of information. But having that direction of, you know, here’s where I think we’re headed, here’s where I think we should be focusing on, it does help to make more intentional choices in the short term. And then again, as you progress, you should constantly be re-evaluating, like, is that North Star that I’m heading towards still, right? Or do I need to shift it a little bit? It’s good to have that trajectory to make an intentional choice towards that trajectory, but also to understand that you’re not trying to jump all the way there in one go. You’re taking a step and then after every step, you’re reevaluating, and maybe you learned something, either through that step, or just something in the broader context changed, right? And now, that shifts your next step. But if you were intentional about those choices that you’re making, that you can more easily adapt your strategy as you go forward. Like, for example, for us, you know, before we had committed, made this sort of broader commitment to focus on Cloud, there were various choices that we were making that were definitely informed by the fact that we were running in a data center. And so, when we shifted that sort of big decision now, you know, obviously, that changed a lot of the next steps for some of the ways we were thinking about evolving our major infrastructure components. And so that’s a great example of how something in the broader context changes. But it’s sort of, you can adapt to that and now say, okay, so here’s the next step, we’re going to take given that new input into the equation.
Liran Haimovitch: Yeah, stay focused on reality while keeping the dream inside, rather than focusing on a dream or trying not to hit any roadblocks in reality.
Tamar Bercovici: Yeah, exactly. It’s the realist view, the practical view.
Liran Haimovitch: It’s been super fun. There is one question, I’m asking all of my guests. So, I’m getting a lot of bugs, Rookout is all about debugging and removing bugs. So, what’s the single bug that you remember the most from your career at Box and elsewhere?
Good question, because anyone who has developed code or managed systems has a whole list of these but maybe I’ll think back to one of the interesting ones I had earlier on at my time at Box, when I was still working as an engineer. As I mentioned, my first big project was building out our database scalability layers. So, when I joined, all of box was effectively at any given time running on a single MySQL host, which is kind of amazing how far that can actually scale. But clearly, we were sort of at edges of that for a variety of reasons. And so, that was what I worked on, and we built out the charting architecture for MySQL for Box, or at least the initial version of it. It’s evolved quite a bit since then. And so, I had been very proud of this approach that I’d come up with to enable us to do an incremental deployment, because charting a database is actually quite a complicated production change, because you’re both needing to move the data and actually like partition it. It’s not like you’re moving from one data store to another data store, but you’re actually splitting the data out into separate charted databases. And at the same time, you need to adapt your application code to be able to now interface with this different database technology. And so, it’s a quite a complex change. And I was personally worried about the risk of sort of, kind of a flip a switch, yay, we’re sharded, which I know there– I know, several companies who’ve done that successfully. So, it’s not that that’s impossible. But for me, it scared me and I tend to think like, if there’s something that scares me, can I design around that. And so, we had this incremental approach, where we adapted the application code to think that the data was sharded as a first step, even though all of the data was still co-located. And then, we could kind of incrementally carve out the physical shards and have the app. It’s just that the database configuration changed at that point, to point at a new database. So, it was sort of a way to sequence out the changes. So, this worked great, we got to that point where everything was logically sharded. And we’re going to pull out our first physical shard for our first enterprise, which just had the Box, the actual account for our Box enterprise on it. And this was– the company was smaller. So, this was, I actually remember sending an email to the entire company saying, hey, so we’re going to be rolling out this change, everything should be fine. But if you see any problems, let me know. We kind of do, there were a few steps that we needed to do, we push out the first step. And immediately we can see that in our Box account, you can sort of see your files and folders, and it looked like everything was doubled. So, there was, let’s say you had a, you know, folder sharding project, and now you had two of those. I was like oh, that’s so weird, but actually are relatively quickly realized what the problem was, because now it was sort of pulling the data twice, but it was in the same database. I was like, oh, that’s silly, we should just fix that. We rolled it back, made a small tweak, rolled it out, again, everything was fine. And then we went a few more steps through the process. And when we pulled out that first physical charts, and now we got further through that migration. Again, everything was doubled. And this was a bigger problem, because we didn’t want to rush to delete the data from the original source, right? We wanted, you know, it was always meant to be this incremental phase, I hadn’t accounted for the fact that we would be basically pulling data from both places. So anyway, I had to send out a follow-up email, like, hey, okay, so we found a problem. Yay, for dogfooding, stay tuned. We took a week to code the support for this case, and then rolled it out again, and it was successful. But what I found really interesting from that, is that, yes, it’s very good to have these incremental production deployment plans, but then you have to make sure that you’ve actually validated every interim step on that path, because we had done a ton of testing that we could work well, being fully sharded in our development environment, in our staging environment, we’ve even– you know, created like a test enterprise. So, everything was working, we just never tested that in-between migration step. And that was the one that had the bug in it. And so, maybe to update what I said before about risk reduction, it’s like, you got to come up with an incremental deployment plan. But then you have to acknowledge the fact that you’re going to be living in each one of those increments for a period of time. And so, it in a way becomes in and of itself, a state that needs to be, you know, working and its sort of an edge case in its own right, but one that needs to be tested for and accounted for. So at least we found it out with our own enterprises. So, none of our customers were subjected to that particular strange bug. But it was a fun one for sure.
Liran Haimovitch: Well, it’s been super fun having you on the show. And you know, those cool stories about Box. If anyone is looking for enterprise storage, or for a cool place to walk it, I guess they should reach out to you.
Tamar Bercovici: Yes, absolutely. Both. We can give you a platform for your content or an opportunity to work on this–
Liran Haimovitch: Amazing platform.
Tamar Bercovici: Amazing large-scale infrastructure that we have backing itself.
Liran Haimovitch: Awesome. Thanks, everyone for listening in. So that’s a wrap on another episode of The Production- First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us