Appsflyer's Nir Rubinstein - Databases, Databases Everywhere
Appsflyer’s Nir Rubinstein – Databases, Databases Everywhere
Liran Haimovitch: Welcome to the Production- First Mindset podcast, where we discuss the world of building code from the lab, all the way to production. We explore the tactics, methodologies and metrics used to drive real customer value by the engineering leaders actually doing it. I’m your host, Liran Haimovitch, CTO and co-founder of Rookout. Hey Nir, welcome to the show. Can you tell us a little bit about yourself?
Nir Rubinstein: Sure. First of all, thank you for having me. My name is Nir. I’m 40 something? 42. I live in Tel-Aviv with my wife, two kids, and I am the Chief Architect of Appsflyer. I’ve been so for the last nine and a half years. I’ve done various startups before AppsFlyer and most of them, as startups go, failed miserably. Then the co-founders of Appsflyer, when they got their seed money, they asked me if I wanted to join and I joined them. It, again, was nine and a half years ago, and I was the first employee then. It was just both of them and me. And now nine and a half years later, it’s a big company, 1300 employees. We’re doing marketing analytics and attribution, mostly in the mobile landscape. And that’s it.
Liran Haimovitch: Appsflyer- I’m sure everybody knows it. It’s one of the biggest companies in the tech scene in Israel and probably has one of the biggest footprints around. So kind of what did that infrastructure look like in the early days?
Nir Rubinstein: In the early days we were poor because the seed money back then was low. We were still running on Amazon, but primarily the tech stack was vastly different to what we have now. We wrote in Python and we used Redis both as a database and as a message bus. And we had a single database for persistent storage, which was couched a bit. That’s it? That was the infrastructure back in the day. I think we had like two servers, three servers, running on AWS. That’s it? No Docker , no nothing, no pure Python code, pure best script installations of the databases. And that’s it.
Liran Haimovitch: Is any of that code still in production?
Nir Rubinstein: No. I think that the most interesting aspect about Appsflyer is that contrary to a lot of other companies that have some legacy code that when new people join the company, people tell them beware this is the legacy code don’t touch. It works. No one knows what it does. That’s not the case in Appsflyer. We are diligently rewriting, both infrastructure and whatever over the years, not because it’s fun, even though it is fun, right, but we need to justify it in a business aspect. And the truth of the matter is that AppsFlyer grew a lot over the decade. Sometimes it was linear growth. Sometimes it was exponential growth. It depends on the years. But as we grew, not only the traffic grew, but also the business requirements changed drastically because the whole concept of marketing analytics attribution in the mobile landscape was born a decade ago. And it matured to the point where we are now and everything is vastly different than it was back in the day. So that means essentially we need to rewrite everything almost from scratch as we come along, different services, different whatever.
Liran Haimovitch: And so how does the infrastructure look like today?
Nir Rubinstein: The infrastructure conceptually, we can split it down the middle. I’m going to give a few numbers because numbers are interesting. So Appsflyer is running primarily on AWS. We have some footprint on Google as well, but primarily on AWS on peak traffic, it’s around 15,000 servers. We’re elastic during the day. Most everyone are and we were handling a traffic of around 130/140 billion events per day. That means upwards of a million, million,and a half events per second in the system. So the system again is primarily split down the middle one part deals with all the real time will the streaming aspects of the system, right? So it has a lot of kafka for sure. There’s a lot of web handlers and load balancers, volatile, ephemeral databases, such as memcached and redis. We have persistent high performance, key value stores like Aerospike and dynamo DB, and a lot of microservices primarily written in closure and Go. That’s one half of the system. The other half of the system is our offline processing. All the events that we streamed through kafka we flushed them down into our data lake. We use S3 on Amazon as a data lake . We’re running also Delta lake on top of that. And then we have a lot of hadoop clusters where we run multiple spark jobs to do our glorified MapReduce. The results of which gets written back into S3 into the data lake and also into analytical databases, primarily among those is Druid . That’s how the system does it.
Liran Haimovitch: That’s a pretty big system. I mean, that’s quite a journey from just two servers running Python to 15,000 servers running everything. Can you share a bit more about that journey? How do you go about adopting, especially dumping technologies as you grew out of them? How did you know when you should change the technology and how do you go about it?
Nir Rubinstein: So the easy answer is when it broke, right? So back in the day , the journey, we can split it into three parts. Appsflyer is conceptually, we had like the incubation phase or the early stage. We had a hyper-growth stage, which we’re still in right now, but it’s much more mature. And right now we’re in what I like to call the mature phase. So back at the beginning, the early phase, we had no problem. Everything was written in Python because it’s easy. And Python is really easy to write production grade code to our clients. A lot of clients called us in the morning asking for feature A or B or C and in noon or afternoon, we call them back and we told them it’s in production. It’s ready. You can go. And then as we started to grow, we hit our first major bump of traffic. Right? A lot of traffic was pouring into the system and Python is amazing in lots of stuff, but it has that obnoxious Guild , the global interpreter lock, which means that essentially you can’t do concurrent or parallel work unless it’s on top of aisle , but we had some also CPU stuff to do. Then we knew that we have to grow in how we look at the system, right? Because up until then, the concepts were there. We had the concepts of microservices because everyone had them where the cost concept of message bus, because service A didn’t communicate directly with service B. It communicated with the message bells , which was a good concept. Again, back in the day, it was the message bus was based primarily on Redis. But as we came into this first traffic bump, we had to change technology stacks because Python didn’t serve us as well. And redis also started jittering as the traffic grew. And again, redis is a single point of failure. We can, we could have shot at it and blah , blah, blah, blah, blah. So when you come to that point, you can go in one of two distinct routes. You can either take whatever tool you have right now hack it or delve deep and do whatever you need to do in order to make it work. We could have run C code on top of Python or we could have used a Pi pi or we could have used a lot of other stuff. And again, with redis, we could have a shot at it. And whatever the alternative is simply to adopt a new tool and the business constraints of AppsFlyer dictated that we don’t have the time to start digging deep into the existing technology. It will be much simpler to take off the shelf technology and just adopt that. And maybe the new paradigms that will come with the new technology will also serve us in the future. So the first major shift in that style, which was kind of, I think, two years down the road, we ditched Python and we transitioned everything to clojure. Clojure is a functional language. It’s a list that runs on top of the JVM and the implementation doesn’t really matter. The only thing that interested us back in the day was that we wanted a functional programming language. Why, because it’s really, really easy to do parallel and concurrent work over f unctional languages. We had both Reshef, the co-founder and CTO and myself had a lot of previous experience in parallel work on top of, u h, all languages. And we knew that it’s difficult, to say the least. It can become difficult. And we saw the traffic and we started seeing where it will take us. And we said, well, we don’t have time to deal with the nitty-gritty aspects of trying to do like a multiple logs and incremental backups of lock release and stuff like that over regular old code. And we wanted the functional code. So we chose Clojure, and Clojure, more than the programming language itself brought with it, a slew of ideas that are relevant to functional programming, such as CSP, communicating sequential processes. Then at that point, we also ditched Reddis and adopted Kafka. And that was the major first shift of the system, right? Instead of Python and red is now we had closure and Kafka, and it was really, really difficult at the beginning , because change is always difficult, but, you know, two years in, we were like, I dunno , three developers, four developers, five developers . It’s easy. When the workforce is small, it’s really, really easy. And then we had several databases and services that refused to scale. Now we come to the second phase of Appsflyer where traffic grew in order of magnitude every year. Essentially every tool that we adopted in the database landscape had the, like the single promise of, “Hey, look, I’m supporting infinite scale and scale out paradigm”. And everything was nice up until a point andwhen we reached that point stuff broke and then we didn’t have the option, but to migrate from one tech tech to another tech stack. And if you look at the pure, like the main database back then in Appsflyer when we did the matching algorithm. It was written first in CouchDB. Then we wrote it in Mongo . Then we transitioned it to Couchbase and now it’s Aerospike. And this is just one aspect of the system. We had other stuff, that system algorithm , the analytics already were the best. Again, it started in CouchDB, and then we transitioned it to memcache . I don’t remember the name of the database. It was in memory sequel database. And it grew with us up until a point, but then it broke again. And then we transitioned to Druid and Druid is really nice. But then we had, even now we have like a Lambda architecture with Druid, because it only supports through this time series, database and events that restream to it, as we speak can be in the past that we have to have like a Lambda architecture with Druid it and click us together and blah, blah, blah, blah, blah. So we needed to change a lot of the stuff along the way. And change is difficult.
Liran Haimovitch: Yeah. I would say most people would argue that changing languages is not that hard, especially for a microservices environment. Just write the new piece of code in a different language, pipe it back in and over time, we’re going to get new code in the new language. Over the old language. Message buses are also a bit easier. You start writing to both then a t some point you start reading from the new one over the old ones. Changing database on the other hand is way more challenging, way more complex. You have to migrate the data. You have to shift your business logic paradigms to a certain extent, s top d ata queries. How do you go about those large projects, especially with so much data and so much traffic involved?
Nir Rubinstein: Yeah, it’s a good question.
So I always like to say that in AppsFlyer we were not doing any work that’s super duper complex in the algorithmic sense, right.
That also changes as we speak right now because the traditional landscape is changing.
So we’re doing really, really interesting stuff with machine learning, but, but the main core aspect of Appsflyer and the main difficulties of working in Appsflyer revolve around the sheer scale of the data that we have to work with. Right?
So migrating database is easy if the database is small, right ? If you have a database with a thousand records, you just back it up, restore it to someone else, or even go over it line by line, by script, right into another database.
And you’re done. And essentially it can take a few seconds. What happens if you have a database of 12 billion records, 15 billion records, 30 billion records where the backup and restore process takes more than a week. Right? So if the backup and restore process takes more than a week, I can’t like, hang a message on my front door saying, “listen, out of service, be back in a week” because the business needs to continue.
A lot of the expertise that we’ve developed over the years have to do with these kinds of migrations.
So what we actually do first, we formulate a game plan, right? And people think that my winning database is something that can take like a few days or whatever. Again, at lower scales, it is true.
But like you said Liran, when the paradigms are different, the scale is different. It can take even a few months. So how do we go about it? So, first of all, we go into an early period where we try and find out the new database that will work for us. Usually, it takes like a month or so to read, write a few POCs, stream a bit of data, and see how it behaves. Once we’ve picked the potential winner, we load test it, stress test it. We spin up machines and then we start bombarding them with actual data.
Doesn’t have to be any relevant data, but it has to be data that approximates the data in production. And we see how they handle there .
And once we’ve gained the confidence that it will work for us, both not only on the operational aspect , because the operational aspect is also split around many other aspects. One of them is streaming data into the database.
One of them is essentially one of them is writing.
One of them is reading, but also can we do cross that set of replication? Can we do XDR? Can we do backup and restore in the same manner? How does the backup impact the performance of the other services when the backup is running and stuff like that? Right? So we take all of that into consideration when we’re testing the database.
And again, once we’ve said, okay, we’re good, then comes the tricky part. Because if you’re migrating from say one version of Aerospike to another version of Aerospike, it’s easier.
It’s not easy, but it’s easier. But essentially the change means that your code base now needs to talk, quote, unquote to two different databases, right? One of them is the old database and one of them is the new database.
So we start there. We rewrite this slice of code, and we put it into production.
For us , the single machine, a Canary deployment into production, then we see that single machine has no discernible negative impact when talking to two databases at the same time, mostly one database is the old database is written right and the new database is only writing the new data to. Once we have confidence that this works and what we do is we’re back upping the old data, restoring it into the new cluster. And before we restart , as we start backing up the new data, our entire code base also starts talking to new databases.
So we’re running into the two databases at the same time, and then restarting the backup process.
This takes a few days because it’s a lot of data. And then we’re restoring the data into the new cluster. But in the meantime, , some new data was already written into it, right? When we’re restoring the data into the new database, we’re doing it in a manner that says if the key or whatever already exists, don’t override it, right. Just take the new one. So again, this also needs to be supported, by the new database that we choose. That’s the easy way. But what happens if you’re migrating from mysql, whether it was the database we had back then, what happens when you’re migrating from mysql sequel to Druid or from Couchbase to Aerospike? This is different because not only are the database drivers different also probably the queries , the indexing, the whatever, right? So again, it’s a lot to do with applicative code and infrastructure code. So the applicative code.
We essentially have a service or a bunch of services that know how to communicate in both languages or dialects or whatever.
And then if we can do a backup and restore, because it’s the same engine, good for us, if we can do a backup and we can’t restore because it’s different engine, then we have to get creative.
So sometimes if it’s a database X, you can iterate over the backup file with some scripts or whatever, and then write another service that iterates over the backup data and write it into the new database.
Again, with the same logic, if the key exists, don’t override , if the indexes, don’t override or something like that.
But , if you can do it then essentially what you need to do is to spin up third cluster restore the data into it, right? So you have the old database – technology X – the new database – technology Y – and then you have the backup, which you restore to , which is also technology X , the old one.
And then you write another service that goes line by line or key by key, and just writes to , to the new database.
And again, when the two databases are full again, quote unquote , we run them side by side in production.
And we do a lot of sanity checks, both in key volumes or row volumes and checking for key hit or misses between the database.
And once we’re satisfied that everything is good in the new database, then we retire the old one via configuration.
This entire process can take anywhere between a couple of weeks to a couple of months, because it’s a lot of data and it’s difficult.
So I think that it sounds easy enough, but reality is difficult.
And reality also means that we have to support now two databases. Now we have PagerDuty’s or alerts over two databases, because even if one of them is in some kind of problem, it means both of them are conceptually running in production.
Again, we have to do it over a period of time between a couple of weeks to a couple of months.
And that’s a long time to be like hyper alert and hyper ready.
When we started doing these kinds of exercises back in the day, we were really naive.
We thought that it’s going to take us, you know, a week or two and we’re done.
It was going to be, but like everything in production, whatever can go wrong will go wrong and stuff takes much longer than you expect.
And it impacts everything.
It impacts new features that you need to deliver it. It impacts cloud costs, it impacts the diligence of the team, both the one writing the services and the platform team that have to keep these databases alive and stuff like that. It has a lot of impact and we need to take it into consideration.
Liran Haimovitch: Sounds like quite a journey scaling from just three engineers to over 300, scaling from just a couple of services to 1500, and changing a lot of technologies along the way. I have to ask you one question that I ask all of my guests. What’s the single bug that you remember the most?
Nir Rubinstein: The single bug, what, that I wrote? There are so many of them over the years, it’s really hard to remember. I think I’ll give the dignified chief architect answer, and then I’ll give a concrete example. The dignified chief architect answer is that everything that’s wrong with the system today, everything that’s a bit missing, a bit irrelevant schemas for , uh , APIs or API proliferation or , uh , whatever. That’s on me. These are the mistakes of the past have immediate impact now. Stuff that Reshef and myself wrote 10, nine years ago, people have to deal with today and not too easy, right? So that’s on us. As far as the biggest unknown one that comes to light is during one of those database migrations, we had a VP R&D back in the day, who told us listen, you’re migrating. And some of the skis are going to get expired. So write them a default TTL over , I dunno, two months when we started, it’s going to be in production. And if the TTLs need to be updated, they will be updated by the service and everything’s going to be okay. Of course, it took us more than two months to arrive in production. So what happened like a week before or something like that? We realized that like 99% of the keys from the new database are gonna get expired in a week and it’s not in production yet. So we wrote like a Lewis script that runs over all the keys in database and updates , DTL of everything . And this also takes a long time to run and we missed by like, I dunno , a few single percentage got evicted, but , uh , you know, it is what it is.
Liran Haimovitch: Sucks. Now I’m guessing you guys are hiring?
Nir Rubinstein: Yes, we’re hiring, everyone’s hiring. But I think that if what I spoke about right now, interests you like doing stuff with absurd amount of traffic and data and doing pure engineering walk pipelines and algorithm. And I don’t know a lot of companies in the world right now that deal with our traffic and our challenges. The best thing about it is that Appsflyer is not stopping, right? The market is changing all the time and it’s changing. The changes are always taking us further and further, both in business complexity and in sheer volume of data. So what’s true right now. I’m sure that we’ll sit in a couple of years and we’ll talk about the data and scale. It’s not going to be 140/150 billion events per day. It’s going to be 300 billion, something like that. It’s going to double itself. So if that interests you, we’re open.
Liran Haimovitch: So definitely check out AppsFlyer and thank you very much for listening. Thank you very much Nir for joining us.
Nir Rubinstein: Thank you Liran for having me. Have a great one.
Liran Haimovitch: So that’s a warp on another episode of The Production-First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.