Deepchecks's Shir Chorev - Where The Machine Learning Part Comes In
Deepchecks’s Shir Chorev – Where The Machine Learning Part Comes In
Liran Haimovitch: Welcome to the production-first mindset, a podcast where we discuss the world of building code, from the lab all the way to production.
We explore the tactics, methodologies, and metrics used to drive real customer value by engineering leaders actually doing it. I’m your host, Liran Haimovitch, CTO and co-founder of Rookout.
Liran Haimovitch: Today, we’re going to be discussing testing machine learning applications. With us is Shir Chorev, CTO and co-founder of Deepchecks. Thank you for joining us and welcome to the show.
Shir Chorev: Thank you, happy to be here.
Liran Haimovitch: Shir, what can you tell us about yourself?
Shir Chorev: So in the past 12 years, I’ve been in tech. I started originally in Unit 8200. I was working on cyber-level research after finishing Talpiot program, and in the past six years I’ve been in the field of machine learning, initially working on it for various purposes of anti-terror and using different types of data sources to find interesting phenomenas. For the past two years, I’ve been in Deepchecks, where we help machine learning models work widely in a better and more trusted manner.
Liran Haimovitch: I know everybody’s talking about machine learning and I would love to know what machine learning means for you.
Shir Chorev: While maybe – I wouldn’t say a popular misconception but probably thinking ten years back, thinking of machine learning, people would maybe imagine robots taking over the world and having their own understanding, which we would call today artificial general intelligence, and we’re still quite far away from that. I’d say that machine learning is generally a simplistic concept of taking data and correct answers or, as we call them, the labels, and having a computer or a model understand what is the correct relation using various statistics and optimization methods in order to understand the algorithm by itself. So that would probably be the way I see it currently.
Liran Haimovitch: So essentially, a machine learning application is some sort of application that can take a piece of data and label it somehow.
Shir Chorev: Right, and the way it does it somehow is based on previous data and labels, usually. There are some exceptions to that.
Liran Haimovitch: So we train the application to be able to tell us stuff about new data that’s going to come in the pipeline sometime in the future.
Shir Chorev: Exactly, and we don’t do it based on like, heuristics that we say explicitly, unlike in regular algorithms, yes.
Liran Haimovitch: Yeah, which is where the machine learning part comes in.
Shir Chorev: Right.
Liran Haimovitch: You’ve been doing a lot of engineering before you dove into machine learning, so how is machine learning different from traditional engineering?
Shir Chorev: I’d say that conceptually there’s quite a few differences and challenges. One of the things is that machine learning essentially is more like algorithmic research and you try to understand: did the model learn it successfully or not? And it’s often also very hard to understand what exactly did the model learn, and what is it supposed to learn, so I’d say the focus of the modeling part is quite mathematical and algorithmic, while many of the actual challenges in deploying and developing machine learning algorithms are actually everything surrounding the model. So while you need people with a very, very deep understanding of algorithms and data, the challenges are usually on the engineering part. We have the model serving and the data engineering and everything on data cleaning, so there’s lots of challenges that are not actually in that specific area. So I’d say it’s a different kind of connection between those two areas.
Liran Haimovitch: So if I were to compare it to traditional engineering, where a product manager comes and writes some spec about what needs to be done, then this spec now has to be converted into something that can be used for learning. Whether it’s the input data or the labels or anything that comes together, we have to define whatever we want to teach the computer and then make it accessible for the computer to learn.
Shir Chorev: Right. I was split in the way it represented reality, I’d split the areas or the differences into two parts, and one part is everything that’s around, so it’s the engineering and the modeling aspect, and the other part is what you related to, it’s relatively more researchy in the fact that you have maybe the spec that you’re trying to achieve, but like many research tasks, you don’t know whether the data is good enough, whether you’ll be able to reach there. So there’s kind of two areas, I would say.
Liran Haimovitch: As you go through this process, as you build those models, obviously the next steps, kind of like in software engineering, is testing it. Testing whether the model you’ve built, the application you’ve built is good enough, testing it for various edge cases, testing its performance, testing everything you would test in a normal software application. But how is it different? How is testing machine learning applications different than traditional software engineering?
Shir Chorev: First of all, machine learning is a relatively new field, so just like in software there was quite a long process of evolving how do we do it? Do we do it manually, do we have a test-driven development, and do we do it in a methodological manner? I’d say that machine learning in that aspect is a few steps behind, so there are many differences in how do we approach it and what should be tested.
I think many of the specific testing for machine learning is that it’s more challenging to understand what’s supposed to happen. If I take a few concepts from software, let’s look at things like coverage testing. In software, one of the things we want to do is go over all the branches of code and then we think, okay. So if we went over all the flows, this does represent that we did a relatively thorough testing and we have some ease of mind. But for machine learning, this isn’t even possible. How do I cover everything? Do I cover all the inputs base? That’s not even possible, because it’s infinite – not infinite, but theoretically infinite. And do I cover the model? Let’s say it’s a neural network. Do I cover all the numbers that it should? Do I go through all of the branches in it? So in that aspect it’s much more challenging to understand like, what do I test?
And there’s also additional aspects like the fact that it’s very likely to fail silently, which is something that is very troubling, because essentially, you put in numbers, even if the numbers don’t represent what you think they’ll represent, the model still takes a number and outputs a different number. So from the way it’s built, and sometimes you’ll discover it only many months later, especially if it’s in production and you don’t even have the labels, you don’t have the correct answers and suddenly understand that some loans you’ve been approving a few months ago are actually performing much worse than you thought.
Liran Haimovitch: Now on the one hand, you can obviously go about manual testing saying this is something I want to test, whether it’s an edge case you find interesting, whether it’s a performance test you want to run. But you can also automate something. And as we’ve seen in software engineering, obviously automation is great because it allows you to retest and retest time and time again for every small modification, often in a very quick manner. But you’re also seeing that automation testing is not without its price, whether it’s about designing the automated tests or it’s about running them, spending the computer on them, dealing with unstableness of some tests, so from your experience, where do automated tests shine in machine learning?
Shir Chorev: So I’ll say that initially when we in our own projects face these problems of machine learning projects, having some unstable behavior or some unknown behavior, and we try to understand, how can we catch it as early as possible, and we wanted to see how we can do it, and when you ask peers or search the internet or check existing stuff, basically you understand that there isn’t yet something that is in place that “okay, this is is how we do it. You have [units set 07:28] for this part. So I’d say it’s maybe a few steps – I mean I’ll be happy to say how we do automated testing, but I think that conceptually, this is something that isn’t yet in place, and I would say that’s one of the reasons that I think solutions like that are, or from our experience, like with our package, are being quite keenly adopted, because people are craving for it.
‘Cause currently what happens usually is you have something and it’s ready to go to production. You would probably have to have the most senior member of the team go over it, manually, and check everything and verify that no edge cases were missed and stuff like that. So I’d say that we’re on the verge of transferring to automated testing. What we saw is relevant. I think some of the things that really have to do with the methodology of machine learning, the fact that we have training data and test data or the fact that we have [inaudible 08:16] time and the fact that we have various ways of how it’s built enables us to actually look at it as a generic approach of okay, let’s check stuff about the data integrity, let’s check stuff about data distributions, about the model performance, and these are things that are really relevant for basically any model. Of course, what exactly is a problem or not, that’s something that you have to make sure in your specific domain. But what we really like in machine learning and automatically testing machine learning is that the ideas are the same and then you only have to do the customizations and the adaptations for your use case, enabling you to work in a more methodological manner and the data framework that will really help you be able to comprehensively check stuff with less rewriting the code or with less manual inspection every time.
Liran Haimovitch: So today you mentioned this more senior member of the team would probably go ahead and do a bunch of manual tests. What would they be focusing on? What would be important to test? How do they go about testing it?
Shir Chorev: By the way, I’m not sure if it’s a bunch of manual tests, but it’s rather to go over along with the researcher or data scientist that developed the model and make sure what did they test for or not. So I really think it splits into various areas, probably the most straightforward is test evaluating and the performance of the model. It’s very typical to do that at a high level of “Okay, let’s look at the overall performance,” but what about things like how do I perform in a specific demographic, or is there a difference in performance of the model between various demographics? For obvious reasons, some use cases, for example, insurance-related cases that sometimes have to have the same performance or the same kind of estimations for different genders and things like that. So that’s I would say one area that has everything to do with performance evaluation.
Other areas, which I think every data scientist at some point in their career makes a very embarrassing mistake of not noticing something in the data. For example, if you have a specific leakage in the way you collected the data, so you didn’t know, but you have much more car accidents from Europe while the data that you collected from the US is with much less accident. This is something that has to do with the collection, not even with your model. But this is something that you have to manually inspect. Does my data represent the world of my problem?
Obviously, as you get more experience, there are more and more pitfalls which you want to look at, like have an overview of your data and make sure that it works as you expected or as you understood that it does work. So I would say that the senior, in this case, like someone experienced will go and walk and check what are the things that were tested or not, and do they represent what we expected to?
Liran Haimovitch: Let’s say you have this model and have been working on it for a few months now, and I’ve released a few versions, so I have a great checklist of everything that needs to be tested whenever I release a new version of the model. And now I want to go about automating that. What should I do?
Shir Chorev: I’ll take one step back and say that you don’t or you shouldn’t wait until you get to that phase. The way we look at testing in machine learning is, for example, things have to do with data integrity. This is something that can arrive at any phase. It can arrive in your initial data set, and it can arrive when you have now – you retrained your model and you have new data coming in, and suddenly some data source changed because whoever you bought the data from didn’t tell you that they changed something. I think you have to do two steps. One is to well, have a checklist and check different things that have to do with – we know specific examples, but again, it’s in the data integrity and the methodology and in evaluating your model, and what you can or should check for really varies in which step you are in the phase. So some things are very relevant in the research phase ’cause, for example, the example I gave with data being collected in an unrepresentative manner, this is something you want to check. You want to verify much earlier.
Liran Haimovitch: Potentially as soon as you start collecting the data, as you’re collecting the data, you want to check that every new data source you bring into your research is aligned with what you want to train the model to do.
Shir Chorev: Exactly, and same for, also for example, of data that is maybe stale and has things that maybe don’t represent for various reasons, just like mistakes in the data. Essentially what you want is a very elaborate checklist, and then try to understand which of these checks, which can usually be really run in the same manner, so you don’t have to re-run them every time. You can have some library framework in which you implement relevant checks, and then in the relevant phase, which I would split it into a few phases. One is during research or both when you get to the data and also when you finish the research. When you retrain the model, which is something we do usually once a day, once a few weeks, depends on the use case. And in a continuous manner, but that would already be called monitoring when it’s in production. So I would say it’s both in the research, in the deployment and in production.
Liran Haimovitch: And you recently released an open source project focusing on exactly that, automating some of the most common tests engineers and researchers should run on their data and models, right?
Shir Chorev: Right. So while working with customers and seeing many machine learning models in production and analyzing the problems that they face, we understood that many of these problems could’ve already been identified much earlier. Whether it’s the research phase or right before they were actually deployed. And when we were thinking how can this be or why do these types of things happen, and how can we help in general with wider adoption of machine learning and what are the missing tools, and talking with many more people and asking them about how do they test their models and what is the process in their teams, and do they do a peer evaluation and how does it look, we understood that it’s something that’s very much an awareness on one hand, but on the other hand, there isn’t yet a solution.
And one of the things that we thought can really help advance the field and also basically every data scientist, whether they’re still in their academic research phase or even just entering the field, and also when they’re actively deploying models to production on a daily basis, it’s really relevant to have a way to easily and comprehensively check the things that you know might go wrong, and no reason to wait. And that was our motivation for reaching this area of offline evaluation, I would say.
Liran Haimovitch: And the open source project is called…
Shir Chorev: Deepchecks.
Liran Haimovitch: Deepchecks. So how would you go about using that if you were doing research or if you were releasing a new model? How would you go about employing Deepchecks to improve the quality of the model?
Shir Chorev: So if you want to run Deepchecks – it’s a Python package, so what you need is to import it and then to give it your data or your data and your model, depends on the phase you are, so if you’re just starting – if you only have your data, if you already have an initial version of your model, so you can give it also that. Then it runs a very long list of checks, both on your data and on your model and on your performance. And of course you can also add various custom checks or give it your own metrics and things like that. The idea of having something that you can…
Liran Haimovitch: Extend.
Shir Chorev: Both extend and also that you don’t have to every time put the effort of thinking again what is relevant for your domain and what is relevant in general. We do find that we have some sometimes interesting conclusions or surprises. One cute example that we had is that we had someone trying the package on something that they knew they had some problem with but they weren’t sure whether it will give interesting insights and will it catch it? And a simple test that many times people don’t necessarily do is checking your model compared to various types of baselines. So for example, a simple heuristic model, or what if I just choose, what if I just guess the most popular answer, or various types of baselines. And this may be a very interesting indication of something going wrong that I’m not necessarily aware of, ’cause if I see that my performance goes down from 0.9 accuracy to 0.85, okay. Those things can happen. But what if a very naive model outperforms me? This is a very good indication. Many times just having a very wide range of tests and checks can help spotlight different types of surprises that you understand shouldn’t be there.
Liran Haimovitch: How has the reception of the open source project been so far?
Shir Chorev: It was an interesting process, ‘cause initially when we released it, we felt that there’s a lot of traction of people seeing the [rep? 16:11] or recommending it to other people and it was cool to see that I guess people, mainly people in the space. So we really didn’t know what to expect when we released the package initially. We did have – our internal criteria was to reach a phase where it will, when we asked beta testers would you recommend this a friend or would you use this yourself, their answer was yes, or usually yes. So that was the criteria, but still we didn’t know how look we basically kind of open-sourced it and then had a post about it and waited to see what will happen.
And it was a really interesting process because what we thought is that initially, I guess seeing the topic and the fact that really it’s, as I said quite high in awareness, we saw that it received lots of attention but mainly attention of “Oh, this is cool, let’s check it out.” And stars and traffic and things like that. And then I’d say something like a few weeks later, you suddenly start to see usage patterns, and suddenly we have some issues and questions. It was nice to see initially the people adopting or liking the concept and then starting to work with it and check it out. And well, we’re still in that process, but it’s a really nice journey to experience and see how it evolves.
Liran Haimovitch: If you go back to the software development life cycle, the machine learning development life cycle, what happens after we’ve built out our model, we’ve tested it, we’re happy with the performance, we’ve gone through all the sanity checks out there. What do we do now and what do we have to worry about in the next step as we’re heading into production?
Shir Chorev: I’ll split it into two. One thing is when we just initially deployed, we have to make sure, and these are also types of bugs that we’ve seen with some customers and surprised us, but I guess these kinds of things happen, is that the production pipeline actually behaves the same as the training pipeline. You may have things like you just updated some feature engineering and the data – you trained the model on data that was normalized and the features between 0 and 1, but in the production, it’s still between 0 and 100. Again, by the way, the fail silently, these are the types of things that can happen there if you don’t check your distributions and monitor it continuously.
So I would say that first thing is to make sure that the initial phase of the model in production life really looks and behaves as it should and as it was in training. And I’d say the second area is the fact that data is very dynamic and the world changes and suddenly shoppers’ behavior is different because of Covid or there was a specific event, but your data, for example, Christmas, and your data wasn’t trained – I mean these are obviously things that you have to be aware of, but sometimes maybe there was a specific promotion that you didn’t know of, and suddenly it doesn’t represent and you’re mispricing all of your offerings. So I would say the second part is continuously making sure that the data and the model still represent the reality as you think they should. And for that, there are various areas, like retraining and when I retrain how do I check that the retrained model behaves as I expect it to, or in general how do I verify continuously that the data and model are still relevant?
Liran Haimovitch: Do you have any tips or recommendations about that?
Shir Chorev: So the way I see it in machine learning is that the challenges are kind of the same throughout the process and life cycle, but the emphasis is a bit different. So while in the research, we want to understand and analyze the weak spots of our model and how can we improve them, or do we have any problems with labeling, or do we have things like that. In the production, I would say that the focus should be that in, again, continuously monitoring and checking but the focus is a bit more on a few specific areas, which is data distribution, performance evaluation if I can. If not, things like, is the model, the distribution, the predictions of the model and is it rapidly changing? Always stuff like data integrity, which is relevant because things continuously change. But I do think it’s the same mindset but the focus is just a bit different. When you’re in monitoring and also the challenges are different because now I have to do it continuously and over time and I have to take note when do I retrain and how do I check it. I’d say that’s the way to look at it and have a comprehensive checklist and go through it.
Liran Haimovitch: I think you have some offering around that at Deepchecks?
Shir Chorev: We do look at making machine learning models work as expected over time as a wider notion from research to production, and in the area of monitoring during production. That’s the area of our enterprise offering and it quite naturally also relates to validating, ’cause the same things that I want to check before I deploy it, usually, some of them I still want to check over time. That’s our I’d say paid product.
Liran Haimovitch: Interesting stuff. Before we wrap this up, there is one question I ask all of my guests and I would love to hear your input on. You want to share with me some interesting bugs you’ve had?
Shir Chorev: I had a project which had something like 50 features and after some complex feature engineering I managed to take it down to only two features, which had many advantages and made it much more lean, and just before deploying the final project, I rechecked the model performance, and suddenly I see it’s random. And it took me something like a day to debug what happened there, and what I understood is that while I was trying to optimize the way the model was trained, one of the things was that I changed it to a NumPy framework – just a different framework cause I was working with before, in which the model, the order of the samples didn’t stay the same. So eventually what happened is I had a lot of complex and really good and sophisticated feature engineering that worked well and after that, the model was trained on random labels because the samples were just mixed between themselves. So luckily, I did check it again just before production and obviously sorted the labels and the samples before production.
Another bug which is, I’m not sure if it’s a classical bug in the areas of data integrity, but it certainly was a weird or surprising one, is that after working a few weeks on a project and – we were also kind of stuck at a specific accuracy level and trying to understand how can we improve it, so re-checking the data and joining it with additional data sources, I suddenly saw that some samples have different labels, like people were both doing a certain action and not doing a certain action, which didn’t make sense. And trying to understand what happened there, I understood that something like 30 percent of the data samples were of dead people.
Liran Haimovitch: So I see dead people.
Shir Chorev:[laughs] Yeah, exactly. That did improve a lot, the model, eventually, but yeah, it was a nice one.
Liran Haimovitch: Awesome. Thank you. It was a pleasure having you on the show.
Shir Chorev: Thank you for having me.
Liran Haimovitch: So that’s a wrap on another episode of the production-first mindset. Please remember to like, subscribe and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter @Productionfirst. Thanks again for joining us.