What Machine Learning Can Learn from DevOps
- Product teams, CI/CD, automated testing and IT are working closely with the business. However, where are your data scientists and ML engineers? Bring them closer as well – but there is no need to call it DevMLOps, ok?
- Data scientists usually come from an academic background and are afraid of sharing models that they do not consider good enough – create a safe environment for them to fail!
- Continuous integration and continuous deployment (CI/CD) are amazing best practices that can also be applied to machine learning components.
- More than CI/CD, we should do continuous evaluation of the models – version algorithms, parameters, data and its results.
- Machine learning bugs are not just functions returning wrong values, they can cause bias, accuracy drift, and model fragility.
The fact that machine learning development focuses on hyperparameter tuning and data pipelines does not mean that we need to reinvent the wheel or look for a completely new way. According to Thiago de Faria, DevOps lays a strong foundation: culture change to support experimentation, continuous evaluation, sharing, abstraction layers, observability, and working in products and services.
Developing apps with if/elses, loops and deterministic functions encapsulate the vast majority of cases in the industry. The tools we built to support the ecosystem, how we debug and test… all of it was designed with those use cases in mind. A different perspective must arise for ML applications, argued de Faria.
Data science practitioners must also absorb a lot of the industry gains from the last year, a direct result of the DevOps culture: deployable artifacts, observability, a sharing culture, experimentation/failure in its core, and also working in products and services, said de Faria.
Bringing people to the “hell of operations” helps to improve the quality of products, argued de Faria.
InfoQ spoke with de Faria about applying machine learning with DevOps.
InfoQ: What's your definition of artificial intelligence and machine learning?
Thiago de Faria: My definition of it is not the true or unique one. I usually point this out at the beginning of talks in order to have a common ground with the audience, because I have noticed that sometimes ML, AI, data science, deep learning and all of it are used as the same thing.
Artificial Intelligence (AI) is making computers capable of doing things that when done by a human, would be thought to require a unique intelligence. Machine learning (ML) is one of the ways of building the intelligence part, making sure that these machines can find patterns without explicitly programming them to do so.
To make it less vague, especially for the ML part, we can imagine the following:
- Digital pictures are composed of values of RGB at every pixel, varying from 0 to 255.
- If you want to create a rule to identify cats in pictures, you could try to do this with traditional loops if/else and rules.
- When the 4th pixel has an RGB value (24, 42, 255) and the 5th pixel is (28, 21, 214) and blah blah blah – it is a cat!
- Can you imagine all the possibilities of if/else you would have to write? Infinite! Also, when you get a new picture that you have never seen, would you catch in a clause? This makes the world extremely binary and hard to look at!
Machine learning allows us to have algorithms, that based on statistical learning (a process where you try to find a predictive function based on the data you have), can give you the chance of that picture containing a cat, and it can be trained based on data.
InfoQ: What are the challenges that come with artificial intelligence and machine learning regarding integrating with development and deployment?
De Faria: They are the same problems that we solved for traditional development but with a different perspective now: version control, packaging, deployment, collaboration and serving. The main issue is that we are trying to force the solutions we used before in software development, into this ecosystem
In the last year, we saw a significant increase in the number of products (especially Open Source) trying to solve ML's development lifecycle – Spark runs on top of Kubernetes, tensorflow, kubeflow, MLflow, Uber's Michelangelo, and cloud providers giving tools that allow the training and serving of models. We are witnessing the maturation of this ecosystem, and it's a growing environment.
InfoQ: What about bugs and testing… how does that work with ML components?
De Faria: Concerning the bugs it is important to keep notice of machine learning bugs: Bias, Drift and Fragility.
Bias comes from the bias that exists on the datasets used to build the feature and can have catastrophic results, especially when used on blackbox-like models (Hello, DeepLearning!) – Amazon scrapping ‘sexist AI' tool is one example of a ML model that passed all the tests in company known for its high-quality of engineering. However, while trying to filter and recruit software engineers, the data is biased because we are part of an industry where women are a large underrepresented group. That meant that the algorithm was disfavoring women applicants since it didn't have lots of cases on the training set. This can also happen in mortgage scoring systems and many other businesses. Cathy O'Neil's Weapons of Math Destruction is a book from 2016 that raised a lot of these problems in algorithms making important decisions concerning hiring, classifying people and others – this is a great read!
Drift occurs when models are built, working well and deployed. You may consider the job over and nothing else is needed, right? Unfortunately not. The model must be recalibrated and resynced according to the usage and data to keep the accuracy. Otherwise, it will drift and become worse with time.
Fragility is related to bias, but more related to changes outside of the team's reach. A change in definition, data that becomes unavailable, a null value that should not be there… how can your model cope with these issues, how fragile is it?
The worst part is, the majority of these bugs in ML cannot be identified before production. That is why monitoring and observability, others pillars from DevOps, play a gigantic role in machine learning components. You must measure proxies that identify the business value that your ML components should impact. For example, have you created a recommendation engine, and are you applying an AB test strategy to roll-out? Let's see the variation of spend between the two groups. Or maybe you have an image tagging component now, so are people using the features around it? You cannot directly track ML components, but you may be able to analyze proxy measures on it. These types of metrics and focusing on measuring can help you to detect and approach the ML bugs early on: bias, drift, and fragility.
InfoQ: You spoke about having distance between data science and operations. What is causing this distance?
De Faria: The same problem that affected (and still does) the business world and “gave birth” to the DevOps movement – a distance between the business and the actual industrialization/operationalization of what is built. This gap is a result of three things: slowness (things flowing from idea to production taking a gigantic time), lots of handovers (X talks to client, A writes the user story, B builds, C validates, D approves, E deploys, F operates, G fixes bugs, H rebuilds,…), and clustered teams working on projects, not products. The Accelerate book and State of DevOps Report from Dr. Nicole Forsgren, Jez Humble and Gene Kim (co-founders of DORA) are good places to look into this. This distance is getting more explicit and more evident as organizations start to change the way they approach software development and delivery lifecycle. Why? Because we can see a lot of organizational wins in adopting new practices, tools and doing the hardest thing: changing the culture.
InfoQ: What can we do to decrease distances and improve collaboration?
De Faria: Again, the hardest thing that any organization can do: change the culture. In the case of ML engineers and data scientists, some cultural aspects can impact a lot, but the most compelling one I have seen is related to the background of the professionals. The majority of them have a very academic background, meaning that they are used to spending long periods working on one problem until it is good enough to be accepted in a publication. The bar there, to be good enough, is extremely high, not just on some metrics but also on the design of the experiments, mathematical rigor, and so on. In a business context this is important, but less so… That means that it is OK to publish a model with 60% accuracy and have it on a deployable state. It is better to have that ready and consider putting it in production today, than waiting months to have something “good enough”. Maybe in three months that will not be a problem worth solving anymore. Moving fast with flexible possibilities is the best way to go.
InfoQ: What's your advice for companies who want to reap the benefits from applying artificial intelligence and machine learning? What should they do or not do?
De Faria: Training data scientists and generating value from ML techniques to build AI application is extremely hard. To make it viable, fun and to attract these professionals, we must change the culture around it. It is hard to design a path to this “optimal culture”, as every company has its own way and interactions are hard. Some cultural characteristics I have seen that support a short time-to-market and where a lot of value is generated from data science include:
- Data science- the “science” part indicates experimentation and failed tryouts. Not all experiments succeed, so data science will not produce mind-blowing solutions at all times. Sometimes you can go into a rabbit hole. Alternatively, the business may change. So the question is: if you are a data scientist working on a project for a few days and you see no future in it, do you have the courage and autonomy to tell that to your boss/stakeholders? Likewise, the other way around… can they come to you at any time and say that the business changed and we must pivot?
- More than CI/CD, we need to talk about CE – continuous evaluation. Every time a new model is tested – it can be new hyperparameters in the same algorithm or a completely new algorithm – we must be able to compare it with previous runs. How accurate was it? Why the result different? Are we using the same dataset? This is fundamental for a good usage of resources and a learning experience inside the company. Thus, let's implement CI/CD/CE pipelines!
- Share, share, share. Do share not only your good models, but also the ones that are a total flunk! Version control your code and your models, at all times! Learn to use git at every moment! Why? Because when someone else sees that, they will not try it again with the same datasets and parameters… stop the waste!
- Provide platforms and tools for the data science to abstract the things they do not know (and they do not need to know). A data scientist may not know if they want to serve their models in REST or gRPC, how to package the model, if it should be deployed on the Kubernetes cluster or how many requests per second it should withstand – this is not their expertise. A team must provide a platform and a have a pipeline to do that and let the decisions be taken, experimented with and changed. The problem here is companies “selling the silver bullet platform”. Every company has its flow, ways of working and ideas… do not bend the culture to the tool.
- Work on products and services, not projects. Developers, security specialists, SRE's… everyone should be involved and help. By doing this, you can make sure that you have deployable artifacts from day one! After it is deployed, the job is not over… You have to operate, monitor, refactor, calibrate and do several things with ML models that are running on production.