MLOps Blog

MLOps Is an Extension of DevOps. Not a Fork — My Thoughts on THE MLOPS Paper as an MLOps Startup CEO

Piotr Niedzwiedz

15 min

7th September, 2023

MLOps

By now, everyone must have seen THE MLOps paper.

“Machine Learning Operations (MLOps): Overview, Definition, and Architecture”

By Dominik Kreuzberger, Niklas Kühl, Sebastian Hirschl

Great stuff. If you haven’t read it yet, definitely do so.

The authors give a solid overview of:

What MLOps is,
Principles and components of the MLOps ecosystem,
People/roles involved in doing MLOps,
MLOps architecture and workflow that many teams have.

They tackle the ugly problem in the canonical MLOps movement: How do all those MLOps stack components actually relate to each other and work together?

In this article, I share how our reality as the MLOps tooling company and my personal views on MLOps agree (and disagree) with it. Many of the things I will talk about here I already see today. Some are my 3–4 year bets.

Just so you know where I am coming from:

I have a heavy software development background (15+ years in software). Lived through the DevOps revolution. Came to ML from software.
Founded two successful software services companies.
Founded neptune.ai, a modular MLOps component for ML metadata store, aka “experiment tracker + model registry”.
I lead the product and see what users, customers, and other vendors in this corner of the market do.
Most of our customers are doing ML/MLOps at a reasonable scale, NOT at the hyperscale of big-tech FAANG companies.

If you’d like a TLDR, here it is:

MLOps is an extension of DevOps. Not a fork:
– The MLOps team should consist of a DevOps engineer, a backend software engineer, a data scientist, + regular software folks. I don’t see what special role ML and MLOps engineers would play here.
– We should build ML-specific feedback loops (review, approvals) around CI/CD.
We need both automated continuous monitoring AND periodic manual inspection.
There will be only one type of ML metadata store (model-first), not three.
The workflow orchestration component is actually two things, workflow execution tools and pipeline authoring frameworks.
We don’t need a model registry. If anything, it should be a plugin for artifact repositories.
Model monitoring tools will merge with the DevOps monitoring stack. Probably sooner than you think.

Ok, let me explain.

MLOps is an extension of DevOps. Not a fork.

First of all, it is great to talk about MLOps and MLOps stack components, but at the end of the day, we are all just delivering software here.

A special type of software with ML in it but software nonetheless.

We should be thinking about how to connect to already existing and mature DevOps practices, stacks, and teams. But so much of what we do in MLOps is building things that already exist in DevOps and putting the MLOps stamp on them.

MLOps Extension of DevOps — *MLOps is an extension of DevOps*

When companies add ML models to their products/services, something is already there.

That something is regular software delivery processes and the DevOps tool stack.

In reality, almost nobody is starting from scratch.

And in the end, I don’t see a world where MLOps and DevOps stacks sit next to each other and are not just one stack.

I mean, if you are with me on “ML is just a special type of software”, MLOps is just a special type of DevOps.

So, figuring out MLOps architecture and principles is great, but I wonder how that connects to extending the already existing DevOps principles, processes, and tools stacks.

Production ML team composition

Let’s take this “MLOps is a an extension of DevOps” discussion to the team structure.

Who do we need to build reliable ML-fueled software products?

Someone responsible for the reliability of software delivery 🙂
We are building products, so there needs to be a clear connection between the product and end users.
We need people who build ML-specific parts of the product.
We need people who build non-ML-specific parts of the product.

Great, now, who are those people exactly?

I believe the team will look something like this:

Software delivery reliability: DevOps engineers and SREs (DevOps vs SRE here)
ML-specific software: software engineers and data scientists
Non-ML-specific software: software engineers
Product: product people and subject matter experts

Wait, where is the MLOps engineer?

How about the ML engineer?

Let me explain.

MLOps engineer is just a DevOps engineer

This may be a bit extreme, but I don’t see any special MLOps engineer role on this team.

MLOps engineer today is either an ML engineer (building ML-specific software) or a DevOps engineer. Nothing special here.

Should we call a DevOps engineer who primarily operates ML-fueled software delivery an MLOps engineer?

I mean, if you really want, we can, but I don’t think we need a new role here. It is just a DevOps eng.

Either way, we definitely need that person on the team.

Now, where things get interesting for me is here.

Data scientist vs ML engineer vs backend software engineer

So first, what is the actual difference between a data scientist, ML engineer, software engineer, and an ML researcher?

Today I see it like this.

In general, ML researchers are super heavy on ML-specific knowledge and less skilled in software development.

Software engineers are strong in software and less skilled in ML.

Data scientists and ML engineers are somewhere in between.

But that is today or maybe even yesterday.

And there are a few factors that will change this picture very quickly:

Business needs
Maturity of ML education

Let’s talk about business needs first.

Most ML models deployed within product companies will not be cutting-edge, super heavy on tweaking.

They won’t need state-of-the-art model compression techniques for lower latency or tweaks like that. They will be run-of-the-mill models trained on specific datasets that the org has.

That means the need for super custom model development that data scientists and ML researchers do will be less common than building packaging and deploying run-of-the-mill models to prod.

There will be teams that need ML-heavy work for sure. It’s just that the majority of the market will not. Especially as those baseline models get so good.

Ok, so we’ll have more need for ML engineers than data scientists, right?

Not so fast.

Let’s talk about computer science education.

When I studied CS, I had one semester of ML. Today it’s 4x + more ML content on that same program.

I believe that packaging/building/deploying the vanilla, run-of-the-mill ML model will become common knowledge for backend devs.

Even today, most backend software engineers can easily learn enough ML to do that if needed.

Again, not talking about those tricky-to-train, heavy-on tweaking models. I am talking about good baseline models.

So considering that:

Baseline models will get better
ML education in classic CS programs will improve
Business problems that need heavy ML tweaking will be less common

I believe the current roles on the ML team will evolve:

ML heavy role -> data scientist
Software heavy role -> backend software engineer

So who should work on the ML-specific parts of the product?

I believe you’ll always need both ML-heavy data scientists and software-heavy backend engineers.

Backend software engs will package those models and “publish” them to production pipelines operated by DevOps engineers.

Data scientists will build models when the business problem is ML-heavy.

But you will also need data scientists even when the problem is not ML-heavy, and backend software engineers can easily deploy run-of-the-mill models.

Why?

Cause models fail.

And when they fail, it is hard to debug them and understand the root cause.

And the people who understand models really well are ML-heavy data scientists.

But even if the ML model part works “as expected”, the ML-fueled product may be failing.

That is why you also need subject matter experts closely involved in delivering ML-fueled software products.

Subject matter experts

Good product delivery needs frequent feedback loops. Some feedback loops can be automated, but some cannot.

Especially in ML. Especially when you cannot really evaluate your model without you or a subject matter expert taking a look at the results.

And it seems those subject matter experts (SMEs) are involved in MLOps processes more often than you may think.

We saw fashion designers sign up for our ML metadata store.

WHAT? It was a big surprise, so we took a look.

Turns out that teams want SMEs involved in manual evaluation/testing a lot.

Especially teams at AI-first product companies want their SMEs in the loop of model development.

It’s a good thing.

Not everything can be tested/evaluated with a metric like AUC or R2. Sometimes, people just have to check if things improved and not just metrics got better.

This human-in-the-loop MLOps system is actually quite common among our users:

Greensteam in shipping: SME audits the results of new models
Respo Vision in sports analytics: Data scientists are looking at various metrics and visual outputs to evaluate the pipeline performance
Continuum Industries in industrial optimization: devs look at the results over the entire test suite before approving PRs

So this human-in-the-loop design makes true automation impossible, right?

That is bad, right?

It may seem problematic at first glance, but this situation is perfectly normal and common in regular software.

We have Quality Assurance (QA) or User Researchers manually testing and debugging problems.

That is happening on top of the automated tests. So it is not “either or” but “both and”.

But SMEs definitely are present in (manual) MLOps feedback loops.

Principles and components: what is the diff vs DevOps

I really liked something that the authors of THE MLOps paper did.

They started by looking at the principles of MLOps. Not just tools but principles. Things that you want to accomplish by using tools, processes, or any other solutions.

They go into components (tools) that solve different problems later.

Too often, it is completely reversed, and the discussion is shaped by what tools do.
Or, more specifically, what the tools claim to do today.

Tools are temporary. Principles are forever. So to speak.

And the way I see it, some of the key MLOps principles are missing, and some others should be “packaged” differently.

More importantly, some of those things are not “truly MLOps” but actually just DevOps stuff.

I think as the community of builders and users of MLOps tooling, we should be thinking about principles and components that are “truly MLOps”. Things that extend the existing DevOps infrastructure.

This is our value added to the current landscape. Not reinventing the wheel and putting an MLOps stamp on it.

So, let’s dive in.

Principles

So CI/CD, versioning, collaboration, reproducibility, and continuous monitoring are things that you also have in DevOps. And many things we do in ML actually fall under those quite clearly.

Let’s go into those nuances.

CI/CD + CT/CE + feedback loops

If we say that MLOps is just DevOps + “some things”, then CI/CD is a core principle of that.

With CI/CD, you get automatically triggered tests, approvals, reviews, feedback loops, and more.

With MLOps come CT (continuous training/testing) and CE (continuous evaluation), which are essential to a clean MLOps process.

Are they separate principles?

No, they are a part of the very same principle.

With CI/CD, you want to build, test, integrate, and deploy software in an automated or semi-automated fashion.

Isn’t training ML models just building?

And evaluation/testing just, well, testing?

What is so special about it?

Perhaps it is the manual inspection of new models.

That feels very much like reviewing and approving a pull request by looking at the diffs and checking that (often) automated tests passed.

Diffs between not only code but also models/datasets/results. But still diffs.

Then you approve, and it lands in production.

I don’t really see why CT/CE are not just a part of CI/CD. If not in naming, then at least in putting them together as a principle.

The review and approval mechanism via CI/CD works really well.

We shouldn’t be building brand new model approval mechanisms into MLOps tools.

We should integrate CI/CD into as many feedback loops as possible. Just like people do with QA and testing in regular software development.

Workflow orchestration and pipeline authoring

When we talk about workflow orchestration in ML, we usually mix two things.

One is the scheduling, execution, retries, and caching. Things that we do to make sure that the ML pipeline executes properly. This is a classic DevOps use case. Nothing new.

But there is something special here: the ability to author ML pipelines easily.

Pipeline authoring?

Yep.

When creating integration with Kedro, we learned about this distinction.

Kedro explicitly states that they are a framework for “pipeline authoring”, NOT workflow orchestration. They say:

“We focus on a different problem, which is the process of authoring pipelines, as opposed to running, scheduling, and monitoring them.”

You can use different back-end runners (like Airflow, Kubeflow, Argo, Prefect), but you can author them in one framework.

Learn more

Argo vs Airflow vs Prefect: How Are They Different

Kedro vs ZenML vs Metaflow: Which Pipeline Orchestration Tool Should You Choose?

Pipeline authoring is this developer experience (DevEx) layer on top of orchestrators that caters to data science use cases. It makes collaboration on those pipelines easier.

Collaboration and re-usability of pipelines by different teams were the very reasons why Kedro was created.

And if you want re-usability of ML pipelines, you sort of need to solve reproducibility while you are at it. After all, if you re-use a model training pipeline with the same inputs, you expect the same result.

Versioning vs ML metadata tracking/logging

Those are not two separate principles but actually parts of a single one.

We’ve spent thousands of hours talking to users/customers/prospects about this stuff.

And you know what we’ve learned?

Versioning, logging, recording, and tracking of models, results, and ML metadata are extremely connected.

I don’t think we know exactly where one ends and the other starts, let alone our users.

They use versioning/tracking interchangeably a lot.

And it makes sense as you want to version both the model and all metadata that comes with it. Including model/experimentation history.

You want to know:

how the model was built,
what were the results,
what data was used,
what the training process looked like,
how it was evaluated,
etc.

Only then can you talk about reproducibility and traceability.

And so in ML, we need this “versioning +” which is basically not only versioning of the model artifact but everything around it (metadata).

So perhaps the principle of “versioning” should just be a wider “ML versioning” or “versioning +” which includes tracking/recording as well.

Model debugging, inspection, and comparison (missing)

“Debugging, inspection and comparison” of ML models, experiments, and pipeline execution runs is a missing principle in the MLOps paper.

Authors talked about things around versioning, tracking, and monitoring but a principle that we see people want that wasn’t mentioned is this:

As of today, a lot of the things in ML are not automated. They are manual or semi-manual.

In theory, you could automatically optimize hyperparameters for every model to infinity, but in practice, you are tweaking the model config based on the results exploration.

When models fail in production, you don’t know right away from the logs what happened (most of the time).

You need to take a look, inspect, debug, and compare model versions.

Obviously, you experiment a lot during the model development, and then comparing models is key.

But what happens later when those manually-built models hit retraining pipelines?

You still need to compare the in-prod automatically re-trained models with the initial, manually-built ones.

Especially when things don’t go as planned, and the new model version isn’t actually better than the old one.

And those comparisons and inspections are manual.

Automated continuous monitoring (+ manual periodic inspection)

So I am all for automation.

Automating mundane tasks. Automating unit tests. Automating health checks.

And when we speak about continuous monitoring, it is basically automated monitoring of various ML health checks.

You need to answer two questions before you do that:

What do you know can go wrong, and can you set up health checks for that?
Do you even have a real need to set up those health checks?

Yep, many teams don’t really need production model monitoring.

I mean, you can inspect things manually once a week. Find problems you didn’t know you had. Get more familiar with your problem.

As Shreya Shankar shared in her “Thoughts on ML Engineering After a Year of my PhD”, you may not need model monitoring. Just retrain your model periodically.

“Researchers think distribution shift is very important, but model performance problems that stem from natural distribution shift suddenly vanish with retraining.” — Shreya Shankar

You can do that with a cron job. And the business value that you generate through this dirty work will probably be 10x the tooling you buy.

Ok, but some teams do need it, 100%.

Those teams should set up continuous monitoring, testing, and health checks for whatever they know can go wrong.

But even then, you need to manually inspect/debug/compare your models from time to time.

To catch new things that you didn’t know about your ML system.

Silent bugs that no metric can catch.

I guess that was a long way of saying that:

You need not only continuous monitoring but also manual periodic inspection.

Data management

Data management in ML is an essential and much bigger process than just version control.

You have data labeling, reviewing, exploration, comparison, provisioning, and collaboration on datasets.

Especially now, when the idea of data-centric MLOps (iterating over datasets is more important than iterating over model configurations) is gaining so much traction in the ML community.

Also, depending on how quickly your production data changes or how you need to set up evaluation datasets and test suits, your data needs will determine the rest of your stack. For example, if you need to retrain very often, you may not need the model monitoring component, or if you are solving just CV problems, you may not need Feature Store etc.

Collaboration

When authors talk about collaboration, they say:

“P5 Collaboration. Collaboration ensures the possibility to work collaboratively on data, model, and code.”

And they show this collaboration (P5) happening in the source code repository:

This is far from the reality we observe.

Collaboration is also happening with:

Experiments and model-building iterations
Data annotation, cleanups, sharing datasets and features
Pipeline authoring and re-using/transfering
CI/CD review/approvals
Human-in-the-loop feedback loops with subject matter experts
Model hand-offs
Handling problems with in-production models and communication from the front line (users, product people, subject matter experts) and model builders

And to be clear, I don’t think we as an MLOps community are doing a good job here.

Collaboration in source code repos is a good start, but it doesn’t solve even half of the collaboration issues in MLOps.

Ok, so we talked about the MLOps principles, let’s now talk about how those principles are/should be implemented in tool stack components.

Components

Again many components like CI/CD, source version control, training/serving infrastructure, and monitoring are just part of DevOps.

But there are a few extra things and some nuance to the existing ones IMHO.

Pipeline authoring
Data management
ML metadata store (yeah, I know, I am biased, but I do believe that, unlike in software, experimentation, debugging, and manual inspection play a central role in ML)
Model monitoring as a plugin to application monitoring
No need for a model registry (yep)

Workflow executors vs workflow authoring frameworks

As we touched on it before in principles, we have two subcategories of workflow orchestration components:

Workflow orchestration/execution tools
Pipeline authoring frameworks

The first one is about making sure that the pipeline executes properly and efficiently. Tools like Prefect, Argo, and Kubeflow help you do that.

The second is about a devex of creating and reusing the pipelines. Frameworks like Kedro, ZenML, and Metaflow fall into this category.

Data management

What this component (or a set of components) should ideally solve is:

Data labeling
Feature preparation
Feature management
Dataset versioning
Dataset reviews and comparison

Today, it seems to be either done by a home-grown solution or a bundle of tools:

Feature stores like Tecton. Interestingly now they go more in the direction of a feature management platform: “Feature Platform for Real-Time Machine Learning”.
Labeling platforms like Labelbox.
Dataset version control with DVC.
Feature transformation and dataset preprocessing with dbt labs.

Should those be bundled into one “end-to-end data management platform” or solved with best-in-class, modular, and interoperable components?

I don’t know.

But I do believe that the collaboration between users of those different parts is super important.

Especially now in this more data-centric MLOps world. And even more so when subject matter experts review those datasets.

And no tool/platform/stack is doing a good job here today.

ML metadata store (just one)

In the paper, ML metadata stores are mentioned in three contexts, and it is not clear whether we are talking about one component or more. Authors talk about:

ML metadata store configured next to the Experimentation component
ML metadata store configured with Workflow Orchestration
ML metadata store configured with Model registry

The way I see it, there should just be one ML metadata store that enables the following principles:

“reproducibility”
“debugging, comparing, inspection”
“versioning+” (versioning + ML metadata tracking/logging), which includes metadata/results from any tests and evaluations at different stages (for example, health checks and tests results of a model release candidates before they go to a model registry)

Let me go over those three ML metadata stores and explain why I think so.

ML metadata store configured next to the Experimentation component

This one is pretty easy. Maybe because I hear about it all the time at Neptune.

When you experiment, you want to iterate over various experiment/run/model versions, inspect the results, and debug problems.

You want to be able to reproduce the results and have the ready-for-production models versioned.

You want to “keep track of” experiment/run configurations and results, parameters, metrics, learning curves, diagnostic charts, explainers, and example predictions.

You can think of it as a run or model-first ML metadata store.

That said, most people we talk to call the component that solves it an “experiment tracker” or an “experiment tracking tool”.

The “experiment tracker” seems like a great name when it relates to experimentation.

But then you use it to compare the results of initial experiments to CI/CD-triggered, automatically run production re-training pipelines, and the “experiment” part doesn’t seem to cut it anymore.

I think that ML metadata store is a way better name because it captures the essence of this component. Make it easy to “Log, store, compare, organize, search, visualize, and share ML model metadata”.

Ok, one ML metadata store explained. Two more to go.

2. ML metadata store configured with Workflow Orchestration

This one is interesting as there are two separate jobs that people want to solve with this one: ML-related (comparison, debugging) and software/infrastructure-related (caching, efficient execution, hardware consumption monitoring).

From what I see among our users, those two jobs are solved by two different types of tools:

People solve ML-related job by using native solutions or integrating with external experiment trackers. They want to have the re-training run results in the same place where they have experimentation results. Makes sense as you want to compare/inspect/debug those.
The software/infrastructure-related job is done either by the orchestrator components or traditional software tools like Grafana, Datadog etc.

Wait, so shouldn’t the ML metadata store configured next to the workflow orchestration tool gather all the metadata about pipeline execution, including the ML-specific part?

Maybe it should.

But most ML metadata stores configured with workflow orchestrators weren’t purpose-built with the “compare and debug” principle in mind.

They do other stuff really well, like:

caching intermediate results,
retrying based on execution flags,
distributing execution on available resources
stopping execution early

And probably because of all that we see people use our experiment tracker to compare/debug results complex ML pipeline executions.

So if people are using an experiment tracker (or run/model-first ML metadata store) for the ML-related stuff, what should happen with this other pipeline/execution-first ML metadata store?

It should just be a part of the workflow orchestrator. And it often is.

It is an internal engine that makes pipelines run smoothly. And by design, it is strongly coupled with the workflow orchestrator. Doesn’t make sense for that to be outsourced to a separate component.

Ok, let’s talk about the third one.

3. ML metadata store configured with Model registry

Quoting the paper:

“Another metadata store can be configured within the model registry for tracking and logging the metadata of each training job (e.g., training date and time, duration, etc.), including the model-specific metadata — e.g., used parameters and the resulting performance metrics, model lineage: data and code used”

Ok, so almost everything listed here is logged to the experiment tracker.

What is typically not logged there? Probably:

Results of pre-production tests, logs from retraining runs, CI/CD triggered evaluations.
Information about how the model was packaged.
Information about when the model was approved/transitioned between stages (stage/prod/archive).

Now, if you think of the “experiment tracker” more widely, like I do, as an ML metadata store that solves for “reproducibility”, “debugging, comparing, inspection”, and “versioning +” principles, then most of that metadata actually goes there.

Whatever doesn’t, like stage transition timestamps, for example, is saved in places like Github Actions, Dockerhub, Artifactory, or CI/CD tools.

I don’t think there is anything left to be logged to a special “ML metadata store configured next to the model registry”.

I also think that this is why so many teams that we talk to expect close coupling between experiment tracking and model registry.

It makes so much sense:

They want all the ML metadata in the experiment tracker.
They want to have a production-ready packaged model in the model registry
They want a clear connection between those two components

But there is no need for another ML metadata store.

There is only one ML metadata store. That, funny enough, most ML practitioners don’t even call an “ML metadata store” but an “experiment tracker”.

Ok, since we are talking about “model registry”, I have one more thing to discuss.

Model registry. Do we even need it?

Some time ago, we introduced the model registry functionality to Neptune, and we keep working on improving it for our users and customers.

At the same time, if you asked me if there is/will be a need for a model registry in MLOps/DevOps in the long run, I would say No!

For us, “model registry” is a way to communicate to the users and the community that our ML metadata store is the right tool stack component to store and manage ML metadata about your production models.

But it is not and won’t be the right component to implement an approval system, do model provisioning (serving), auto-scaling, canary tests etc.

Coming from the software engineering world, it would feel like reinventing the wheel here.

Wouldn’t some artifact registry like Docker Hub or JFrog Artifactory be the thing?

Don’t you just want to put the packaged model inside a Helm chart on Kubernetes and call it a day?

Sure, you need references to the model building history or results of the pre-production tests.

You want to ensure that the new model’s input-output schema matches the expected one.

You want to approve models in the same place where you can compare previous/new ones.

But all of those things don’t really “live” in a new model registry component, do they?

They are mainly in CI/CD pipelines, docker registry, production model monitoring tools, or experiment trackers.

They are not in a shiny new MLOps component called the model registry.

You can solve it with nicely integrated:

CI/CD feedback loops that include manual approvals & “deploy buttons” (check out how CircleCI or GitLab do this)
+ Model packaging tool (to get a deployable package)
+ Container/artifact registry (to have a place with ready-to-use models)
+ ML metadata store (to get the full model-building history)

Right?

Can I explain the need for a separate tool for the model registry to my DevOps friends?

Many ML folks we talk to seem to get it.

But is it because they don’t really have the full understanding of what DevOps tools offer?

I guess that could be it.

And truth be told, some teams have a home-grown solution for a model registry, which is just a thin layer on top of all of those tools.

Maybe that is enough. Maybe that is exactly what a model registry should be. A thin layer of abstraction with references and hooks to other tools in the DevOps/MLOps stack.

Model monitoring. Wait, which one?

“Model monitoring” takes the cake when it comes to the vaguest and most confusing name in the MLOps space (“ML metadata store” came second btw).

“Model monitoring” means six different things to three different people.

We talked to teams that meant:

(1) Monitor model performance in production: See if the model performance decays over time, and you should re-train it.
(2) Monitor model input/output distribution: See if the distribution of input data, features, or predictions distribution change over time.
(3) Monitor model training and re-training: See learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
(4) Monitor model evaluation and testing: log metrics, charts, prediction, and other metadata for your automated evaluation or testing pipelines
(5) Monitor infrastructure metrics: See how much CPU/GPU or Memory your models use during training and inference.
(6) Monitor CI/CD pipelines for ML: See the evaluations from your CI/CD pipeline jobs and compare them visually.

For example:

Neptune does (3) and (4) really well, (5) just ok (working on it), but we saw teams use it also for (6)
Prometheus + Grafana is really good at (5), but people use it for (1) and (2)
Whylabs or Arize AI are really good at (1) and (2)

As I do believe MLOps will just be an extension to DevOps, we need to understand where software observability tools like Datadog, Grafana, NewRelic, and ELK (Elastic, Logstash, Kibana) fit into MLOps today and in the future.

Also, some parts are inherently non-continuous and non-automatic. Like comparing/inspecting/debugging models. There are subject matter experts and data scientists involved. I don’t see how this becomes continuous and automatic.

But above all, we should figure out what is truly ML-specific and build modular tools or plugins there.

For the rest, we should just use more mature software monitoring components that quite likely your DevOps team already has.

So perhaps the following split would make things more obvious:

Production model observability and monitoring (WhyLabs, Arize)
Monitoring of model training, re-training, evaluation, and testing (MLflow, Neptune)
Infrastructure and application monitoring (Grafana, Datadog)

I’d love to see how CEOs of Datadog and Arize AI think about their place in DevOps/MLOps long-term.

Is drift detection just a “plugin” to the application monitoring stack? I don’t know, but it seems reasonable, actually.

Final thoughts and open challenges

If there is anything I want you to take away from this article, it is this.

We shouldn’t be thinking about how to build the MLOps stack from scratch.

We should be thinking about how to gradually extend the existing DevOps stack to specific ML needs that you have right now.

Authors say:

“To successfully develop and run ML products, there needs to be a culture shift away from model-driven machine learning toward a product-oriented discipline

…

Especially the roles associated with these activities should have a product-focused perspective when designing ML products”.

I think we need an even bigger mindset shift:

ML models -> ML products -> Software products that use ML -> just another software product

And your ML-fueled software products are connected to the existing infrastructure of delivering software products.

I don’t see why ML is a special snowflake here long-term. I really don’t.

But even when looking at the MLOps stack presented, what is the pragmatic v1 version of this that 99% of teams will actually need?

The authors interviewed ML practitioners from companies with 6500+ employees. Most companies doing ML in production are not like that. And MLOps stack is way simpler for most teams.

Especially those who are doing ML/MLOps at a reasonable scale.

They choose maybe 1 or 2 components that they go deeper on and have super basic stuff for the rest.

Or nothing at all.

You don’t need:

Workflow orchestration solutions when a cron job is enough.
Feature store when a CSV is enough.
Experiment tracking when a spreadsheet is enough.

Really, you don’t.

We see many teams deliver great things by being pragmatic and focusing on what is important for them right now.

At some point, they may grow their MLOps stack to what we see in this paper.

Or go to a DevOps conference and realize they should just be extending the DevOps stack 😉

I occasionally share my thoughts on ML and MLOps landscape on my Linkedin profile. Feel free to follow me there if that’s an interesting topic for you. Also, reach out if you’d like to chat about it.