MLOps Blog

What Does GPT-3 Mean For the Future of MLOps? With David Hershey

23 min
19th December, 2023

This article was originally an episode of the MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners. 

Every episode is focused on one specific ML topic, and during this one, we talked to David Hershey about GPT-3 and the feature of MLOps.

You can watch it on YouTube:

Or Listen to it as a podcast on: 

But if you prefer a written version, here you have it! 

In this episode, you will learn about: 

  • 1 What is GPT-3 all about?Ā 
  • 2 What is GPT-3’s impact on the MLOps field and how is it changing ML?Ā 
  • 3 How can language models complement MLOps?Ā Ā 
  • 4 What are the concerns associated with building this MLOps sort of system?
  • 5 How are startups and companies already leveraging LLMs to ship products fast?

Stephen: On this call, we have David Hershey, one of the community’s favorites, I would say ā€“ I dare to say, in fact ā€“ and we will be talking about what OpenAI GPT-3 means for the MLOps world. David is currently the Vice President of Unusual Ventures, where they are raising the bar of what founders should expect from their venture investors. Prior to Unusual Ventures, he was a Senior Solutions Architect at Tecton. Prior to Tecton, he worked as a Solutions Engineer at Determined AI and as a Product Manager for the ML Platform at Ford Motor Companies. 

David: Thank you. Excited to be here and excited to chat.

Stephen: I’m just curious, just giving a background, what’s really your role at Unusual Ventures

David: Unusual is a venture fund, and my current focus is on our Machine Learning and Data Infrastructure Investments.  I lead all the work we do, thinking about the future of machine learning infrastructure and data infrastructure and a little bit about DevTools more generally. But it’s sort of a continuation of I’ve spent five or six years now dedicated to thinking about ML infrastructure and still doing that, but this time trying to figure out the next wave of it.

Stephen: Yeah, that’s pretty awesome. And you wrote a few blog posts on the next wave of ML infrastructure. Could you sort of throw more light into what you’re seeing there?

David: Yeah, it’s been a long MLOps journey, I suppose, for a lot of us, and there have been ups and downs for me. We’ve accomplished an amazing number of things. When I got into this, there were not many tools, and now there are so many tools and so many possibilities, and I think some of that’s good and some of it’s bad. 

The topic of this conversation, obviously, is to dive a little bit into GPT-3 and language models; there’s all this hype now about Generative AI. 

I think there’s this incredible opportunity to broaden the number of ML applications we can build and the set of people that can build machine learning applications thanks to recent advances in language models like ChatGPT and GPT-3 and things like that. 

Regarding MLOps, there are new tools we can think about, there are new people that can participate, and there are old tools that might have new capabilities that we can think about too. So thereā€™s a ton of opportunities.

What is GPT-3? 

Stephen: Yeah, absolutely, we’ll definitely delve into that. Speaking of the Generative AI space, the core focus of this episode would be the GPT-3, but could you share a bit more about what GPT-3 means and just give a background there?

David: Of course. GPT-3 is related to ChatGPT, which is the thing I guess the whole world’s heard about now. 

In general, it’s a large language model, not altogether that different from language machine learning models we’ve seen in the past that do various natural language processing tasks.

 It’s built on top of the transformer architecture that was released by Google in 2017, but GPT-3 and ChatGPT are sort of proprietary incarnations of that from OpenAI. 

They’re called large language models because, in the last six or so years, what we’ve been doing largely is giving more data and making the models bigger. As we’ve done that through both GPT-3 and other folks who have trained language models, we’ve seen these sort of amazing sets of capabilities emerge with language models beyond just sort of the classical things we’ve associated with language processing, like sentiment analysis, 

These language models can do more complex reasoning and solve a ton of language tasks efficiently; one of the most popular incarnations of them is ChatGPT, which is essentially a Chatbot that is capable of having human conversations.

Check also

Generative Adversarial Networks and Some of GAN Applications

The impact of GPT-3 on MLOps  

Stephen: Awesome. Thanks for sharing thatā€¦ What are your thoughts on the impact of GPT-3 on the MLOps field? And how do you see Machine Learning changing?

David: I think there are a couple of really interesting pieces to tease out what language models mean for the world of MLOps ā€“ maybe I want to separate it into two things. 

1. Language models

Language models, as I said, have an amazing number of capabilities. They can solve a surprisingly large number of tasks without any extra work; this means you don’t have to train or tune anything ā€“ you are only required to write a good prompt,

Several problems can be solved using language models.

The nice thing about being able to use a model someone else trained is you offload the MLOps to the people building the model, and you still get to do a whole bunch of fun work downstream. 

You don’t need to worry about inference as much or versioning and data. 

There are all these problems that suddenly fall out, enabling you to focus on other things, which I think broadens the accessibility of machine learning in a lot of cases. 

But not every use case is going to be immediately solved; Language models are good, but they’re not everything yet.

One category to think about is if we don’t need to train models anymore for some set of things, 

  • What activities are we taking part in? 
  • What are we doing, and what tools do we need? 
  • What talents and skills do we need to be able to build machine learning systems on top of language models? 

2. How language models complement MLOps

We are still training models; there are still a lot of cases where we do that, and I think it’s worth at least commenting on the impact of language models today. 

One of the hardest things about MLOps today is that a lot of data scientists aren’t native software engineers, but it may be possible to lower the bar to software engineering. 

For example, there has been a lot of hype around translating natural language to things like SQL so that it’s a little bit easier to do data discovery and things like that. And so those are more sideshows of the conversations or other complementary pieces, maybe. 

But I think it is still impactful when you think about whether there’s a way language models can be used to lower the bar of who can actually participate in traditional MLOps by making the software aspects more accessible, the data aspects more accessible, et cetera.

The accessibility of large language models

Stephen: When you talk about GPT-3 and Large Language Models (LLMS), some people think these are tools for large companies like Microsoft, OpenAI, Google, etc. 

How are you seeing the trend toward making these systems more accessible for smaller organizations, early-stage startups, or smaller teams? I want to leverage this stuff and put it out there to consumers.

David: Yeah, I actually think this is maybe the most exciting thing that’s come out of language models, and I’ll frame it in a couple of ways. 

Someone else has figured out MLOps for the Large Language Models.

To some extent, they’re serving them, they’re versioning them, they’re iterating on them, they’re doing all the fine-tuning. And what that means is for a lot of companies that I work with and talk to, Machine Learning in this form is way more accessible than it’s ever been ā€“ they don’t need to hire a person to learn how to do machine learning and learn PyTorch and figure out all of MLOps to be able to get something out. 

The amazing thing with language models is you can kind of get your MVP out by just writing a good prompt on the OpenAI playground or something like that.

A lot of them are demos at that point, they’re still not products. But I think the message is the same: itā€™s suddenly so easy to go from an idea to something that looks like it actually works.

At a very surface level, the obvious thing is anybody can try and potentially build something pretty cool; it’s not that hard, but that’s great ā€“ not hard is great. 

We’ve been doing very hard work to create simple ML models for a while, and this is really cool. 

The other thing I’ll touch on is this: when I look back to my time at Ford, a major theme that we thought about was democratizing data. 

How can we make it so the whole company can interact with data? 

Democratization has been all talk for the most part, and language models, to some extent, have done a little bit of data democratizing for the whole world. 

To explain that a little further, when you think about what those models are, the way that GPT-3 or the other similar language models are trained is on this corpus of data called the Common Crawl, which is essentially the whole internet, right? So they download all of the text on the internet, and they train language models to predict all of that text. 

One of the things you used to need to do the machine learning that we’re all familiar with is data collection

When I was at Ford, we needed to hook things up to the car and telemetry it out and download all that data somewhere and make a data lake and hire a team of people to sort that data and make it usable; the blocker of doing any ML was changing cars and building data lakes and things like that. 

One of the most exciting things about language models is you don’t need to hook up a lot of stuff. You just sort of say, please complete my text, and it will do it. 

I think one of the bars that a lot of startups had in the past was this cold start problem. Like, if you don’t have data, how do you build ML? And now, on day one, you can do it, anybody can. 

Thatā€™s really cool. 

You may find interesting

How to Do Data Labeling and Data Collection: Principles and Process

What do startups worry about if MLOps is solved? 

Stephen: And it’s quite interesting because if you’re not worrying about these things, then what are you worrying about as a startup? 

David: Well, Iā€™ll give the good and then the badā€¦

The good case is worrying about what people think, right? You’re customer-centric.

Instead of worrying about how youā€™re going to find another MLOps person or a data engineer, which is hard to find because there’s not enough of them, you can worry about building something that customers want, listening to customers, building cool features, and hopefully, you can iterate more quickly too. 

The other side of this that all of the VCs in the world like to talk about is defensibility ā€“ and I don’t want to, we don’t need to get into that. 

But when it’s so easy to build something with LLMs, then it’s sort of table stakes ā€“ It stops being this cool differentiated thing that sets you apart from your competition. 

If you build an incredible credit scoring model that will make you a better insurance provider, or that will make you a better loan provider, etc. 

Text completion is kind of table stakes right now. A lot of folks are worried about how to build something that my competitors can’t rip off tomorrow ā€“ but hey, that’s not a bad problem to have. 

Going back to what I said earlier, you can focus on what people want and how people are interacting with it and maybe frame it slightly differently. 

For example, there’s all of this MLOps tooling, and the thing that’s kind of at the far end is monitoring, right? When we think about it, it’s like you ship a model, and the last thing you do is monitor it so that you can continuously update and stuff like that. 

But monitoring for a lot of MLOps teams I work with is sort of still an afterthought because they’re still working on getting to the point where they have something to monitor. But monitoring is actually the cool part; it’s where people are using your system, and you’re figuring out how to iterate and change it to make it better. 

Almost everybody I know that’s doing language model stuff right now is already monitoring because they ship something in five days; they’re working on iterating with customers now instead of trying to figure it out and scratching their heads. 

We can focus more on iterating these systems with users in mind instead of the hard PyTorch stuff and all that.

Learn more

A Comprehensive Guide on How to Monitor Your Models in Production

Has the data-centric approach to ML changed after the arrival of large language models? 

Stephen: Prior to LLMs, there was a frenzy around data-centric AI approaches to building the systems. How does this sort of approach to building your ML systems link to now having Large Language Models that already have been trained on this vast amount of data?

David: Yeah, I guess one thing I want to call out is that – 

Machine learning that’s the least likely to be replaced by language models in the short term, is some of the most data-centric stuff.

When I was at Tecton, they built a feature store, and a lot of the problems we were working on are things like fraud detection, recommendation systems, and credit scoring. It turns out the hard part of all of those systems is not the machine learning part, it’s the data part. 

You almost always need to know a lot of small facts about all of your users around the world, in a short amount of time; this data is then used to synthesize the answer. 

In that sense, it’s a hard part of a problem:  Is data still because you need to know what someone just clicked on or what are the last five things someone bought? Those problems aren’t going away. You still need to know all of that information. You need to be focused on understanding and working with data ā€“ I’d be surprised if language models had almost any impact on some of that. 

There are a lot of cases where the hard part is just being able to have the right data to make decisions. And in those cases, being data-centric, asking questions about what data you need to collect what data, like how to turn that into features and how to use that to make predictions, are the right questions to ask. 

On the language model side of things, the data question is interesting ā€“ you need potentially a little bit less focus on data to get started. You don’t need to curate and think about everything, but you must ask questions about how people actually use this ā€“ as well as all the monitoring questions we talked about.

Building something such as Chatbots needs to be built like product analytics to be able to track what our users’ responses to this generation or whatever we’re doing and things like that. So data is really important for those still. 

We can get into it, but it certainly has a different texture than it used to because data is not a blocker to building features with language models as often anymore. It’s maybe an important part to keep improving, but it’s not a blocker to get started like it used to be.

How are companies leveraging LLMs to ship products fast? 

Stephen: Awesome. And I’m trying not to lose my train of thought for the other MLOps component side of things, but I just wanted to give a bit of context againā€¦

From your experience, how are companies leveraging these LLMs to ship products out fast? Have you seen use cases you want to share based on your time with them, at unusual? 

David: It’s almost everything; you’d be amazed at how many things are out there. 

Thereā€™s may be a handful of obvious use cases of language models out there and then we’ll talk about some of the quick shipping things tooā€¦ 

Writing assistants 

Thereā€™s tools that help you write lots of those; for example, copy for marketing or blogs or whatever. Examples of such tools include Jasper.AI and Copy.AI ā€“ they have been around the longest. This is probably the easiest thing to implement with a language model. 

Agents 

There are use cases out there helping you take action. These are one of the coolest things going on right now. The idea is to build an agent that takes tasks in natural language and carries them out for you. For example, it could send an email, hit an API, or do nascent things. There’s more work going on there, but it’s neat. 

Search & semantic retrieval

A lot of folks working on search and semantic retrieval and things like thatā€¦ For example, if I want to look for a note, I can get a rich understanding of how to search through large information. Language models are good at digesting and understanding information so knowledge management and finding information are cool use cases. 

I give broad answers because nearly every industry product has some opportunity to incorporate or improve a feature using language models. There are so many things out there to do and not enough time in the day to do them.

Stephen: Cool. And these are like DevTool-related use cases;  like DevTooling and stuff?

David: I think there are all sorts of things out there, but in terms of thinking on the DevTool side, there’s Copilot, which helps you write code faster. And there are a lot of things like even making pull requests. I’ve seen tools that help you write and author pull requests more efficiently, and that help automate building documentation. I think the whole universe of how we develop software to some extent is also ripe to change. So along those lines exactly.

Monitoring LLMs efficiently in production

Stephen: Usually, when we talk about the ML platform or MLOps these are like tightening neat up close different components. You have your:

The data is then moved across this workflow, modeled and then deployed, 

Now there’s a good link between your development environments and the production environment where it’s monitoring. 

But in this case now, where LLMs have almost eliminated the development sideā€¦ 

How you have sort of seen folks monitor these systems efficiently in production, especially replacing them with other models, and other systems out there?

David: Yeah, it’s funny. I think monitoring is one of the hardest challenges for the language models now because we eliminated development so it becomes challenge number one.

With most of the machine learning we’ve done in the past, the output is structured (i.e., is this a cat or not?); monitoring this was pretty easy. You can look at how often youā€™re predicting itā€™s a cat or not, and evaluate how itā€™s changing over time. 

With language models, the output is a sentence ā€“ not a number. Measuring how good a sentence is, is hard. You have to think about things such as: 

  • 1 Is this number above 0.95 or something like that?Ā 
  • 2 IsĀ this sentence authoritative and nice?Ā 
  • 3 And are we friendly and are we not toxic, are we not biased?Ā 

And all these questions are way harder to evaluate and harder to track and measure. So what are people doing? I think the first response for a lot of folks is to go to something like product analytics. 

It’s closer to tools like Amplitude than it was to classic tools where you just generate something and you see if people like it or not. Do they click? Do they click off the page? Do they stay there? Do they accept this generation? Things like that. But man, that’s a real course metric. 

That doesn’t give you nearly the detail of understanding the internals of a model. But it’s what people are doing.

There aren’t many great answers to that question yet. How do you monitor these things? How do you keep track of how good my model is doing besides looking at how users interact with it? It’s an open challenge for a lot of people. 

We know a lot of ML monitoring tools out thereā€¦ I’m hopeful some of our favorites will iterate into being able to more directly help with these questions. But I also think there’s an opportunity for new tools to emerge that help us say how good a sentence is, and help you measure that before and after you ship a model; this will make you feel more confident over time. 

Right now, the most common way I’ve heard people say they ship new versions of models is they have five or six prompts that they test on, and then they check with their eyes if the output looks good and they ship it.

Stephen: That’s killable. Ironic, amazing, and sarcastic.

David: I don’t think that will last forever. 

Where people are just happily looking at five examples with their eyes and hitting the ship to the production side of the error button. 

That’s bold, but there’s so much hype right now that people will ship anything, I guess, but it won’t take long for that to change.

Closing the active learning loop

Stephen: Yeah, absolutely. And just a step more for that, because I think even before the large language models frenzy, when it was just the basic transformers they had, I think most companies that deal with these sorts of systems would usually find a way to close the active learning loop. 

How can you find a way to close that active learning loop where you’re continuously refining that system or that model with your own data set as it comes began better?

David: I think this is still an active challenge for a lot of folks ā€“ not everybody’s figured it out. 

OpenAI has a fine-tuning API, for example. Others do too, where you can collect data and they’ll make a fine-tuned endpoint. And so I’ve talked to a lot of folks that go down that route eventually, either to improve their model, more commonly actually to improve the latency performance. Like, if you can, GPT-3 is really large and expensive, and if you can fine-tune a cheaper model to be similarly good, but much faster and cheaper. I’ve seen people go down that route. 

We’re in the early days of using these language models, and I have a feeling over time that the active learning component is still going to be just as, if not more important to refine models. 

You hear a lot of people talking about like, per-user fine-tuning, right? Can you have a model per user that knows my style, what I want, or whatever it may be? It’s a good idea for anybody that’s using these right now to be thinking about that active learning loop today before they, even if it’s hard to execute on today, can’t download the weights of GPT-3 and fine-tune it yourself. 

Even if you could, there are all sorts of challenges in fine-tuning a 175 billion parameter model, but I expect that the data that you collect now to be able to continuously improve is going to be really important in the long run.

See also

Active Learning: Strategies, Tools, and Real-World Use Cases

Is GPT-3 an opportunity or risk for MLOps practitioners? 

Stephen: Yeah, that’s quite interesting to see how the field sort of evolves in that sense. So at this point, we’ll jump right into some of the community questions. 

So the first question from the community: is GPT-3 an opportunity or risk for MLOps Practitioners? 

David: I think opportunities and risks are two sides of the same coin in some ways is I guess what I would say. I’ll cop out and say both. 

I start with the risk I think it’s hard to imagine that a lot of the workloads that we used to rely on training models to do, where you had to do the whole MLOps cycle, you won’t anymore, maybe to expand. As we talked about, language models can’t do everything right now, but they can do a lot. And there’s no reason to believe they won’t be able to do more over time. 

And if we have these general-purpose models that can solve lots of things, then why do we need MLOps? If we’re not training models, then a lot of MLOps go away. And so there’s a risk that if you aren’t paying attention to that, the amount of work out there to be done is going to go down. 

Now, the good news is there aren’t enough MLOps practitioners today, to begin with. Not even close, right. And so I don’t think we’re going to shrink to a point where the number of MLOps practitioners today is too many for how much MLOps we need to do in the world. So I wouldn’t worry too much about it, I guess that is what I would say. 

But the other side of it is there’s a whole bunch of new stuff to learn, like what are the challenges of building language model applications? There are a lot of them, and there are a lot of new tools. And I think looking forward to a couple of the community questions, I think we’ll get into it. But I think there’s a real opportunity to be a person that understands that and maybe even to push that a little bit further. 

You can use a language model, if you’re an MLOps person but not a data scientist; if you’re an engineer that helps people build and push models to production, maybe you don’t need the data scientist anymore. Maybe the data scientist should be worried. Maybe you, the MLOps person, can build the whole thing. You’re a full stack engineer suddenly in a sense where you get to build ML models by building on top of language models ā€“ you build the infrastructure and the software around them. 

I think that’s a real opportunity to be a full-stack practitioner of building language model-powered applications. You’re well positioned, you understand how ML systems work and you can do it. So I think that’s an opportunity.

What should MLOps practitioners learn in the age of LLMs? 

Stephen: That’s a really good point; we have a question in Chatā€¦

In this age of Large Language Models, what should MLOps practitioners actually learn or what should they  prioritize when it comes to trying to gain the skills as a beginner?  

David: Yeah, good questionā€¦

I don’t want to be too radical. There’s a lot of machine learning use cases that aren’t going to be impacted drastically by language models. We still do fraud detection and things like that. These are still things where someone’s going to go train a model on our own proprietary data and all of that. 

If you’re passionate about MLOps and the development and training and full lifecycle of machine learning, learn the same MLOps curriculum as you would have learned before. Learning software engineering best practices and understanding how ML systems get built and productionized. 

Maybe I’d complement that by like it’s simple, but just go to the GPT-3 playground by OpenAI and play around with a model. Try to build a couple of use cases. There are lots of demos out there. Build something. It’s easy. 

Personally, I’m a VC… Iā€™m barely technical anymore and I’ve built like four or five of my own apps to play with and use in my spare time ā€“ it’s ridiculous how easy it is. You wouldn’t believe it. 

Just build something with language models, it’s easy, and you’ll learn a lot. You’ll be amazed probably at how simple it is.

I have something that takes transcripts of my calls and writes call summaries for me. I have something that takes a paper and I can ask questions against that paper, like a research paper, things like that. Those are simple applications. But you’ll learn something. 

I think it’s a good idea to be somewhat familiar with what it feels like to build and iterate with these things right now and it’s fun too. So I highly recommend anybody in the MLOps field try it out. I know it’s your free time, but it should be fun.

What are the best options to host an LLM at a reasonable scale? 

Stephen: Awesome. So focus on shipping stuff. Thanks for the suggestion. 

Let’s jump right into the next question from the community: what are the best options to host large language models at a reasonable scale?

David: This is a tough oneā€¦ 

One of the hardest things about language models is somewhere in the 30 billion parameter range. GPT-3 has 175 billion parameters. 

Somewhere in the 30 billion parameter range, a model starts fitting on the biggest GPUs we have today.,,

The biggest GPU on the market today in terms of memory is the A100 with 80GB of memory. GPT-3 does not fit on that.

You can’t infer GPT-3 on a single GPU. And so what does that mean? It gets horribly complicated to do inference of a model that doesn’t fit on a single GPU ā€“ you have to do model parallelism and it’s a nightmare. 

My short advice is don’t try unless you have to ā€“ there are better options. 

The good news is a lot of people are working on taking these models and turning them into form factors that fit on a single GPU. For example, [we’re recording on February 28th] I think it was like yesterday or last Friday that the LLaMA paper from Facebook came out; they changed a language model that does fit on one GPU and has similar capabilities to GPT-3. 

There are others like it that are 5 billion parameter models up to like 30ā€¦ 

The most promising approach we have is to find a GPU or a model that does fit on a single GPU and then use the tools that we’ve used for all historical model deployment to host them. You can pick your favorite ā€“ there are lots out there, the folks at BentoML have a great serving product. 

A lot of other people do need to make sure you get a really big beefy GPU to put it on still. But I think it’s not much different at that point, as long as you pick something that does fit on one machine at least.

Are LLMs for MLOps going mainstream? 

Stephen: Oh yeah, thanks for sharing thatā€¦ 

The next question is whether LLMs for MLOps are going mainstream; what are the new challenges that they can address better than conventional MLOps for NLP use cases? 

David: Man, I feel like this is a landmine I’m going to make people angry no matter what I say here. It’s a good question though. There’s an easy version of this, which we talked about it for a lot of building ML or applications on top of language models. You don’t need to train a model anymore, you don’t need to host your own model anymore, you don’t need to all of that goes away. And so it’s like easy in a sense. 

There’s just a whole bunch of stuff you don’t need to build language models. The new questions you should be asking yourself are: 

  • 1 what do I need?Ā 
  • 2 what are the new questions I need to answer?Ā 
  • 3 what are the new workflows that we’re talking about if it’s not training and hosting, serving and testing?Ā 

Prompting is a new workflow language modelā€¦. Building a good prompt is like a really simple version of building a good model. It’s still experimental.

You try a prompt and it works or it doesn’t work. You tinker with it until it works or doesn’t work ā€“ it’s almost like tuning hyperparameters in a way. 

You’re tinkering and tinkering and trying stuff and building stuff until you come up with a prompt that you like and then you push it or whatever. And so some folks are focused on like, prompt experimentation. And I think that’s like a valid way to think about it, how you think about weights and biases is experimentation for models. 

How do you have a similar tool for experimentation on prompts?

Keep track of versions of prompts and what worked and all that. I think that’s like a tooling category of its own. And whether or not you think Prompt Engineering is a lesser form of machine learning, it’s certainly something that works its own set of tools and is completely new and it’s certainly different from all of the MLOps we’ve done before. I think there’s a lot of opportunity to think about that workflow and to improve it. 

We touched on evaluation and monitoring and some of the new challenges that are unique to evaluating the quality of the output of a language model compared to other models.

There are similarities between that and monitoring historical ML models, but there are things that are just uniquely different. I think the questions we’re asking are different. As I said, a lot of it is like product analytics. Do you like this or not? All of the goals of what you capture might be able to fine-tune the model in a slightly different way than it was before. 

You can say we know about monitoring and MLOps, but I think there are at least new questions we need to answer about how to monitor language models. 

For example, what’s similar? It’s experimental and probabilistic. 

Why do we have MLOps as opposed to DevOps? This is the question you could ask first, I guess. Itā€™s because ML has this weird set of probabilities and distributions and stuff that acts differently from traditional software, and that’s still the same. 

In some sense, there’s a big overlap for similarity because a lot of what we’re doing is figuring out how to work with probabilistic software. The difference is we don’t need to train models anymore; we write prompts. 

The challenges of hosting, and interacting are differentā€¦ Does it warrant a new acronym? Maybe. The fact that saying LLMOps is such a pain doesnā€™t mean we shouldn’t be trying to do it in the first place. 

Regardless of the acronyms, there are certainly some new challenges that we need to address and some old challenges that we don’t need to address as much.

Stephen: I just wanted to touch on the experimentation part of I know developers are already taking notes,.. A lot of prompt engineering is happening. It’s now actually actively becoming a role. There are actually advanced prompt engineers, which is like, incredible in itself. 

David: It’s easier to become a prompt engineer than it is to maybe become an ML person. Maybe. I’m just saying that because I have a degree in machine learning, and I don’t have a degree in prompting. But it’s certainly a skill set, and I think managing and working with it is a good skill to have, and it’s clearly a valuable one. So why not?

Does GPT-3 require any form of orchestration? 

Stephen: Absolutely. All right, let’s check the other question: 

Does GPT-3 need to involve any form of orchestration or maybe pipelining? From their understanding, they feel like MLOps is like an orchestration type of process more than anything else.

David: Yeah, I think there are two ways to think about that. 

There are use cases of language models that you could imagine happening in batch. For example, take all of the reviews of my app, pull out relevant user feedback, and report them to me or something like that. 

There’s still all of the same orchestration challenges of grabbing all the new data, all the new reviews from the App Store, passing them through a language model in parallel or in sequence or whatever it is, collect that information, and then stick it out wherever it needs to go. Nothing has changed there. If you had your model hosted at an endpoint internally before, now you have it hosted at the Open.AI endpoint externally. Who cares? Same thing, no changes, and challenges are about the same. 

At inference time, you’ll hear a lot of people talking about things like chaining and things like that in language models. And the core insight there is a lot of the use cases we have actually involve going back and forth with a model a lot. So I write a prompt, the language model says something back based on what the language model says back, and I send another prompt to clarify or to move in some other direction. That’s an orchestration problem. 

Fundamentally, like, getting data back and forth from a model a few times is an orchestration problem. So, yeah, there are certainly orchestration challenges with language models. Some of them look just like before. Some of them are kind of net new. I think the tools we have to orchestrate are the same tools we should keep using. So if you’re using Airflow I think that’s a reasonable thing to do if you’re using Kubeflow pipelines, I think that’s a reasonable thing to do if you’re doing those live things maybe we want slightly new tools like what people are using LangChain for now.

It looks similar to a lot of orchestration things, like temporal or other things that help with orchestration and workflows in general too. So yeah, I think that’s a good insight, though. There’s a lot of good similar work of just like, gluing all these systems together to work when they’re supposed to, that still needs to be done. And it’s software engineering, kind of it’s like building something that always does a set of things you need to do and always does it. And you can rely on whether that’s MLOps or DevOps or whatever it is, building reliable computational flows. 

That’s good software engineering.

What MLOps principles are required to get the most from LLMs? 

Stephen: I know MLOps has its own principles itself. You talk about reproducibility, which might be a hard problem to solve, and talk about collaboration. Are there MLOps principles that need to be followed to make the potentials of these Large Language Models utilized properly for teams being in the system?

David: Good question. I think we’re early to actually know, but I think there are some similar questionsā€¦

A lot of what we’ve learned from MLOps and DevOps are both just giving kind of principles of how to do this. And so at the end of the day, a lot of what I think of this being for both MLOps and DevOps is software engineering to some extent. It’s like, can we build stuff that’s maintainable and reliable and reproducible and scalable? 

For a lot of the questions we want to build products, essentially, maybe specifically for language model Ops, you probably want to version your prompts. It’s a similar thing. You want to keep track of the versions and as they change, you want to be able to roll back. And if you have the same version of the prompt and the same zero temperature on the model, it’s reproducible, it’s the same thing. 

Again, the scope of challenges is kind of smaller, innately. So I don’t think there’s a lot of new stuff we necessarily need to learn. But I need to think more about it, I guess because I think there’s I’m sure there will be a playbook of all the things we need to follow for language models moving forward. But I think nobody’s written it yet, so maybe one of us should go do that.

Regulations around generative AI applications

Stephen: Yeah, an opportunity. Thank you for sharing that, David. 

The next question from the community: are there regulatory and compliance requirements that small DevTool teams should be aware of when embedding generative AI models into services for users?

David: Yeah, good questionā€¦ 

A range of things that I think are probably worth considering. We’ll caveat that I’m not a lawyer, so please don’t take my advice and run with it because I don’t know everything. 

A few vectors though, of challenges: 

  1. OpenAI and external services: a lot of the folks that host language models right now are external services. We’re sending them data.  Due to the active changes that they’re making to ChatGPT, you can now get proprietary Amazon source code because Amazon engineers have been using sending their code to ChatGPT and it’s been fine-tuned and now you can sort of back it out.  

That’s a good reminder that you’re sending your data to someone else when you use an external service. And that obviously depending on legal or just company implications that might mean that you shouldn’t do that and you may want to consider hosting on-site and there are all sorts of challenges that come with that. 

  1. The European Union: the EU AI Act should pass this year and it has pretty strict things to say about introducing bias to models and measuring bias and things like that. When you don’t own a model, I think it’s just worth being aware that these models certainly have a long history of producing biased or toxic content and there could be compliance ramifications for not testing and being aware of it. 

And I think that’s probably a new set of challenges we’re going to have to face of how can you make sure that when you’re generating the content, you’re not generating toxic content or biased content or taking biased actions because of what’s being generated. And so we are used to a world where we own the data that’s used to train these models so we can hopefully iterate and try to scrub them of biased things. If that’s not true, certainly new questions you need to ask about how it’s even possible to use these systems in a way that’s compliant with the evolving landscape of legislation. 

In general, AI legislation is still pretty new. I think a lot of people are going to have to figure out a lot of things, especially when the EU AI Act passes when it does.

Testing LLMs

Stephen: And you mentioned something really interesting about the model testing partā€¦ Has anybody figured that out for LLMs? 

David: Lots of people are trying; I know people are trying interesting things. There are metrics people have built in Academia to measure toxicity. There are methods and measures out there to evaluate the output of text. There have been similar tests for gender bias and things like that that have historically played this. So there are methods out there. 

There are folks that are using models to test models. For example, you can use a language model to look at the output of another language model and just say, ā€œis this hateful or discriminatory?ā€ or something like that ā€“ and they are pretty good at that. 

I guess the short version is we’re really early and I don’t think there’s a single tool I can point someone to, to say, like, here’s the way to do all of your evaluation and testing. But there are building blocks in the raw form out there right now to try to work on some of this at least. But it’s hard right now. 

ā€œI think it’s one of the biggest active challenges for people to figure out right now.ā€

Generative AI on limited resources

Stephen: When you talk about a model evaluating another model, my mind goes straight to teams using monitoring on some of the latest platforms, which have models actively doing the evaluation itself. It’s probably a really good business place to look into for these tools there.

I’m just going to jump right into the next question and I think it’s all about the optimization part of things… 

There’s a reason we call them LLMs, and you spoke of a couple of tools ā€“ the most recent one being from Facebook,  LLaMA.  

How are we going to see more generative AI models optimized for resource-constrained developments over time where there are limited resources, but you want to host it on the platform?

David: Yeah, I think this is really important, actually. I think it’s probably one of the more important trends that we’re going to see, and people are working on it still early, but there are a lot of reasons to care about this: 

  1. Cost ā€“ It’s very expensive to operate thousands of GPUs to do this.
  2. Latency ā€“ If you’re building a product that interacts with a user, every millisecond of latency in loading a page impacts their experience.
  3. Environments that can’t have a GPU ā€“ you can’t carry a cluster around in your phone or whatever it is, or wherever you are to do everything. 

I think there’s a lot of development happening in the image generation. There’s been an incredible amount of progress in a few short months on improving the performance. My MacBook can generate images pretty quickly.

Now, language models are bigger and more challenging still ā€“ I think there’s a lot more work to be done. But there are a lot of promising techniques that I’ve seen folks use, like using a very large model to generate data, to tune a smaller model to accomplish a task. 

For example, if the biggest model from OpenAI is good at some task but the smallest one isn’t, you can have the biggest one do that task 10,000 times, fine-tune the smallest one to get better, or a smaller one to get better at that task. 

The components are there, but this is another place where I don’t think we have all of the tooling we need yet to solve this problem. It’s also one of the places that I’m the most excited about; how can we make it easier and easier for folks to take the capabilities of these really big impressive models and tune them down into a form factor that makes sense for their cost or latency constraints or environmental constraints?

What industries will benefit from LLMs and how can they integrate it? 

Stephen: Yeah, and it does seem like the way we think about active learning other technique is in fact changing over time. Because if you can have a large language model like fine-tune a smaller one or train a smaller one, sort of, that’s an incredible chain of events going on there. 

Thank you for sharing that, David. 

I’m going to jump right into the next community question: what kind of industries do you think would benefit the most from GPT-3’s language generation capabilities and how can they integrate it?

David: Maybe to start with the obvious and then we’ll get into the less obvious because I think that’s easy. 

Any content generation should be complemented by language models now.

That’s obvious. 

For example, copywriting and marketing are fundamentally different industries now than they used to be ā€“ and it’s obvious why; it’s way cheaper to produce quality content than it’s ever been. You can build customized quality content in no time at an infinite scale almost. 

It’s hard to believe that nearly every aspect of that industry shouldn’t be somewhat changed and somewhat quickly be adopting language models. And we’ve seen that largely to date. 

There are people that will generate your product descriptions and your product photos and your marketing content and your copy and all that. And it’s no mistake that that’s the biggest and obvious breakout because it’s a big obvious fit. 

Moving downstream, I think my answer gets a little bit worse. Everybody should probably take a look at how they can use a language model, but the use cases are probably less obvious. Like not everybody needs a chatbot, not everybody needs to have autocomplete of text or something like that. 

But whether it means that your software engineers are more efficient because they’re using Copilot, whether it means that you have a better internal search of your documentation or your own documentation of your product has better search capabilities because you can index it with language models, that’s probably true for most people in some form. And once you get more complicated and as I said, there are opportunities to do things like automate actions or do other automation, you start to get into a sort of like a whole can of forms of nearly everything. 

I guess there’s stuff that’s obviously completely transformed by language models, which is like anywhere where content is being generated, it should be completely transformative in some sense. Then there’s a long tail of potential augmentative changes that apply across nearly every industry.

Tools to help with the deployment of LLMs

Stephen: Right, thanks for sharing that. And just two final questions before we sort of wrap up the session. 

Are there tools that you’re seeing a real change in the landscape now that folks should be aware of right now, especially that’s really making the deployment of these models easier?

David: Well, we’re complaining about LLMOps. I’ll call out a few of the folks that are working in that space and doing cool stuff. The biggest takeoff tool to help people with prompting and orchestrating prompts and things like that is LangChain ā€“ It’s gotten really popular. 

They have Python, a Python library, and a JavaScript library. Now they’re iterating at an incredible rate. That community is really amazing and vibrant. So check that out if you’re trying to get started and tinker. I think it’s like the best place to get started. 

Other tools like Dust and GPT Index are there in a similar space to help you write and then build, like, prototypes of actually interacting with language models. 

Thereā€™s some other stuff around. We talked a lot about evaluation and monitoring, and I think there’s a company called Humanloop, a company called HoneyHive that are both in that space as well as, like, four or five companies in the current YC batch, which maybe they’ll get mad at me for not calling them out individually, but they’re all building really cool stuff there. 

A lot of new stuff coming out around the valuation and managing prompts and things like that, managing costs and everything. And so I’d say take a look at those tools and maybe familiarize yourself with what the new things that we need to help with are.

The future of MLOps with GPT, GPT-3, and GPT-4

Stephen: Awesome. Thanks, David. Definitely leave those in the show notes as well for the later podcast episode that will be released. 

Any final words, David, on the future of MLOps with GPT-3 and GPT on the horizon, GPT-4 on the horizon? 

David: I’ve been working on MLOps for years and years now, and this is the most exciting I’ve ever been. Because I think this is the opportunity we have to go from a niche field, like a relatively niche field, to impacting everybody and every product. And so that’s going to change and there’s a lot of differences. 

But for the first time, I feel like ML really I’ve been hoping that MLOps would make it so that everybody in the world could use ML to change their products. And this is the closest, I feel like we are where, by lowering the barred entry, everybody can do it. So I think we have a huge opportunity to bring ML to the masses now, and I hope that as a community, we can all make that happen.

Wrap up

Stephen: Great. I hope so as well because I’m also excited about the landscape in and of itself. So thank you so much. David, where can people find you and connect with you online?

David: Yeah, both LinkedIn and Twitter are great.

@DavidSHershey on Twitter, and David Hershey on LinkedIn. So please reach out, shoot me a message anytime. Happy to chat about language models, MLOps, whatever, and flush boat.

Stephen: Awesome. So here at MLOps Live, we’ll be back again in two weeks, and in two weeks’ time, we are going to be talking with Leanne and we’re going to be really discussing how you can navigate organizational barriers by doing MLOps. So lots of MLOps stuff on the horizon, so don’t miss out on that one. So thank you so much, David, for joining the session. We appreciate your time and appreciate your work as well. So really great to have you both.

David: Thanks for having me. It was really fun.

Stephen: Awesome. Bye and take care.

Was the article useful?

Thank you for your feedback!
Thanks for your vote! It's been noted. | What topics you would like to see for your next read?
Thanks for your vote! It's been noted. | Let us know what should be improved.

    Thanks! Your suggestions have been forwarded to our editors