building NLG systems

Problems in using LLMs in commercial products

I recently participated in a panel discussion where I was asked to speak (to an academic audience) about commercial usage of LLMs. I think some of the points I made were novel to this audience (might not be to a commercial audience), so I thought I would write a blog on the topic.

Are there successful products which use LLMs?

The first point is that as of the time of writing (August 2023), it isnt clear whether there are high-volume commercial products using LLMs which are both profitable and useful (Im looking at products, not stock market valuations or success in attracting VCs). I read an interesting post recently from Gary Marcus which relates to this topic (see also post); Marcus essentially speculates that there are solid use cases in helping developers and generating marketing copy, but other high-volume use cases, including search assistants, are more questionable.

I wont comment on the commercial market for LLMs and generative AI (Marcus and others are much more qualified to do so), except to say that as far as I know its unclear whether the highest-profile LLM apps (chatGPT itself, and the Google and Microsoft search bots) are profitable. Only the companies concerned know, and they arent sharing this information.

Anyways, I want to focus on engineering and technical challenges in using LLMs in successful commercial products, since I feel more qualified to talk about this. I discuss some of these below.

Engineering: Cost, response time, stability, etc

As a software engineer, its not easy to work with LLMs such as GPT4. Reasons include

  • Cost: Producing a text with GPT4 can easily cost US$0.50 or more.
  • Response time: Large LLMs can sometimes take 10 seconds or more to produce a text, which is too slow for many interactive applications, including dialogue systems. ChatGPT is presented as a dialogue system, but its too slow to be used in many dialogue contexts.
  • Unstable: Commercial LLMs are constantly being updated. Worse, they can be “discontinued“. This makes testing and quality assurance very difficult.
  • Uncertain legal status: IP-related lawsuits may make current LLMs illegal.
  • Hallucinations and inappropriate texts: Last but not least, hallucinations and other text-quality problems may lead to LLMs producing unacceptable texts in an unpredictable fashion.

For example, suppose we use GPT-4 to build a dialogue system for ordering pizzas, which is a “standard” dialogue task. We’d end up with a system which was (1) more expensive (cost per order) than a call centre, (2) provided a much worse user experience (seemingly endless waits for a response), and (3) might be switched off if IP-related lawsuits against LLMs succeeded.

Human-in-loop workflow

I’ve seen a lot of talk about using LLMs in “human-in-loop” workflows, where a person checks and edits LLM outputs before these are released. This is not straightforward to implement in a real-world context (blog).

For example, I recently saw some real-world data about a deployed app used to generate documents which were checked and edited by a domain expert. When the system was deployed in real-world usage (with all of the messiness and chaos this entails), actual productivity gains in domain-expert time (ie, time saved by checking and editing a document instead of writing it from scratch) was only 10%. Since the domain experts were doing other things as well, introducing the system only increased their total productivity (across all tasks) by 1-2%. The company decided not to proceed with this project. They didnt tell me why, but I suspect it was partially because the small overall productivity gain was not worth the cost of the AI system and also the potential disruption caused by major changes in workflow (how experts produce documents).

I’ve seen similar scenarios elsewhere. New technology which requires a major change to workflow (eg, check+edit instead of write) is only going to be used if it offers large advantages over the current setup.

Of course there are contexts where human-in-loop workflows work! But people building commercial LLM systems should not assume that human post-editing can always be used to fix problems in LLM texts; post-editing has its own challenges, and doesnt make sense in many contexts.

Real-world use-case specific challenges

The final point is that I see a lot of real-world use-case specific challenges. For example

  • Health: As I pointed out in a previous blog, LLMs can generate health messages which are true but inappropriate because of stress or emotional impact on the user. We have done some recent work (not yet published) which shows that 15% of LLM outputs in a specific health context are regarded as unsafe by domain experts, usually because of stress/emotional impact.
  • Coding: As others have pointed out, there are major security risks in incorporating code from random (and unknown) Github projects into a commercial project. The source projects may not have been designed to be hacker-proof; even worse, they could be deliberately created by hackers in order to “train” models to produce hackable code full of security bugs.

In short, despite all of the talk about “AGI” and general “foundation” models, I suspect that most real-world use cases will require solving use-case-specific challenges; we cant just drop in our favourite general-purpose LLM and expect the system to work acceptably.

Final Thoughts

LLMs are certainly a very exciting technology, but they can be difficult to use in real-world commercial products, for reasons which I rarely see discussed in the academic or research literature. Of course the technology is very new and the above issues may be solved over time, as has happened with other new technologies which faced pragmatic engineering challenges.

Leave a comment