building NLG systems

LLMs and Data-to-text

In January I wrote a blog on Can ChatGPT do Data-to-Text?, which was based on some simple experiments I ran on ChatGPT. Since then I’ve had lots of discussions with people, in both academic and commercial contexts, about whether it makes sense to use LLMs to generate texts that summarise data. Below I give a high-level summary of my current views on this topic.

Simple data-to-text: simple inputs, short outputs

Lets first look at simple data-to-text tasks, where the system’s input is small and clearly defines the semantics of what will be in the output text, and the output is 1 or 2 sentences. Most “leaderboard” data-to-text tasks, such as E2E and WebNLG, fit into this category.

Generally speaking, LLMs do very well on such tasks *if* they are evaluated on a “leaderboard” basis, that is average performance on a test set as measured by automatic or simple human evaluation. However, there are a number of important caveats that apply to real-world (as opposed to leaderboard) usage:

  • Data Protection: Some data is sensitive, for example because it contains confidential commercial or medical information. Such data cannot be used to train models because of the danger of leakage. I think most LLMs accept this and allow users to forbid using their data for training purposes.
  • Time/Cost: GPT4 in particular is neither fast not cheap. Depending on context, it can cost $0.10 or more to generate a response with GPT4, and take 5-10 seconds. Many applications require a faster and cheaper solution. Of course most LLMs are faster and cheaper than GPT4!
  • Controllability: In real-life NLG, users often want control over output texts; for example in corporate contexts, the language used in a text must match the corporate brand. In theory this can be done in LLMs by using suitable prompts, but this is not always straightforward.
  • Beyond English: Many people have told me that GPT does poorly in low-resource languages, which perhaps is not surprising. It also seems to have problems in high-resource non-English languages, for example applying English punctuation rules to Spanish texts.
  • Safety and Robustness: Generated texts should be accurate and safe under all circumstances, even when real-world users start doing crazy things which developers never anticipated. In some cases, developers must provide proof (eg to regulators) that this is the case. Specifics depend on use case and work flow, but this can be a major challenge for black-box neural systems.

Of course LLMs are improving and hopefully we will see progress in the above areas.

Complex analytics

In my experience, most NLG users care about about content than language. Providing good content may require doing data analytics in order to extract key insights from the data. When I wrote my earlier blog on Can ChatGPT do Data-to-Text?, I pointed out that chatGPT really struggled with analytics.

The latest versions of GPT and LLMs have improved a bit, but they are still not very good at analytics and extracting insights from data. And perhaps we shouldn’t expect them to be good at analytics; after all, they are language models!

Everyone I have talked to recently about this, in both academic and commercial contexts, has said that the best approach is to do analytics and insight creation separately (outside the LLM), and then provide these insights to the LLM as part of its input data. Which makes sense to me!

Long texts

Finally, many data-to-text systems generate texts that are several paragraphs or even pages long. The impression I get from people is that current LLMs do not do this task very well, unless they are given detailed guidance about document structure and content. As texts get longer, we see more hallucinations, more omissions, more discourse/contextual/lexical errors, and confusing document structures.

I gave a simple example in my earlier blog, where I gave chatGPT some super-simple made-up weather data on wind speed, precipitation, and temperature, and it produced the text

This weather data shows a mostly stable weather pattern with little precipitation and temperatures ranging from 6 to 15 degrees Celsius. The wind speed fluctuates between 9-12 mph. The temperature increases during the day and decreases at night. It appears to be a sunny day.

The last sentence is a hallucination, because I did not give the system any information whether the day was sunny or cloudy.

Of course, LLMs may get better at generating multi-paragraph texts. I think asking LLMs to do analytics fundamentally makes little sense, but its certainly possible that LLMs will get better at discourse-level issues and hence generate better long-form texts.

LLMs and Data-to-text Pipeline

Many years ago I described a data-to-text pipeline which divided data-to-text into several stages: signal analysis, data interpretation, document planning, microplanning, and surface realisation. What the above suggests is that, at least at this point in time:

  • LLMs are very good at microplanning and surface realisation, at least in academic leaderboard contexts. However, there are some important caveat about real-world usage.
  • LLMs currently don’t do nearly as well at document planning, but perhaps this will change over time.
  • LLMs are poor at signal analysis and data interpretation, and it is a mistake to expect them to do these tasks.

Of course in the real-world we don’t need to be purists and insist on 100% LLM solutions, we can build systems which use LLM technology where it makes sense and other technologies elsewhere.

One thought on “LLMs and Data-to-text

  1. I do agree with all your points, but would like to add another one: I have encountered situations where two facts seem to be excluding each other, because they VERY rarely occur together. A prime example would be “diabetes type 1″ and diabetes type 2”. Based on any reasonable amount of training data, an ML system would use the presence of “diabetes type 2” as enough ground to state “no diabetes type 1” or something similar.
    This is a particular kind of “hallucination” which, as I see it, is hard to prevent and may render the LLM output in a real world (medical) context unusable.

    Like

Leave a comment