Why tell stories with data visualisations?

Published in

Applied Data Science

10 min readJun 1, 2021

“All happy families are alike; each unhappy family is unhappy in its own way.”

From the very first sentence of Anna Karenina, Tolstoy gives us a universal truth about success. You can fail in many ways, but there’s only one way in which you can succeed. Successes share a similar set of qualities, as it is only by satisfying these qualities that our works can be perceived as successful by the world.

In this post we attempt to answer the question: “What are the qualities of a successful visualisation?”. We dig into a real-world dataset to search for stories worth telling and explain how common practices in data visualisation sometimes fail to convey the right message.

This blog post is for visual creators, visual communicators, curious readers fascinated by visuals on the internet, readers whose business depends on correctly appreciating visuals, readers annoyed by a lack of truth and elegance in proliferating visuals. Finally, for practical people who like to dirty their hands with some code, you can find our Jupyter notebooks for analysing the dataset and reproducing our visualisations here.

Why tell stories?

A career in data science requires an extensive and at times daunting set of skills, including knowledge in programming, statistics, machine learning, databases and business intelligence. Story-telling recently entered this list and is arguably still puzzling the data science community: is it just a buzzword or is there more to it?

Our society is bombarded by numbers. Tables, pie charts and exponential curves are being featured in our magazines and social media front pages, spurred on by the current demand for real-time data and analysis. Used to justify governmental decisions, alarm the public and shape public opinion, these numbers play a much more important role than most of us realise.

Numbers are, by definition, objective. After all, they are the vocabulary of science and engineering. Chances are, however, that the majority of the numbers you see in your everyday life are the products of statistical analysis, rather than physical measurements of the world.

Statistics is the science of reaching conclusions about a large population by studying a small sample. There is no statistical analysis without assumption and no data collection without bias. Models are the end result of statistics and they are inadvertently influenced by the assumptions made during the analysis. A model can only be perfect if it explains the whole complexity of our world, which is obviously impossible. In any other case, a model is but an abstraction of our world. The art of statistics is in finding the right model. The art of visualisation is in finding the right presentation for it.

“All models are wrong but some are useful.” — George Box

Data story-telling is the process with which we choose the plot, protagonists and level of detail of our data analysis. Contrary to popular belief, story-telling is not just a last step in a data science project, solely related to the task of communication. Instead, it should play an active role in shaping every step of the data science pipeline.

Let’s see why.

The dataset: grocery shopping habits

In 2017 Instacart open-sourced 3 million grocery orders made by people in the US — arguably the largest dataset currently available capturing shopping habits. Take the expression “we are what we eat” a bit seriously, and this dataset makes for a fascinating playground that, if interrogated appropriately, can reveal insights about correlations between products, day/night variations and shopping trends across the time and space of consumer consumption.

The dataset also has immense commercial value: these data can fuel market basket analysis in grocery retail, where AI-driven customer insights and engagement reign supreme.

In the rest of this post we put on our data-scientific glasses and get a glimpse of some of the story-telling potential of this dataset. We do not aim at exhaustively listing all its potential data insights, but you can check some cool analysis shared by the curators here.

The database consists of different tables describing the aisles, departments and products of the supermarket, as well as orders of customers whose user id allows for tracking their long-term behaviour. For example, below we can see the products bought by user 1 on their first and second order. We can see that we have rich information about the day of the week, hour and the order in which items were placed in the basket. The unique product ids allow us to associated products with departments and build product and customer profiles.

Fancy visuals can be engaging

Product correlations are some of the most important insights one can derive from market basket data. They can be used for online product recommendations, inventory optimisation and creating market basket bundles.

The most common tool used to derive these insights, standing right between the raw data and the final product association, are correlation matrices. These are two-dimensional tables, featuring the same set of elements in their rows and columns. The cells contain values that capture how a change in one variable affects another one: higher values indicate higher correlation and negative values show that when one variable increases, the other one decreases.

Here’s a correlation matrix of our products:

Correlation matrices show how variables react to changes in other variables

Found any stories in there?

Scanning through the colourful tiles, I saw that people buying organic bananas tend to regularly buy organic avocado and often buy organic strawberries and spinach (it seems we have an eco-concerned group in our customers database). On the other side, people that buy organic bananas do not buy regular bananas (one can eat only as many bananas I suppose).

Our correlation matrix certainly contains all the information necessary to capture correlations between products. But doesn’t it feel a bit dry? How could we spice it up?

Enter chord plots. These plots depict correlations between products, albeit in a distinctively different format. Here’s how a chord plot for the same dataset looks like:

The products are placed on the periphery of a circle and their relationships are drawn as arcs connecting the data. There’s many qualities that make them different from correlation matrices:

instead of using numbers and colours, the amount of correlation is shown as the width of a chord
colours are reserved to distinguish the different categories. They take values in a categorical rather a continuous colour scale
these plots are interactive; the reader can choose to focus on one particular category and follow the cross-category arcs to see the correlations

Chord plots are an example of a visualisation that makes the reader part of the story-telling experience. They stand as a proof that, sometimes, providing all the raw information can turn out to be overwhelming and uninformative. By interacting with the plot, the reader can focus on one side of the data without losing the opportunity of seeing the full story.

On the downside, chord plots cannot by definition show negative correlations. If this information matters for your own story, chord plots may not be a good choice for you.

Fancy visuals can be misleading

If you are not a time traveler from the past, then you’ve probably come across a word cloud, also known as a tag cloud. At a first glance, this visual feels more like an artistic canvas of letters rather than a statistical visualisation.

Before we deconstruct word clouds, let’s make a small exercise. How would you read this plot?

My shortlist of questions-for-conscious-graph-reading contains the following practical questions:

what were the data used to produce the plot?
how does the plot represent the values characterising the data?
what information is the plot trying to convey?

You can probably extract this information yourself, but here’s my insider info:

To create this plot I computed the total number of times, across all orders and customers, that a product was bought. This number is translated into the size of the word describing the product and the plot is trying to convey the relative popularity of the different products.

Perhaps the most important property of word clouds is their attractiveness and their ability to capture the attention of a wide array of readers, even those lacking a background in statistics, or even a general interest in reading graphs.

Once the attention of the reader is there, however, one needs to contemplate: what information does the reader receive from this plot?

This is where the weakness of these graphs lies. Many data science experts have warned us against the true value of these visuals:

the longer a word is the more important it seems. This compromises the quality of information offered by these graphs.
colours may be pretty but they do not convey any relevant information
in the case of text analysis, words are taken out of context, which can often lead to extremely wrong conclusions. There’s certainly a big difference between the word “great” and the expression “not great” being the most popular in your database of customer reviews.
you can easily spot the most popular word, but can you tell which is the 8th most popular one? These graphs fail to convey even the original information they were intended for, the relative frequency of the different words.

This list is enough to make word clouds inappropriate for entering your story-telling quiver. Even Flickr, the platform that originally popularised this type of visualisation, publicly acknowledged their shortcomings and apologised.

Sometimes going old-school, as with this classical bar chart, may be better for your readers:

A Paradox In My Basket?

So you’ve chosen the story that you want to share with your readers, you’ve identified the right plot for the job and you are ready to share your message. Can things still go wrong?

Data paradoxes are perhaps the most illustrative example of the subjectivity of statistical analysis. They teach us that a single dataset is just a representation of reality, a facet that can tell many stories, simultaneously contradictory and true.

Let us look at an example inspired from our own dataset. As the collected data did not contain the features that will allow us to illustrate the paradox, we carefully construct a scenario inspired from our setting.

Let us assume that Instacart collected data from online recommendation made to customers. A recommendation has three features: the time at which it is displayed, the product it promotes and whether the customer ended up ordering the product.

We focus on two products, bananas and avocados. Then, our table aggregating all banana and avocado recommendations looks like this:

These data tell us that, in total, there were 1000 recommendations for bananas, out of which 800 led to an order. For avocados, out of the 1000 orders, 750 led to orders. Thus, our conclusion should be clear: banana recommendations were more successful by 5% percent when compared to avocados.

But let’s say that before we share our success story with our colleagues, we decide to look at the data again, this time separating recommendations made during the day from those made during the night. Now our data looks like this:

Product recommendations during day and night

We now see that, during the day, avocados has a 86% success rate, higher than bananas with a 84% success rate. During the night, avocados were again more successful, with a 50% rate compare to bananas with 40%.

Perplexed, we look at our data and replay the story in our head: avocados are more successful both during the day and night, but bananas are more successful in total. How do we explain this?

Surely you are ready to throw your data and conclusions out of the window. But a more careful look describes some interesting biases in this dataset: there are much more recommendations during the day rather than during the night and they are not evenly distributed between the two products. Is this enough to justify our paradoxical conclusion?

Edward Simpson reassures us that it is. Simpson’s paradox is a well-known phenomenon in statistics, where a trend appears in several sub-groups of a dataset but disappears when they are aggregated in a single group. You may have come across it in data . It has been observed in analysing public opinion and COVID-19 statistics.

Visualising our data for Simpson’s Paradox

Our need for consuming, making up and sharing stories goes further back in the past than you would expect.

5 million years in fact. At the time point where human societies started increasing in numbers and their need for communication led to the emergence of human language.

Our language is not perfect in a general way. If you want to teach someone how to tie a knot, language is probably not going to be of much help and you’d better get that rope on your hand.

Humans language is a means for telling stories. Stories of victories, betrayal, everyday stories, for making people see the world through our eyes. A common belief in stories is what holds the human society together.

These ancient origins of stories are where we should look for our answer to the questions: “what makes for a good visual?”

Every statistical analysis is a story told through numbers and visuals. It is subjective, has intentions and will be judged by us, the story-telling species.

Applied Data Science is a London based consultancy that implements end-to-end data science solutions for businesses, delivering measurable value. If you’re looking to do more with your data, please get in touch: hello@adsp.ai.