Synthetic data, the new wave of AI innovation

Ségolène Combettes
Aug 1, 2023


During the past decade Artificial intelligence knows a strong growth thank to important technological advances changing industrial, economic and societal environments. Data has a key place in the development and the performances of artificial intelligence algorithms thus it is crucial to have access to a sufficient quantity of high-quality data to build robust artificial intelligence solutions.
Artificial intelligence is currently slowed by a limited amount of data, which is often costly but also difficult to access because of privacy and its too sensitive nature. This is particularly true in the health area because the amount of available data is limited (especially in the case of rare diseases), and highly confidential so difficult to share.

Just imagine a world in which it would be possible to produce unlimited amounts of high-quality, inexpensive and completely anonymous and secure data … This is now possible thanks to synthetic data!
Synthetic data is artificial generated data by an intelligence artificial algorithm trained with real data. This generated synthetic data is able to replicate the characteristics and relationships found in real-world data. There is no limit of quantity and they are built with respect of privacy.

In a world which is more and more data-driven, let’s see how synthetic data could exceed the limits encountered today and allow artificial intelligence to enter a new aera.


What is Synthetic Data?

By our days, real-world data is the most important piece of artificial intelligence solutions development or project analysis but it can be difficult to access, it is very protected with regulations and it can be expensive too. Indeed, less than 1% of the data used in artificial intelligence solutions development are synthetic, but the research firm Gartner estimates that by 2030, synthetic data will overshadow real data in a wide range of artificial intelligence models.

Gartner famously claimed that “by 2030, synthetic data will overshadow real data in a wide range of artificial intelligence models”. Gartner has also put Synthetic Data on the “Impact Radar for Edge AI” making it the top 3 high-profile technologies.

Hence, Synthetic data can be an effective alternative or a supplement to real-world data to enhance existing dataset and improve the robustness of artificial intelligence models.
Synthetic data is an artificial data generated by an artificial intelligence algorithm. The synthetic data generation algorithm is trained with real data and is able to create synthetic data which reproduce the characteristics and statistical distributions of the original data.

Synthetic data are particularly useful and valuable in the artificial intelligence field but it can be also used in many field with data analysis thanks to the numerous advantages that we will detail hereafter.

The Advantages of Synthetic Data

1. One of the main advantages of synthetic data is that there is no limit of quantity of synthetic data we can create. That is we can build a dataset with all the situations or characteristics we wish with no quantity restrictions. This is very useful when the initial dataset is limited, hard to obtain or when a pattern or a situation is not well represented.

2. Synthetic data is also a good solution to overcome the difficulties of accessing data. In many cases, access to the data is limited or impossible and synthetic data allows to accelerate research in area where data is scarce, as in health, or where data is expensive, as in finance.

3. Real-world data can be very expensive and synthetic data can be more cost-effective than the real one. It can be a good alternative to test a simulation or to do statistical analysis before buying the real expensive dataset.

4. The last advantage concerns security and confidentiality. Synthetic data is dummy data and thus completely anonymous and so protects the privacy of individuals. It’s therefore easy to share it or to use it for any project.

How can we evaluate the quality of Synthetic Data?

First of all, an easily way to evaluate the quality of the synthetic data is to follow and measure the performance of the artificial intelligence algorithm which generate it. During and after the training of such models, it’s possible to evaluate the quality and thus have an idea of the data generation performance. There are many different types of evaluation metrics we can use to measure artificial intelligence performance to generate synthetic data.

Once synthetic data is generated, we can evaluate its own quality through three key dimensions: Fidelity, utility and privacy.

With these three main axes, we should be able to answer to the following questions:
- How similar is synthetic data compare to the real-world data which have been used to train the artificial intelligence model ?
- How useful is synthetic data for the project I want to use it for ?
- Is synthetic data fully anonymized and contains no personal information of the original dataset ?

For each dimension it exists metrics which allows us to answer to at least one of these questions, but it can defer from a dataset to another. For example, metrics will depend on the type of data we are going to use. Images won’t have the same evaluation metrics then tabular data.

3 main axes of evaluation of synthetic data : fidelity, privacy, utility
Evaluation metrics of synthetic data — Image by author

Metric to evaluate fidelity

One of the main goals of synthetic data is to be as realistic as possible, and to keep the characteristics, the pattern and the statistical distribution of the real-world data. Synthetic data shouldn’t be recognized by human when it’s compared to real data. To evaluate the fidelity of synthetic data, we mainly use visual representations. It’s possible for example to compare the statistical distribution of real and generated dataset, the histogram for categorial variables, the correlation between variables, etc…

Metric to evaluate utility:

Once we confirm that synthetic data is similar to the real-world data, it’s important the verify that its use improve the performance of a task.
Concerning the training of artificial intelligence model, an easy way to evaluate the utility of synthetic data is to compare the performance of an artificial intelligence model trained with only the real-world data with the same artificial intelligence model but trained with both real-world data and synthetic data. If the performance of the model achieves better result with synthetic data, synthetic data is of good quality in terms of utility.

Metric to evaluate privacy:

Finally, privacy regulations exist and are quite strict especially in medical sector where sensitive information are highly protected. In order to make full use of synthetic data, we need to ensure that it’s fully anonymized and that no personal information from real-world data is found in synthetic data. As a privacy metric, we can first check if there is no duplicate between real-world data and synthetic data. Then we can calculate the neighbors’ privacy score which measure the ratio of synthetic data which could be too close in similarity to real-world data and could highlight potential point of privacy leakage.

All these metrics give us quality information of the synthetic data itself and contribute to the decision of using it in a project.

Finally, to evaluate the quality of synthetic data, it’s necessary to have the feedback and the report of an expert of the domain in which the data belong to. If we work with MRI for example, it will be required to collaborate with radiologists to validate the quality of generated synthetic MRI with the eye of a specialist and confirm its use in clinical studies.

Create your own synthetic dataset with Alia Santé

Composed of experts in artificial intelligence development solutions, Alia Santé propose a new platform for synthetic data generation. Based on several innovative artificial intelligence models, this platform allows anyone to generate any type of synthetic data from his own dataset with high quality. Problems of insufficient data and difficulties to share sensitive or personal data are solved. Besides, existing artificial intelligence models can be improved.

Platform of generation of synthetic data of Alia Santé
Alia Santé synthetic data generation platform — Image by author

Alia Santé platform provides synthetic data generation models which are able to generate digital twins and avatars and, most of all, to scale real-world data. A specific study of the real-world dataset enables to select the best model in the artificial intelligence generation models library to ensure high quality synthetic data. In addition, a quality report is attached to synthetic data giving a quality score. This report is based on multiples items and metrics which contribute to the measure of the final quality score.

In conclusion, synthetic data is a powerful innovation that is about to transform artificial intelligence by overcoming the limits currently encountered with real-world data. It allows researchers and industries to have access to an unlimited amount of data, to easily share it since it is fully anonymized and thus to accelerate and improve their work.
Synthetic data is without a doubt the key for artificial intelligence to access to the next stage of its evolution by building more robust artificial intelligence models and increase the performance of existing ones while protecting privacy.

