Agents IA : La nouvelle ère de l’intelligence artificielle
Dans cet article, découvrez les perspectives liées à l'émergence de cette nouvelle technologie.
As a data scientist, you know that high-performance machine learning models cannot exist without a large amount of high-quality data to train and test them. Most of the time, building an appropriate machine learning model is not a problem: there are plenty of architectures available, and since it is part of your job, you know exactly which one will best suit your use case. However, having a large amount of high-quality data can be much more challenging: you need a labeled and cleaned dataset matching exactly your use case. Unfortunately, such a dataset is usually not already available. Maybe you only have a few data matching your requirements, maybe you have data but they are not matching exactly what you want (they can have biases or unbalanced classes for example), or maybe a dataset exists but you cannot access it because it contains private information. Therefore, you need to collect new data, label them and clean them, which can be a time-consuming and costly process, or even not be possible at all.
Synthetic data can be a valuable tool for improving machine learning workflows.
They can allow you to :
- Obtain more training and testing data for your models.
- Improve the diversity and representativity of your data.
- Share a synthetic version of a database in order to protect private information.
However, in order to make the most of synthetic data, it is necessary to ensure their quality and avoid generating inconsistent data. Tools such as Faker can allow you to generate fake data easily, but not complex enough to replace a real-world dataset. Using machine learning models can help you generate more realistic data, and by combining this with tricks to ensure that your synthetic data meet business constraints you can unlock useful and consistent synthetic data.
Synthetic data are artificially generated data that imitate the characteristics of real-world data so that you can use them for your use case. Therefore, they can help you in various scenarios:
So, is synthetic data generation that amazing? Well, it can be, but only if the quality of the synthetic data you generate is good enough. You can generate by yourself a synthetic dataset with always the same value, but will this dataset be useful to train or test a machine learning model? Of course not. Let’s see what options we have to generate synthetic data.
Faker is a library that allows you to quickly generate fake data. It is possible to randomly generate fake names, dates, ages, etc. It is mostly used by developers to populate databases for testing purposes. However, when you need data to train or test a machine learning model, what you are looking for is not a random dataset but a high-quality dataset with complex distributions and correlations. Faker has a lot of things to offer, but not the ability to handle complex distributions and correlations. If we want our machine learning model to be able to learn something useful from our synthetic data, we have to continue to level 1.
Unsupervised machine learning models are specifically designed to learn the complex distribution of real-world data and generate new samples from this distribution. Exactly what we are looking for! You can use the data you already have as a training set for an unsupervised machine learning model, and generate new data once your model is trained. Deep learning models such as Generative Adversarial Networks (GAN) [1] or Variational Auto Encoders (VAE) [2] are the most efficient for this kind of task.
Since many companies use tabular data, adaptations of these models have been proposed to be more efficient with tabular data. For instance, Conditional Tabular GAN (CTGAN) [3], an adaptation of the GAN architecture, is very popular. Hopefully, you do not have to implement it yourself: libraries such as the Synthetic Data Vault (SDV) allow you to instantiate architectures such as CTGAN, train them with the data you already have, and generate new data very easily.
So, are we done now? Sometimes yes! But … not always. You should look more carefully at your synthetic data because it could be that your newly generated dataset is still inconsistent. Indeed, the goal of your synthetic data generators is to estimate the distribution of the training set you send them, but this training set is not always enough to make them learn specific properties of your data. For instance, it can be a numeric attribute $a$ that must be greater than a second numeric attribute $b$, but the distribution of $a$ and $b$ overlap so that a generator can generate rows for which $b>a$. Moreover, if you are generating synthetic data for privacy reasons, as mentioned before you should use Differentially Privacy (DP), and, unfortunately, DP will add random noise to your generator that may deteriorate its performance a bit.
Hopefully, if your dataset mostly learned the distribution of its training data, but with still some inconsistencies in some samples, you can reach fully consistent synthetic data by adding constraints to your synthetic data generators: as long as the constraints you add do not involve sharing private information of individuals in your dataset, adding these constraints can improve the utility of your dataset without privacy leaks. There are two strategies for this and they can be combined: reject sampling and transformations.
Let’s see how it is possible to generate constrained synthetic data with SDV on the breast cancer dataset, and how it can improve both the consistency and the utility of the synthetic data generated.
The dataset presents features describing the cells of breast samples, with the aim of determining for each sample whether the cells observed are representative of the presence of breast cancer or not. Different features (radius, perimeter, texture, etc.) are described and for each of them, both a mean value (mean radius, mean texture, etc) and a maximum value (worst radius, worst texture, etc) are available. In order for the data to be consistent, it is necessary that for each example the average value associated with a feature is less than or equal to the maximum value. For example, the attribute mean perimeter cannot be strictly greater than the attribute worst perimeter of the same instance, otherwise it would mean that the mean perimeter of the cells on the sample is greater than the maximum perimeter of this same set of cells, which does not make sense.
Let’s generate a synthetic version of the dataset with the SDV library.
Now, let’s generate another synthetic version, but with the constraint that, for each feature, the mean value generated has to be lower than or equal to the maximum value generated.
Now, for each feature and for each dataset, let’s check for how many rows the mean value is greater than the maximum value.
Both the original dataset and the synthetic dataset generated with constraints are perfect. In contrast, the synthetic dataset generated without constraints suffer from inconsistencies for each feature.
Now, let’s compare the utility of each dataset on the prediction of whether the cells observed are representative of the presence of breast cancer or not.
Adding constraints to the synthetic data generator improved the utility of the data generated, even though there is a drop in utility compared to the original dataset. Indeed, adding constraints can allow your synthetic data generator to be closer to perfection but not yet to reach it: there is still a lot of work to be done on synthetic data generation!
Synthetic data generators can be helpful tools to get more data for training and testing ML models, as well as if you want to generate a synthetic version of private data. Therefore, at Craft AI, we are working on this subject in order to be able to help you integrate easily a step of synthetic data generation in your machine learning pipelines.
References :
[1] Goodfellow, Ian, et al., Generative adversarial nets, 2014
[2] Kingma, D. P., Welling, M., Auto-encoding variational bayes, 2013
[3] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni, Modeling Tabular data using Conditional GAN, 2019