Creating Your Own Face Dataset: DatasetGAN, GPUs

What Are Face Datasets?

Image datasets include digital images chosen especially to help test, train, and evaluate the performance of ML and artificial intelligence (AI) algorithms, typically computer vision algorithms. Specifically, face datasets include images of human faces, curated for machine learning (ML) projects. See a list of commonly used face datasets.

A face dataset includes faces shown in a variety of lighting conditions, emotions, poses, ethnicity, gender, age, and other factors. Face datasets are key enablers of face recognition, a computer vision field applicable to various use cases, such as augmented reality (AR), personal device security, and video surveillance.

How Is GPU Advancing Deep Learning?

You create deep learning (DL) models by training neural networks. GAN architectures are one type of neural network. Training neural networks typically involve exposing the DL model to millions of data points, which can be very computationally intensive.

What Is a GPU?

A graphics processing unit (GPU) is a specialized microprocessor that can run multiple computations simultaneously. It can significantly speed up the process of training deep learning models. There are various GPUs designed and optimized for training AI and DL models. Each GPU includes many cores to ensure better computation of multiple parallel processes.


In the past, machine learning was slower partly because it relied on traditional central processing units (CPUs) that could only handle 50 GB/s. Today’s GPUs can run up to 750 GB/s, supporting higher memory bandwidth. GPUs are highly suitable for deep learning because DL computations handle massive amounts of data in parallel, and the high memory bandwidth offered by GPUs improves performance.


You can further improve the performance of DL processes by using a multi-GPU cluster. You can run the cluster with or without GPU parallelism, a process that combines multiple GPUs into one computer to provide better performance. Each system supports different levels of parallelism, determining the overall performance.

Not all deep learning frameworks support GPU parallelism. However, you do not have to use GPU parallelism to run a multi-GPU cluster. In this case, each GPU runs separately and computes its processes. This approach does not offer better speeds but offers the freedom to run and experiment with several algorithms at once.

DatasetGAN: AI Training Dataset Generator

DatasetGAN is a system that generates annotated synthetic visual data for training computer vision models. Based on the NVIDIA StyleGAN technology, it generates realistic images based on minimal inputs. A human annotates the first 16 images, and an interpreter learns to generate further annotations.

DatasetGAN produces infinite annotated images that you can use to train AI models. A generative adversarial network (GAN) produces images—a generator learns to generate photorealistic data while a discriminator tries to distinguish the synthetic data from the real data. Once trained, the generator can produce realistic datasets.

Before DatasetGAN, computer vision datasets often required thousands of annotators to label images manually. Complex computer vision applications require massive datasets depicting diverse, complex events, usually with semantic segmentation. Some scenes may have dozens of objects and take over an hour to annotate.

With DatasetGAN, the input includes annotations, allowing the interpreter to generate labels alongside the GAN-generated images. It uses an MLP classifier to label every pixel in a generated image. For instance, the interpreter can label facial features in a face image. The NVIDIA researchers trained this interpreter on labeled images of faces, rooms, birds, cats, and cars, with at least 16 examples for each class.

The team compared DatasetGAN’s performance using computer vision benchmarks like Celeb-A. It outperformed each benchmark’s baseline.

Setting Up a GPU Compute Instance on AWS

In order to work with DatasetGAN, it is important to have a machine with a strong GPU. If you don’t have a sufficiently powerful machine, you can rent a GPU workstation in the Amazon cloud and pay only for the time you use it.

Amazon Elastic Compute Cloud (EC2) provides several types of GPU compute instances. The latest generation of NVIDIA GPU-based instances is EC2 G5. The G5 instance comes with up to 8 NVIDIA A10G Tensor Core GPUs, which provide huge processing power. Keep in mind that GPU instances are not cheap: if you plan on working with them on an ongoing basis, check out a few tips on AWS cost optimization.

To launch a Linux G5 instance via the AWS Management Console:

  1. Go to the EC2 console.

  2. Go to the EC2 console dashboard, find the Launch instance box, and select Launch instance.

  3. Go to Name and tags, and enter a descriptive name for the instance under Name.

  1. Go to Application and OS Images (Amazon Machine Image), and follow these steps:

    1. Select Quick Start, and then select Amazon Linux as the operating system for the instance.

    2. Go to Amazon Machine Image (AMI), and choose an HVM version of Amazon Linux 2.

  2. Go to the Instance type list, and select the G5 instance type as the desired hardware configuration for the instance.

  3. Go to Key pair (login), and choose the key pair you have created when setting this up as a Key pair name.

  4. Go to Network Settings, and select Edit. The wizard creates and selects a security group for you, specified under the Security group name.

  5. Use the default selections for all other configurations for the instance.

  6. Go to the Summary panel to review the summary of the instance configuration. When ready, select Launch instance.

Training Your First Dataset with DatasetGAN


To train a dataset using DatasetGAN, install PyTorch 1.4.0 and Python 3.6 on your GPU workstation. Check additional requirements for the Python package at requirements.txt.

Next, download DatasetGAN and place it in the ./datasetGAN/dataset_release folder. You can also download a pre-trained checkpoint from StyleGAN and convert it from Tensorflow to PyTorch.

Reproduce the Original DatasetGAN Research Paper

To learn how DatasetGAN works, it is a good idea to re-run the dataset prepared for the original research paper. You can reproduce the DatasetGAN research using cd datasetGAN. Follow these steps:

1. Train the interpreter:

python --exp experiment/<exp_1>.json  

2. Sample the GAN:

--generate_data True --exp experiment/<exp_1>.json  
--resume [path-to-trained-interpreter in step3] 
--num_sample [num-samples]

You can run parallel sampling processes using:

sh datasetGAN/script/ 

3. Train a downstream task:


--data_path [path-to-generated-dataset in step4] 

--exp experiment/<exp_1>.json

Create Your Own Model

Now that you are familiar with how DatasetGAN works, you can create a new model using these steps:

1. Train a new StyleGAN model using StyleGAN code. Convert the Tensorflow checkpoint to PyTorch. You can specify the Stylegan path in datasetGAN/experiment/custom.json.

2. Run the following function:

 python datasetGAN/ --exp datasetGAN/experiment/custom.json --sv_path ./new_data . 

It will generate a sample dataset and put two npy files in the SV path. The first file is avg_latent_stylegan1.npy, which enables truncation. The second file is latent_stylegan1.npywhich can retrieve training images.

3. Annotate the generated image according to your preferences. Use the appropriate file format and store the annotations in the ./datasetGAN/dataset_release folder.


In this article, I explained the basics of DatasetGAN and showed how to:

  • Set up a powerful GPU compute instance in AWS.

  • Install prerequisites for DatasetGAN.

  • Reproduce the original DatasetGAN paper on your workstation.

  • Create your own model and generate a new image dataset for your needs.

I hope this will help you get started with synthetic data in your computer vision projects.


Leave a Comment