HaGRID — HAnd Gesture Recognition Image Datasets | by Karina Kvanchiani | Jun, 2022

Build your own datasets

The use of gestures in human communication plays an important role: gestures can reinforce statements emotionally or completely replace them. What is more, hand gesture recognition (HGR) can be a part of human-computer interaction.

Such systems can be used in video conferencing services (Zoom, Skype, Discord, Jazz, etc.), home automation systems, the automotive sector, services for people with speech and hearing impairments, etc. Besides, the system can be a part of a virtual assistant or service for active sign language users — hearing and speech-impaired people.

These areas require the system to work online and be robust to background, scenes, subjects, and lighting conditions. These and several others problems inspired us to create a new HGR dataset.

HaGRID (HAnd Gesture Rcognition Image Dataset) is one of the largest datasets for HGR systems. This dataset contains 552,992 fullHD RGB images divided into 18 classes of gestures. We are especially focused on interaction with devices to manage them. That is why all 18 decided gestures are functional, familiar to the majority of people, and may be an incentive to take some action.

“inv.” is the abbreviation of “inverted”

We used crowdsourcing platforms to collect the dataset and took into account various parameters to ensure data diversity. The dataset contains 34,730 unique scenes. It was collected mainly indoors with considerable variation in lighting, including artificial and natural light. Besides, the dataset includes images taken in extreme conditions such as facing and backing to a window. Also, the subjects had to show gestures at a distance of 0.5 to 4 meters from the camera.

HaGRID can be used for two HGR tasks: hand detection and gesture classification; and for the additional task — leading hand search. The annotations consist of bounding boxes of hands in COCO format [top left X position, top left Y position, width, height] with gesture labels. Also, annotations have markups of leading hands (left or right for gesture hand) and leading_conf as confidence for leading_hand annotation. We provide user_id field that will allow you to split the train / test dataset yourself.

Keep in mind that the proposed dataset contains some gestures in two positions: the front and the back of the hand. This allows interpreting dynamic gestures using two static gestures. For example, with gestures stop and stop inverted you can design dynamic gestures swipe up (stop thumbs downie stop rotated by 180 degrees, as the start of the row and stop inverted as the end) and swipe down (stop as the start of the row and stop inverted thumbs downie stop inverted rotated by 180 degrees, as end). Also, you can get 2 more dynamic gestures, swipe right and swipe leftwith 90-degree rotation augmentation.

Leading_hand was added to annotations to interpret dynamic gestures with static ones: swipe up and swipe down can be shown with one hand, while gestures swipe right and swipe left are hard to show without using a second hand. If the horizontal static gestures stop and stop inverted are shown with the left hand, then it is the dynamic gesture swipe rightotherwise — swipe left. Leading hand labels can be used to design two gestures from one. For example, right three and left three can be two different gestures right three and left three.

Links to download the HaGRID are publicly available in the repository.

The dataset was collected in 4 stages: (1) gesture image collection stage called mining(2) validation stage where mining rules and some conditions are checked, (3) filtration of inappropriate images, (4) annotation stage for markup bounding boxes and leading hands. The classification stage is built into the mining and validation pipelines by splitting pools for each gesture class.

  1. Mining. The crowd workers’ task was to take a photo of themselves with the particular gesture indicated in the task description. We define the following criteria: (1) the annotator must be at a distance of 0.5 – 4 meters from the camera, and (2) the hand with gesture (ie leading hand) must be completely in the frame. Sometimes, subjects receive a task to take a photo in low light conditions or against a bright light source to make the neural network resilient to extreme conditions. All received images were also checked for duplicates using image hash comparison.
  2. Validation. We implemented the validation stage to achieve high confidence images because users tried to cheat the system at the mining stage. The goal of the validation stage is a selection of correctly executed images at the mining stage.
  3. Filtration. Images of children, people without clothes and images with inscriptions were removed from the HaGRID at this stage due to ethical reasons.
  4. Annotation. At this stage, crowd workers should draw a red border box around the gesture on each image, and a green border box around the hand without the gesture if it is completely in the frame. Different colors are needed for their further translation into labels.

Detailed information about the dataset and mining is provided in our paper.

In this section we provide a tutorial to show how to use the HaGRID to build an HGR system, which can detect your hands with gestures and classify it. In addition, we want to explain how to make a model with two heads, where the second is to predict the leading hand.

Let us start with importing all the necessary libraries:

HaGRID is divided into 18 classes of gestures and one no gesture class. We don’t use many epochs due to the subsample of the dataset chosen for this tutorial and the model can learn fast. Before model training, it moves to a chosen device.

We implement a class GestureDataset that inherits from the Dataset type, and defines the data reading and data preprocess functions:

In addition, we implement our own class ToTensor for get_transform() function:

We specify two different data sets, one to train the model (the training set) and the other to test it (the test set). Two commands differ by one parameter is_trainwhich splits whole dataset into 2 parts by users with user_id hash. You can split the training set into training and validation sets, if you have more data.

The dataset/ has the following structure:

Output from implement GestureDataset will be the following:

Let’s create short class names for ease of visualization of images and confusion matrix:

Let us try to visualize multiple images to make sure the data is processed correctly:

Finally, we can put our pictures into a DataLoader.

We implement a class ResNet18 to add the second head to the pre-trained torchvision ResNet18 model.

We will train our model using an SGD with a momentum of 0.9 and a weight decay of 0.0005 as an optimizer. The learning rate starts from 0.005. Cross-Entropy Loss function is chosen as criterion.

Further, we iterate over a batch of images in the train_dataloader. All pipeline is standard, except for calculating the loss for two tasks and summing them. The code:

Let’s evaluate the trained model on our test_dataloader. Similar to code for model training, following code differ from standard in calculating two metrics:

Model evaluation returned a F1 score of 93.8% for each task. Given the following confusion matrix, we can conclude that all classes are well separated from each other.

The whole dataset, the trial version with 100 images per class, pre-trained models, and the demo are publicly available in the repository.

Other links: arXiv, Kaggle, Habr

Leave a Comment