HomeAboutJobsNews
Contact
← Back to Blog

Dataset Cleaning Redefined: The Power of Job Chaining

Explore the revolutionary technique of job chaining in dataset cleaning. Enhance data quality and efficiency like never before.

Dataset Cleaning Redefined: The Power of Job Chaining

1. Introduction

In the previous post (link) we introduced our human plus AI joint labeling tool called "Model in the Loop" (MITL), and showcased some of its labeling capabilities using one class of the Object 365 dataset. In this post, we will introduce Image Labeling Multiple Choice, a different kind of job in our system, and the concept of chaining labeling jobs with Workflow Orchestrator. With job chaining, it is possible to perform more complex tasks, such as hierarchical labeling, or collecting more fine-grained annotations by filtering images by labels and applying different jobs dedicated to different labels. This time, instead of humans and AI, we will only use human labelers through HUMAN Protocol to focus on job chaining. But keep in mind that you can use MITL in any of these jobs.

2. Multi-class image classification.

Image labeling multiple choice (ILMC) is a type of job that allows us to label an image with 2 to 4 different labels. There are two main differences between the manifest used to launch ILMC and the manifest used for the ILB job. First, the “job_type” should be set to “image_labeling_multiple_choice” instead of “image_label_binary”. Second, for the example images in the ILB job, we have provided a list of 3 images; all of them for the positive class. For the ILMC job, we will provide one image for each class. 

‍

Now let’s go for an example, using some images from the Objects 365 dataset. Let’s say we have some pictures, and we would like to know which of them contain cows, cats, or dogs. To do so, we can use the following manifest to run the ILMC job.

‍

{

    "jobs": [

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Cow": {

                    "answer_example_uri": "https://someurl/images/Cow/1032219.jpg",

                    "en": "Cow"

                },

                "Cat": {

                    "answer_example_uri": "https://someurl/images/Cat/1023885.jpg",

                    "en": "Cat"

                },

                "Dog": {

                    "answer_example_uri": "https://someurl/images/Dog/1027230.jpg",

                    "en": "Dog"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_2_gt.json",

            "job_identifier": "animals_job",

            "platform": "human"

        }

    ],

    "config": {

        "taskdata_uri": "https://someurl/tasks.json",

        "restricted_audience": {},

    }

}

‍

We also need to take into account that an image might not always hold a cow, a cat or a dog. So, asking a user to label one of these options when it’s not available will lead to unexpected results, as they will have to randomly select one answer to pass the captcha. To avoid this, it is possible to set an extra class to cover such a scenario. We will call this class “Other”, and the resulting manifest will look like this.

‍

{

    "jobs": [

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Cow": {

                    "answer_example_uri": "https://someurl/images/Cow/1032219.jpg",

                    "en": "Cow"

                },

                "Cat": {

                    "answer_example_uri": "https://someurl/images/Cat/1023885.jpg",

                    "en": "Cat"

                },

                "Dog": {

                    "answer_example_uri": "https://someurl/images/Dog/1027230.jpg",

                    "en": "Dog"

                },

                "Other": {

                    "answer_example_uri": "https://someurl/ilmc-empty-option.jpg",

                    "en": "Other"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_2_gt.json",

            "job_identifier": "animals_job",

            "platform": "human"

        }

    ],

    "config": {

        "taskdata_uri": "https://someurl/tasks.json",

        "restricted_audience": {},

    }

}

‍

Finally, the challenge for this job will look as follows:

‍

3. Job chaining

The complexity of labeling jobs can escalate very quickly. For example, if we want to get bounding boxes for all cats in a dataset, it will be convenient to know which images contain a cat, and ask for bounding boxes only for those images. By doing this, we can avoid asking unclear questions to labelers, like asking them to draw a bounding box around an cat when there’s no cat in the image. 

‍

To solve the labeling problem, we will need to launch two jobs and use the results of the classification job as the input for the bounding boxes job. We call this operation “job chaining”, and is currently available as a feature of Factored Cognition.

3.1. Labeling more than 4 classes

‍

As we mentioned before, at the time of writing this article, ILMC jobs have a limit of 4 classes. If we want to surpass this limit we will need to create two labeling jobs and we can chain them together to avoid double labeling on every image.

‍

For example, let's assume we have a dataset with unknown images, and on top of the animals labeled on the previous jobs, we want to label some instruments, guitar, piano, and drums. To do so, we can launch two jobs. On the first one, we will label the animals, and on the second job, we can label the instruments. 

‍

{

    "jobs": [

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Cow": {

                    "answer_example_uri": "https://someurl/images/Cow/1032219.jpg",

                    "en": "Cow"

                },

                "Cat": {

                    "answer_example_uri": "https://someurl/images/Cat/1023885.jpg",

                    "en": "Cat"

                },

                "Dog": {

                    "answer_example_uri": "https://someurl/images/Dog/1027230.jpg",

                    "en": "Dog"

                },

                "Other": {

                    "answer_example_uri": "https://someurl/ilmc-empty-option.jpg",

                    "en": "Other"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_2_gt.json",

            "job_identifier": "animals_job",

            "platform": "human"

        },

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Guitar": {

                    "answer_example_uri": "https://someurl/images/Guitar/375964.jpg",

                    "en": "Guitar"

                },

                "Piano": {

                    "answer_example_uri": "https://someurl/images/Piano/1002675.jpg",

                    "en": "Piano"

                },

                "Drum": {

                    "answer_example_uri": "https://someurl/images/Drum/530108.jpg",

                    "en": "Drum"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_3_gt.json",

            "job_identifier": "instruments_job",

            "platform": "human"

        }

    ],

    "config": {

        "taskdata_uri": "https://someurl/tasks.json",

        "restricted_audience": {},

    }

}

‍

We still have a problem: each image is being labeled twice, and we already know some of them have animals from the first labeling job. To avoid annotating each image twice, we can filter images labeled as “other” in the first job, and use them on the second job. Thanks to that, we can label only images that contain this class on the second job. We can also introduce this "other" class to the second job, in case any image doesn’t hold any of the mentioned animals or instruments.

‍

{

    "jobs": [

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Cow": {

                    "answer_example_uri": "https://someurl/images/Cow/1032219.jpg",

                    "en": "Cow"

                },

                "Cat": {

                    "answer_example_uri": "https://someurl/images/Cat/1023885.jpg",

                    "en": "Cat"

                },

                "Dog": {

                    "answer_example_uri": "https://someurl/images/Dog/1027230.jpg",

                    "en": "Dog"

                },

                "Other": {

                    "answer_example_uri": "https://someurl/ilmc-empty-option.jpg",

                    "en": "Other"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_2_gt.json",

            "job_identifier": "animals_job",

            "platform": "human"

        },

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Guitar": {

                    "answer_example_uri": "https://someurl/images/Guitar/375964.jpg",

                    "en": "Guitar"

                },

                "Piano": {

                    "answer_example_uri": "https://someurl/images/Piano/1002675.jpg",

                    "en": "Piano"

                },

                "Drum": {

                    "answer_example_uri": "https://someurl/images/Drum/530108.jpg",

                    "en": "Drum"

                },

                "Other": {

                    "answer_example_uri": "https://someurl/ilmc-empty-option.jpg",

                    "en": "Other"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_3_gt.json",

            "job_identifier": "instruments_job",

            "platform": "human",

            "inputs": [

                {

                    "job_identifier": "animals_job",

                    "filter": {

                        "labels": [

                            "Other"

                        ]

                    }

                }

            ]

        }

    ],

    "config": {

        "taskdata_uri": "https://someurl/tasks.json",

        "restricted_audience": {},

    }

}

‍

After this, the only images that will get double labeling would be the ones identified as “other” on the first labeling job, as they will be also labeled in the second labeling job.

3.2. Hierarchical image classification

‍

Another scenario where we would like to apply job chaining can be hierarchical labeling. For example, for the animals and instruments job that we launched, we could launch first a job to label instruments, animals, and the “other” class. Then, on the second job, we can filter the instruments, and label guitars, pianos, and drums, and for the third job, we can filter the animal images and label cows, cats, and dogs. 

‍

This will allow us to get more complete labeling for the images, as each one will have different usable categories and groupings. And will also simplify the labeling experience for the labelers, as they will receive already filtered images, with a smaller distribution of options.

‍

{

    "jobs": [

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Animal": {

                    "answer_example_uri": "https://someurl/images/Cow/1032219.jpg",

                    "en": "Animal"

                },

                "Instrument": {

                    "answer_example_uri": "https://someurl/images/Guitar/375964.jpg",

                    "en": "Instrument"

                },

                "Other": {

                    "answer_example_uri": "https://someurl/ilmc-empty-option.jpg",

                    "en": "Other"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_1_gt.json",

            "job_identifier": "animals_vs_instruments_job",

            "platform": "human"

        },

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Cow": {

                    "answer_example_uri": "https://someurl/images/Cow/1032219.jpg",

                    "en": "Cow"

                },

                "Cat": {

                    "answer_example_uri": "https://someurl/images/Cat/1023885.jpg",

                    "en": "Cat"

                },

                "Dog": {

                    "answer_example_uri": "https://someurl/images/Dog/1027230.jpg",

                    "en": "Dog"

                },

                "Other": {

                    "answer_example_uri": "https://someurl/resources/ilmc-empty-option.jpg",

                    "en": "Other"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_2_gt.json",

            "job_identifier": "animals_job",

            "platform": "human",

            "inputs": [

                {

                    "job_identifier": "animals_vs_instruments_job",

                    "filter": {

                        "labels": [

                            "Animal"

                        ]

                    }

                }

            ]

        },

        {

            "job_type": "image_label_multiple_choice",

            "question": "Select the most accurate description of the image.",

            "labels": {

                "Guitar": {

                    "answer_example_uri": "https://someurl/images/Guitar/375964.jpg",

                    "en": "Guitar"

                },

                "Piano": {

                    "answer_example_uri": "https://someurl/images/Piano/1002675.jpg",

                    "en": "Piano"

                },

                "Drum": {

                    "answer_example_uri": "https://someurl/images/Drum/530108.jpg",

                    "en": "Drum"

                },

                "Other": {

                    "answer_example_uri": "https://someurl/resources/ilmc-empty-option.jpg",

                    "en": "Other"

                }

            },

            "groundtruth_uri": "https://someurl/labeling_job_3_gt.json",

            "job_identifier": "instruments_job",

            "platform": "human",

            "inputs": [

                {

                    "job_identifier": "animals_vs_instruments_job",

                    "filter": {

                        "labels": [

                            "Instrument"

                        ]

                    }

                }

            ]

        }

    ],

    "config": {

        "taskdata_uri": "https://someurl/tasks.json",

        "restricted_audience": {},

    }

}

‍

This is just an example of a simple hierarchy, but it will be possible to follow complex hierarchies like Wordnet, simplifying the labeling complexity and achieving grouping between related classes.

4. Results of the cleaning process

‍

From the previous examples, we ran the hierarchical labeling manifest in the human platform. We will analyze some of the collected results, especially labels that can be corrected from the original Objects 365 dataset. 

‍

The example workflow holds a subset of images from Objects365. The purpose is to validate the class associated with the bounding box. For this, we cropped 3000 bounding boxes from 6 different classes (cow, cat, dog, piano, drum, and guitar); the number of images per class was fixed at 500. And then, we classified the cropped images into the same 6 classes using human protocol, and compared the original label vs. the collected label.

‍

As a result, we found out that 11 images from the 3000 images (0.3%) have an incorrect label.

4.1 Cats bounding boxes

Of the cat annotations, five bounding boxes didn’t contain a cat. Instead, they contain planes.

4.2 Guitars bounding boxes

For the “guitars” annotations, one image didn’t contain a guitar. Instead, it contains hockey players.

4.3 Dogs bounding boxes

For dog annotations, three images didn’t contain dogs. Instead, they contained a horse.

‍

4.4 Drums bounding boxes
For drum annotations, one image didn’t contain a drum. Instead, it contains a knife.


4.5 Cows bounding boxes

For cow annotations, one image didn’t contain a cow. Instead, it contains a horse.

‍

5. Conclusion

In this post we have shown how to use ILMC and Job chaining to clean a subset of 6 classes of the Objects365 dataset. We found out that some of the labels were incorrect and can be fixed, even if it’s a small percentage. This labeling example can be extended to clean the whole dataset with all its classes, leading to better quality data, and potentially better model training.

‍

The code and dataset used for this blog post could be found here. If you wish to learn more about labeling in detail, check out our website! Also if you wish to use our services, feel free to reach out to our support team at https://www.imachines.com.

‍

Dataset Cleaning Redefined: The Power of Job ChainingDataset Cleaning Redefined: The Power of Job Chaining

Read more

Our Work on Generative AI Safety

Our Work on Generative AI Safety

IM scientists and engineers have worked on generative AI for many years, with a focus on AI safety and harm reduction. This post covers some of our recent research reports.
IM's hCaptcha Product Suite Now The Largest Independent CAPTCHA Service

IM's hCaptcha Product Suite Now The Largest Independent CAPTCHA Service

Together, hCaptcha and hCaptcha Enterprise now protect hundreds of millions of users across tens of millions of websites and apps every month. Our story shows that you can compete with Big Tech when you put privacy first.
IM Scientists Set New Benchmark for Weakly Supervised Text Recognition (OCR) at CVPR 2020

IM Scientists Set New Benchmark for Weakly Supervised Text Recognition (OCR) at CVPR 2020

For the first time, IM scientists demonstrated a method ("OrigamiNet") to transcribe challenging handwritten text without line break info. OrigamiNet also achieves state-of-the-art OCR accuracy, even compared to fully supervised methods using line segmentation data.
Cloudflare chooses IM's hCaptcha Enterprise offering to protect 12% of the Internet

Cloudflare chooses IM's hCaptcha Enterprise offering to protect 12% of the Internet

Cloudflare (NYSE: NET) today announced that it has chosen IM's hCaptcha Enterprise offering to protect over 25 million customer sites, dropping Google's reCAPTCHA offering.
Meet IM scientists at CVPR 2020 in Seattle

Meet IM scientists at CVPR 2020 in Seattle

IM scientists will be presenting class-leading results in page-level text recognition at CVPR 2020, the top computer vision conference.
IM product featured in FastCompany

IM product featured in FastCompany

One of Intuition Machines’ products for high volume ML annotation, hCaptcha.com, was featured by FastCompany recently.

View all‎‎‎  →‎‎‎

View Blog
HomeAboutJobsNewsContact
©2022 Intuition Machines, Inc. ("IM") All rights reserved. San Francisco, CA, USA
Intuition Machines® is a registered trademark of IM. Not associated with Carlos E. Perez or Intuition Machine Inc.
Privacy PolicyCookie Policy