Building Data Annotation Pipelines | Ahmad Baracat | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Building Data Annotation Pipelines

teacher avatar Ahmad Baracat, Facebook / ex-Amazon Alexa

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

8 Lessons (17m)
    • 1. Introduction

    • 2. What & Why

    • 3. Golden Standard (GS)

    • 4. Training the Annotators

    • 5. Labeling New Data (Testing)

    • 6. Dataset Shift

    • 7. Project

    • 8. Conclusion

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

You will learn:

  1. what is a data annotation or labelling pipeline
  2. why do you need one
  3. the steps needed to build a data annotation pipeline.
  4. Finally, we are gonna touch upon related topics such as dataset shift.

Even though this course doesn’t require prior experience other than interest in data science and machine learning, I highly recommend watching it when you are faced with a a data labelling challenge. For example, you want to label your own data. This will ensure that you can directly apply what you have learnt.

Background Music:

Sunshine (version 2) by Kevin MacLeod



Meet Your Teacher

Teacher Profile Image

Ahmad Baracat

Facebook / ex-Amazon Alexa


I am currently a Software Engineer at Facebook. I used to work at Amazon Alexa solving computer vision problems using deep learning. In 2017, I was awarded a silver medal in the Understanding the Amazon from Space Kaggle Competition. Currently, I am building my AgriTech startup Priceless AI.

Few years back, I created 10+ apps/games for Windows Phone, Android & iOS with 300K+ customers and featured by Microsoft in 150+ countries. You can have a look at my games and apps on my personal website

See full profile

Class Ratings

Expectations Met?
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: So hello everyone. I am Achmat. I've been working at Amazon and Facebook for the past 44 to five years, working on different projects involving machine learning. And today I'm going to walk you through about what you need to know and what I've learned in the past few years on building data annotation pipelines. So we're going to talk about what data annotation pipelines actually means. When do you need it? Why do you need it? The different stages you need to go through to build one. And finally, touch upon different considerations you need to be aware of while building a data pipeline. In essence, we're gonna talk about golden standard datasets, how to deal with bad day annotations, how to train the annotators by following an objective process. So at the end of the course, there is a mini-project where you're going to apply hands-on what you've learned to your specific domain or problem space. I'm very excited to share this with you. And so let's just start. 2. What & Why: Okay, So let's begin. So the first thing I want to talk about is what is actually building a data labeling pipeline. So a data labeling pipeline has several parts. And the idea in general is that you will have unlabeled data coming in to annotators. And then these humans will label the data. And once they labeled the data, the data will then be processed by different stages until eventually we will get the final labels. So why do we need all of this? So let me start by saying that if you are a small company, start-up, small team, you probably may not need a data labeling pipeline from the beginning. From the start, it is an investment to be made and as such, you need to justify that investment. Here are a few reasons why you would justify the investment into a data labeling flow or a pipeline. The third thing is you may have already exhausted the pre-trained models. So as a start-up or as a big company, usually we start by getting, like looking around and researching what are the pre-trained model that we can use to solve a specific task. And so we start with these models, but sometimes we want to, we want to squeeze more performance. We want to improve the accuracy. We want to improve the different metrics. And so at some point, these pre-trained models may not be enough. The other thing is, sometimes you don't find a pre-trained models online. So you, but you find training data that you could use to train your own models. And even though this is a good starting point, you may exhaust this as well. The third possible reason as solving another problem. So if you are solving a new problem, you may not find a pre-trained model. You may not find training data. And as such, you want to build your own process for labeling the data. And finally, and this is what we used in Amazon Alexa is to build a data flow or a data labeling pipeline to measure models online performance. So we have a model that is doing whatever task. But then we want to measure what is happening online and buy online, I mean, at testing time. So for example, you may have an image understanding model that, that is classifying if an image is offensive or not. So you want to sample from the test data, from the data that the model is running on online and then start to measure the performance. And the way to do that is by sampling the data and then liberating it using a manual workforce and measure the accuracy of a model. Over time. 3. Golden Standard (GS): So let's start with the first component. The first component is the golden standard dataset. This is a dataset that is built by product owners. By product owners, I mean, someone who is familiar with the domain space, who can actually label the first batch of data. Usually at this stage, you want to make sure that the data that gold, golden standard dataset is not biased. So you don't want it to be labeled by one single person. You wanted to be diversified as much as possible because we will be building upon the golden standard dataset later on. And so what is this golden standard dataset? It is a big enough dataset. Enough depends on your problem space. If your problem space is simple, you may use something like a 100 labeled samples in the golden standard dataset. The more complex the problem space and the variety of the cases that people need to be aware of the bigger this dataset needs to be. And it contains the ground truth labels. It contains what we think is the right answer to the problem at hand. And finally, this golden standard will be used to judge annotators at different stages of building the pipeline at training and testing. And I'm going to go into detail later. 4. Training the Annotators: So then let's talk about annotators. These are humans obviously who are going to label the data by hand. And depending on the size of your company or the stage you're in, or how critical building a data flow or a data pipeline is for your team. It could be an in-house team. You could invest in having full-time or part-time employees just to label data. Alternatively, you could also outsource this task to something like Mechanical Turk or other services. And because of these annotators have never seen this task before, they are not subject matter experts, then we need to train them. And the way we're gonna do that is by thinking about the hall Dataflow pipeline. For example, we're going to first train the annotators and validate their performance and then use them to test at test time, essentially to label unlabeled data. And because labeling is subjective, we want to move it as much as possible to be an objective task. So we want to build a sustainable process by which we labeled the data. And the first thing we will need is a guideline document, it document that highlights the process that an annotator should follow when they are labeling new data. And the whole process of training annotator then boils down to nailing the guideline document. This let's let go deep into the guideline document. The guideline documents will contain some sample, some examples from the golden standard dataset and what the annotator should do in these cases. And we will start with, let's say V1 of the guideline document. Then what we do is give the annotator the guideline document that V1 and another subset of the golden standard is something that was not added to the guideline document, samples that were not added to the guideline documents. And so the annotator, given golden standard unlabeled data from their point of view and the guidelines, then they produce the labels. Finally, we measure the agreement between the golden standard dataset, which we know the labels of, and the new labels that the annotator has produced. And once we measure the agreement between the gold standard and annotators labels, we iterate by changing the description, adding new samples to highlight subtle differences that we want to make sure that the annotators are aware of. Or we mark this whole process as a success and say, we have trained annotators, they know what to do. Let's touch upon very quickly on the metric. We could use something like kappa score to judge how the annotators are performing. But any agreement metric could do, any metrics that could measure the agreement between two datasets. Essentially in this case, the golden standard and, or a subset of the gold-standard to be precise. And the annotators labels. And once we are happy with the agreement, we then could move to the next step. 5. Labeling New Data (Testing): The next step essentially is testing. So now we have trained and validated the performance of the annotators. We want to start using them, we start testing them. And the way to do that is by labeling unlabeled data, data that we don't know the labels for essentially this term, bad day annotations referred to the fact that humans being humans, they can do mistakes. They could be having a bad day. And essentially this affects their labeling or their performance. And like we did in the training, although there is subtle differences that golden standard will be used to judge whether or not the performance of the annotative it should be trusted on that given day. Remember, all of these annotators have been already trained so we know that they know the process. It is just that we don't trust their annotations. Or essentially it is a good practice to not trust any annotations and try as much as possible to build process to catch miss labeled data. So moving on, we want to start labeling unseen data. Like I said, we sent the sample pluses subject of golden standard to an odd number of annotators and odd number of annotators make it easy to judge later on which level should be assigned because should be assigned to a sample. Because we can use something as simple as majority voting. Now that we have a good view of the different components, Let's talk about the whole process. This is that testing process. So essentially we want to label this unlabeled data. On the bottom left. We augmented with a golden standard subset, something that the annotators have not seen before during training or validation or in the guidelines, we pass it to an odd number of annotators, they produce labels. We then measure the agreement between the golden standard and the annotators label. If there is no strong agreement, then we discard all annotators labels, this is key. All of their annotators at this off, all of their labels at this point in time is considered to be garbage. If there is strong agreement though, we keep the annotators labels and then we do majority voting for each sample to produce the final labels. So let's talk about the subtle thing. What, what happens when you discard and annotator? Remember we started with an odd number. If we remove one single annotator, then we're left with an even number of unknowns. Theaters, meaning that majority voting will break. But this is not very true because if, if that process of training the annotators was a success, then usually most of the samples, most of the data points in a sample would have, usually the data points will have a strong agreement between the, the, the annotators on what label to use. Nevertheless, there might be a few data points that they didn't agree on what to do with it. So essentially you want solution to the problem is to split them, take a few samples, and get a subject matter experts to look at them and figure out why they were different from the other data points. And either add them to the guidelines or the golden standard dataset, and then take another subset for the rest and do another pass through this testing pipeline by using different annotators, for example. And by doing that or keep iterating on this eventually will reach a point where a golden standard and the guidelines capture as much subtleties to allow the annotators to label with confidence and we won't have that many data points and that annotators don't agree on what to do with them. 6. Dataset Shift: Finally, I want to talk about data-set shift. The golden standard is kind of fixed in stone by guidelines are the same as well. But the testing data is not. Usually. You will see that over time, the data that is fed to the annotators is starting to shift. If this happens, usually people refer to it as data-set shift. And the way to combat that is to have a process by which is small subset of the new label data is fed back to the golden standard and is analyzed by subject matter experts. So this is a similar point to what I was talking about. In some cases, you want to start sample from the data that is the test data, the LiveData, the online data. You want to sample from this data a tiny bit and get subject matter experts to look at and to analyze, maybe on a quarterly basis or on a yearly basis. And if they see that the data starts to shift, then we need to go back and train, validate, improve the guidelines at some of them to the golden standard and so on. 7. Project: So for this course, for the project, we want to sketch our own pipeline. Essentially, I want you to build something similar to what I did in the training, validation, and testing. You can use something like these. Diagrams were built on drone dot io. You can use any other diagram building tool. But the key point is I want you to focus on your own problem space. For example. Does it make sense to do majority voting? Does it make sense to have an odd number of annotators? Does it also makes sense to have two Scarpa as an agreement. What other metric can use? So I want you to focus on your own problem space. And if not, think about the problem space you are interested in. And then sketch these diagrams, taking into account these new details, feel free to move blocks around if they make sense to your own problem space. And feel free to invent new blocks and what others process or steps they need to go through. What other details or subtle, subtleties that needs to be taken into account if you are tackling your specific problem space. And hopefully once you have done that, it shouldn't be a big exercise. Please post if possible both these diagrams below on Skillshare in the project upload section for other students and myself to benefit. So thank you for taking the time and I hope this exercise will get you to notice things. Because when you're doing things by hand, you'll start to notice things that may have gone unnoticed when you're just watching me lecturing. So please go through this exercise. It is useful and see you on the next video. 8. Conclusion: So in conclusion, we talked about the different components we need to build a data flow or data annotation pipeline. We also talked about the different stages or different steps we need to take from training and validation of the annotators of the whole process and the guidelines we're going to use to testing, which involves labelling unseen data. We also touched upon the fact that data shifts and data points and that the annotators don't know what to do with or don't agree on the labels. This is another consideration that we need to take into account. For example, facet through subject matter, experts added to the golden standard, improve the guidelines, and so on. So hopefully you have found this useful. And hopefully this will help you build more reliable data products involving machine learning or other tools also finally, if possible, if you're seeing this on Skillshare, I'll appreciate if you can leave a comment on what you found useful, how I can improve my communication skills. Thank you.