How we measure Quality in Semantic Segmentation
What is a Dataset?
A dataset has two components
Images which represent the input data
Segments of pixels which represent objects in the images
The scope of this article is limited to defining only the quality of ground truth segmentation in the dataset.
What does Quality of a Segment Mean?
A dataset may have thousands of images with millions of objects represented by billions of pixels. The color of each pixel in an image tells us about the object that the pixel is representing. However, when evaluating the ground truth dataset, it is humanly not possible to check each and every pixel for quality. Hence we evaluate a segment/cluster of pixels rather than each pixel.
Every object in an image can have one or more segments representing it using a particular color in a well-defined color map. The color of the segment and its contours should correctly give information about the following parameters.
Object Type or Class: Each segment should correctly classify the object it represents as per the color map. e.g. if the color map says that cars are denoted by red, then all segments representing cars should be in red.
Instance (only applicable for instance segmentation): In case of instance-level segmentation, each instance of an object type should have a different color. e.g. segments representing car1 should have a different color than segments representing car2 and captured in the color map.
Segment Boundary/Contour: The boundary of the segments should be such that they don't include pixels which don't belong to the object and don't exclude pixels which belong to the object. Getting this right is one of the most time consuming aspects of data labelling. Usually some buffer zone of a couple of pixels is considered while ensuring correct boundary.
Object Attributes: If multi channel segmentation is used to capture attributes then each segment will have a different color which represents the attribute value. The color of the segment is checked to ensure it represents the correct attribute value as per the attribute color map. e.g. If every car also needs to have information about its direction i.e. same or opposite, the segment for every car should have correct value for direction.
Missed Objects: If there is an object which can be identified by human eyes and is not segmented, then such objects are called missed objects.
What is the Traditional Method of Evaluating the Quality of Semantic Segmentation?
Most of the research done so far focuses on evaluating the performance of computer vision models which generate semantic segmentation results. One of the most popular methods we found was pixel to pixel comparison between the AI generated result and the corresponding ground truth segmentation mask. This method will tell you what is the error % of your model compared to the ground truth data. There is not enough research done on evaluating quality of ground truth data. We interviewed many of our clients to understand what they most care about in their ground truth data and that is what formed the basis of our quality assurance process.
What does the Quality of a Segment Mean?
Concept of Doodles:
Instead of evaluating every pixel we evaluate contours and color of segments enclosed by contours. If we find an error with a certain portion of a contour or an entire object we draw a rectangle around the problematic area of the mask. We can then select what kind of error is enclosed within the rectangle. It can be one of the parameters as explained above. Each such rectangle is called a doodle. You can also make a doodle if an object was missed.
Playment's Quality Assurance Process:
For Playment, quality is always the top priority. We have multiple checks and balances in place during the execution phase to ensure quality of the output is best in class. Each annotation is checked multiple time before it is ready for submission.
How do we Measure Quality Using Quantifiable Metrics?
After speaking with a lot of our clients we have found that
Clients want to evaluate quality of each parameter separately, and don't want a single metric which fuses all errors together. This helps them fine tune their requirements to Playment and optimise for costs as per their requirements.
Clients also want to differentiate between objects which are very critical for their model e.g. cars and objects which are not very critical e.g. sky. Usually they have different quality benchmarks for critical and non-critical objects and helps them optimise for costs. It is also perfectly fine if all the classes are critical.
The care about entire segmentation mask being usable, i.e. if a mask has one class mislabeling error vs multiple class mislabeling errors its the same. .
For each group of classes (critical and non-critical) we define following error metrics
Class precision: (# of images with class no mislabelling errors)/Total images
Geometric precision: (# of images with no wrong boundary error)/total images
Instance precision: (# of images with no incorrect instance errors)/total images
Attribute precision: (# of images with no incorrect attribute errors)/total images
Recall: (# of images with no missed objects)/total images
Example: if a company cares most about classes cars, pedestrians, lane markings, traffic lights and traffic signs. These labels become part of our critical set. The company may have following quality benchmarks which is contractually agreed by both client and Playment.
For critical classes:
Class precision: 95%. i.e. at least 95% of images in a submission should have 0 class mislabeling errors
Geometric precision: 85%, i.e. atleast 85% of images in a submission should have 0 boundary errors
Instance precision: 90%, i.e. atleast 90% of images in a submission should have 0 instance errors
Recall: 95%, i.e. atleast 95% of images in a submission should have 0 objects missed
Similarly for non-critical classes
, these benchmarks can be lower. e.g. class precision - 90%, geometric precision - 75%, instance precision - 85% and recall - 85%.
When we make a submission the client can evaluate its quality against above benchmarks as contractually agreed, and if any of the benchmark is missed, he can send the submission for rework.
Importance of Sampling in Determining Quality of Large Ground Truth Dataset:
As we have seen so far in order to calculate the quality of ground truth segmentation, you need to manually check all the parameters of all the objects in an image. In a dataset with thousands of images this is a huge manual effort and can become prohibitively expensive. Hence we create a statistically significant random sample of images from the dataset which would be a good representation of the dataset.
We usually consider a random sample of atleast 5-10% of the dataset to be a good representation.
A Glimpse of Playment's Quality Assurance Tool:
Playment has built a quality checking tool which allows us to comprehensively check all the parameters of all the annotations present in a sample.
QC Results:
After finish the QC of a sample set, the tool generates a result in the following format which clearly tells us the quality of the dataset based on various parameters.
In the image above we can see the results of a QC performed on a submission of 824 images. The random sample consisted of 85 imagers which were checked. Class precision of 95% means that there were 81 images with 0 class mislabeling error, no issues were found with recall, geometric precision and instance precision.
In the table you can also check the number of errors found in image which were checked. The table also provides a view it check if any particular type of mistake is most prevalent or objects a particular has more errors
Last updated