How we measure Quality of 3D Cuboids

What is a Dataset?

A dataset has two components

  1. Lidar data streams with camera streams for reference

  2. Annotations representing the objects present in the lidar data.

For the scope of this article, we will be defining the quality based only on the quality of ground truth annotations involving 3D shapes in a particular dataset.

What does the Quality of an Annotation Mean?

A dataset may have millions of objects which need to be annotated and tracked along a sequence. An annotation can have following key parameters.

  1. Object Type or Class: The annotation should correctly classify the object it represents. e.g. A car shouldn't be annotated as a truck.

  2. Annotation Position and Dimensions: The annotation should be as tight as possible while ensuring no portion of the object is outside the annotation. This is evaluated on 6 parameters, length, height, width, roll, pitch, yaw and getting this right is one of the most time-consuming aspects of data labelling. Usually, some buffer zone of a couple of pixels is considered while ensuring the correct dimension.

  3. Object Attributes: The annotation should include correct values for attributes as per the project requirements. e.g. If every car also needs to have information about its direction i.e. same or opposite, the annotation for every car should have correct value for direction.

  4. Object Tracking Along a Sequence (if applicable): If an object is present in multiple frames of a sequence, all the annotations representing the object should have a common tracking id. If there is an object which is represented by two different tracking id then this type of error is termed as id switching error.

True Positive (TP): When all the parameters of an annotation are correct, that annotation is called True positive. This means that the object has been correctly annotated.

False Positive (FP): When atleast one of the parameters of an annotation are incorrect, that is annotation is called False positive.

False Negative (FN): If there is an object which can be identified by human eyes and doesn't have an annotation, then every such object is counted as False negative. <e.g. with an image>

How to Measure Quality a Dataset Using Quantifiable Metrics?

Quality of a ground truth dataset is good if most of the annotations in it are true positives and there are very few false positives and false negatives. In order to measure quality in quantifiable metric we experimented with few different universally accepted methods. We have found that many our clients view FP errors and FN errors differently with different costs attached to them. Hence a standard and simple "precision" and "recall" gives a good representation of quality of the dataset

Precision=tTPtt(TPt+FPt)Precision = \frac{\sum_{t}TP_t}{\sum_{t}(TP_t+FP_t)}
Recall=tTPtt(TPt+FNt)Recall = \frac{\sum_{t}TP_t}{\sum_{t}(TP_t+FN_t)}

How does Playment Check Quality of a Submission?

Playment's Quality Assurance Process:

For Playment, quality is always the top priority. We have multiple checks and balances in place during the execution phase to ensure quality of the output is best in class. Each annotation is checked multiple time before it is ready for submission.

Importance of Sampling in Determining Quality of Large Ground Truth Dataset:

As we have seen so far in order to calculate the quality of ground truth dataset, you need to manually check all the parameters of all the annotations present in a dataset. However when dealing with millions of annotations this activity becomes prohibitively expensive. Hence we create a statistically significant random sample which would be a good representation of the dataset.

There are 2 ways to generate a sample

  1. Long High FPS Sequences: If the dataset contains very long sequences, in such cases its better to scale down the FPS by a certain factor. e.g. If the dataset has 100 sequences with 1000 frames each i.e. total 100,000 frames in the dataset. We can scale down each FPS by a factor of 10 and pick up 100 equidistant frames from each sequence. This will give you a sample size of 100 sequences * 100 frames/seq = 10,000 frames which is 10% of original dataset.

  2. Short Low FPS Sequence: If the dataset contains a large number of short low FPS sequences, its better to randomly pick certain % of the sequences at the original FPS. e.g. If the dataset has 1000 sequences with 100 frames each. i.e. total 100,000 frames in the dataset. We randomly pick up 10% of the sequences i.e. 100 sequences at the original FPS i.e. with 100 frames each. This will give you a sample size of 100 sequences * 100 frames/seq = 10,000 frames which is 10% of original dataset.

We usually consider an unbiased sample with at least 5% of the total frames to be statistically significant.

A Glimpse of Playment's Quality Assurance Tool:

Playment has built a quality checking tool which allows us to comprehensively check all the parameters of all the annotations present in a sample.

QC Results:

Upon QC completion on a sample set, the tool generates a result in the following format which clearly tells us the quality of the dataset.

Below image shows the result of a QC task. We provide two views, first is the categorisation of mistakes based on classes. Bottom row shows the overall quality of the batch. Second view is based on jobs and shows number of mistakes in each job being QCed.

A client can perform a QC to evaluate the quality of a batch. If the Precision and recall numbers are below the contractually agreed upon benchmarks then the client can request Playment to rework the batch to iron out the issues.

Last updated