How we measure Quality in Point Cloud Segmentation
What is a Dataset?
A dataset has two components
LIDAR data streams with camera streams for reference.
Segments of points that represent the objects in the point cloud.
The scope of this article is limited to defining only the quality of ground truth point-cloud segmentation in the dataset.
What Does the Quality of a Segment Mean?
A dataset may have thousands of point-cloud frames with millions of objects represented by millions of points in the point cloud. The color of each point in a frame tells us about the object that the point is representing. However, when evaluating the ground truth dataset, it is humanly not possible to check each and every point for quality. Hence we evaluate a segment/cluster of points rather than each point.
Every object in a point-cloud frame has a segment of points representing it using a particular color in a well-defined color map. The color of the segment of points should correctly give information about the following parameters.
Object Type or Class: Each segment should correctly classify the object it represents as per the color map. e.g. if the color map says that cars are denoted by red, then all segments representing cars should be in red.
Instance (only applicable for instance point cloud segmentation): In case of instance-level segmentation, each instance of an object type should have a different color. e.g. segments representing
Car#1
should have a different color than segments representingCar#2
and captured in the color map.Segment Definition: All the points representing the object should be correctly classified, such that none of the points that represent the object are missed/unclassified.
How ML teams Around the World Evaluate the Quality of Point Cloud Segmentation?
Most of the research done so far focuses on evaluating the performance of computer vision models which generate point cloud segmentation results. One of the most popular methods we found was a point to point comparison between the AI-generated result and the corresponding ground truth segmentation mask. This method will tell you what is the error % of your model compared to the ground truth data. There is not enough research done on evaluating the quality of ground truth data. We interviewed many of our customers to understand what they most care about in their ground truth data and that is what formed the basis of our quality assurance process.
How Does Playment Check the Quality of a Point Cloud Segmentation Submissions?
Concept of Beacons
Instead of evaluating every point, we evaluate a segment of points that defines an object. If we find an error with a certain portion of an object or the entire object we create a beacon around the problematic area of the point cloud. We can then select the kind of error that is present in the marked area. It can be one of the parameters as explained above. You can also make a beacon if an object was missed.
Playment's Quality Assurance Process:
For Playment, quality is always the top priority. We have multiple checks and balances in place during the execution phase to ensure the quality of the output is best in class. Each frame is checked multiple times before it is ready for submission.
How do we Measure Quality Using Quantifiable Metrics?
After speaking with a lot of our customers we have found that:
Customers weigh various kinds of mistakes differently. For example, if an entire object is misclassified for a frame vs few points for an object are misclassified (the former being more severe), errors on far away objects are considered less severe.
Customers also want to differentiate between the nature of the objects. Thing objects are usually more critical than Stuff objects.
Thing class: object categories that have a well-defined shape such as people and cars.
Stuff class: Object categories which have an amorphous spatial extent such as grass and sky.
To assess the quality of a frame, we score the frame on a scale of 0 -100 (a perfect frame with zero mistakes will have a score of 100). We have defined various kinds of errors and a penalty associated with them. If any of the error is present in the frame, we deduct the penalty value from the frame score. We classify the errors as follows:
QC Scoring Config
Error Type
Error Description
Penalty
Critical Error
A thing object (e.g. car, pedestrian etc.) is misclassified
10
Critical Error
An object is missed entirely
10
Critical Error
More than ~5%* points of a stuff-class (e.g. sky, buildings, vegetation, etc) are misclassified, Examples: entire buildings or areas of vegetation incorrectly classified, or sloppy classification of the stuff classes throughout the entire scene.
10
Non Critical Error
Less than ~5%* of points are misclassified for a stuff class (e.g. border points are misclassified and other points are correct).
3
Non Critical Error
If less ~5%* of the points belonging to that object has been missed.
3
Non Critical Error
Wrong instance.
3
Non Critical Error
Any mistake on a far away (>75m**) object.
3
* It is only for approximation, the exact proportion of the points are not measured to evaluate a frame.
** Distance value can be variable based on the dataset
Note: This scoring config is only applicable for 360-degree point cloud segmentation. Based on the type of dataset, the scoring config can be customized.
Importance of Sampling in Determining the Quality of Large Ground Truth Dataset
As we have seen so far in order to calculate the quality of ground truth segmentation, you need to manually check all the parameters of all the objects in a frame. In a dataset with thousands of frames, this is a huge manual effort and can become prohibitively expensive. Hence we create a statistically significant random sample of frames from the dataset which would be a good representation of the dataset. We usually consider a random sample of at least 5-10% of the dataset to be a good representation.
We then evaluate the quality score of each frame in the random sample as:
Once we calculate the quality score for each frame, we evaluate the overall quality score for the sample as:
When we make a submission, the customer can evaluate its quality against the benchmarks as contractually agreed, and if any of the benchmarks are missed, they can send the submission for rework.
Note: If multiple errors are present on the same instance, only the largest penalty is applied to the score.
Last updated