Summary

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection, Zhou, Tuzel; 2017 - Summary

author:	TL25693
score:	/ 10

TODO: Summarize the paper:

What is the core idea?

The paper introduced VoxelNet, an end to end deep network that goes from raw 3d point cloud (from lidar) to object detection. It eliminates the need of traditionally hand craft features from the raw representation

How is it realized (technically)?

Voxel Partitioning From a raw 3d point cloud input, VoxelNet first partitions the points to voxels.

Grouping Points are grouped according to the voxels. Some voxels may contain significantly more points than others

Random Sampling Randomly Sample at most T points for voxels that have more than T points. This have drastic effect on computation speed and decrease imbalance of points

Stacked Voxel Feature Encoding A novel way of encoding the points in raw input (per voxel) to feature space. It feeds points, reflectance, and mean of points to encode the surface shape, as stacked encoder layer aggragrates information from surronding points.

Sparse Tensor Representation Due to majority (90%) of the voxels beeing empty, we can represent the entire voxels as a sparse 4D tesnor of size CxD’xH’xW’

Region Proposal Network The algorithm then feed the feature map to the RPN, which is modified from the origional network

Loss Function Loss on 3d ground truth box by distinguishing positive anchors (Intersection over union with ground truth is above 0.6) from negative anchors (IoU is below 0.45)

How well does the paper perform?

Overall, voxelnet performed much better than the baseline and previous works on all of Car, Pedestrian, and Cyclist detection

Visually, most of the bounding box matches the expectation

TL;DR

Voxelnet provides a unified approach to 3d bounding box classification
A broadphase - narrow phase approach may provide significant speed up to the estimation
With an efficient implementation, we can achieve high speed with great accuracy.