author: | TL25693 |

score: | / 10 |

TODO: Summarize the paper:

- What is the core idea?

The paper introduced VoxelNet, an end to end deep network that goes from raw 3d point cloud (from lidar) to object detection. It eliminates the need of traditionally hand craft features from the raw representation

- How is it realized (technically)?

**Voxel Partitioning** From a raw 3d point cloud input, VoxelNet first partitions the points to voxels.

**Grouping** Points are grouped according to the voxels. Some voxels may contain significantly more points than others

**Random Sampling** Randomly Sample at most T points for voxels that have more than T points. This have drastic effect on computation speed and decrease imbalance of points

**Stacked Voxel Feature Encoding** A novel way of encoding the points in raw input (per voxel) to feature space. It feeds **points, reflectance, and mean of points** to encode the **surface shape**, as stacked encoder layer aggragrates information from surronding points.

**Sparse Tensor Representation** Due to majority (90%) of the voxels beeing empty, we can represent the entire voxels as a sparse 4D tesnor of size CxD’xH’xW’

**Region Proposal Network** The algorithm then feed the feature map to the RPN, which is modified from the origional network

**Loss Function** Loss on 3d ground truth box by distinguishing positive anchors (Intersection over union with ground truth is above 0.6) from negative anchors (IoU is below 0.45)

- How well does the paper perform?

Overall, voxelnet performed much better than the baseline and previous works on all of Car, Pedestrian, and Cyclist detection

Visually, most of the bounding box matches the expectation

## TL;DR

- Voxelnet provides a unified approach to 3d bounding box classification
- A broadphase - narrow phase approach may provide significant speed up to the estimation
- With an efficient implementation, we can achieve high speed with great accuracy.