PointPillars: Fast Encoders for Object Detection from Point Clouds, Lang, Vora, Caesar, Zhou, Yang, Beijbom; 2018 - Summary
author: joshpapermaster
score: 7 / 10

What is the core idea?

Previous solutions for 3D object detection were too slow, especially for real time usage. This paper proposes PointPillars, a novel point cloud encoder and network, to solve 3D object detection while only using 2D convolution layers in the end to end learning process. It does this by using “pillars”, which is basically looking at the 3D object from a bird’s eye view and interpreting each space of the 2D bird’s eye view as a vertical column (providing 3D depth). PointPillars then predicts 3D boxes for the objects. This network can work with any standard 2D convolutional architecture.

There are several benefits to PointPillars over other common approaches

How is it realized technically?

There are three stages


  1. Encoder that takes the point cloud and maps it to a sparse pseudo-image
    • point cloud input is x,y,z,r where r is reflectance
    • point cloud is turned into a grid
    • The points within each pillar are then additionally described by their distance to the pillar arithmetic mean and offset to the x,y center
    • Sparse nature of pillar used to create dense tensor representation
    • Random sampling if too many points
    • Zero padding if not enough - Apply simple version of PointNet - Following encoding, pesudo-image created from features being brought back to original locations
  2. 2D convolutional backbone network that takes in the pseudo-image and learns a high-level representation
    • Similar backbone to VoxelNet
    • There’s two networks:
    • Creates features from smaller spatial resolution images (harder to tell when objects are close)
    • Concatenates and upsamples features from previous network
  3. Detection head predicts 3D boxes
    • Single Shot Detector
    • Uses 2D Intersection over Union (IoU)
    • Box height was an additional regression target

Loss was the same as used in SECOND


PointPillars uses NMS with overlapping threshold of 0.5 IoU to pick best bounding box

Data augmentation was very important to the performance of PointPillars

How well does the paper perform?

Bird’s eye view results:


3D detection results:


Object orientation results: pointpillars

Ablation results:

There was a speed vs accuracy trade off

Learning the feature encoding was significantly better than methods using a fixed encoding


What interesting variants are explored?

Not any significant variants.