Baking Neural Radiance Fields for Real-Time View Synthesis, Hedman, Srinivasan, Mildenhall, Barron, Debevec; 2021 - Summary
author: timchen0618
score: 8 / 10

What is the core idea?

NeRF is a good method to recreate 3D scenes from viewpoints unobserved by the system. However, it is computationally intensive and cannot be applied in real-time, since rendering one pixel would require evaluating MLP hundreds of times. Overall, the paper first reformulated the calculation of NeRF such that only 1 MLP operation is needed for constructing the image pixel in each ray. Then, they precomputed the values needed to construct NeRF and stored them in a structure called Sparse Neural Radiance Grid (SNeRG). With those, they achieved real-time rendering.

How is it realized (technically)?

To reduce computation, we do not want to do so many MLP operations for one pixel, so we have to trade computation for storage. But also, we do not want to precompute and store the entire 5D representation. (Too memory intensive)

What they do is a hybrid approach which precomutes some value in a sparse 3D space and leave some computation to inference time.

Deferred NeRF

They reconstruct NeRF to output diffuse colors, volume densitise and 4-dimensional feature vectors.

image

They compute the color of a pixel following the steps below.

First, they calculate the accumulated diffuse color and feature vector along the ray. Then, the accumulated feature vector and the direction of the ray are passed into a MLP. The output of the MLP is considered a view-dependent residual, which is then added to the accumulated diffuse color.

image

This deferred version of NeRF requires only 1 MLP evaluation per pixel. (In contrast to hundreds of samples as in standard NeRF.)

Sparse Neural Radiance Grids (SNeRG)

They need to store diffuse colors, volume densities, and feature vectors. Instead of storing a dense voxel grid, they devised a block-sparse representation.

Macroblock: They used blocks of \(B^3\) each to represent a densely occupied region. Indirection Grid: Size of \((N/B)^3\), either indicates each macroblock is empty or points to the content of that macroblock in the 3D atlas.

Rendering

image

For each pixel,

  1. Match the ray through the indirection grid and ignore the empty ones
  2. For non-empty macroblocks, fetch the precomputed values for all sample points along the ray.
  3. Use alpha compositing to accumulate diffuse color and feature vectors, and stop if the opacity value saturated
  4. With the accumulated values, we can compute the pixel color as in Eq. 7

Other Tricks & Details

Regularization: To enable block-sparse representation mentioned previously and reduce rendering time and storage, they introduce opacity regularization. This essentially encourages sparsity in NeRF’s opacity fields. Regularization is done by penalizing predicted density.

image

Sparsification: They also further sparsify the voxel grid by culling macroblocks with low opacity and visibility.

Compresssion: Represent all values with only 8 bits; compress a indirection grid as a lossless PNG and 3D atlas as a set of lossless PNGs, a set of JPEGs or a video encoded with H264.

How well does the paper perform?

They experiment with the proposed method on free-viewpoint rendering of 360-degree scenes. The quality of their model is comparable to other methods, yet the runtime is significantly faster.

image

What interesting variants are explored?

They offer some interesting ablation studies regarding their design decisions.

image

Table 2. basically shows that using a smaller network (Tinyview) and deferred NeRF do hurt performance. Using compression schemes also displayed similiar effect. But these are just crucial parts to speeding up the inference time, and the performance drops are not too significant.

image

Table 3. showed the compression techniques do reduce memory usage. For example, H264 reduce storage by 200x and sacrifice little performance.

TL;DR