NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng; 2020 - Summary
author: mattkelleher
score: / 10

Core Idea

This paper introduces a fully connected deep network that synthesizes 2D views of complex 3D scenes given a location \((x,y,z)\) and viewing direction \((\Theta , \Phi )\). The scene is represented by the network and is trained by using multiple images from know location and viewing directions.

Technical Implementation

The 3D scenes are represented by the network as a Neural Radiance Fields (NeRF) consisting of a color and a volume density, \(\sigma\). Volume density can be thought of as the probability that a ray terminates at a given point.

The network is restricted so that the location density, \(\sigma\) is dependant only of the location while color value is depenant on location and viewing direction.
It is optimized to be consistent with all of the (training) images with known locations and viewing directions.

Predictions are made at inference time by marching a ray originating at the given location with the given direction through the scene and using the location density information to predict where the ray would end and what color light would be seen from the given vantage point.

Positional Encoding

The authors reference previous work that suggests that neural networks are biased towards learning lower frequency functions. This was backed up by the fact that early iterations of NeRF struggled when trying to represent high frequency variations in geometry and color. To overcome this the authors project the input into higher dimentional space before feeding them into the network. The projection can be seen below:

\(\gamma\) is applied to each of the three location inputs (x,y,z) individually. It is also applied to a Cartesian unit vector constructed from \((\Theta , \Phi )\). This encoding is very similar to the one used in the popular transformer presented in the Attention is all you need paper.

Results

NeRF is evaluated accross three dataset, Diffuse Syntheic 360, Realistic Syntheic 360, and as well as a dataset presented in the LLFF paper and augmented with more examples generated by the authors.

Evaluation Metrics:

NeRF is compared against the following methods for View Synthesis

AS seen in the table above NeRF out performs the three other methods in all but one metric on one dataset where LLFF performs better.

Subjectively we can see that NeRF performs better as the rendered views seen above from NeRF are signifcantly sharper than those of the other three methods. This is most easily seen with complex scenes such as those from the ship and the lego figure.

Ablations

Ablation studies were conducted using the Real Synthetic 360 dataset.

The authors explored the affects of each of the three main compentents (positional encoding, view dependencem hierarchical sampling) of the system design as seen in rows 1-4 above. As we would expect intuitively a lack view dependence was more determintal to model performance than a lack of positional encoding or a a lack of hierarchical sampling. They also show (rows 5-6) that with less views of the object (training data) the network is not able to optimize as well and suffers decreases in performance (as expected). Finally the authors show that the choice of maximum frequence used (rows 7-8) in the postional encoding step is in fact a sweet spot as increasing or decreasing this value hurts performance.

TL;DR