|
|
|
|
|
|
We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos.
Without any manual annotation, our method can reconstruct a dynamic scene with a single rigid object in motion by simultaneously decomposing it into its two constituent parts and encoding each with its own neural representation. This is achieved by jointly optimizing the parameters of two neural radiance fields and a set of rigid poses which align the two fields at each frame.
On both synthetic and real world datasets, we demonstrate that our method can render photorealistic novel views, where novelty is measured on both spatial and temporal axes. Our factored representation furthermore enables animation of unseen object motion.
The following videos show rendering of novel spatial-temporal views on two synthetic (Lamp and Desk, Kitchen Table) and one real-world (Moving Banana) dynamic scenes. The rendered videos are 20x slow motion of the training videos from a continuously varying camera view unseen during training.
STaR's factored representation of motion and appearance allows it to synthesize novel views of animated trajectories of the dynamic object which have not been seen during training, without any 3D ground truth or supervision.
With the decomposition of static and dynamic components learned by STaR, static background and dynamic foreground can be seamlessly removed during spatial-temporal novel view rendering.