Few visualizations are shown below, especially GIF animations that are difficult to include in the paper. We show EchoNeRF's two-stage learning process below and compare with the ground truth floorplan.
In NeRFs (left), the sensors are cameras that record light (as RGB images) coming from an object of interest (bulldozer). From these multi-view RGB images, (optical) NeRFs learns to synthesize novel views of the object and reconstruct its shape. Similarly, in EchoNeRF (right), the sensors are phones that measure RF signals (as signal strength) present in an environment (the house), except the measurements are now made inside the environment as opposed to outside. Given a few such RF measurements, EchoNeRF learns to infer the spatial layout of the house. While NeRFs use cameras to see objects, EchoNeRF uses RF signals to see environments. How does EchoNeRF achieve this?
A camera pixel (left) primarily captures light arriving directly from the object in front of it—along the line-of-sight (LoS) path (see [1]). In contrast, an RF sensor (right) such as a WiFi receiver collects signals that include multiple reflections, or “echoes,” from walls and other objects in the surrounding environment. EchoNeRF is designed to learn from such multipath signals, enabling it to “see” the environment and infer spatial structure from sparse WiFi measurements.
While EchoNeRF can be viewed as a general framework, as an application, we show how it can be used to solve a floorplan estimation problem in the above figure. (a) User's device measures wireless signal power. (b) Transmitted signals arrive at the device along a line of sight (LoS) path and after reflections from surrounding walls. (c) Our LoS model estimates crude walls. (d) Our first-order reflection model refines the inner and outer walls. (e) Together, the floorplan is estimated which can then be used to predict signals at new locations.
Few visualizations are shown below, especially GIF animations that are difficult to include in the paper. We show EchoNeRF's two-stage learning process below and compare with the ground truth floorplan.
We can use several post-processing steps to improve the output of EchoNeRF. Below, we show the outputs of EchoNeRF and the results of applying post-processing — through traditional image processing and via diffusion models [3] .
The estimated floorplan images are first binarized and then refined by applying morphological closing to fill small white gaps, followed by dilation to expand wall thickness. A median blur is used to smooth the edges, and finally contour detection is applied to remove small black regions based on area.
We can also use generative priors to improve the output of EchoNeRF. We first train a diffusion model (DDPM [3] ) on the ZInD dataset [4], which contains clean floorplans. Once trained, we perform denoising only over the final steps of the diffusion process, starting from the EchoNeRF's output inserted at time step \( t = 100 \). The final outputs after denoising are shown below.
We also show the intermediate steps visualizing the denoising trajectory. The first column shows the diffusion process at time step \( t = 100 \), which is the output of EchoNeRF with added noise. The last column shows the final denoised output.
To demonstrate that EchoNeRF solves the true inverse problem, we perform the following tasks:








We also show the ray tracing results by generating higher order reflections (until order 3) using Sionna [5] using the ground truth floorplans. Tx and Rx are shown in the image as red and green stars, respectively.
We repeat the above steps for the post-processed floorplans, and show the results below.
@inproceedings{echo-nerf-2025,
title = {Can NeRFs See Without Cameras?},
author = {Amballa, Chaitanya and Basu, Sattwik and Wei, Yu-Lin and Yang, Zhijian and Ergezer, Mehmet and Choudhury, Romit Roy},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025},
url = {https://arxiv.org/pdf/2505.22441},
eprint = {2505.22441},
archivePrefix = {arXiv}
}