CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

With explicit permission from Google, we collect Street View data from Google Maps across 10 diverse cities scattered across the globe: Paris, Athens, Anchorage, Hyderabad, Philadelphia, San Francisco, San Juan, Honolulu, London, and Sao Paolo. Across the 10 cities, we collected a total of 5.5M panoramas. Importantly, all sensitive information, such as license plates and faces, are blurred prior to collection.

CityRAG allows users to navigate arbitrary trajectories that do not exist in the database by stitching multiple retrieved geospatial videos. In this video example below, CityRAG retrieves and combines two perpendicular paths from the same intersection to construct a new trajectory that resembles turning right. Despite the discontinuity in the geospatial condition frames, the generator produces a consistent video, indicating its robustness and its understanding of the static and transient elements in a scene.

Minutes-long navigation with arbitrary trajectories

By stitching geospatial videos, users can specify arbitary paths. Then, CityRAG generates consistent sequences via autoregressive generation (AR). Note: CityRAG was not trained on AR. Consecutive generations simply take the last frame of the previous generation as the first image. Geospatial conditions ground each independent generation to ensure consistency (of the static structures).

Example

Top: Generated video (sped up 2x) of the streets of Philadelphia. Even though consecutive generations are linked only by one frame, the video (in particular the static structure) remains mostly stable. Each new generation is labeled by a green border.
Bottom: Geospatial and trajectory conditions. The trajectory is arbitraily drawn and does not exist as a continuous video in the Street View database. 5 geospatial videos were retrieved and stitched, with each transition labeled in the "Retrieved geospatial #" counter.

Additional Results: Loop Closure

Athens

San Juan

Additional Results: Navigation

San Francisco

Paris

London

Gene Chou was supported by an NSF graduate fellowship (2139899). We thank Gordon Wetzstein, Aleksander Holynski, Jon Barron, Dor Verbin, Pratul Srinivasan, Rundi Wu, Ruiqi Gao, Haian Jin, Linyi Jin, and Haofei Xu for discussions and support.

BibTeX


      @misc{chou2026cityrag,
        title     = {CityRAG: Stepping Into a City via Spatially-Grounded Video Generation},
        author    = {Chou, Gene and Herrmann, Charles and Genova, Kyle and Deng, Boyang and Peng, Songyou and Hariharan, Bharath and Zhang, Jason Y. and Snavely, Noah and Henzler, Philipp},
        year         = {2026},
        eprint       = {2604.19741},
        archivePrefix= {arXiv},
        primaryClass = {cs.CV},
        url = {https://arxiv.org/abs/2604.19741}
      }

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

CityRAG Generates a Simulated Environment of a City

Controllable Weather and Appearance

Loop Closure

CityRAG generates minutes-long, physically grounded video sequences that 1) reconstruct real buildings and roads; 2) are initialized from a first image and respects its weather conditions and dynamic objects; 3) follow arbitrary user-defined trajectories and show loop closure.

Data

Training on paths that are geographically aligned, but temporally unaligned

Results

Non-trivial disentanglement of weather / cars (first image) and geometry (geospatial condition)

Non-pixel-aligned geospatial conditions and generations

Additional results

Baselines

Inference via User Input and RAG

Minutes-long navigation with arbitrary trajectories

Example

Additional Results: Loop Closure

Additional Results: Navigation

Limitations and Future Work

Autoregressive Generation

Object-Oriented Control

Data Biases

Acknowledgements

BibTeX