ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

Abstract

Scene synthesis plays a crucial role in autonomous driving by addressing data scarcity and closed-loop validation. Current approaches struggle to maintain temporal consistency in synthesized videos while preserving fine-grained details. We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformers (SF-DiT) that convert sequential BEV semantic maps into temporally consistent driving videos. Operating in a pretrained occupancy VQ-VAE latent space, our SF-DiT generates temporally consistent 3D occupancy, which provides guidance for controlled image and video diffusion for scene synthesis. To address the temporal consistency, SF-DiT enhances standard DiT blocks with temporal semantic modeling through two designs: (1) A Semantic Flow Estimation module capturing scene motions (flow, uncertainty, and classification) from sequential BEV semantic maps, and (2) A Semantic Flow-Modulated Cross-Attention module that dynamically adapts attention based on semantic flow patterns. This integration of semantic flow modeling in DiT enables consistent scene evolution understanding. Evaluations of image and video synthesis on nuScenes dataset demonstrate state-of-the-art performance with FID 8.3 and FVD 73.6, and superior temporal occupancy generation results on nuCraft and OpenOccupancy benchmarks.

Publication
IEEE/CVF International Conference on Computer Vision (ICCV)

Benjin Zhu1, Xiaogang Wang1, Hongsheng Li1,2,3

1 CUHK MMLab 2 Shanghai AI Laboratory 3 CPII under InnoHK

(Paper / Code)

Overview


Illustration of Scene Flow Estimation

We present ConsistentCity, a two-stage scene synthesis framework which first converts sequential BEV semantic maps into temporally consistent 3D occupancies, and then provides guidance for temporally coherent video synthesis. In the first stage, ConsistentCity adopts DiT architectures in a pretrained occupancy VQ-VAE latent space for efficiency. To ensure temporal consistency,

Scope of SFMCA

we introduce Semantic Flow-guided Diffusion Transformers (SF-DiT), which extend standard DiT by incorporating temporal semantic flow modeling. SF-DiT processes noised latent vectors conditioned on sequential BEV semantic maps, with semantic flow guidance integrated at each transformer block. The generated temporally coherent latents are then decoded to full 3D occupancy that preserve temporal evolution. For the second stage, these consistent 3D occupancies are rendered (as semantic & depth maps) onto 2D image planes as control signals for pretrained diffusion models to generate driving videos.

Pipeline of ConsistentCity

Specifically, our approach begins with a Semantic Flow Estimation step to extract flow vectors (direction and magnitude), model uncertainty of estimated flows, and classify flow semantics (e.g., objects appearing & disappearing). This combined semantic flows provide a physics-aligned motion foundation for both spatial and temporal modeling. Our Semantic Flow-guided Diffusion Transformer (SF-DiT) architecture integrates this semantic flow directly into DiT through a novel Semantic Flow-Modulated Cross-Attention (SFMCA) module (as in Fig.1). SF-DiT ensures both geometric precision and temporal consistency based on physics-aligned and temporally consistent flow constraints, achieving robust temporal fusion throughout the generation process. Contributions of our work are threefold:

Illustration of SFMCA

  • A comprehensive semantic flow estimation module leveraging sequential BEV semantic maps to model scene dynamics through temporal flow vectors, uncertainty, and classification.
  • A SF-DiT architecture with novel Semantic Flow-Modulated Cross-Attention that achieve temporally consistent driving videos generation.
  • ConsistentCity achieves state-of-the-art performance on both nuScenes image & video generation (FID 8.3, FVD 73.6), and temporal 3D occupancy prediction on nuCraft and OpenOccupancy dataset.
@inproceedings{zhu2025consistent,
  title={ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis},
  author={Zhu, Benjin and Wang, Xiaogang and Li, Hongsheng},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2025},
}
Benjin ZHU
Benjin ZHU
Ph.D Candidate

Ph.D candidate @ MMLab, CUHK.