ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

Abstract

Scene synthesis plays a crucial role in autonomous driving by addressing data scarcity and enabling closed-loop validation. However, current approaches struggle to maintain temporal consistency in synthesized videos while preserving fine-grained details. We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformer (SF-DiT) that converts sequential BEV semantic maps into temporally consistent driving videos. Operating in a pretrained occupancy VQ-VAE latent space, SF-DiT generates temporally coherent 3D occupancy, which then guides controlled image and video diffusion for scene synthesis. To enforce temporal consistency, SF-DiT extends standard DiT blocks with temporal semantic modeling through two designs: (1) a Semantic Flow Estimation module that captures scene motion (flow, uncertainty, and flow semantics) from sequential BEV semantic maps, and (2) a Semantic Flow-Modulated Cross-Attention module that dynamically adapts attention based on semantic flow patterns. This integration of semantic flow modeling within DiT enables a coherent understanding of scene evolution. Evaluations of image and video synthesis on the nuScenes dataset demonstrate state-of-the-art performance with FID 8.3 and FVD 73.6, along with superior temporal occupancy generation on the nuCraft and OpenOccupancy benchmarks.

Publication
IEEE/CVF International Conference on Computer Vision (ICCV)

Benjin Zhu1, Xiaogang Wang1, Hongsheng Li1,2,3

1 CUHK MMLab    2 Shanghai AI Laboratory    3 CPII under InnoHK

(Paper / Code)

Overview


Illustration of Semantic Flow Estimation

We present ConsistentCity, a two-stage scene synthesis framework that first converts sequential BEV semantic maps into temporally consistent 3D occupancies, and then uses them to guide temporally coherent video synthesis. In the first stage, ConsistentCity adopts a DiT architecture in a pretrained occupancy VQ-VAE latent space for efficiency. To ensure temporal consistency, we introduce Semantic Flow-guided Diffusion Transformers (SF-DiT), which extend standard DiT by incorporating temporal semantic flow modeling.

Semantic Flow-Modulated Cross-Attention (SFMCA)

SF-DiT processes noised latent vectors conditioned on sequential BEV semantic maps, with semantic flow guidance integrated at each transformer block. The generated temporally coherent latents are then decoded into full 3D occupancy that faithfully preserves scene evolution. In the second stage, these consistent 3D occupancies are rendered (as semantic and depth maps) onto 2D image planes and used as control signals for pretrained diffusion models to synthesize driving videos.

Pipeline of ConsistentCity

Our approach begins with a Semantic Flow Estimation step to extract flow vectors (direction and magnitude), estimate flow uncertainty, and classify flow semantics (e.g., objects appearing and disappearing). This combined semantic flow provides a physics-aligned motion prior for both spatial and temporal modeling. Our SF-DiT architecture integrates this semantic flow directly into DiT via a novel Semantic Flow-Modulated Cross-Attention (SFMCA) module (as in Fig. 1). SF-DiT enforces both geometric precision and temporal consistency under physics-aligned flow constraints, achieving robust temporal fusion throughout the generation process.

Illustration of the SFMCA mechanism

Contributions. Our main contributions are threefold:

  • A comprehensive semantic flow estimation module that leverages sequential BEV semantic maps to model scene dynamics via temporal flow vectors, uncertainty, and flow semantics.
  • A Semantic Flow-guided Diffusion Transformer (SF-DiT) with a novel Semantic Flow-Modulated Cross-Attention (SFMCA) mechanism that enables temporally consistent driving video generation.
  • State-of-the-art performance on nuScenes image and video generation (FID 8.3, FVD 73.6), and superior temporal 3D occupancy prediction on the nuCraft and OpenOccupancy datasets.

Quantative comparison to SOTA methods

@inproceedings{zhu2025consistentcity,
  title={ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis},
  author={Zhu, Benjin and Wang, Xiaogang and Li, Hongsheng},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={26382--26392},
  year={2025}
}
Benjin ZHU
Benjin ZHU
Ph.D

Hoping you take that jump, but don’t fear the fall