Existing benchmarks for 3D semantic occupancy prediction in autonomous driving are limited by low resolution (up to [512×512×40] with 0.2m voxel size) and inaccurate annotations, hindering the unification of 3D scene understanding through the occupancy representation. Moreover, previous methods can only generate occupancy predictions at 0.4m resolution or lower, requiring post-upsampling to reach their full resolution (0.2m). The root of these limitations lies in the sparsity, noise, and even errors present in the raw data. In this paper, we overcome these challenges by introducing nuCraft, a high-resolution and accurate semantic occupancy dataset derived from nuScenes. nuCraft offers an 8× increase in resolution ([1024 × 1024 × 80] with voxel size of 0.1m) and more precise semantic annotations compared to previous benchmarks. To address the high memory cost of high-resolution occupancy prediction, we propose VQ-Occ, a novel method that encodes occupancy data into a compact latent feature space using a VQ-VAE. This approach simplifies semantic occupancy prediction into feature simulation in the VQ latent space, making it easier and more memory-efficient. Our method enables direct generation of semantic occupancy fields at high resolution without post-upsampling, facilitating a more unified approach to 3D scene understanding. We validate the superior quality of nuCraft and the effectiveness of VQ-Occ through extensive experiments, demonstrating significant advancements over existing benchmarks and methods.
Benjin Zhu1, Zhe Wang2, Hongsheng Li1,3,4
1 MMLab, The Chinese University of Hong Kong 2 SenseTime Research 3 Shanghai AI Laboratory 4 CPII under InnoHK
Paper / API / Dataset Download(HuggingFace, Kaggle)
Here we showcase the common failure cases of previous 3D semantic occupancy datasets like OpenOccupancy and Occ3D in the below figure. Previous occupancy datasets with lower resolutions (e.g., occupancy with grid size 0.4m) face problems including missing objects, unclear and noisy road boundaries, and incomplete shapes and geometries. As a comparison, our nuCraft dataset overcomes these limitations and presents 3D occupancy with less noises at a higher 0.1m resolution.
The above figure shows a cross-sectional view of aggregated LiDAR point clouds along the Z-axis before and after pose estimation. The inaccurate ego poses provided in the raw nuScenes annotations cause great challenges in obtaining valid dense point clouds. Pose estimation is necessary to better align the LiDAR frames and generate reliable aggregated point clouds for subsequent processing steps.
The above figure illustrates the difference between the raw point clouds with noisy semantic labels and the reconstructed semantic mesh (before post-processing). The raw point clouds suffer from inconsistent and inaccurate semantic labels. In contrast, the reconstructed semantic mesh exhibits less noise, more complete object shapes, and clearer semantic boundaries. The mesh reconstruction step in nuCraft helps to mitigate the issues with the raw semantic labels and produces a higher-quality representation of the 3D scene.
The data generation pipeline of nuCraft consists of 3 steps. A pre-processing step to provide better inputs for next steps (e.g., generate longer sequences). A data aggregation step to conduct pose estimation before aggregating lidar scans within sequences. Then a mesh-reconstruction step is applied to generate high-quality semantic meshes for the “voxel densification” purpose. At last, necessary post-processings are adopted to filter out outliers, reduce noises and generate sensor visibility masks for evaluation.
The framework consists of two main components: occupancy GT encoding and semantic occupancy prediction. In the encoding stage, a VQ-VAE is used to compress the high-resolution occupancy GT into a compact latent space using a learned codebook. For occupancy prediction, multi-view images (Here we only visualize image inputs, while LiDAR inputs can be encoded by point cloud backbones and easily integrated to generate the BEV feature.) are encoded by an image encoder to extract features, which are then projected to the VQ latent space dimensions. The model is trained to simulate the discrete VQ features of the corresponding occupancy GT, with an auxiliary depth prediction task to better capture scene geometry. During inference, the image features are encoded, projected to the VQ space, and then decoded by the pre-trained VQ-VAE decoder to directly generate the final high-resolution semantic occupancy predictions without any post-upsampling.
We compare the performance of VQ-Occ on our nuCraft dataset at 0.2m resolution with other methods in the below table. VQ-Occ consistently outperforms all baselines across all input modalities. With C+L input, VQ-Occ achieves an IoU of 37.5% and an mIoU of 26.2%, surpassing M-CONet by 7.6% and 5.5%, respectively. The results validate the effectiveness of our nuCraft dataset in providing high-quality and precise semantic occupancy annotations for advancing 3D scene understanding. Additionally, the same model performs slightly better on nuCraft than on OpenOccupancy, indicating the consistency and reduced noise in our nuCraft dataset.
@inproceedings{zhu2024nucraft,
title={nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding},
author={Zhu, Benjin and Wang, Zhe and Li, Hongsheng},
booktitle={Proceedings of the IEEE/CVF Conference on European Conference on Computer Vision},
year={2024}
}