We introduce a novel superpoint-based transformer π€ architecture for efficient β‘ semantic segmentation of large-scale 3D scenes. Our method incorporates a fast algorithm to partition point clouds into a hierarchical superpoint structure π§©, which makes our preprocessing 7Γ faster than existing superpoint-based approaches. Additionally, we leverage a self-attention mechanism to capture the relationships between superpoints at multiple scales, leading to state-of-the-art performance on three challenging benchmark datasets: S3DIS (76.0% mIoU 6-fold), KITTI360 (63.5% on Val), and DALES (79.6%). With only 212k parameters π¦, our approach is up to 200 times more compact than other state-of-the-art models while maintaining similar performance. Furthermore, our model can be trained on a single GPU in 3 hours β‘ for a fold of the S3DIS dataset, which is 7Γ to 70Γ fewer GPU-hours than the best-performing methods. Our code and models are accessible on GitHub.
Superpoint Transformer reasons on geometry-guided partitions of the scene, achieving SOTA task performance with orders of magnitude less compute and memory than existing 3D backbones.
This project aims at fusing the best of two worlds:
|
Transformer-based models π€ ( Point Transformer, Stratified Transformer, ...) |
Superpoint-based models π§© ( SPG, SSP+SPG, ...) |
|---|---|
|
β
Expressivity β Capture long-range interactions β Compute effort guided by point sampling β Loads of parameters β Long training |
β
Reduced problem complexity β Geometry-guided compute effort allocation β Fast training β Lightweight model β Long preprocessing time β GNN's expressivity and long-range interactions β No hierarchical reasoning |
To this end, we introduce:
| Superpoint Transformer π§©π€ |
|---|
|
β
Much smaller problem complexity β Geometry-guided compute effort allocation β Fast training β Lightweight model β β‘ β Fast parallelized preprocessing β β‘ β Transformerβs expressivity and long-range interactions β β‘ β Multi-scale reasoning on a hierarchical partition π§© |
The changes introduced in this work allow Superpoint Transformer (SPT) to matchβor surpassβSOTA models with much fewer parameters and in a fraction of their training and inference time. Here are some SPT-facts:
|
π SOTA on S3DIS 6-Fold (76.0 mIoU)
π SOTA on KITTI-360 Val (63.5 mIoU) π Near SOTA on DALES (79.6 mIoU) π¦ 212k parameters (PointNeXt Γ· 200, Stratified Transformer Γ· 40) β‘ S3DIS training in 3 GPU-hours (PointNeXt Γ· 7, Stratified Transformer Γ· 70) β‘ Preprocessing x7 faster than SPG |
This work builds on the idea that semantic parsing can
largely be accelerated by reasoning on superpoints
(i.e., geometrically homogeneous pieces of the scene).
This idea was previously explored in
Superpoint Graph,
which Superpoint Transformer largely improves upon.
The below interactive visualization will help you get a sense of
what our hierarchical partition structure looks like.
Sample scene from the DALES dataset.
Position RGB colors the points based on their 3D position.
Labels shows semantic labels.
Predictions shows model predictions.
Features 3D shows the point features used for superpoint partitioning.
Level 1 and Level 2 show the superpoint partitions and graphs.
Our model extends SPG with the following ideas. Parallelized partition algorithm to accelerate preprocessing. Hierarchical partition for multiscale reasoning. Attention-based reasoning on the graphs of adjacent superpoints.
Architecture of our Superpoint Transformer pipeline. Note that for training and metrics, we only need to classify superpoints, circumventing full-resolution operations.
Model size vs performance of 3D semantic segmentation methods on S3DIS 6-Fold. We observe that small, tailored models can offer a more flexible and sustainable alternative to large, generic models for 3D learning.
With training times of a few hours on a single GPU, SPT allows practitioners to easily customize the models to their specific needs, enhancing the overall usability and accessibility of 3D learning.
@inproceedings{robert2023spt,
title={{Efficient 3D Semantic Segmentation with Superpoint Transformer}},
author={Robert, Damien and Raguet, Hugo and Landrieu, Loic},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2023},
}
This work was funded by
ENGIE Lab CRIGEN
and carried out in the
LASTIG
research unit of
Univ. Gustave Eiffel.
It was supported by ANR project READY3D ANR-19-CE23-0007, and was
granted access to the HPC resources of IDRIS under the allocation AD011013388R1 made by GENCI.
We thank
Bruno Vallet,
Romain Loiseau, and
Ewelina Rupnik
for inspiring discussions and valuable feedback.