Abstract

superpoint transformer

We introduce a novel superpoint-based transformer πŸ€– architecture for efficient ⚑ semantic segmentation of large-scale 3D scenes. Our method incorporates a fast algorithm to partition point clouds into a hierarchical superpoint structure 🧩, which makes our preprocessing 7 times faster than existing superpoint-based approaches. Additionally, we leverage a self-attention mechanism to capture the relationships between superpoints at multiple scales, leading to state-of-the-art performance on three challenging benchmark datasets: S3DIS (76.0% mIoU 6-fold), KITTI360 (63.5% on Val), and DALES (79.6%). With only 212k parameters πŸ¦‹, our approach is up to 200 times more compact than other state-of-the-art models while maintaining similar performance. Furthermore, our model can be trained on a single GPU in 3 hours ⚑ for a fold of the S3DIS dataset, which is 7Γ— to 70Γ— fewer GPU-hours than the best-performing methods. Our code and models are accessible at github.com/drprojects/superpoint_transformer

Motivation

This project aims at fusing the best of two worlds:

Transformer-based models πŸ€–
( Point Transformer, Stratified Transformer, ...)

Superpoint-based models 🧩
( SPG, SSP+SPG, ...)

βœ… Expressivity
βœ… Capture long-range interactions
❌ Compute effort guided by arbitrary point or voxel samplings
❌ Loads of parameters
❌ Long training

βœ… Much smaller problem complexity
βœ… Geometry-guided compute effort allocation
βœ… Fast training
βœ… Lightweight model
❌ Long preprocessing time
❌ GNN's expressivity and long-range interactions
❌ No hierarchical reasoning

To this end, we introduce Superpoint Transformer πŸ§©πŸ€– :

βœ… Much smaller problem complexity
βœ… Geometry-guided compute effort allocation
βœ… Fast training
βœ… Lightweight model
❌ ➑ βœ… Fast parallelized preprocessing
❌ ➑ βœ… Transformer’s expressivity and long-range interactions
❌ ➑ βœ… Multi-scale reasoning on a hierarchical partition 🧩

These changes allow SPT to match -or surpass- the performance of SOTA models with much fewer parameters and in a fraction of their training and inference time. Here are some SPT-facts:

πŸ“Š SOTA on S3DIS 6-Fold (76.0 mIoU)
πŸ“Š SOTA on KITTI-360 Val (63.5 mIoU)
πŸ“Š Near SOTA on DALES (79.6 mIoU)
πŸ¦‹ 212k parameters (PointNeXt Γ· 200, Stratified Transformer Γ· 40)
⚑ S3DIS training in 3 GPU-hours (PointNeXt ÷ 7, Stratified Transformer ÷ 70)
⚑ Preprocessing x7 faster than SPG

The above interactive visualization will help you get a sense of what our hierarchical partition structure looks like.

Our model architecture replaces SPG's Graph Neural Networks with Transformer self-attention blocks, reasoning on a graph connecting ajacent superpoints.



spt architecture


Visualizing the model size vs performance of 3D semantic segmentation methods on S3DIS 6-Fold, we observe that small, tailored models can offer a more flexible and sustainable alternative to large, generic models for 3D learning.

model size vs performance

With training times of a few hours on a single GPU, SPT allows practitioners to easily customize the models to their specific needs, enhancing the overall usability and accessibility of 3D learning.

BibTex

In case you use all or part of this project, please cite the following paper:

	                    
@inproceedings{robert2023spt,
  title={Efficient 3D Semantic Segmentation with Superpoint Transformer},
  author={Robert, Damien and Raguet, Hugo and Landrieu, Loic},
  journal={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023},
url = {\url{https://github.com/drprojects/superpoint_transformer}}

}
	                    
                    

Acknowledgments πŸ™

This work was funded by ENGIE Lab CRIGEN and carried out in the LASTIG research unit of Univ. Gustave Eiffel. It was supported by ANR project READY3D ANR-19-CE23-0007, and was granted access to the HPC resources of IDRIS under the allocation AD011013388R1 made by GENCI.

We thank Bruno Vallet, Romain Loiseau, and Ewelina Rupnik for inspiring discussions and valuable feedback.