Abstract

deepviewagg

Recent works on 3D semantic segmentation propose to exploit the synergy between images 📸 and point clouds ☁ by processing each modality with a dedicated network and projecting learned 2D features onto 3D points. Merging large-scale point clouds and images raises several challenges, such as constructing a mapping between points and pixels, and aggregating features between multiple views. Current methods require mesh reconstruction or specialized sensors to recover occlusions, and use heuristics to select and aggregate available images. In contrast, we propose an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions. Our method can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks without requiring colorization, meshing, or true depth maps. We set a new state-of-the-art for large-scale indoor/outdoor semantic segmentation on S3DIS (74.7 mIoU 6-Fold) and on KITTI-360 (58.3 mIoU). Our full pipeline is accessible at on GitHub, and only requires raw 3D scans and a set of images and poses.

Motivation

Modern 3D scene acquisition systems often produce images 📸 along with point clouds ☁. Previous works already have demonstrated that those modalities are complementary and extracting features from both benefits the scene understanding.

As an example, it is not possible to distinguish a frame 🖼 from the wall 🧱 on which it is hung solely based on the 3D geometric information in the point cloud ☁. However, the difference in texture between these two objects can easily be identified in images 📸 of the same scene.

On the contrary, when objects are poorly-lit, occluded or have uniform textures, it is much easier to identify them by their 3D geometry ☁ rather than their radiometry 📸.

You can play with the interactive visualization below to get the intuition of how the 3D and 2D modalities relate. This represents a 3D spherical sample of an indoor scene from the S3DIS dataset. The colored balls account for equirectangular picture positions. The corresponding images can be seen below the 3D plot. RGB shows the points colorized from the image using human expertise and a dedicated software. Labels shows the expected output for semantic segmentation. Times seen shows how many images see each point. Position RGB colors the points based on their 3D position. In the images below, the mapping between 3D points and the corresponding pixels is shown using the Times seen color scheme. Note that computing such a mapping is a non-trivial task as it requires recovering occlusions. If you want to see the interactive visualization used in our poster, check out this sample 👈.

Although several works already exist on multimodal learning from point clouds and images, we found that these tend to overlook the multi-view problem for large-scene analysis. Indeed, when analysing large 3D scenes with images, each point may be seen in multiple images. You can see an illustration of this in the Times seen mode in the visualization below. To aggregate this multi-view information, one can use simple max-pool or average-pool schemes. However, this disregards the fact that not all views of a same object are equal. Views can be far, close, sideways, front-facing, occluded, distorted, etc. and carry different qualities of information depending on those observation conditions.

Our core idea is simply to let a model learn to aggregate the multi-view information based on the observation conditions 👀.

After playing with this tool, you may have noticed several things:

  • 3D point clouds ☁ and 2D images 📸 carry complementary information. This justifies trying to extract features from both modalities.
  • Views of a same object do not carry the same level of information, depending on their observation conditions 👀. By the way, this implies that the colorization scheme used to give RGB colors to the points is non-trivial, although often taken for granted by 3D semantic segmentation methods.
  • Mapping 3D points with pixels of 2D images located in the wild requires recovering the occlusions. This requires computing what is called a visibility model, which can be easily solved if you have a mesh of the scene or if you have depth cameras. But mesh reconstruction is not a simple task and not all acquisition systems are equipped with depth cameras.
  • Some images 📸 have only a small portion of their pixels actually linked to the 3D spherical sample at hand. This means extracting features maps from entire images with CNNs may require a lot of unneeded computation.

Our paper addresses these four points to learn multi-modal aggregation without mesh nor depth cameras.

Video

BibTex

In case you use all or part of this project, please cite the following paper:

	                    
@inproceedings{robert2022dva,
    title={Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation},
    author={Robert, Damien and Vallet, Bruno and Landrieu, Loic},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2022},
    pages={5575--5584},
    year={2022},
    url = {\url{https://github.com/drprojects/DeepViewAgg}}
}
	                    
                    

Acknowledgments 🙏

This work was funded by ENGIE Lab CRIGEN and carried out in the LASTIG research unit of Univ. Gustave Eiffel.

We thank AI4GEO for sharing their computing resources. AI4GEO is a project funded by the French future investment program led by the Secretariat General for Investment and operated by public investment bank Bpifrance.

We thank Philippe Calvez, Dmitriy Slutskiy, Marcos Gomes-Borges, Gisela Lechuga, from ENGIE Lab CRIGEN and Romain Loiseau, Vivien Sainte Fare Garnot and Ewelina Rupnik from the LASTIG lab at IGN for inspiring discussions and valuable feedback.