Recent works on 3D semantic segmentation propose to exploit the synergy between images 📸 and point clouds ☁️️ by processing each modality with a dedicated network and projecting learned 2D features onto 3D points. Merging large-scale point clouds and images raises several challenges, such as constructing a mapping between points and pixels, and aggregating features between multiple views. Current methods require mesh reconstruction or specialized sensors to recover occlusions, and use heuristics to select and aggregate available images. In contrast, we propose DeepViewAgg an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions. Our method can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks without requiring colorization, meshing, or true depth maps. We set a new state-of-the-art for large-scale indoor/outdoor semantic segmentation on S3DIS (74.7 mIoU 6-Fold) and on KITTI-360 (58.3 mIoU). Our full pipeline is accessible on GitHub, and only requires raw 3D scans and a set of images and poses.
DeepViewAgg uses a simple visibility model to project 2D features onto 3D points and learns to leverage viewing conditions to aggregate multi-view information.
Modern 3D scene acquisition systems often produce images 📸 along with point clouds ☁️. Previous works already have demonstrated that those modalities are complementary and extracting features from both benefits the scene understanding. As an example, it is not possible to distinguish a frame 🖼 from the wall 🧱 on which it is hung solely based on the 3D geometric information in the point cloud ☁️. However, the difference in texture between these two objects can easily be identified in images 📸 of the same scene. On the contrary, when objects are poorly-lit, occluded or have uniform textures, it is much easier to identify them by their 3D geometry ☁️ rather than their radiometry 📸. To get the intuition of how the 3D and 2D modalities relate, see the interactive example below.
Sample scene from the S3DIS dataset.
Colored balls account for equirectangular picture positions with corresponding images displayed below.
RGB shows colorized point cloud.
Labels shows semantic labels.
Times seen shows how many images see each point.
Position RGB colors the points based on their 3D position.
In the images below, the mapping between 3D points and
the corresponding pixels is shown using the Times seen color scheme.
Note that computing such a mapping is a non-trivial task as it requires a visibility model to recover occlusions.
Although several works already exist on multimodal learning from point clouds and images, we
found that these tend to overlook the multi-view problem for large-scene analysis. Indeed,
when analyzing large 3D scenes with images, each point may be seen in multiple images. You can
see an illustration of this in the Times seen mode in the visualization. To
aggregate this multi-view information, one can use simple max-pool or average-pool schemes.
However, this disregards the fact that not all views of a same object are equal. Views can be
far, close, sideways, front-facing, occluded, distorted, etc. and carry different qualities of
information depending on those observation conditions.
Our core idea is simply to let a model learn to aggregate the multi-view information based on the
observation conditions 👀.
After playing with this tool, you may have noticed several things:
🔍 3D point clouds ☁️ and 2D images 📸 carry complementary information. This justifies trying to
extract features from both modalities.
🔍 Views of a same object do not carry the same level of information, depending on their
observation conditions 👀. By the way, this implies that the colorization scheme used to
give RGB colors to the points is non-trivial, although often taken for granted by 3D
semantic segmentation methods.
🔍 Mapping 3D points with pixels of 2D images located in the wild requires recovering the
occlusions. This requires computing what is called a visibility model, which can be easily
solved if you have a mesh of the scene or if you have depth cameras. But mesh reconstruction
is not a simple task and not all acquisition systems are equipped with depth cameras.
🔍 Some images 📸 have only a small portion of their pixels actually linked to the 3D spherical
sample at hand. This means extracting features maps from entire images with CNNs may require
a lot of unnecessary computation.
Our paper addresses these four points to efficiently learn multi-modal aggregation at scale,
without colorization, mesh, nor depth cameras.
@inproceedings{robert2022deepviewagg,
title={{Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation}},
author={Robert, Damien and Vallet, Bruno and Landrieu, Loic},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2022}
}
This work was funded by
ENGIE Lab CRIGEN
and carried out in the
LASTIG
research unit of
Univ. Gustave Eiffel.
We thank AI4GEO for sharing their computing resources.
AI4GEO is a project funded by the French future investment program led by the Secretariat
General for Investment and operated by public investment bank Bpifrance.
We thank Philippe Calvez,
Dmitriy Slutskiy,
Marcos Gomes-Borges,
Gisela Lechuga, from
ENGIE Lab CRIGEN and
Romain Loiseau,
Vivien Sainte Fare Garnot
and Ewelina Rupnik from the
LASTIG lab at IGN for inspiring discussions and valuable feedback.