Abstract¶
Recent works on 3D semantic segmentation propose to exploit the synergy between images 📸 and point clouds ☁ by processing each modality with a dedicated network and projecting learned 2D features onto 3D points. Merging large-scale point clouds and images raises several challenges, such as constructing a mapping between points and pixels, and aggregating features between multiple views. Current methods require mesh reconstruction or specialized sensors to recover occlusions, and use heuristics to select and aggregate available images. In contrast, we propose an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions. Our method can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks without requiring colorization, meshing, or true depth maps. We set a new state-of-the-art for large-scale indoor/outdoor semantic segmentation on S3DIS (74.7 mIoU 6-Fold) and on KITTI-360 (58.3 mIoU). Our full pipeline is accessible at on GitHub, and only requires raw 3D scans and a set of images and poses.
Motivation¶
Modern 3D scene acquisition systems often produce images 📸 along with point clouds ☁. Previous
works already have demonstrated that those modalities are complementary and extracting features
from both benefits the scene understanding.
As an example, it is not possible to distinguish a frame 🖼 from the wall 🧱 on which it is hung
solely based on the 3D geometric information in the point cloud ☁. However, the difference in
texture between these two objects can easily be identified in images 📸 of the same scene.
On the contrary, when objects are poorly-lit, occluded or have uniform textures, it is much
easier to identify them by their 3D geometry ☁ rather than their radiometry 📸.
You can play with the interactive visualization below to get the intuition of how the 3D and 2D
modalities relate. This represents a 3D spherical sample of an indoor scene from the
S3DIS dataset. The colored balls
account for equirectangular picture positions. The corresponding images can be seen below the 3D
plot. RGB
shows the points colorized from the image using human expertise and a
dedicated software. Labels
shows the expected output for semantic segmentation.
Times seen
shows how many images see each point. Position RGB
colors
the points based on their 3D position. In the images below, the mapping between 3D points and
the corresponding pixels is shown using the Times seen
color scheme. Note that
computing such a mapping is a non-trivial task as it requires recovering occlusions. If you want
to see the interactive visualization used in our poster, check out this
sample 👈.
Although several works already exist on multimodal learning from point clouds and images, we
found that these tend to overlook the multi-view problem for large-scene analysis. Indeed,
when analysing large 3D scenes with images, each point may be seen in multiple images. You can
see an illustration of this in the Times seen
mode in the visualization below. To
aggregate this multi-view information, one can use simple max-pool or average-pool schemes.
However, this disregards the fact that not all views of a same object are equal. Views can be
far, close, sideways, front-facing, occluded, distorted, etc. and carry different qualities of
information depending on those observation conditions.
Our core idea is simply to let a model learn to aggregate the multi-view information based on the
observation conditions 👀.