Multi-Sensor Fusion for Semantic Scene Understanding

Title Multi-Sensor Fusion for Semantic Scene Understanding
Summary Multi-Sensor Fusion for Semantic Scene Understanding
TimeFrame Fall 2022
Author Anton Persson
Supervisor Eren Erdal Aksoy
Level Master
Status Open

In our current robot setup, there are two cameras: One is an RGB-D camera fixed on the wall, providing the third person view on the scene, and the other one is attached to the robot wrist providing RGB images (without any depth cues) from robot's point of view (i.e. first person view). Assume that there will be a bunch of objects on the table in front of the robot. The robot will first get the 3D point cloud from the RGB-D camera. This point cloud will indeed have some occluded objects since the scene is cluttered. Then the robot arm should then do some reasoning and decide where to approach around the table to increase the information gain about the scene by using the RGB camera on the wrist. Therefore, the first task would be to convert the RGB hand camera images into RGB-D format by using state-of-the-art depth estimation networks. Once this is done, there will be a fusion step where the robot merges both point clouds: one is coming from the wrist and the other one is from the fixed camera. This way, the robot should autonomously decide how many new images (e.g., two more new images) he needs from his wrist camera to detect more objects in the scene. After each fusion step, the robot should estimate the 6D pose of each detected objects in the scene. This topic is more about computer vision and AI.