CMRNet++ Demo

This demo contains localization examples on KITTI, Argoverse, and Lyft Level 5 datasets. Select a dataset to load from the drop down box below. CMRNet++ is trained on both KITTI and Argoverse datasets. All examples depict places that were never seen during the training phase. Note that the Lyft5 dataset was not included in the training set. We evaluate the generalization ability of CMRNet++ on Lyft5 without any retraining. Please see the Abstract section below for more details.


Given a pre-existing LiDAR map (for example, an HD map from map providers), a single RGB image, and a rough initial position estimate (e.g., obtained from GPS), we project the map's points on a virtual image plane placed at this initial position. If we overlay the resulting depth image on the RGB image, the two points of view strongly differ when the initial position is inaccurate. Click on any image to see the map projected from the position estimated by our CMRNet++. Our approach can effectively localize a single RGB image in any environment for which a LiDAR map is available, without any retraining or fine-tuning.

Please Select a Dataset:

Selected Dataset:

KITTI

Abstract

Localization is a critically essential and crucial enabler of autonomous robots. While deep learning has made significant strides in many computer vision tasks, it is still yet to make a sizeable impact on improving capabilities of metric visual localization. One of the major hindrances has been the inability of existing CNN-based pose regression methods to generalize to previously unseen places. Our recently introduced CMRNet effectively addresses this limitation by enabling map independent monocular localization in LiDAR-maps.

In this work, we now take it a step further by introducing CMRNet++ which is a significantly more robust model that not only generalizes to new places effectively, but is also independent of the camera parameters. We enable this capability by moving any metric reasoning outside of the learning process. Extensive evaluations of our proposed CMRNet++ on three challenging autonomous driving datasets namely, KITTI, Argoverse, and Lyft Level5, demonstrate that our network substantially outperforms CMRNet as well as other baselines by a large margin. More importantly, for the first-time, we demonstrate the ability of a deep learning approach to accurately localize without any retraining or fine-tuning in a completely new environment and independent of the camera parameters.

How Does It Work?

We present our novel CMRNet++ approach for camera to LiDAR-map registration. We build upon our previously proposed CMRNet model that localizes independent of the map, to now also be independent of the camera intrinsics. Unlike existing state-of-the-art CNN-based methods for pose regression, CMRNet does not learn the map, instead it learns to match images to a pre-existing map. Consequently, CMRNet can be used in any environment for which a LiDAR-map is available. However, since the output of CMRNet is metric (a 6-DoF rigid body transformation from an initial pose), the weights of the network are tied to the intrinsic parameters of the camera that was used for collecting the training data. In this work, we mitigate this problem by decoupling the localization by first employing a pixel to 3D point matching step, followed by a pose regression step.


Network architecture
Figure: Outline of our proposed CMRNet++ framework. The input RGB image (a) and LiDAR-image (b) are fed to our CMRNet++, which predicts pixel displacements between the two inputs (c). The predicted matches (d) are used to localize the camera using a PnP+RANSAC scheme.

In the first step, the CNN only focuses on matching at the pixel-level instead of metric basis, which makes the network independent of the intrinsic parameters of the camera. These parameters are instead employed in the second step in which traditional computer vision methods are exploited to estimate the pose of the camera given the matches from the first step. Consequently, after training, CMRNet++ can also be used with different cameras and maps from those used while training. We generate a synthesized depth image that we refer to as a LiDAR-image by projecting the LiDAR map into a virtual image plane placed at an initial pose (for example, obtained from GPS). We feed this LiDAR-image and the RGB image from the camera to CMRNet++ that estimates, for every 3D point in the LiDAR-image, the pixel of the RGB image that represent the same world point. To localize the camera given the set of 2D-3D correspondences estimated by CMRNet++, we used the EPnP algorithm within a RANSAC scheme. We evaluate our model on KITTI, Agroverse, and Lyft Level 5 datasets and demonstrate that our approach exceeds state-of-the-art methods while being agnostic to map and camera parameters.

Videos

Publications

Daniele Cattaneo, Domenico Giorgio Sorrenti, Abhinav Valada,
CMRNet++: Map and Camera Agnostic Monocular Visual Localization in LiDAR Maps
IEEE International Conference on Robotics and Automation (ICRA) Workshop on Emerging Learning and Algorithmic Methods for Data Association in Robotics, 2020.

(Pdf) (Bibtex)


People