Monocular Visual SLAM

I programmed a monocular visual SLAM system because mapping with just a camera seemed incredibly useful. Here is the source code.

Monocular Visual Odometry

First things first, we have to create a visual odometry system. To do so, we can compute and match sparse features (like ORB) across image frames and then find the rotation and translation between the frames using epipolar geometry. If we have a calibrated camera, we can find the essential matrix given the 2D point correspondences between two frames. Then we can decompose the essential matrix into rotation and translation. Keep chaining these transforms and you have odometry. Keep in mind that there is scale ambiguity in monocular setting, so extracting exact translations between frames is not possible.

Mapping Points By Triangulation

We can map these 2D features into the world as 3D map points by triangulating the point correspondences between two views. This can be done with some linear algebra.

Optimization By Bundle Adjustment

The core of visual SLAM is applying bundle adjustment (BA) to refine both the camera poses and the map points. This means minimizing the reprojection error of these map points against the observed points in the images taken from the camera poses. This can be done using a graph optimization library such as g2o. That is easy enough, but the hardest part is choosing how to apply BA properly because it’s a very expensive procedure. I am experimenting with applying BA in a local window of frames. The other approach would be to use a covisibility graph.

Loop Closure

I have not yet implemented loop closure mechanism as of 6/14/2024, but I want to do so in the future. The idea is to apply a visual place recognition module.

Summary

So far, I was able to make a 3D map that was optimized through global BA. The most frustrating part about monocular SLAM is making sure that nothing was implemented backwards. Somehow I was able to make it work with some trial and error. Also, more engineering is needed to control scale and improve performance of SLAM but I’ll take what I came up with.

Video demonstrating the monocular visual SLAM pipeline on the KITTI dataset: