Robust stereo visual odometry: A comparison of random sample consensus algorithms based on three major hypothesis generators

Guangzhi Guo; Zuoxiao Dai; Yuanfeng Dai

doi:10.1017/S0373463322000236

Robust stereo visual odometry: A comparison of random sample consensus algorithms based on three major hypothesis generators

Published online by Cambridge University Press: 12 May 2022

Guangzhi Guo ,

Zuoxiao Dai and

Yuanfeng Dai

Show author details

Guangzhi Guo: Affiliation:
Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai, China University of Chinese Academy of Sciences, Beijing, China
Zuoxiao Dai*: Affiliation:
Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai, China
Yuanfeng Dai: Affiliation:
Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai, China University of Chinese Academy of Sciences, Beijing, China
*: *Corresponding author. E-mail: tcgd_daizx@163.com

Article contents

Abstract
Introduction
Previous comparison
Description of the hypothesis generators
Experimental comparison
Conclusion
References

Rights & Permissions

Abstract

Almost all robust stereo visual odometry work uses the random sample consensus (RANSAC) algorithm for model estimation with the existence of noise and outliers. To date, there have been few comparative studies to evaluate the performance of RANSAC algorithms based on different hypothesis generators. In this work, we analyse and compare three popular and efficient RANSAC schemes. They mainly differ in using the two-dimensional (2-D) data points measured directly and the three-dimensional (3-D) data points inferred through triangulation. This comparison presents several quantitative experiments intended for comparing the accuracy, robustness and efficiency of each scheme under varying levels of noise and different percentages of outlier conditions. The results suggest that in the presence of noise and outliers, the perspective-three-point RANSAC provides more accurate and robust pose estimates. However, in the absence of noise, the iterative closest point RANSAC obtains better results regardless of the percentage of outliers. Efficiency, in terms of the number of RANSAC iterations, is found in that the relative speed of the perspective-three-point RANSAC becomes superior under low noise levels and low percentages of outlier conditions. Otherwise, the iterative closest-point RANSAC may be computationally more efficient.

Keywords

stereo visual odometry RANSAC hypothesis generator

Type: Research Article
Information: The Journal of Navigation , Volume 75 , Issue 6 , November 2022 , pp. 1298 - 1309

DOI: https://doi.org/10.1017/S0373463322000236 [Opens in a new window]
Copyright: Copyright © The Author(s), 2022. Published by Cambridge University Press on behalf of The Royal Institute of Navigation

1. Introduction

As a fundamental problem of robot navigation, visual odometry (VO) has been studied for decades (Chen et al., Reference Chen, Mei and Wan2021). The basic principle of VO is expressed in the form of mathematics, specifically by incrementally solving the pose change of two series of frames (He et al., Reference He, Zhu, Huang, Ren and Liu2019). Since monocular VO only estimates the translation up to a scale ambiguity, stereo VO (SVO) adding another camera implies that 3-D depth information is achievable by triangulation of a calibrated stereo camera.

With SVO as a dead reckoning process, drift is not easy to avoid due to error accumulation (Liu et al., Reference Liu, Balazadegan Sarvrood, Liu and Gao2021). A crucial subset of design choices that make the overall system performance better might include: detecting distinctive image features for more robust matching, such as SIFT (Lowe, Reference Lowe2004), ASIFT (Wan et al., Reference Wan, Liu, Di, Wang and Zhou2014), ORB (Rublee et al., Reference Rublee, Rabaud, Konolige and Bradski2011) features, and establishing sub-pixel feature correspondences for more accurate stereo reconstruction (Vancea et al., Reference Vancea, Miclea and Nedevschi2016), estimating an initial pose, and refining that pose through loop closure and global optimisation, see, for example, Cvišić et al. (Reference Cvišić, Ćesić, Marković and Petrović2018) and Mur-Artal and Tardós (Reference Mur-Artal and Tardós2017). In addition, to reduce the drift, almost all VO works use the random sample consensus (RANSAC) algorithm for robust model estimation in the presence of noise and outliers (Kaess et al., Reference Kaess, Ni and Dellaert2009). However, there have been few comparisons on RANSAC algorithms based on different hypothesis generators. The definition of the hypothesis generator was proposed by Nistér (Reference Nistér2005). In this work, we are interested in evaluating the performance of three popular and efficient RANSAC schemes, which are based on the five-point method (RANSAC-5P) (Nistér, Reference Nistér2004), the perspective-three-point method (RANSAC-P3P) (Fischler and Bolles, Reference Fischler and Bolles1981) and the iterative closest point method (RANSAC-ICP) (Maimone et al., Reference Maimone, Cheng and Matthies2007), respectively.

We divide the comparison into three parts. First, we employ ground truth transformation in the absence of noise and outliers to determine the absolute accuracy of each scheme. Second, the robustness is measured as 2-D data points are corrupted with increasing quantities of noise and outliers. Third, the relative efficiency of each scheme, in terms of the number of RANSAC iterations, is compared for the above cases. Conclusions derived from these consequences will make the selection of an optimal scheme for a specific application more robust and efficient.

The rest of this paper is organised as follows. After the introduction, Section 2 discusses the relation to previous comparison work on hypothesis generators in the RANSAC framework. Section 3 outlines three major hypothesis generators decoupling the rotation and translation estimation. Experimental comparisons for three versions of RANSAC are shown in Section 4 and conclusions are made in Section 5.

2. Previous comparison

The work by Alismail et al. (Reference Alismail, Browning and Dias2010) evaluated the performance of perspective-n-points methods and absolute orientation methods within a RANSAC, in which the latter estimates the pose of the camera from 3-D-to-3-D correspondences, while the former uses 2-D-to-3-D correspondences. The comparison with the classical five-point algorithm which estimates the pose of the camera from 2-D-to-2-D correspondences was not given. Nistér (Reference Nistér2004) studied and quantitatively compared the performance of the five-point method with that of the well-known eight- and seven-point methods as well as a six-point method in RANSAC framework. He found that the five-point algorithm will be the winner in both cases of sideways motion and forward motion. Also, Nistér et al. (Reference Nistér, Naroditsky and Bergen2006) proposed P3P would be greatly superior to ICP because 3-D reconstruction from 2-D-to-2-D correspondences increases the error and uncertainty of depth. In fact, the ICP algorithm minimises the 3-D Euclidean error in Equation (16), whereas the P3P algorithm minimises the reprojection error in Equation (19) (Scaramuzza and Fraundorfer, Reference Scaramuzza and Fraundorfer2011). In this paper, the comparison and analysis without prejudice presented will develop on these previous findings.

3. Description of the hypothesis generators

Each of the three hypothesis generators solves the same problem, which can be described as follows.

First, the projection within each camera is modelled as an ideal perspective projection followed by scaling and translation of the image coordinates. As shown in Figure 1, with each stereo pair, we measure the image coordinates ${\textbf{q}_l} = {[{x_l},{y_l}]^T}$ and ${\textbf{q}_r} = {[{x_r},{y_r}]^T}$ of the projections of each landmark $\textbf{P} = {[X,Y,Z]^T}$ onto the left and right image, respectively. Throughout this paper, we assume that the stereo rig is well calibrated and the intrinsic matrix is known. Therefore, we can solve $\textbf{P}$ by performing stereo triangulation.

Figure 1. Stereo observation model

Corresponding to acquisition of the current stereo image pair, the variables to be estimated are the rotation matrix $\textbf{R}$ and translation vector $\textbf{T}$ that describe the vehicle motion between the previous and the current stereo pair, plus the 3-D coordinates ${\textbf{P}_{pj}}$ of the landmarks in the previous coordinate frame and the 3-D coordinates ${\textbf{P}_{cj}}$ of the landmarks in the current coordinate frame. In the notation, the first subscript p and c index the ‘previous’ and ‘current’ coordinate frames, respectively, and the second subscript j indexes landmarks at each point. We define the coordinate transformation between frames by

(1)\begin{equation}{\textbf{P}_{pj}} = \textbf{R}{\textbf{P}_{cj}} + \textbf{T}\end{equation}

This describes the change in the true coordinates of the landmarks, relative to the vehicle position, from image to image.

We assume that no prior information is available about the 3-D coordinates of landmarks; that is, the only information available about landmarks is obtained from the images. Since the image coordinates contain some measurement errors and wrong data associations, the inferred 3-D coordinates will also contain measurement errors and outliers (Matthies, Reference Matthies1989). RANSAC (Fischler and Bolles, Reference Fischler and Bolles1981) has been regarded as the classic approach for robustly estimating the egomotion in the presence of outliers. It solves model hypotheses from randomly sampled minimal sets of data points and testifies them on all or part of the data points. The hypothesis with the largest number of consistent data points is considered as the optimal model solution (Fraundorfer and Scaramuzza, Reference Fraundorfer and Scaramuzza2012). For egomotion estimation, as used in stereo visual odometry, the estimated model is to solve for the optimal transformation $[\hat{\textbf{R}},\hat{\textbf{T}}]$ that maps the set $\{ {\textbf{P}_{cj}}\}$ onto $\{ {\textbf{P}_{pj}}\}$. In this paper, the hypothesis generators of RANSAC will be split into two parts, one for calculating rotation and the other for calculating translation. The following sections include how to compute the rotation matrix $\textbf{R}$ and translation vector $\textbf{T}$ within three major hypothesis generators, that is, the five-point algorithm, the perspective-three-point algorithm and the iterative closest point algorithm.

3.1 Calculating rotation

3.1.1 Five-point

Image points are represented by homogeneous 3-vectors ${\textbf{q}_p} = (q_1^p,q_2^p,q_3^p)$ and ${\textbf{q}_c} = (q_1^c,q_2^c,q_3^c)$ in the previous and current view, respectively. Because the stereo rig is well calibrated, we can always assume that the image points ${\textbf{q}_p}$ and ${\textbf{q}_c}$ have been premultiplied by ${\textbf{K}^{ - 1}}$, where $\textbf{K}$ is a camera intrinsic matrix, so that the epipolar constraint simplifies to

(2)\begin{equation}\textbf{q}_c^T\textbf{E}{\textbf{q}_p} = 0\end{equation}

where $\textbf{E}$ is an essential matrix. We can obtain a constraint of the form in Equation (2) from image point correspondences. The above formula can also be rewritten as

(3)\begin{equation}{\bar{\textbf{q}}^T}\bar{\textbf{E}} = 0\end{equation}

where

\[\bar{\textbf{q}} \equiv {[{q_1^pq_1^c,q_2^pq_1^c,q_3^pq_1^c,q_1^pq_2^c,q_2^pq_2^c,q_3^pq_2^c,q_1^pq_3^c,q_2^pq_3^c,q_3^pq_3^c} ]^T}\]

\[\bar{\textbf{E}} \equiv {[{{E_{11}},{E_{12}},{E_{13}},{E_{21}},{E_{22}},{E_{23}},{E_{31}},{E_{32}},{E_{33}}} ]^T}\]

Five-point correspondences can construct a 5 × 9 matrix. We will acquire four 3 × 3 matrices $\textbf{X},\textbf{Y},\textbf{Z},\textbf{W}$ from the right null space of this matrix and the essential matrix must be built as

(4)\begin{equation}\textbf{E} = x\textbf{X} + y\textbf{Y} + z\textbf{Z} + w\textbf{W}\end{equation}

where $x,y,z,w$ are four scalars. Introducing Equation (4) into the 10 cubic constraints in Equations (5) and (6), a tenth degree polynomial can be obtained. Solving its roots, the essential matrix is then obtained from Equation (4).

(5)\begin{equation}\det (\textbf{F}) = 0\end{equation}

(6)\begin{equation}\textbf{E}{\textbf{E}^T}\textbf{E} - \frac{1}{2}\textrm{trace}(\textbf{E}{\textbf{E}^T})\textbf{E} = 0\end{equation}

where $\textbf{F}$ is a fundamental matrix. Finally, the rotation matrix $\textbf{R}$ is recovered from the essential matrix (Hartley and Zisserman, Reference Hartley and Zisserman2000; Nistér, Reference Nistér2004).

3.1.2 Perspective-three-point

Given three 3-D locations ${\textbf{P}_{pj}}$ of A, B and C in the previous coordinate frame by performing stereo triangulation and their 2-D projections $A^{\prime}$, $B^{\prime}$ and $C^{\prime}$ in a current calibrated camera, the P3P algorithm can estimate the rotation matrix $\textbf{R}$ relative to the previous coordinate frame. First, given the face angles of the opposing trihedral angle $({\theta _{ab}},{\theta _{ac}},{\theta _{bc}})$ and the lengths of the three sides of the base $({R_{ab}},{R_{ac}},{R_{bc}})$ (see Figure 2), the P3P algorithm can compute the lengths of the three edges of a triangular pyramid $(a,b,c)$ by using the cosine rule. Then the 3-D locations ${\textbf{P}_{cj}}$ of A, B and C in the current coordinate frame can be computed (Matthies, Reference Matthies1989). Once the three 3-D-to-3-D point correspondences are known, we can obtain the rotation matrix $\textbf{R}$ using the ICP algorithm (Section 3.1.3).

Figure 2. Geometry of the P3P algorithm

3.1.3 Iterative closest point

Given 3-D coordinates ${\textbf{P}_{pj}}$ and ${\textbf{P}_{cj}}$ by performing stereo triangulation, within the ICP algorithm, the rotation matrix $\textbf{R}$ is directly determined by the weighted least squares:

(7)\begin{equation}\sum {{w_j}\textbf{e}_j^T{\textbf{e}_j}} \end{equation}

(8)\begin{equation}{\textbf{e}_j} = {\textbf{P}_{pj}} - \textbf{R}{\textbf{P}_{cj}} - \textbf{T}\end{equation}

(9)\begin{equation}{w_j} = {[{\det ({{\Sigma _{pj}}} )+ \det ({{\Sigma _{cj}}} )} ]^{ - 1}}\end{equation}

where ${\textbf{e}_j}$ is the error residual between previous position ${\textbf{P}_{pj}}$ and current position ${\textbf{P}_{cj}}$ of the jth feature, and ${w_j}$ is the scalar weight, and ${\Sigma _{pj}}$ and ${\Sigma _{cj}}$ are the covariance of ${\textbf{P}_{pj}}$ and ${\textbf{P}_{cj}}$, respectively. Let

(10)\begin{equation}w = \sum\limits_j {{w_j}} \end{equation}

(11)\begin{equation}{\textbf{Q}_c} = \sum\limits_j {{w_j}{\textbf{P}_{cj}}} \end{equation}

(12)\begin{equation}{\textbf{Q}_p} = \sum\limits_j {{w_j}{\textbf{P}_{pj}}}\end{equation}

(13)\begin{equation}\textbf{A} = \sum\limits_j {{w_j}{\textbf{P}_{pj}}{\textbf{P}_{cj}}}\end{equation}

(14)\begin{equation}\textbf{E} = \textbf{A} - \frac{1}{w}{\textbf{Q}_c}\textbf{Q}_p^T\end{equation}

Let $\textbf{E} = \textbf{US}{\textbf{V}^T}$ be the singular value decomposition of $\textbf{E}$. Then (Maimone et al., Reference Maimone, Cheng and Matthies2007)

(15)\begin{equation}\textbf{R} = \textbf{U}{\textbf{V}^T}\end{equation}

3.2 Calculating translation

Based on the now known camera rotation, the hypothesis generators often recover the translation from two classical cost functions.

The following cost function minimises Euclidean error in 3-D space:

(16)\begin{equation}\mathop {\textrm{argmin}}\limits_T \sum {{w_j}{{||{(\hat{\textbf{R}}{\textbf{P}_{cj}} + \textbf{T}) - {\textbf{P}_{pj}}} ||}^2}} \end{equation}

where ${w_j} = {[{\textrm{det}({{\Sigma _{pj}}} )+ \textrm{det}({{\Sigma _{cj}}} )} ]^{ - 1}}$ and $\hat{\textbf{R}}$ is the optimal rotation matrix. Differentiating Equation (16) with respect to $\textbf{T}$ and setting the derivative to zero, we have

(17)\begin{equation}\hat{\textbf{T}} = {\bar{\textbf{P}}_p} - \hat{\textbf{R}}{\bar{\textbf{P}}_c}\end{equation}

where ${\bar{\textbf{P}}_p} = {{\sum {{w_j}{\textbf{P}_{pj}}} } / {\sum {{w_j}} }}$ and ${\bar{\textbf{P}}_c} = {{\sum {{w_j}{\textbf{P}_{cj}}} } / {\sum {{w_j}} }}$ are the weighted centroids of each point set, respectively.

We assume that the projection within each camera is modelled as an ideal perspective projection

(18)\begin{equation}\left[ {\begin{array}{*{20}{c}} u\\ v\\ 1 \end{array}} \right] = \pi (\textbf{P};\textbf{R},\textbf{T}) = \textbf{K}[{\textbf{R}|\textbf{T}} ]\left[ {\begin{array}{*{20}{c}} X\\ Y\\ Z\\ 1 \end{array}} \right]\end{equation}

where ${[u,v,1]^T}$ is a homogeneous coordinate of 2-D image point, $\pi$ is the projection function, $\textbf{K}$ is an intrinsic matrix, $[\textbf{R}|\textbf{T}]$ is a 3-D rigid transformation between two views and ${[X,Y,Z,1]^T}$ is a homogeneous coordinate of 3-D world point. The following cost function minimises the image reprojection error (Cvišić et al., Reference Cvišić, Ćesić, Marković and Petrović2018):

(19)\begin{equation}\mathop {\textrm{argmin}}\limits_T \sum {{w_j}({{{||{{\textbf{p}_{l,j}} - {\pi_l}({\textbf{P}_j};\hat{\textbf{R}},\textbf{T})} ||}^2} + {{||{{\textbf{p}_{r,j}} - {\pi_r}({\textbf{P}_j};\hat{\textbf{R}},\textbf{T})} ||}^2}} )} \end{equation}

Unfortunately, using the reprojection error leads to a nonlinear optimisation problem that does not appear to have a direct solution. Minimising the cost function requires the use of iterative techniques (Hartley and Zisserman, Reference Hartley and Zisserman2000). Therefore, in our work, within RANSAC-5P, RANSAC-P3P and RANSAC-ICP, the initial translation is calculated by minimising Equation (16). Then we obtain the final translation by iteratively minimising Equation (19) on the largest consensus set. Here, we apply the Levenberg–Marquardt optimisation approach (Lourakis and Argyros, Reference Lourakis and Argyros2009).

4. Experimental comparison

We tested the performance of each RANSAC scheme. When choosing a scheme, one must take accuracy, robustness and efficiency into account. The results of these experiments are detailed below.

4.1 The implementations

The experiments were performed using a computer with an Intel I9 Eight Core 3⋅6 GHz processor, and the code used to compute was written in MATLAB. For the evaluation, the data were generated as follows.

(1) Assume without loss of generality that the images have a resolution of 640 × 480, and the intrinsic parameters of the stereo rig are the focal length $f = 600$, the principal point in the image plane${c_x} = 320$, ${c_y} = 240$, respectively, and the baseline $B = 200$. Therefore, let the projection matrices for the left and right cameras be $\textbf{K}[\textbf{I}|\textbf{0}]$ and $\textbf{K}[\textbf{I}|\textbf{t}]$, where $\textbf{I}$ is a 3 × 3 identity matrix, $\textbf{K}$ is a 3 × 3 upper triangular calibration matrix holding the intrinsic parameters and $\textbf{t} = {[ - B,0,0]^T}$.
(2) The 3-D point sets $\{ {\textbf{P}_{cj}}\}$ in the current coordinate frame, ranging from 50 points to 5,000 points, are randomly generated from a uniform distribution within a cube of size 3000 × 3000 × 3000 in front of the origin (relative to the positive Z direction). Their corresponding 2-D image plane coordinates $\{ u_{l,j}^c,v_{l,j}^c,u_{r,j}^c,v_{r,j}^c\}$ in the left and right cameras are obtained via the projection matrices $\textbf{K}[\textbf{I}|\textbf{0}]$ and $\textbf{K}[\textbf{I}|\textbf{t}]$. Note that we must eliminate the invalid 3-D points whose 2-D point correspondences are out of the image size.
(3) Ground truth transformation is generated in two parts. A rotation $\textbf{R}$ is calculated from Euler angles $\Theta$ that are randomly generated from a uniform distribution representing correct rotation (we restrict each angle to the range between 0 and $\pi /6$). Then we compute a 3-D translation $\textbf{T}$ by randomly determining each component uniformly distributed in the range [−500, 500].
(4) Transform the 3-D point sets $\{ {\textbf{P}_{cj}}\}$ to the previous coordinate frame using the transformation $[\textbf{R},\textbf{T}]$ and we can acquire the 3-D point sets $\{ {\textbf{P}_{pj}}\}$. Repeat the perspective projection process in step (2) and we obtain their corresponding 2-D image plane coordinates $\{ u_{l,j}^p,v_{l,j}^p,u_{r,j}^p,v_{r,j}^p\}$.

At this point, a list L of n 14-tuples is given, each 14-tuple including the true 3-D world point coordinates in the previous and current coordinate frames, and its corresponding 2-D image point coordinates in the previous and current left and right camera. In the following experiments, we assume that the only information available about landmarks is obtained from the images, that is, we just employed the 2-D point correspondences $\{ u_{l,j}^p,v_{l,j}^p,u_{r,j}^p,v_{r,j}^p\}$ and $\{ u_{l,j}^c,v_{l,j}^c,u_{r,j}^c,v_{r,j}^c\}$ and their 3-D spatial coordinates would be estimated via triangulation.

4.2 The accuracy/robustness experiments under varying levels of noise conditions

In this experiment, we tested the absolute accuracy of three schemes by comparing their own output transformations to the ground truth transformations in the absence of noise and outliers. Then, at different noise levels in the input, the robustness was reported by comparing the error in the output. We added Gaussian noise to each 2-D point correspondence component. Because the feature detection and matching algorithms are able to register images with subpixel accuracy, the mean of the Gaussian noise is zero and its standard deviation ranges from zero to one pixel.

For each data set size/noise/outlier level, we perform 1,000 trials. Each trial was made up of new 2-D point correspondences, noise, outliers and ground truth transformations. We computed two error statistics over these 1,000 trials to determine the difference between the three schemes. These included the translation error between the true and estimated translation vectors, ${T_{\textrm{err}}} = {{||{{\textbf{T}_{\textrm{alg}}} - {\textbf{T}_{\textrm{true}}}} ||} / {||{{\textbf{T}_{\textrm{true}}}} ||}}$, as well as the rotation error between true and estimated Euler angles, ${\Theta _{\textrm{err}}} = {{||{{\Theta _{\textrm{alg}}} - {\Theta _{\textrm{true}}}} ||} / {||{{\textbf{T}_{\textrm{true}}}} ||}}$.

Figures 3(a)–3(d) show, in the absence of outliers, the robustness of the three schemes when it comes to changes in noise levels $\sigma = \{ 0,\;0.01,\;0.05,\;0.1,\;0.5,\;0.7,\;1.0\}$ in terms of rotation error and translation error. As the number of points increases, both errors are close to a value related to the noise level and the RANSAC-P3P is substantially superior to the other two. However, in the noiseless case, as shown in Figures 3(e) and 3(f), rather than RANSAC-P3P, the absolute accuracy of RANSAC-ICP is obviously better than that of the others. We will demonstrate why they show a different ranking with different noise levels in Section 4.5.

Figure 3. Graphs are rotation error and translation error versus data set size, N. A range of noise levels, $\sigma = 0.0$ to $\sigma = 1.0$, are shown in panels (a)–(d), while the noiseless case is enlarged in panels (e) and (f)

4.3 The robustness experiments under different percentages of outlier conditions

The tutorial by Friedrich and Davide (Fraundorfer and Scaramuzza, Reference Fraundorfer and Scaramuzza2012) only presents a comparison of the number of RANSAC iterations versus percentages of outliers between five-point RANSAC, two-point RANSAC and one-point RANSAC. A quantitative comparison that would answer the question ‘Which is the most robust under different percentages of outlier conditions?’ was not given. In this section, we report the robustness by testing the rotation error and translation error under different percentages of outliers in the input. The primary causes of outliers are wrong data associations and image noise. So a percentage of outliers can be accomplished by randomly interchanging 2-D point correspondences in the previous and current coordinate frames for various noise levels $\sigma = \{{0,\;0.01,\;0.1,\;0.7,\;1.0} \}$. We did the same thing as described in the previous section for each data set, in addition to computing the mean of rotation error and translation error about different percentages of outliers.

As shown in Figure 4, in the presence of noise, RANSAC-P3P appears to be much more immune to the changes in the percentage of outliers. We divide the outlier effect into two cases. The first case is the effect of those outliers that can be identified by the RANSAC scheme. These outliers just have a negative impact on the efficiency of each scheme. A larger percentage of outliers results in a higher number of RANSAC iterations used to find inliers. For this point, because the RANSAC-5P must sample at least five point correspondences per trial, while RANSAC-P3P and RANSAC-ICP must sample at least three, both RANSAC-ICP and RANSAC-P3P can obviously find more inliers easier and faster than RANSAC-5P. Section 4.4 will provide a discussion about this. The second case is the effect of those outliers that cannot be identified by the RANSAC scheme. In our experiments, a reprojection error was used to decide whether a 2-D point correspondence was an outlier or not. Since the reprojection error of these outliers does not exceed a certain threshold, one can regard their effect as that of noise and the level of noise depends on the threshold of the reprojection error. We set the threshold to two pixels. Therefore, as shown in Figure 4, we can find that RANSAC-ICP is still superior to the other two in the absence of noise. Otherwise, the RANSAC-P3P does well on different percentages of outlier conditions.

Figure 4. Graphs are rotation error and translation error versus percentage of outliers for four noise levels, $\sigma = 0$ (a, b), $\sigma = 0.01$ (c, d), $\sigma = 0.1$ (e, f), $\sigma = 0.7$ (g, h) and $\sigma = 1.0$ (i, j)

4.4 The efficiency experiments

As an indirect measure of computational performance and efficiency, the number of RANSAC iterations is an indirect indication of computation efficiency (Alismail et al., Reference Alismail, Browning and Dias2010). In this section, we examine the number of RANSAC iterations (NRI) under different conditions. The NRIs for various noise levels in the presence of outliers depicted in Figures 5 and 6 show the NRI for different percentages of outliers at various noise levels. Note that we limited the number of RANSAC iterations in the experiments to be between 9 and 1,000. Here the lower limit is primarily used to avoid the problem of nearly degenerate data.

Figure 5. The NRI of algorithms for various noise levels in different percentages of outliers, $p = 0$ (a), $p = 0.3$ (b), $p = 0.5$ (c) and $p = 0.8$ (d)

Figure 6. The NRI of algorithms for different percentages of outliers with $\sigma = 0$ (a), $\sigma = 0.01$ (b), $\sigma = 0.1$ (c) and $\sigma = 0.7$ (d)

From a theoretical standpoint, under the lower noise levels, the NRI is mainly associated with the accuracy of initial solutions and the minimum number of point correspondences needed to estimate the motion. For one thing, a more accurate initial solution result leads to the faster a sufficient number of inliers can be found. For another thing, a lower number of point correspondences a scheme requires results in a faster motion estimation. The RANSAC-5P scheme must sample at least five point correspondences per iteration, and the RANSAC-P3P and the RANSAC-ICP schemes must sample at least three point correspondences. Therefore, when the percentage of outliers does not exceed 50%, RANSAC-P3P appears to find a solution much easier at low noise levels, as shown in Figure 5. The NRI of each scheme rapidly reaches the maximum iterations of RANSAC as a result of the increasing percentage of outliers. However, due to the fact that RANSAC-P3P can have as many as four possible solutions, the fourth point is necessary to verify which solution is correct, which will have a negative effect on the NRI not only at high noise levels (see Figure 5) but also in the presence of outliers (see Figure 6). Based on these results, we conclude that RANSAC-P3P may be computationally more efficient under low noise levels, typically $\sigma < 0.6$ pixels, and low percentages of outlier conditions, typically, $p < 0.5$. Otherwise, the efficiency of RANSAC-ICP is better than that of the others.

4.5 The analysis of accuracy comparison

Figure 3 shows that RANSAC-ICP provides more accurate pose estimates in the theoretically noiseless case. We believe that there are mainly two reasons behind this.

(1) The 3-D data points used by RANSAC-ICP are determined by intersecting rays projected through the camera models, i.e. triangulation, while those of RANSAC-P3P are determined by the cosine rule, where the accuracy of the former is at least an order of magnitude higher than that of the latter. We estimate the 3-D positions of 1,000 random 2-D image point correspondences used in the above two methods. The results show that the means of 3-D Euclidean distance between the true and estimation are approximately 5 × 10⁻¹² for the triangulation and 7 × 10⁻⁸ for the cosine rule.
(2) Another possibility is that both RANSAC-5P and RANSAC-P3P require finding the real roots of a polynomial to obtain the initial solutions. However, for RANSAC-ICP, the singular value decomposition-based solution is, in general, more concise and accurate in the absence of noise.

With increasing amounts of noise, although the cosine rule still fails to obtain more accurate 3-D data points, it can compensate for noise and make the initial solution more accurate. The RANSAC-P3P computes the current 3-D data points by a combination of 2-D image point correspondences in the current coordinate frame and 3-D data points in the previous coordinate frame, so as to keep the distance consistent between any two 3-D data points in spite of the position and attitude changes at the two coordinate frames. The further evidence is as follows.

Solving for the initial transformation $[\textbf{R},\textbf{T}]$, each scheme requires minimising a least-squares error criterion for the three algorithms given by Equation (16). If we assume that ${w_j} = 1$, ${\textbf{P}_{pj}} = {\bar{\textbf{P}}_{pj}} + {\varepsilon _{Pj}}$ and ${\textbf{P}_{cj}} = {\bar{\textbf{P}}_{cj}} + {\varepsilon _{cj}}$, where ${\bar{\textbf{P}}_{pj}}$ and ${\bar{\textbf{P}}_{cj}}$ are the true coordinates of ${\textbf{P}_{pj}}$ and ${\textbf{P}_{cj}}$, and ${\varepsilon _{pj}}$ and ${\varepsilon _{cj}}$ are the error vectors, respectively, we have

(20)\begin{equation}{|{{\textbf{e}_j}} |^2} = {|{{\varepsilon_{Pj}}} |^2} + {|{\textbf{R}{\varepsilon_{cj}}} |^2} - 2|{{\varepsilon_{Pj}}} ||{\textbf{R}{\varepsilon_{cj}}} |\textrm{cos}\left\langle {{\varepsilon_{Pj}},\textbf{R}{\varepsilon_{cj}}} \right\rangle \end{equation}

where $|\cdot |$ is the ${L_2}$ norm and $\left\langle \cdot \right\rangle$ is the angle between two error vectors. One can find that only if the angle between the error vector ${\varepsilon _{pj}}$ and ${\varepsilon _{cj}}$ is equal to zero, Equation (20) can obtain the minimum. Therefore, RANSAC-P3P will acquire a more accurate initial rotation matrix $\textbf{R}$ and translation vector $\textbf{T}$ when it keeps the orientation angles of error vectors in the previous and current coordinate frames similar as a result of the distance consistency constraint and the cosine rule. Combining with the results shown in Figure 3, one can easily conclude that the RANSAC-P3P provides greatly superior pose estimates except in the absence of noise in 2-D point correspondences.

5. Conclusion

A comparison has been made between three popular and efficient RANSAC algorithms which robustly estimate the egomotion: RANSAC-5P, RANSAC-P3P, RANSAC-ICP. As with most comparisons, there is no optimal algorithm under all the levels of noise and all the percentages of outlier conditions. We have presented the quantitative comparisons of accuracy, robustness and efficiency in the previous section. From these, we can draw a few universal conclusions.

As mentioned earlier, the results of the accuracy and robustness experiments for various levels of noise indicate that RANSAC-P3P can provide greatly superior pose estimates in the presence of noise in 2-D point correspondences. Otherwise, RANSAC-ICP can yield better results.

A greater separation of the robustness of the three schemes was noticed under different percentages of outlier conditions. As the percentage of outliers increases, the performance of RANSAC-5P and RANSAC-ICP will be tremendously degraded, while RANSAC-P3P only slightly deteriorates in the absence of outliers.

In terms of efficiency, regardless of the data set size, the level of noise and the percentage of outliers in 2-D point correspondences are closely related. Under higher noise levels and for virtually all the percentages of outliers, the relative speed of RANSAC-ICP becomes superior.

So, in conclusion, with any real-world application under low levels of noise and low percentages of outliers conditions, RANSAC-P3P outperforms both RANSAC-5P and RANSAC-ICP in terms of accuracy and robustness. While RANSAC-ICP does better on efficiency especially under high noise levels and percentages of outlier conditions.

References

Alismail, H., Browning, B. and Dias, M. B. (2010). Evaluating Pose Estimation Methods for Stereo Visual Odometry on Robots. Proceedings of the 11th International Conference on Intelligent Autonomous Systems (IAS-11), Ottawa, Canada.Google Scholar

Chen, Y. G., Mei, Y. H. and Wan, S. F. (2021). Visual odometry for self-driving with multithpothesis and network prediction. Mathematical Problems in Engineering, 2021, 1930881.CrossRef Google Scholar

Cvišić, I., Ćesić, J., Marković, I. and Petrović, I. (2018). SOFT-SLAM: Computationally efficient stereo visual simultaneous localization and mapping for autonomous unmanned aerial vehicles. Journal of Field Robotics, 35(4), 578–595.CrossRef Google Scholar

Fischler, M. A. and Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.CrossRef Google Scholar

Fraundorfer, F. and Scaramuzza, D. (2012). Visual odometry: Part II: Matching, robustness, optimization, and applications. IEEE Robotics & Automation Magazine, 19(2), 78–90.CrossRef Google Scholar

Hartley, R. and Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge: Cambridge Univ. Press.Google Scholar

He, M., Zhu, C. Z., Huang, Q., Ren, B. S. and Liu, J. T. (2019). A review of monocular visual odometry. Visual Computer, 36(5), 1053–1065.CrossRef Google Scholar

Kaess, M., Ni, K. and Dellaert, F. (2009). Flow Separation for Fast and Robust Stereo Odometry. 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan.CrossRef Google Scholar

Liu, F., Balazadegan Sarvrood, Y., Liu, Y. and Gao, Y. (2021). Stereo visual odometry with velocity constraint for ground vehicle applications. The Journal of Navigation, 74(5), 1026–1038.CrossRef Google Scholar

Lourakis, M. I. A. and Argyros, A. A. (2009). SBA: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software, 36(1), 1–30.CrossRef Google Scholar

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRef Google Scholar

Maimone, M., Cheng, Y. and Matthies, L. (2007). Two years of visual odometry on the Mars exploration rovers. Journal of Field Robotics, 24(3), 169–186.CrossRef Google Scholar

Matthies, L. H. (1989). Dynamic stereo vision. Ph.D. thesis, Pittsburgh: Carnegie Mellon University.Google Scholar

Mur-Artal, R. and Tardós, J. D. (2017). ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 33(5), 1255–1262.CrossRef Google Scholar

Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–770.CrossRef Google Scholar

Nistér, D. (2005). Preemptive RANSAC for live structure and motion estimation. Machine Vision and Applications, 16(5), 321–329.CrossRef Google Scholar

Nistér, D., Naroditsky, O. and Bergen, J. (2006). Visual odometry for ground vehicle applications. Journal of Field Robotics, 23(1), 3–20.CrossRef Google Scholar

Rublee, E., Rabaud, V., Konolige, K. and Bradski, G. (2011). ORB: An efficient alternative to SIFT or SURF. International Conference on Computer Vision, 2011, 2564–2571.Google Scholar

Scaramuzza, D. and Fraundorfer, F. (2011). Visual odometry [tutorial]. IEEE Robotics & Automation Magazine, 18(4), 80–92.CrossRef Google Scholar

Vancea, C. C., Miclea, V. C. and Nedevschi, S. (2016). Improving Stereo Reconstruction by Sub-Pixel Correction Using Histogram Matching. 2016 IEEE Intelligent Vehicles Symposium (IV), 2016, 335–-341.CrossRef Google Scholar

Wan, W. H., Liu, Z. Q., Di, K. C., Wang, B. and Zhou, J. Z. (2014). A Cross-Site Visual Localization Method for Yutu Rover. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XL-4, 279–284.CrossRef Google Scholar

Figure 1. Stereo observation model

Figure 2. Geometry of the P3P algorithm

Figure 5. The NRI of algorithms for various noise levels in different percentages of outliers, $p = 0$ (a), $p = 0.3$ (b), $p = 0.5$ (c) and $p = 0.8$ (d)

Figure 6. The NRI of algorithms for different percentages of outliers with $\sigma = 0$ (a), $\sigma = 0.01$ (b), $\sigma = 0.1$ (c) and $\sigma = 0.7$ (d)

Article contents

Robust stereo visual odometry: A comparison of random sample consensus algorithms based on three major hypothesis generators

Abstract

Keywords

1. Introduction

2. Previous comparison

3. Description of the hypothesis generators

3.1 Calculating rotation

3.1.1 Five-point

3.1.2 Perspective-three-point

3.1.3 Iterative closest point

3.2 Calculating translation

4. Experimental comparison

4.1 The implementations

4.2 The accuracy/robustness experiments under varying levels of noise conditions

4.3 The robustness experiments under different percentages of outlier conditions

4.4 The efficiency experiments

4.5 The analysis of accuracy comparison

5. Conclusion

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests