Hostname: page-component-78c5997874-s2hrs Total loading time: 0 Render date: 2024-11-13T01:32:32.280Z Has data issue: false hasContentIssue false

A semantic knowledge database-based localization method for UAV inspection in perceptual-degraded underground mine

Published online by Cambridge University Press:  30 October 2024

Qinghua Liang
Affiliation:
Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China
Minghui Zhao
Affiliation:
School of Electronics and Information Engineering, Tongji University, Shanghai, China China Coal Technology & Engineering Group Shanghai Co. Ltd. Shanghai, China
Shigang Wang
Affiliation:
Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China
Min Chen*
Affiliation:
Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China
*
Corresponding author: Min Chen; Email: 13541210030@163.com
Rights & Permissions [Opens in a new window]

Abstract

In recent years, unmanned aerial vehicles (UAVs) have been applied in underground mine inspection and other similar works depending on their versatility and mobility. However, accurate localization of UAVs in perceptually degraded mines is full of challenges due to the harsh light conditions and similar roadway structures. Due to the unique characteristics of the underground mines, this paper proposes a semantic knowledge database-based localization method for UAVs. By minimizing the spatial point-to-edge distance and point-to-plane distance, the relative pose constraint factor between keyframes is designed for UAV continuous pose estimation. To reduce the accumulated localization errors during the long-distance flight in a perceptual-degraded mine, a semantic knowledge database is established by segmenting the intersection point cloud from the prior map of the mine. The topological feature of the current keyframe is detected in real time during the UAV flight. The intersection position constraint factor is constructed by comparing the similarity between the topological feature of the current keyframe and the intersections in the semantic knowledge database. Combining the relative pose constraint factor of LiDAR keyframes and the intersection position constraint factor, the optimization model of the UAV pose factor graph is established to estimate UAV flight pose and eliminate the cumulative error. Two UAV localization experiments conducted on the simulated large-scale Edgar Mine and a mine-like indoor corridor indicate that the proposed UAV localization method can realize accurate localization during long-distance flight in degraded mines.

Type
Research Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press

1. Introduction

The unmanned aerial vehicle (UAV) has the advantages of lightness, flexibility, and highly maneuverable. By equipping the UAV with some environmental sensing equipment, such as LiDAR and vision sensors, the UAV can replace workers to complete complex patrol inspection and other similar tasks in the underground mine. In this way, the incidence of coal mine safety accidents can be greatly reduced, and production efficiency can be improved. Furthermore, mechanization, automation, informatization, and intelligence of the coal mining industry could be rapidly promoted. However, the underground mine environment is extremely complex. The roadways of the mine are narrow and featureless. There are no GPS signals and the light condition is poor. These limitations make it difficult for the existing localization methods to achieve accurate positioning in the underground mine. The two commonly used types of sensors for localization in general environments are camera and LiDAR. Therefore, we review vision-based and LiDAR-based localization methods applied in the underground mine environment below.

Vision-based localization

ORB-SLAM [Reference Mur-Artal, Montiel and Tardos1Reference Campos, Elvira, Rodríguez, Montiel and Tardós3] is currently the most widely used robot localization method based on visual feature point matching. Rogers et al. [Reference Rogers, Gregory, Fink and Stump4] tested the localization accuracy of ORB-SLAM2 in the open-source underground mine. However, the lack of image texture features in underground roadways leads to the maximum localization error of ORB-SLAM up to 30 m. Compared to visual localization algorithms based on feature matching, direct sparse visual odometry exploits pixel gradients rather than image features to estimate the relative pose change, which is usually applied in featureless environments. Özaslan et al. [Reference Özaslan, Shen, Mulgaonkar, Michael and Kumar5Reference Özaslan, Loianno, Keller, Taylor, Kumar, Wozencraft and Hood7] equipped a UAV with an active light and a camera and estimated the axial displacement of the UAV based on Lucas–Kanade optical flow [Reference Shin, Kim, Kang, Lee, Paik, Abidi and Abidi8] during flight in a dark tunnel. The experimental results show that the maximum localization error is up to 40%. The large localization error results from the fact that the active lighting carried by UAV during the flight in the dark tunnel makes it difficult to satisfy the theory of the gray-scale invariance assumption of optical flow. To improve the localization accuracy of the vision-based method in underground mines, Jacobson et al. [Reference Jacobson, Zeng, Smith, Boswell, Peynot and Milford9] proposed a semi-supervised SLAM algorithm for pose tracking of mining vehicles. The semi-supervised SLAM established multiple positioning nodes and stored the keyframe images to form an image database and then used the optical flow to estimate the vehicle pose. Whereas, the semi-supervised SLAM requires a large amount of image data and manual intervention. With the development of robot intelligence, semantic features [Reference Zhu, Wang, Blum, Liu, Song, Pollefeys and Wang10] are conducive to navigation and localization. Furthermore, to cope with the sensitivity of lighting changes and motion blur of single-camera images, vision inertial odometry (VIO) was applied to the localization of robots in the underground mine. Kramer et al. [Reference Kramer, Kasper and Heckman11] tested the localization accuracy of existing OKVIS [Reference Leutenegger, Lynen, Bosse, Siegwart and Furgale12] algorithms. Chen et al. [Reference Chen, Henawy, Kocer and Seet13] evaluated VIO-Mono [Reference Qin, Li and Shen14] state estimator’s accuracy in featureless indoor environments like underground mine environment. Papachristos et al. [Reference Papachristos, Mascarich and Alexis15Reference Khattak, Papachristos and Alexis17] combined the images from an infrared camera and data from an IMU to localize the UAV indoors and proposed RITIO and KTIO algorithm [Reference Khattak, Papachristos and Alexis18]. Compared with visible light images, although infrared images can be adapted to the harsh lighting of underground mines to some extent, the lack of enough features leads to a large localization error. In summary, since the mine is located hundreds of meters underground, where most areas have only weak light and some areas are dark, the vision-based localization method cannot capture enough features or satisfy the gray-scale invariance assumption, making it difficult to achieve accurate localization.

LiDAR-based localization

In the visually degraded underground mine environment, LiDAR sensors can directly measure the distance and orientation information of the target object, which has the advantages of wide-ranging scope and high sensing accuracy compared to vision sensors. Thus, the LiDAR-based method is more suitable for robot localization and mapping [Reference Fasiolo, Scalera and Maset19]. Meanwhile, the LiDAR data is almost unaffected by the changes in ambient light, making it suitable for dark coal mine environments. LiDAR sensors can be divided into single-line LiDAR (2D LiDAR) and multi-line LiDAR (3D LiDAR) depending on the number of laser beams. 2D LiDAR was commonly used in the localization of indoor mobile robots in the early stage, which can only scan in a plane and cannot measure the height of the object. The ATRV mobile robot developed by Bakambu et al. [Reference Bakambu and Polotski20] was equipped with two 2D LiDARs to explore the mine. They used line segments scanned by 2D LiDAR as the basic elements for localization and mapping. In addition, the Groundhog robot developed by Sebastian et al. [Reference Thrun, Thayer, Whittaker, Baker, Burgard, Ferguson, Hähnel, Montemerlo, Morris, Omohundro and Reverte21] also used the point cloud data from 2D LiDAR aligned with a map to estimate the pose of the robot based on the Iterative Closest Point (ICP) matching. Since the 2D LiDAR cannot measure the spatial information, the above research can only acquire the x-y translation and heading angle of the mobile robot. Compared to 2D LiDAR, 3D LiDAR can measure the spatial geometric structure of the object. LOAM [Reference Zhang and Singh22] and Lego_LOAM [Reference Shan and Englot23] are two typical 3D point cloud-based localization algorithms. Li et al. [Reference Li, Zhu, You, Wang and Tang24] combined Lego_LOAM with the Normal Distributions Transform (NDT)-based feature matching algorithm to localize the pose of a mobile robot in a mine. The experimental results showed that the maximum error was up to 29%. Papachristos et al. [Reference Papachristos, Khattak, Mascarich and Alexis25] fused LOAM and IMU based on the Extended Kalman Filter (EKF) to estimate the pose of a UAV during flight in a mine. However, they did not quantitatively analyze the localization accuracy of the UAV during flight since it is difficult to obtain the ground truth of the UAV pose in the real mine environment. Chow et al. [Reference Chow, Kocer, Henawy, Seet, Li, Yau and Pratama26] conducted experimental analysis in an indoor environment for three SLAM methods: Hector_slam [Reference Kohlbrecher, Von Stryk, Meyer and Klingauf27], Gmapping [Reference Grisetti, Stachniss and Burgard28], and Cartographer [Reference Hess, Kohler, Rapp and Andor29]. The results showed that the trajectory generated by Cartographer fluctuated greatly and Gmapping failed in some tests. Koval et al. [Reference Koval, Kanellakis and Nikolakopoulos30] used a dataset collected in an underground area with multiple tunnels to conduct experimental analysis of various LiDAR-based positioning algorithms. However, all of the tested algorithms showed large drift in the Z-axis since the underground mine environment had sparse features and only one LiDAR was used. In addition to traditional pose estimation methods, Wang et al. [Reference Wang, Wu, Liu and Wang31] proposed a supervised 3D point cloud learning model, PWCLO-Net, for LiDAR localization. However, the supervised learning method requires a large amount of point cloud data labeled with ground truth for training. However, the ground truth trajectory in the underground mine is hard to obtain. In conclusion, LiDAR sensors have the advantage of accurate ranging, which are not affected by illumination. The localization accuracy of the LiDAR-based method can reach the decimeter level in general outdoor scenes. However, in degraded mine environments, the sparse point clouds measured by LiDAR cannot directly distinguish the similar geometry structure of the mine roadway, resulting in a non-negligible cumulative error in the LiDAR-based localization method.

Multi-sensor fusion-based localization

As mentioned above, each type of sensor has inherent defects, such as the visible light vision sensor being sensitive to illumination, the resolution of infrared vision image being low, and the LiDAR point cloud being sparse. Multi-sensor data fusion can effectively incorporate the advantages of each sensor to better adapt to complex mines. Alexis et al. [Reference Alexis32] proposed a concurrent fusion framework consisting of LOAM, ROVIO, and KTIO to estimate the pose of a UAV in the mine. However, they did not perform a quantitative evaluation of the localization accuracy. Jacobson et al. [Reference Jacobson, Zeng, Smith, Boswell, Peynot and Milford33] fused LiDAR point cloud with IMU data for pose estimation and used visual images for scene recognition. However, their method needs extensive human intervention to build maps for localization. In addition, some studies combined active sensors such as RFID, UWB, and WIFI with environmental sensing sensors such as LiDAR and camera to localize robot in a mine. Lavigne et al. [Reference Lavigne and Marshall34] fused 2D LiDAR, RFID, and absolute optical encoders to construct an underground localization network of passive RFID tags. Nevertheless, this method restricts the localization to a two-dimensional plane, which results in a large deviation in the robot’s pose when the height of the mine changes. Li et al. [Reference Li, Zhu, You and Tang35] proposed an underground mine pseudo-GPS system, composed of UWB to localize the robot in the mine. Unfortunately, the deployment cost of UWB is high, which limits its practical application in underground positioning. Wang et al. [Reference Wang, Wang, Ding, Liu, Wang, Fan, Bai, Hongbiao and Du36] integrated multi-source data containing WIFI, LiDAR, and a camera for estimating the pose of an underground search and rescue robot. However, the intricate tunnel structure of the mine has a large impact on the signal transmission, making it difficult to obtain the robot’s pose in areas where the WIFI signal intensity is low. In summary, the existing multi-sensor fusion-based localization methods are also limited in the perceptually degraded mine, and it is challenging to accurately estimate the UAV pose in the complex mine.

Based on the above review of localization methods for the robot/UAV in mines, the vision-based and LiDAR-based methods suffer from localization error accumulation due to the lack of enough image features and high similarity geometric structure. Most of the current robot/UAV localization methods for underground mines are based on the alignment of image features or geometric features. While with the development of the robot intelligence, semantic features [Reference Li, Gu, Li, Li, Guo, Gao, Zhao, Yang, Li and Dong37, Reference Ma, Jiang, Ou and Xu38] are conducive to improving navigation and localization. Inspired by manual inspection, intersections are the critical semantic features to achieve accurate UAV localization in underground mines. Therefore, based on our previous work proposed in [Reference Chen, Feng, Zhao, Wang and Liang39] and [Reference Chen, Feng, Wang and Liang40], this paper proposes a semantic knowledge database-based localization method for UAVs in underground mines. Based on the intersection recognition method proposed in [Reference Chen, Feng, Wang and Liang40], a semantic knowledge database is established by segmenting the intersection point cloud from the pre-built map of the mine. Furthermore, the global pose constraint of the current frame with the semantic knowledge database is constructed by detecting semantic intersection features of the mine in real time during UAV flight. Combining the relative pose constraints between the keyframes of the LiDAR point cloud, the pose graph model is established for UAV pose optimization.

2. System overview

To realize stable UAV flight over a long distance in underground mines, a UAV pose estimation framework is proposed in this paper based on LiDAR odometry constraint and semantic intersection pose constraint. The framework is shown in Figure 1. The gray part represents the process of relative pose constraint factor construction, and the yellow part represents the process of global pose constraint factor construction.

Figure 1. The framework for UAV pose estimation based on semantic knowledge database.

To construct the relative pose constraint during UAV flight, the keyframe selection procedure is performed first to improve the computational efficiency of pose estimation, using equal distance intervals and equal angle intervals selection strategies [Reference Zhou, Koppel and Kaess41, Reference Zhang42]. That is, if the relative translation distance between the new frame and the last keyframe is larger than a set distance threshold, or if the rotation angle between the new frame and the last keyframe is larger than a set angle threshold, the new frame will be selected as the keyframe.

Based on the selected keyframes, the geometric features of the point cloud are extracted for estimating the pose change between the neighboring keyframes. Inspired by the feature-extracting method proposed by LOAM, the curvature features of each point in the measured LiDAR point cloud are calculated for extracting edge features and plane features. By establishing the point-to-line and point-to-plane spatial distance functions based on the extracted edge and plane features between keyframes, the relative pose and local map can be estimated. Furthermore, the relative pose constraint factor is constructed taking into account the noise during UAV flight.

As stated in [Reference Chalvatzaras, Pratikakis and Amanatiadis43, Reference Liu, Di and Xu44], map-based localization algorithms are currently considered as the most accurate ones. In this paper, the intersection semantic knowledge database is established based on the pre-built point cloud map. The pre-built point cloud maps are often referred to as High Definition Maps (HD Maps). The resolution of HD Maps can reach centimeter-level accuracy. The intersection pose constraint factor building process is shown in the yellow part of Figure 1. The intersection semantic knowledge database is established based on the pre-built point cloud map. The green part of Figure 1 shows the building procedures of mine intersection semantic knowledge database. The pre-built point cloud map of the underground mine is the basis of the semantic knowledge database, which can be reconstructed by our previous work proposed in [Reference Chen, Feng, Zhao, Wang and Liang39]. Based on the pre-built point cloud map, the dense point cloud information of each intersection is segmented. Accordingly, the location of each intersection, as well as the Polar Wall Contour Vector (PWCV) and Cylinder Wall Contour (CWC) descriptors, are generated for building the intersection semantic knowledge database. Then, the semantic features of the sampled LiDAR point cloud during UAV flight are detected in real time. The candidate intersection is selected by comparing the intersection similarity between the current keyframe and the intersections in the semantic knowledge database. Next, the ICP algorithm is applied to register the detected intersection with the corresponding intersection in the mine intersection semantic knowledge database. If the ICP distance after registration is below the minimum ICP distance threshold, the intersection pose factor is constructed. Combining the relative pose constraint factor and the intersection pose constraint factor, the pose factor graph model is established for UAV pose estimation.

3. Intersection-based factor graph model

The factor graph model is a Bayesian network-based graph optimization model, which is proposed by Kschischang et al. [Reference Kschischang, Frey and Loeliger45] and suitable for modeling complex state estimation problems in SLAM and Structure From Motion (SFM). The factor graph is an undirected graph that contains two types of nodes, that is, variable nodes and factor nodes. The variable nodes represent the unknown variables to be estimated, such as the UAV pose during flight. The factor nodes represent probabilistic constraint relationships between variable nodes, which can be obtained from the sensor measurements or prior knowledge. The factor graph can handle back-end optimization problems incrementally. When the new edges and nodes are added to the graph model, only the nodes connected to the newly added nodes need to be optimized. Therefore, this paper proposes an intersection-based factor graph model for UAV pose estimation.

Figure 2. The pose factor graph model of UAV.

The proposed UAV pose factor graph model is shown in Figure 2. The UAV pose $\boldsymbol{x}_{\boldsymbol{i}}$ is the variable node to be estimated. $I_k$ is the position of the intersection. The gray box represents the relative pose constraint factor between LiDAR keyframes. The yellow box represents the intersection pose constraint factor between the current keyframe and the observed intersection. The pose variable of the graph factor model to be optimized is denoted as:

(1) \begin{equation} \boldsymbol{\chi } = \left \{{\boldsymbol{x}_{\textbf{0}},\boldsymbol{x}_{\textbf{1}},\boldsymbol{x}_{\textbf{2}}, \ldots, \boldsymbol{x}_{\boldsymbol{i}}, \ldots, \boldsymbol{x}_{\boldsymbol{\tau}}} \right \},\boldsymbol{x}_{\boldsymbol{i}} = \{ \boldsymbol{t}_{\boldsymbol{i}}^{\boldsymbol{W}},\boldsymbol{\theta}_{\boldsymbol{i}}^{\boldsymbol{W}}\} \end{equation}

where $\boldsymbol{\chi }$ is the set of pose variables to be estimated. $\boldsymbol{t}_{\boldsymbol{i}}^{\boldsymbol{W}}=(t_x,t_y,t_z)$ is the x-y-z translation of the UAV in the world coordinate system. $\boldsymbol{\theta}_{\boldsymbol{i}}^{\boldsymbol{W}}=(\theta _x,\theta _y,\theta _z)$ represents the roll-pitch-yaw rotation of the UAV in the world coordinate system. According to the Bayesian rule, the joint probability expression of $\boldsymbol{\chi }$ is

(2) \begin{equation} P(\boldsymbol{\chi }|\boldsymbol{Z}) = \prod \limits _{k,i}{P(\boldsymbol{z}_{\boldsymbol{k}}\left |{\boldsymbol{x}_{\boldsymbol{i}}} \right .)} \prod \limits _i{P(\boldsymbol{x}_{\boldsymbol{i}}\left |{\boldsymbol{x}_{\boldsymbol{i}\boldsymbol-\textbf{1}}} \right .)} \end{equation}

where $\boldsymbol{Z}$ represents the set of observations. The first term $P({I_k} |{\boldsymbol{x}_{\boldsymbol{i}}})$ is the likelihood probability, which represents the probability of obtaining observation $\boldsymbol{z}_{\boldsymbol{k}}$ at UAV pose $\boldsymbol{x}_{\boldsymbol{i}}$ . The second term $P(\boldsymbol{x}_{\textbf{i}}|{\boldsymbol{x}_{\boldsymbol{i}\boldsymbol-\textbf{1}}})$ is the prior probability, which represents the relationship between pose ${\boldsymbol{x}_{\boldsymbol{i}\boldsymbol-\textbf{1}}}$ and $\boldsymbol{x}_{\boldsymbol{i}}$ . More generally, Eq. (2) can be represented by the product of relative pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{L}} (\boldsymbol{\chi}_{\boldsymbol{i}})$ and intersection pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{I}} (\boldsymbol{\chi}_{\boldsymbol{i}})$ as:

(3) \begin{equation} P(\boldsymbol{\chi } |\boldsymbol{Z}) = \prod \boldsymbol{\psi}_{\boldsymbol{L}} (\boldsymbol{\chi}_{\boldsymbol{i}})\prod \boldsymbol{\psi}_{\boldsymbol{I}} (\boldsymbol{\chi}_{\boldsymbol{i}}) \end{equation}

where $\boldsymbol{\chi}_{\boldsymbol{i}}$ denotes the set of variables $\boldsymbol{x}_{\boldsymbol{i}}$ associated with the factor $\boldsymbol{\psi}_{\boldsymbol{i}}$ . In the following, we use unified $\boldsymbol{\psi}_{\boldsymbol{i}}$ to represent the factors $\boldsymbol{\psi}_{\boldsymbol{L}}$ and $\boldsymbol{\psi}_{\boldsymbol{I}}$ , respectively. Each factor $\boldsymbol{\psi}_{\boldsymbol{i}}$ is a function of the pose variables in  $\boldsymbol{\chi}_{\boldsymbol{i}}$ .

Based on the above derivation, the solution of the UAV pose factor graph model is to find the optimal estimation of the variables $\boldsymbol{\chi }$ for a prior relative pose and a known observation. It can then be converted into a maximum likelihood problem as:

(4) \begin{equation} \boldsymbol{\chi^*}_{MLE} = \arg \max \prod{\boldsymbol{\psi}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})} \prod{\boldsymbol{\psi}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})} \end{equation}

With the assumption that the measurement noise $\boldsymbol{\varpi}_{\boldsymbol{i}}$ follows the Gaussian distribution [Reference Thrun46], the maximum likelihood problem can be transformed into a simpler form. The general form of the measurement function is

(5) \begin{equation} \boldsymbol{z}_{\boldsymbol{i}} ={h_i}(\boldsymbol{\chi}_{\boldsymbol{i}}) + \boldsymbol{\varpi}_{\boldsymbol{i}} \end{equation}

where the noise term $\boldsymbol{\varpi}_{\boldsymbol{i}}$ follows the distribution $\boldsymbol{\varpi}_{\boldsymbol{i}} \sim \textrm{N}(0,\boldsymbol{\Omega}_{\boldsymbol{i}})$ . Then, the probability density function expansion of the corresponding constraint factor $\boldsymbol{\psi}_{\boldsymbol{i}}$ takes the form:

(6) \begin{equation} \begin{array}{l} \boldsymbol{\psi}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right ) \propto \exp \{ - \frac{1}{2}\left \|{{h_i}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right ) -\boldsymbol{z}_{\boldsymbol{i}}} \right \|_{\boldsymbol{\Omega}_{\boldsymbol{i}}}^2\}{\rm{ = }}\exp \{ - \frac{1}{2}\left \|{\boldsymbol{e}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )} \right \|_{\boldsymbol{\Omega}_{\boldsymbol{i}}}^2\} \end{array} \end{equation}

where $\boldsymbol{e}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ denotes the residual function of the $i$ th constraint factor ${\boldsymbol{\psi}_{\boldsymbol{i}}} (\boldsymbol{\chi}_{\boldsymbol{i}})$ . $\left \|{\boldsymbol{e}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )} \right \|_{\boldsymbol{\Omega}_{\boldsymbol{i}}}^2 = \boldsymbol{e}_{\boldsymbol{i}}{\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )^T}\boldsymbol{\Omega}_{\boldsymbol{i}}^{ - 1}\boldsymbol{e}_{\boldsymbol{i}}\left ({\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ is defined as the square of the Mahalanobis distance with covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{i}}$ . Substituting Eq. (6) into Eq. (4), the maximum likelihood estimation of the pose variable $\boldsymbol{\chi }$ is equivalent to minimizing the negative logarithm of the constraint factor. That is, the objective function for the optimization of the UAV flight pose $\boldsymbol{\chi } _{MLE}^ *$ can be established as:

(7) \begin{equation} \boldsymbol{\chi }_{MLE}^ * = \mathop{\arg \min }\limits _\chi \left \{{\left \|{\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})} \right \|_{\boldsymbol{\Omega}_{\boldsymbol{L}}}^2 + \left \|{\boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})} \right \|_{\boldsymbol{\Omega}_{\boldsymbol{I}}}^2} \right \} \end{equation}

where $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ is the residual function of the LIDAR relative pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ . $\boldsymbol{\Omega}_{\boldsymbol{L}}$ is the covariance matrix of the LIDAR relative pose estimation, which determines the weighting coefficients of the residual function $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ in the UAV pose objective function. $\boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ is the residual function of the intersection pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ . $\boldsymbol{\Omega}_{\boldsymbol{I}}$ is the covariance matrix of the intersection observation, which determines the weighting coefficients of the residual function $\boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ in the UAV pose objective function. The following subsection will separately describe the modeling of the LiDAR relative pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{L}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ and the intersection pose constraint factor $\boldsymbol{\psi}_{\boldsymbol{I}}(\boldsymbol{\chi}_{\boldsymbol{i}})$ in detail.

3.1. LiDAR relative pose constraint factor

The scanning frequency of LiDAR is 10 Hz. The continuously scanned LiDAR point clouds contain some redundant information, so LiDAR keyframes are selected to build the UAV factor graph model. Therefore, the equal interval relative pose change sampling criterion is applied to ensure the uniform distribution of the LiDAR keyframes. That is, the first scanned LiDAR frame is selected as the first keyframe point cloud. Then, the newly scanned frame is aligned with the previous keyframe for calculating the relative pose change $\boldsymbol{\Delta} \boldsymbol{T}$ :

(8) \begin{equation} \boldsymbol{\Delta} \boldsymbol{T} = \left [{\begin{array}{*{20}{c}}\boldsymbol{\Delta} \boldsymbol{R}&\boldsymbol{\Delta} \boldsymbol{t}\\ 0&1\end{array}} \right ] \end{equation}

where $\boldsymbol{\Delta} \boldsymbol{R}$ is the relative rotation matrix and $\boldsymbol{\Delta} \boldsymbol{t}$ is the relative translation vector. Based on $\boldsymbol{\Delta} \boldsymbol{R}$ and $\boldsymbol{\Delta} \boldsymbol{t}$ , the translation distance $||\boldsymbol{\Delta} \boldsymbol{t}||_2$ and the relative rotation angle $\Delta \theta$ between the current frame and the previous keyframe can be calculated by:

(9) \begin{equation} \left \{ \begin{array}{l}{\left \| \boldsymbol{\Delta} \boldsymbol{t} \right \|_2} = \sqrt{\Delta{\rm{t}}_x^2 + \Delta{\rm{t}}_y^2 + \Delta{\rm{t}}_z^2} \\ \Delta \theta = \arccos \dfrac{{trace(\boldsymbol{\Delta} \boldsymbol{R}) - 1}}{2} \end{array} \right . \end{equation}

If the relative distance $||\boldsymbol{\Delta} \boldsymbol{t}||_2$ is larger than the set minimum translation distance threshold $\varepsilon _t$ or the relative rotation angle $\Delta \theta$ is larger than the set minimum rotation angle threshold $\varepsilon _\theta$ , the current frame can be selected as a new keyframe. Then, the new keyframe can be added to the factor graph model as a new node.

Once a new keyframe is added to the factor graph model, the relative pose constraint between the new and previous keyframe needs to be constructed. For two keyframe point clouds $i$ and $j$ , the relative pose transformation matrix $T_i^j$ can be calculated by registering the spatial geometry features. The relative pose transformation matrix $T_i^j$ satisfies the following equation:

(10) \begin{equation} \boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{j}} = \boldsymbol{(}\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}})^{\boldsymbol-\textbf{1}}\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}} \end{equation}

where $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ is the transformation matrix from the $i$ th keyframe to the world coordinate system and $\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}$ is the transformation matrix from the $j$ th keyframe to the world coordinate system.

Due to the accumulated error caused by the environmental perception deviation during UAV flight, Eq. (10) will not be strictly valid. Then, the residual function $\boldsymbol{e}_{\boldsymbol{L}}$ between the $i$ th keyframe and $j$ th keyframe can be denoted as:

(11) \begin{equation} \boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}}) = ln((\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{j}})^{-1}{(\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}})^{ - 1}}\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}) \end{equation}

In the above residual function, there are two pose variables to be optimized: the pose $\boldsymbol{x}_{\boldsymbol{i}}$ of $i$ th keyframe and the pose $\boldsymbol{x}_{\boldsymbol{j}}$ of $j$ th keyframe. Therefore, it is required to find the derivative of $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to the variables $\boldsymbol{x}_{\boldsymbol{i}}$ and $\boldsymbol{x}_{\boldsymbol{j}}$ . It is equivalent to finding the derivative of ${\boldsymbol{e}_{\boldsymbol{L}}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}$ . However, the transformation matrices $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}$ are not closed under addition and subtraction. Thus, Lie group and Lie algebra [Reference Gao, Zhang, Gao and Zhang47] are introduced to solve the deviation. The transformation matrix $\boldsymbol{T}$ is labeled as the special Euclidean group $SE(3)$ and its corresponding Lie algebra is $se(3)$ . In Lie algebra, each pose is denoted by $\boldsymbol{\xi } = \left [ \begin{array}{l}\boldsymbol{\rho } \\\boldsymbol{\phi } \end{array} \right ]$ , where $\boldsymbol{\xi }$ is a six-dimensional vector. The first three dimensions $\boldsymbol{\rho }$ denote the translations and the last three dimensions $\boldsymbol{\phi }$ denote the rotations.

According to the transformation relation between Lie groups and Lie algebras, the Lie algebra form of the pose $\boldsymbol{x}_{\boldsymbol{i}}$ of the $i$ th keyframe is denoted as $\boldsymbol{\xi}_{\boldsymbol{i}}$ , and the Lie algebra form between $i$ th and $j$ th keyframe is denoted as $\boldsymbol{\xi}_{\boldsymbol{ij}}$ . By adding the left perturbation terms ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{i}}$ and ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{j}}$ to the poses $\boldsymbol{\xi}_{\boldsymbol{i}}$ and $\boldsymbol{\xi}_{\boldsymbol{j}}$ , the residual function $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ is changed to the following form:

(12) \begin{equation} \hat{\boldsymbol{e}}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}}) = \ln{\left ({{{\left ( \boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{j}} \right )}^{ - 1}}{{\left ( \boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}} \right )}^{ - 1}}\exp \left ({{{\left (-{\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{i}} \right )}^ \wedge }} \right )\exp \left ({{{\left ( {\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{j}} \right )}^ \wedge }} \right )\boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}}} \right )^ \vee } \end{equation}

Based on the derivation rule of Lie algebra, the derivatives of the residual function $\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to $\boldsymbol{\xi}_{\boldsymbol{i}}$ and $\boldsymbol{\xi}_{\boldsymbol{j}}$ are equivalent to the derivatives of the residual function $\hat{\boldsymbol{e}}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})$ with respect to the left perturbation terms ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{i}}$ and ${\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{j}}$ :

(13) \begin{equation} \frac{{\partial \boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})}}{{\partial \boldsymbol{\xi}_{\boldsymbol{i}}}} = \frac{{\partial {{\hat{\boldsymbol{e}}}_{\boldsymbol{L}}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})}}{{\partial {\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{i}}}} = - \boldsymbol{J}_{\boldsymbol{r}}^{ - 1}\left ({\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})} \right )Ad\left ({{{\left ( \boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}} \right )}^{ - 1}}} \right ) \end{equation}
(14) \begin{equation} \frac{{\partial \boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})}}{{\partial \boldsymbol{\xi}_{\boldsymbol{j}}}} = \frac{{\partial {{\hat{\boldsymbol{e}}}_{\boldsymbol{L}}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})}}{{\partial {\boldsymbol\delta\boldsymbol\xi}_{\boldsymbol{j}}}} = \boldsymbol{J}_{\boldsymbol{r}}^{ - 1}\left ({\boldsymbol{e}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}})} \right )Ad\left ({{{\left ( \boldsymbol{T}_{\boldsymbol{j}}^{\boldsymbol{W}} \right )}^{ - 1}}} \right ) \end{equation}

where $\boldsymbol{J}_{\boldsymbol{r}}$ is the right multiplication of the Jacobi matrix and $Ad(\cdot )$ represents the adjacency matrix.

Furthermore, it is also necessary to estimate the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{L}}$ of the relative pose factors. The inverse of the covariance matrix $(\boldsymbol{\Omega}_{\boldsymbol{L}})^{-1}$ is alternatively called the information matrix, which reflects the weight of residuals of each factor in the factor graph model. The covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{L}}$ between $i$ th and $j$ th keyframes can be constructed based on the uncertainty of the matched geometric feature points. Supposing that $\boldsymbol{{}^{\boldsymbol{i}}\boldsymbol{p}_{\boldsymbol{f}}}$ and $\boldsymbol{{}^{\boldsymbol{j}}\boldsymbol{p}_{\boldsymbol{f}}}$ are a pair of feature points between the $i$ th and $j$ th keyframes, the two feature points satisfy the following projection relationship:

(15) \begin{equation}{{}^{\boldsymbol{j}}\boldsymbol{p}_{\boldsymbol{f}}} = \boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{j}}\left (\boldsymbol{{}^{\boldsymbol{i}}\boldsymbol{p}_{\boldsymbol{f}}} - \boldsymbol{t}_{\boldsymbol{j}}^{\boldsymbol{i}} \right ) \end{equation}

Based on the above feature points, the covariance matrix can be estimated by calculating the sum of all matched feature point pairs in the two keyframes:

(16) \begin{equation} \boldsymbol{\Omega}_{\boldsymbol{L}}(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}}) = \sum \limits _{m = 1}^{{N_f}}{{\boldsymbol{H}^T}{\boldsymbol{\Lambda}_{\boldsymbol{L}} ^{ - 1}}\boldsymbol{H}} \end{equation}

where $N_f$ is the total number of pairs. $\boldsymbol{\Lambda}_{\boldsymbol{L}}$ is the zero-mean Gaussian matrix (the noise matrix of LiDAR). $\boldsymbol{H}$ is the Jacobi matrix, which is defined as:

(17) \begin{equation} \boldsymbol{H} = \left [{\begin{array}{*{20}{c}}{{{\left (\boldsymbol{{}^{\boldsymbol{j}}\boldsymbol{p}_{\boldsymbol{f}}}\right )}^ \wedge }}&\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{j}}\end{array}} \right ] \end{equation}

The noise matrix $\boldsymbol{\Lambda}_{\boldsymbol{L}}$ is defined as:

(18) \begin{equation} \boldsymbol{\Lambda}_{\boldsymbol{L}} = \left [{\begin{array}{*{20}{c}}{2\sigma _x^2}&0&0\\ 0&{2\sigma _y^2}&0\\ 0&0&{2\sigma _z^2} \end{array}} \right ] \end{equation}

where $\sigma _x$ , $\sigma _y$ , and $\sigma _z$ are the measurement noise along the $X$ , $Y$ , and $Z$ axes of the LiDAR sensor with units of m. By setting the covariance matrix at the initial pose as the zero matrix $\boldsymbol{0_{6 \times 6}}$ , the covariance matrix corresponding to the $k$ th keyframe can be iterated according to the following equation as the LiDAR odometry is accumulated:

(19) \begin{equation} \boldsymbol{\Omega}_{\boldsymbol{L}}^{\boldsymbol{k}} \leftarrow \boldsymbol{\Omega}_{\boldsymbol{L}}^{\boldsymbol{k} - \textbf{1}} + {\boldsymbol{H}_{\boldsymbol{k,k}\textbf{-}\textbf{1}}}^T \boldsymbol{\Lambda}_{\boldsymbol{L}}^{ - 1} {\boldsymbol{H}_{\boldsymbol{k,k}\textbf{-}\textbf{1}}} \end{equation}

where $\boldsymbol{H}_{\boldsymbol{k,k}\textbf{-}\textbf{1}}$ corresponds to the Jacobi matrix between the $k$ th keyframe and the $(k-1)$ th keyframe.

3.2. Intersection pose constraint factor

Since most areas of underground roadways are long and narrow corridors with high similarity of geometric structures, the point cloud scanned by LiDAR in such degraded scenarios lacks adequate discriminative features, resulting in increasing accumulated positioning errors. Therefore, the intersection pose constraint factor is added to the factor graph model to provide reliable pose constraint for UAV pose optimization.

Based on our previous work proposed in [Reference Chen, Feng, Wang and Liang40], we first construct an intersection semantic knowledge database from the pre-built point cloud map of the underground mine. The constructed semantic knowledge database contains the geometric invariant point location $\boldsymbol{C}^{\boldsymbol{I}_{\boldsymbol{i}}}$ , the dense point cloud $\boldsymbol{P}^{\boldsymbol{I}_{\boldsymbol{i}}}$ , the PWCV descriptor $\boldsymbol{V}_{\boldsymbol{d}}^{\boldsymbol{I}_{\boldsymbol{i}}}$ , and CWC descriptor ${\boldsymbol{M}_{\boldsymbol{d}}^{\boldsymbol{I}_{\boldsymbol{i}}}}$ of each intersection. While the UAV is inspecting the underground mine, the pose constraint constructing process between the current keyframe and the intersection in the semantic knowledge database is shown in Figure 3. The detailed steps are as follows.

Figure 3. The procedure for intersection location constraints constructing.

  1. 1. The keyframe point cloud is preprocessed first. The key steps of preprocessing include point cloud filtering, mine ground plane detection, and wall point cloud segmentation.

  2. 2. Based on the segmented wall point cloud, the topology feature detection is performed to identify whether the current keyframe is an intersection or not. If the current keyframe is not an intersection, it returns to the first step. If the current keyframe is an intersection, then the geometrical invariant point location of the intersection is computed.

  3. 3. Centering at the geometrical invariant point within radius $R_{pwcv}$ , the intersection point cloud is decoded as a PWCV descriptor $\boldsymbol{V}_{\boldsymbol{curr}}$ . Then, the similarity between the current PWCV descriptor $\boldsymbol{V}_{\boldsymbol{curr}}$ and the PWCV descriptor in the semantic knowledge database is calculated one by one.

  4. 4. According to the calculated similarity, the candidate intersections subset with high similarity is selected. By setting the minimum PWCV similarity threshold $\varepsilon _{pwcv}$ , the intersections with similarity greater than $\varepsilon _{pwcv}$ are selected to form a candidate intersection subset.

  5. 5. The matched intersection is determined by the number of the candidate intersection subset $N_{inter}$ . If $N_{inter}$ is equal to 0, the current keyframe is not a real intersection. The subsequent processes are stopped, and then return to the first step to wait for the processing of the next keyframe. If $N_{inter}$ is equal to 1, it means the unique intersection in the candidate intersections subset is the matched intersection, and the process returns to step 8. If the number of candidate intersections $N_{inter}$ is larger than 1, it means that the current keyframe is similar to multiple intersections of the semantic knowledge database. It is difficult to identify the matched intersection using only the PWCV descriptor similarity results. It is necessary to further compare the similarity of CWC descriptors between the current keyframe and the intersections in the semantic knowledge database.

  6. 6. The CWC descriptor of the current keyframe is generated to select the final matched intersections. The CWC is a cylinder wall contour descriptor, which is generated by adding the height feature encoding on the basis of the PWCV descriptor. The single-frame LiDAR point cloud is sparse and contains less spatial information. Therefore, before generating the CWC descriptor of the current keyframe, the nearest neighboring $N_{his}$ keyframes are transformed to the current keyframe for staking dense point cloud to obtain richer spatial features. The dense point cloud is stacked by:

    (20) \begin{equation} \boldsymbol{P}_{\boldsymbol{dense}} = \boldsymbol{P}_{\boldsymbol{curr}} + \sum \limits _{i = 1}^{{N_{his}}} \boldsymbol{T}_{\boldsymbol{hi}{\boldsymbol{s}_{\boldsymbol{i}}}}^{\boldsymbol{curr}}{\boldsymbol{P}_{\boldsymbol{hi}{\boldsymbol{s}_{\boldsymbol{i}}}}} \end{equation}
    where $\boldsymbol{P}_{\boldsymbol{dense}}$ is the dense point cloud after stacking. $\boldsymbol{P}_{\boldsymbol{curr}}$ is the point cloud of current keyframe. $\boldsymbol{P}_{\boldsymbol{hi}{\boldsymbol{s}}_{\boldsymbol{i}}}$ is the $i$ th nearest-neighbor keyframe point cloud. $\boldsymbol{T}_{\boldsymbol{his}_{\boldsymbol{i}}}^{curr}$ is the transformation matrix from the current keyframe to the $i$ th nearest-neighbor keyframe. Based on the stacked dense point cloud $\boldsymbol{P}_{\boldsymbol{dense}}$ , the CWC descriptor $\boldsymbol{M}_{\boldsymbol{curr}}$ is generated. Meanwhile, its similarity with the CWC descriptors of intersections in the candidate subset is calculated one by one. The intersection with the highest similarity is selected as the final matched intersection.
  7. 7. To avoid false intersection recognition, the number of consecutive recognitions of the same intersection is counted. When the UAV flies through an intersection, this intersection should be recognized by multiple keyframes. At the beginning, the number of keyframes $N_{same}$ continuously recognized for the same intersection is set to 0. If the matched intersection ID of the current keyframe is the same as the matched intersection ID of the latest keyframe, $N_{same}$ is increased by 1. If the matched intersection ID of the current keyframe is different from the matched intersection ID of the latest keyframe, $N_{ same}$ is initialized to 0 and the subsequent steps are terminated. The process then returns to the first step and waits for the processing of the next keyframe.

  8. 8. Judge whether the same intersection is stably detected based on the number of consecutive recognition $N_{same}$ . When $N_{same}$ is smaller than the set minimum number of consecutive recognition $\varepsilon _{same}$ , the subsequent step is terminated until $N_{same}$ is equal to $\varepsilon _{same}$ . Then, it proceeds to the next step, and $N_{same}$ is initialized to 0 simultaneously.

  9. 9. The ICP distance $d_{icp}$ between the current keyframe and the matched intersection is used for distance verification. If the ICP distance $d_{icp}$ is larger than the minimum distance threshold $\varepsilon _{icp}$ , it is considered a wrong match and the intersection pose constraint will not be added to the fact graph model. If the ICP distance $d_{icp}$ is smaller than $\varepsilon _{icp}$ , it is considered as a correct match. Then, the intersection pose constraint between the current keyframe and the matched intersection in the semantic knowledge database is added to the factor graph model for UAV pose optimization.

According to the above procedures, the intersection pose constraint can be established. When the UAV detects a stable intersection at pose $\boldsymbol{x}_{\boldsymbol{i}}$ , the spatial pose relationship among the UAV pose (the LiDAR coordinate system $O_LX_LY_LZ_L$ ), the world coordinate system $O_WX_WY_WZ_W$ , and the intersection coordinate system $O_IX_IY_IZ_I$ is shown in Figure 4.

Figure 4. The 3D location observation of intersection.

As Figure 4 shown, $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{W}}$ displayed as the orange dashed line is the 3D vector of the observed intersection in the world coordinate system at pose $\boldsymbol{x}_{\boldsymbol{i}}$ . $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}$ are displayed as the purple dashed line. $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ represents the 3D vector of the LiDAR sensor, and $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}$ represents the 3D vector of the observed intersection in the current LiDAR coordinate system. Therefore, the observation $\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}}$ of the intersection in the world coordinate system can be denoted as:

(21) \begin{equation} {{}^{\boldsymbol{W}}{\boldsymbol{z}_{\boldsymbol{I}}}} = \boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{W}} = \boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}} + \boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}} \end{equation}

where $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ is the rotation matrix of the LiDAR coordinate system to the world coordinate system.

The error between the intersection observation ${{}^{\boldsymbol{W}}{{\boldsymbol{z}}_{\boldsymbol{I}}}}$ at pose $\boldsymbol{x}_{\boldsymbol{i}}$ and the prior intersection location provided by the intersection semantic knowledge database $\boldsymbol{\mathord{\buildrel{\lower 3pt\hbox{$\scriptscriptstyle \frown $}} \over t} _I^W}$ constitutes the residual function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ of the current intersection pose constraint factor:

(22) \begin{equation} \boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right ) = \boldsymbol{\mathord{\buildrel{\lower 3pt\hbox{$\scriptscriptstyle \frown $}} \over t} _{\boldsymbol{I}}^{\boldsymbol{W}} }- \boldsymbol{t}_{\boldsymbol{i}}^{\boldsymbol{W}} - \boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}} \end{equation}

where the variables to be optimized are as follows:

(23) \begin{equation} \boldsymbol{\chi}_{\boldsymbol{i}} = \{\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}},\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}},\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}\} \end{equation}

Similarly, since the rotation matrix $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ is not closed under the addition and subtraction, the derivative of the residue function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ with respect to the $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ cannot be computed directly. It is required to convert the rotation matrix $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ into Lie algebra form for derivation. The rotation matrix $\boldsymbol{R}$ is known as the special orthogonal group SO(3), and the Lie algebra form of SO(3) is $\boldsymbol{\phi }$ .

Thus, the Lie algebra form of rotation matrix $\boldsymbol{R}_{\boldsymbol{Li}}^{\boldsymbol{W}}$ in the residual function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ (Eq. (22)) is defined as $\boldsymbol{\phi}_{\boldsymbol{i}}$ . Then, the derivative of the residue function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right )$ with respect to the $\boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}$ can be denoted as:

(24) \begin{equation} \frac{{\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})}}{{\partial \boldsymbol{\phi}_{\boldsymbol{i}}}} = \frac{{\partial ( - \boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}}\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}})}}{{\partial \boldsymbol{\phi}_{\boldsymbol{i}}}} = \frac{{\partial ( - \exp (\boldsymbol{\phi}_{\boldsymbol{i}}^ \wedge )\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}})}}{{\partial \boldsymbol{\phi}_{\boldsymbol{i}}}} \end{equation}

By introducing the left perturbation terms ${\boldsymbol\delta\boldsymbol\phi}_{\boldsymbol{i}}$ , the derivative of the residual function $\boldsymbol{e}_{\boldsymbol{I}}\left ({\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{x}_{\boldsymbol{i}}} \right )$ with respect to the rotation $\boldsymbol{\phi}_{\boldsymbol{i}}$ is equal to:

(25) \begin{equation} \frac{{\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})}}{{\partial \boldsymbol{{\phi _i}}}} = \mathop{\lim }\limits _{{{\boldsymbol\delta}{\boldsymbol\phi _{\boldsymbol{i}}}} \to 0} \frac{{ - \exp \left ({{{\left ( \boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}} \right )}^ \wedge }} \right )\exp \left ({\boldsymbol{\phi}_{\boldsymbol{i}}^ \wedge } \right )\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}- \exp \left ({\boldsymbol{\phi}_{\boldsymbol{i}}^ \wedge } \right )\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}}}{\boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}}} \end{equation}

Expanding $\exp \left ({{{\left ( \boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}} \right )}^ \wedge }} \right )$ in Eq. (25) with Taylor function, the derivation $\frac{{\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})}}{{\partial \boldsymbol{{\phi _{\boldsymbol{i}}}}}}$ can be simplified as:

(26) \begin{align}\begin{aligned} \dfrac{{\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})}}{{\partial \boldsymbol{\delta{\phi _{\boldsymbol{i}}}}}} &= \mathop{\lim }\limits _{\boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}} \to 0} \dfrac{{ - \left ({\boldsymbol{E}_{\textbf{3}} +{{\left ( \boldsymbol{\delta{\phi _{\boldsymbol{i}}}} \right )}^ \wedge }} \right )\exp \left ({\boldsymbol{\phi}_{\boldsymbol{i}}^ \wedge } \right )\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}} - \exp \left ({\boldsymbol{\phi}_{\boldsymbol{i}}^ \wedge } \right )\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}}}{\boldsymbol{\delta{\phi _{\boldsymbol{i}}}}}\\&{\rm{ = }}\mathop{\lim }\limits _{\boldsymbol{\delta{\phi _{\boldsymbol{i}}}} \to 0} \dfrac{{ -{{\left ( \boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}} \right )}^ \wedge }\exp \left ({\boldsymbol{\phi}_{\boldsymbol{i}}^ \wedge } \right )\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}}}{\boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}}}\\&{\rm{ = }}\mathop{\lim }\limits _{\boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}} \to 0} \dfrac{{\boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}}{{\left ({\exp \left ({\boldsymbol{\phi}_{\boldsymbol{i}}^ \wedge } \right )\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}} \right )}^ \wedge }}}{\boldsymbol{\delta{\boldsymbol\phi _{\boldsymbol{i}}}}}\\&{\rm{ = }}{\left ( {\boldsymbol{R}_{{\boldsymbol{i}}}^{\boldsymbol{W}}{\boldsymbol{t}}_{\boldsymbol{I}}^{{\boldsymbol{i}}}} \right )^ \wedge } \end{aligned}\end{align}

In addition, according to Eq. (22), the derivatives of the residual function $\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})$ with respect to the optimization variables $\boldsymbol{T}_{\boldsymbol{i}}^{\boldsymbol{W}}$ and $\boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}$ are as follows:

(27) \begin{equation} \frac{{\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})}}{{\partial \boldsymbol{t}_{\boldsymbol{i}}^{\boldsymbol{W}}}} = - \boldsymbol{E}_{\textbf{3}} \end{equation}
(28) \begin{equation} \frac{{\partial \boldsymbol{e}_{\boldsymbol{I}}(\boldsymbol{z}_{\boldsymbol{I}}^{\boldsymbol{W}},\boldsymbol{\chi}_{\boldsymbol{i}})}}{{\partial \boldsymbol{t}_{\boldsymbol{I}}^{\boldsymbol{i}}}} = - \boldsymbol{R}_{\boldsymbol{i}}^{\boldsymbol{W}} \end{equation}

To construct the intersection pose constraint factor, it is also important to analyze the uncertainty of the intersection observation by estimating the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{I}}$ . Based on the registered points correspondence between the current keyframe and the matched intersection in the semantic knowledge database, the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{I}}$ can be calculated. Assuming $\boldsymbol{{}^{\boldsymbol{W}}{\boldsymbol{p}}_{{\boldsymbol{I}}_{\boldsymbol{f}}}}$ and $\boldsymbol{^{\boldsymbol{i}}\boldsymbol{p}_{\boldsymbol{I}_{\boldsymbol{f}}}}$ are a pair of registered feature points in the semantic knowledge database and the $i$ th keyframe, they will obey the following projection function:

(29) \begin{equation} \boldsymbol{{}^{{\boldsymbol{i}}}{\boldsymbol{p}_{{\boldsymbol{I}_{\boldsymbol{f}}}}}} = \boldsymbol{R}_{\boldsymbol{W}}^{\boldsymbol{i}}\left ({{{}^{\boldsymbol{W}}{\boldsymbol{p}_{{\boldsymbol{I}_{\boldsymbol{f}}}}}} - \boldsymbol{t}_{\boldsymbol{i}}^{\boldsymbol{W}}} \right ) \end{equation}

Then, the covariance matrix $\boldsymbol{\Omega}_{\boldsymbol{I}}$ can be estimated by calculating the sum of the distance among all matched feature points between the $i$ th keyframe and intersection in the semantic knowledge database:

(30) \begin{equation} \boldsymbol{\Omega}_{\boldsymbol{I}}\left ({\boldsymbol{{}^{\boldsymbol{W}}{\boldsymbol{z}_{\boldsymbol{I}}}},\boldsymbol{\chi}_{\boldsymbol{i}}} \right ) = \sum \limits _{m = 1}^{{N_{{I_f}}}}{{\boldsymbol{Q}^T}\boldsymbol{\Lambda}_{\boldsymbol{I}}^{ - 1}} \boldsymbol{Q} \end{equation}

where $N_{I_f}$ is the total number of matched point pairs. $\boldsymbol{\Lambda}_{\boldsymbol{I}}$ is the zero-mean Gaussian noise matrix of matched intersections. Based on Eq. 29, the Jocabi matrix $\boldsymbol{Q}$ is defined as:

(31) \begin{equation} \boldsymbol{Q} = \left [{\begin{array}{*{20}{c}}{{{\left ( {{}^{\boldsymbol{W}}{\boldsymbol{p}_{{\boldsymbol{I}_{\boldsymbol{f}}}}}} \right )}^ \wedge }}&{ - \boldsymbol{R}_{\boldsymbol{W}}^{\boldsymbol{i}}} \end{array}} \right ] \end{equation}

Based on the PWCV descriptor generation process proposed in [Reference Chen, Feng, Wang and Liang40], the nonzero portion of the wall contour component of the final formed PWCV descriptor is tightly correlated with the relative pose between the current UAV and the detected intersection geometric invariant point. Therefore, the completeness of the wall contour component of the PWCV descriptor is proposed to estimate the noise matrix $\boldsymbol{\Lambda}_{\boldsymbol{I}}$ . Figure 5(a) shows the PWCV descriptor of an example intersection in the semantic database, and its nonzero eigenvalue dimensions of the wall contour component is $k_b$ . Figure 5(b) is the PWCV descriptor of the detected intersection during UAV flight, and its nonzero eigenvalue dimensions of the wall contour component is $k_c$ . The noise matrix $\boldsymbol{\Lambda}_{\boldsymbol{I}}$ is defined as:

(32) \begin{equation} \boldsymbol{\Lambda}_{\boldsymbol{I}} = \left [{\begin{array}{*{20}{c}}{\dfrac{{{k_c}}}{{k_b}}}&0&0\\ 0&{\dfrac{{{k_c}}}{{k_b}}}&0\\ 0&0&{\dfrac{{{k_c}}}{{k_b}}} \end{array}} \right ] \end{equation}

Thus, the relative pose constraint factor residual function (Eq. (11)) and its covariance matrix (Eq. (16)), and the intersection pose constraint factor residual function (Eq. (22)) and its covariance matrix (Eq. (30)) have been established. By substituting the residual functions and their covariance matrices to the UAV pose factor graph model (Eq. (7)), the UAV pose can be optimized in nonlinear iterations based on the results of deviations (Eq. (13), Eq. (14), Eq. (26), and Eq. (27)).

4. Experiments and results

In this section, two UAV localization tests are conducted in the large-scale Edgar Mine and a mine-like indoor corridor environment. In the two experiments, we compared the localization accuracy with two typical LiDAR-based localization methods, that is, LOAM and Scan Context. LOAM is an open-loop laser odometry localization method. LOAM + Scan Context adds the Scan Context global point cloud descriptor to LOAM for scene recognition. The loop constraints based on Scan Context are added to correct the cumulative localization error during UAV flight after recognizing the same scene.

4.1. UAV localization in Edgar Mine

The first UAV localization experiment is conducted in the simulated Edgar Mine. The simulated Edgar Mine is developed by importing the Edgar Mine environment model file from DARPA [Reference Rogers, Gregory, Fink and Stump48] on the ROS Gazebo platform. As shown in Figure 5 (a), the total length of Edgar Mine is 2.8 km and it consists of a number of narrow roadways and several intersections. The DJI UAV platform equipped with Velodyne 16 LiDAR and Realsense D435 camera is added to the ROS Gazebo for mine exploration, as shown in Figure 5 (b).

Figure 5. UAV flight localization experiment in the Edgar Mine environment. (a) The Edgar Mine deployed in the ROS Gazebo. (b) The UAV simulation platform in the ROS Gazebo.

First, the UAV is controlled manually to fly slowly along the Edgar Mine tunnel for one cycle to record the LiDAR data and RGB-D image data. The Edgar Mine point cloud map is reconstructed based on our previous work [Reference Chen, Feng, Zhao, Wang and Liang39] and the obtained sensor data. Furthermore, the semantic topology information is segmented from the pre-built map to construct a semantic knowledge database for the Edgar Mine. Finally, a long-distance flight trajectory is recorded for UAV flight localization accuracy comparison. The trajectory is shown in Figure 6. The total length of this trajectory is 5.68 km. The LiDAR point cloud data and the ground truth pose are recorded during the UAV flight. In particular, the ground truth pose is obtained by a high-precision IMU sensor without zero drift.

Figure 6. The UAV flight trajectory.

4.1.1. Semantic knowledge database construction

Figure 7. The reconstructed environmental map of Edgar Mine.

With reference to the point cloud fusion framework [Reference Chen, Feng, Zhao, Wang and Liang39], the LiDAR data and depth image are fused based on the Bayesian Kriging model to reconstruct a single-frame high-precision dense point cloud of the mine roadway. Furthermore, the ISS3D key points and FPFH descriptors are extracted from the reconstructed single-frame dense point cloud for spatial feature coarse matching and ICP registration. The reconstructed point cloud map of Edgar Mine is shown in Figure 7.

The roadway of Edgar Mine is complicated and contains several loops and intersections (the ID of each intersection is shown in the yellow box in Figure 7). To construct the intersection semantic knowledge database, the intersection dense point cloud is first segmented from the reconstructed map. Then, the intersection type, geometric invariant points location, the PWCV descriptors, and CWC descriptors are generated for each intersection based on our previous work [Reference Chen, Feng, Wang and Liang40]. The constructed intersection semantic knowledge database is shown in Figure 8.

Figure 8. The intersection semantic knowledge database of the Edgar Mine.

4.1.2. UAV localization accuracy analysis

Figure 9 shows the estimated pose trajectory of UAV based on competing methods and the proposed localization method. Comparing the trajectories plotted in Figure 9, it can be found that the trajectory estimated by LOAM gradually deviates from the ground truth trajectory. This is because the LOAM localization algorithm only relies on the neighboring point cloud registration for pose estimation, which results in error accumulation during long-distance flights in underground mines with similar geometry. LOAM + Scan Context shows a large localization deviation. This is due to the fact that the similar mine roadways cause Scan Context to recognize the scene incorrectly. In contrast, the localization method proposed in this paper can accurately recognize different intersections without false scene recognition. Therefore, in the case where the prior map is not pre-built, the accumulated error at the same intersection can be eliminated, resulting in a reduction of the whole trajectory error. Furthermore, by establishing the semantic knowledge database, the intersection pose constraint between the current frame and the intersection in the semantic knowledge database can be added. Once a stable intersection is detected during flight, the accumulated localization error between the last detected intersection and the current intersection can be eliminated immediately.

Figure 9. The estimated trajectories during UAV flight based on different methods in Edgar Mine.

To quantitatively analyze the localization error of each method [Reference Zhang and Scaramuzza49], the maximum error (MAE), root mean square error (RMSE), and the relative error percentage (REP) are used to evaluate the localization accuracy. REP-1, REP-2, REP-3, REP-4, and REP-5 stand for 20%, 40%, 60%, 80%, and 100% of the trajectory. The localization errors are listed in Table I.

As listed in Table I, the MAE, RMSE, and REP-5 of LOAM are 39.15 m, 10.15 m, and 0.51 %, respectively. The MAE, RMSE, and REP-5 of LOAM + Scan Context are 6.62 m, 2.63 m, and 0.22 %, which cannot completely eliminate the accumulated error of LOAM. In comparison, the MAE, RMSE, and REP-5 of the proposed method are 2.22 m, 1.22 m, and 0.17 %. By adding the intersection pose constraint based on the semantic knowledge database with the pre-built map, the MAE, RMSE, and REP-5 of the proposed method are 2.06 m, 0.60 m, and 0.13 %, which shows over three times performance improvement compared to LOAM. It can be concluded that the proposed method can achieve accurate localization by optimizing the accumulative error of open-loop LOAM.

Table I. UAV flight trajectory errors analysis in Edgar Mine environment.

To further analyze the relationship between the distribution of localization error and the flight trajectory, the localization error maps of the comparison methods and the proposed method are plotted in Figure 10.

Figure 10. The localization error maps with different methods. (a) The error distribution of LOAM. (b) The error distribution of LOAM + Scan context. (c) The error distribution of ours(no map). (d) The error distribution of ours (map).

It can be seen from Figure 10(a) that the localization error of LOAM continuously increases with the increase in the UAV flight distance. The error distribution map in Figure 10(b) shows that the LOAM + Scan Context can detect a loop in the same scene with a small pose change. However, the localization error still accumulates with the fight distance, resulting in a maximum error of up to 6.6 m. As shown in Figure 10(c), based on detecting the same intersection from different directions without a pre-built map, the proposed method can establish stable loop constraints. Then, the maximum error is reduced to 2.2 m. Figure 10(d) shows the error distribution map of the proposed method based on the semantic knowledge database. When an intersection in the semantic knowledge database is detected during UAV flight, the localization error is eliminated immediately. The localization error only remains between two intersections.

4.2. UAV localization in mine-like indoor corridor

To evaluate the localization accuracy and adaptability of the proposed method in real environments, a UAV hardware platform same as the simulated UAV platform is designed and applied to conduct localization experiments in the mine-like indoor corridor. The components of the UAV hardware platform are shown in Figure 11.

Figure 11. The UAV hardware platform.

Since the ground truth pose of UAV cannot be directly measured in a narrow indoor corridor, a localization accuracy evaluation method based on multiple Apriltag [Reference Wang and Olson50] is applied to compare the localization error of different methods. As shown in Figure 12(a), the first experiment was conducted in an indoor corridor. The total length of the indoor corridor is 210 m and its minimum width is 1.3 m. As shown in Figure 12(b), 20 Apriltags were deployed for positioning accuracy evaluation. The position of each Apriltag is listed in Table II. Figure 12(c) shows an example of Apriltag pasted on the wall. To ensure the safety of the localization experiment, the UAV was mounted on a mobile tripod, making it easy to adjust the pose change during flight.

Figure 12. The experiment of UAV localization in the indoor corridor environment. (a) The 2D map of the indoor corridor. (b) The distribution of apriltag in indoor corridor. (c) An example of Apriltag pasted on the wall of the indoor corridor. (d) The process of data recording.

4.2.1. Semantic knowledge database construction

Firstly, the indoor corridor prior map can be reconstructed with the point cloud data scanned by the Velodyne 16 and depth images measured by the Realsense D435, based on our previous work [Reference Chen, Feng, Zhao, Wang and Liang39]. The reconstructed point cloud map of the indoor corridor is shown in Figure 13.

It can be seen from Figure 13 that the indoor corridor contains five intersections, which are numbered 1 to 5. The dense point cloud of each intersection is segmented from the map for constructing the semantic knowledge database, where the segmented point cloud is shown in Figure 14. It can be found from Figure 14 that the geometric structure of intersections 1, 2, and 5 are similar, resulting in the difficulty of distinguishing them only relying on the point cloud data.

Based on our previous work [Reference Chen, Feng, Wang and Liang40], the geometric invariant point of each intersection is detected for generating PWCV and CWC descriptors, which are shown in Figure 15. Correspondingly, the PWCV and CWC similarity matrices are computed for recognizing different intersections. The results are shown in Figure 16. It can be seen that the PWCV and CWC similarities of different intersections are less than 0.8. In the UAV localization experiment, the minimum similarity threshold is set as 0.8 so that the different intersections can be recognized correctly. Based on the segmented intersection point cloud, detected invariant point, and the descriptor, the semantic knowledge database of the indoor corridor is constructed, where the structure is the same as Figure 8.

Table II. The 3D location of Apriltag QR codes on the wall surfaces.

Figure 13. The reconstructed point cloud map of the indoor corridor environment.

Figure 14. The dense point clouds of each intersection in the indoor corridor environment. (a)– (e) The point cloud of intersection 1 to 5, respectively.

Figure 15. The generated descriptors of each intersection. (a) The PWCV descriptors. (b) The CWC descriptors.

Figure 16. The similarities of PWCV descriptors and CWC descriptors in the indoor corridor.

4.2.2. UAV localization accuracy analysis

The UAV platform was driven in the indoor corridor for data recording, and the UAV pose was changed by adjusting the tripod to simulate the flight process. The driving trajectory is shown as the red line of Figure 17. The trajectory passed sequentially through the Apriltags numbered 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-14-13-12-11-10-9-8-7-6-5-16-17-18-19-0, and the total length of the trajectory is 420 m.

Figure 17. The trajectory of UAV in indoor corridor.

Based on the recorded LiDAR data, the UAV pose is estimated by LOAM, LOAM + Scan Context, and the proposed method. The localization errors are computed by calculating the difference between the estimated locations of Apriltags and the ground truth locations of Apriltags, which are listed in Table II. The localization error curves of each method are plotted in Figure 18. The horizontal coordinate represents the IDs of the 31 Apriltags passed by the UAV in sequence, and the vertical coordinate represents the calculated localization error.

Figure 18. The estimated trajectory error curves with different methods in the indoor corridor.

As shown in Figure 18, the localization error of LOAM is increasing continuously with the increase of driving distance, where the average error is 14.18 m and the maximum error is 51.8 m. This is due to the similar geometric structure of the indoor corridor. At the same time, the similar geometric structure also results in mismatches of LOAM + Scan Context, with an average error of 20.9 m, which is higher than that of the open-loop LOAM positioning method. As the red curve shows in Figure 18, the proposed localization method can decrease the average localization error to 5.0 m without a prior map. The proposed localization method (no map) can accurately recognize the same intersection from different directions, thereby constructing stable loop constraints. Furthermore, after adding the intersection constraint factor provided by the prior map, the proposed method can eliminate the accumulated localization error from the previous intersection to the current intersection when passing through the intersection in the semantic knowledge database. Thus, the localization error of the proposed method (with map) is optimized in segments, which greatly improves the accuracy of pose estimation and reduces the average positioning error to 1.7 m.

5. Conclusion

In this paper, a semantic knowledge database-based localization method is proposed for UAV inspection in the perceptually degraded underground mine. First, the relative pose constraint factor is constructed based on the spatial geometry features between neighboring LiDAR keyframes to realize the UAV local pose estimation. Furthermore, the dense point cloud of each intersection is extracted from the prior map. The geometrical invariant point, PWCV, and CWC descriptors are generated for constructing the semantic knowledge database. Moreover, the intersection pose constraint factor is constructed by comparing the semantic topology of the current LiDAR keyframe with the intersections in the semantic knowledge database. Based on the pose factor graph model, the relative pose constraint factor and the intersection pose constraint factor are combined to optimize the UAV flight pose. Finally, the experimental results in the Edgar Mine and the mine-like indoor corridor demonstrate that the proposed UAV localization method proposed in this paper can realize the segmentation elimination of accumulative error, achieve high localization accuracy, and meet the needs of underground inspection and positioning.

Author contributions

Qinghua Liang and Min Chen wrote the main manuscript. Min Chen and Minghui Zhao conducted the experiment and analyze the data. Shigang Wang revised the manuscript. All authors reviewed the manuscript.

Financial support

This work is supported by “Research on key technologies of UAV in a coal mine,” whose project number is 2019-TD-2-CXY007.

Competing interests

The authors declare no conflicts of interest exist.

Ethical approval

Not applicable.

References

Mur-Artal, R., Montiel, J. M. M. and Tardos, J. D., “Orb-slam: A versatile and accurate monocular slam system,” IEEE T. Robot. 31(5), 11471163 (2015).CrossRefGoogle Scholar
Mur-Artal, R. and Tardós, J. D., “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE T. Robot. 33(5), 12551262 (2017).CrossRefGoogle Scholar
Campos, C., Elvira, R., Rodríguez, J. J. Gómez, Montiel, J.é M. M. and Tardós, J. D., “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE T. Robot. 37(6), 18741890 (2021).CrossRefGoogle Scholar
Rogers, J. G., Gregory, J. M., Fink, J. and Stump, E.. Test your Slam! The Subt-Tunnel Dataset and Metric for Mapping. In 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE (2020) pp. 955961.Google Scholar
Özaslan, T., Shen, S., Mulgaonkar, Y., Michael, N. and Kumar, V., “Inspection of Penstocks and Featureless Tunnel-Like Environments Using Micro Uavs,” In: Field and Service Robotics, Tracts in Advanced Robotics (Springer, 2015) vol. 5, pp. 123136.CrossRefGoogle Scholar
Özaslan, T., Mohta, K., Keller, J., Mulgaonkar, Y., Taylor, C. J., Kumar, V., Wozencraft, J. M. and Hood, T.. “Towards Fully Autonomous Visual Inspection of Dark Featureless Dam Penstocks using Mavs.” In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2016) pp. 49985005.Google Scholar
Özaslan, T., Loianno, G., Keller, J., Taylor, C. J., Kumar, V., Wozencraft, J. M. and Hood, T., “Autonomous navigation and mapping for inspection of penstocks and tunnels with mavs,” IEEE Robot. Auto. Lett. 2(3), 17401747 (2017).CrossRefGoogle Scholar
Shin, J., Kim, S., Kang, S., Lee, S.-W., Paik, J., Abidi, B. and Abidi, M., “Optical flow-based real-time object tracking using non-prior training active feature model,” Real-Time Imaging 11(3), 204218 (2005).CrossRefGoogle Scholar
Jacobson, A., Zeng, F., Smith, D., Boswell, N., Peynot, T. and Milford, M.. “Semi-Supervised Slam: Leveraging Low-Cost Sensors on Underground Autonomous Vehicles for Position Tracking.” In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2018) pp. 39703977.Google Scholar
Zhu, S. S., Wang, G., Blum, H., Liu, J., Song, L., Pollefeys, M. and Wang, H., “SNI-SLAM: Semantic     ,Neural Implicit SLAM”, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), Seattle, WA, USA, 2024, pp. 21167–21177. doi: 10.1109/CVPR52733.2024.02000.Google Scholar
Kramer, A., Kasper, M. and Heckman, C., “Vi-Slam for Subterranean Environments,” In: Field and Service Robotics, Proceedings in Advanced Robotics (Springer, 2021) vol.16, pp. 159172.CrossRefGoogle Scholar
Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R. and Furgale, P., “Keyframe-based visual–inertial odometry using nonlinear optimization,” Int. J. Robot. Res. 34(3), 314334 (2015).CrossRefGoogle Scholar
Chen, L. J., Henawy, J., Kocer, B. B. and Seet, G. G. L.. “Aerial Robots on the Way to Underground: An Experimental Evaluation of Vins-Mono on Visual-Inertial Odometry Camera.” In 2019 International Conference on Data Mining Workshops (ICDMW), IEEE (2019) pp. 9196.Google Scholar
Qin, T., Li, P. and Shen, S., “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE T. Robot. 34(4), 10041020 (2018).CrossRefGoogle Scholar
Papachristos, C., Mascarich, F. and Alexis, K.. “Thermal-inertial localization for autonomous navigation of aerial robots through obscurants.” In 2018 International Conference on Unmanned Aircraft Systems (ICUAS), IEEE (2018) pp. 394399.Google Scholar
Khattak, S., Mascarich, F., Dang, T., Papachristos, C. and Alexis, K.. “Robust thermal-inertial localization for aerial robots: A case for direct methods.” 2019 International Conference on Unmanned Aircraft Systems (ICUAS), IEEE (2019) pp. 10611068.Google Scholar
Khattak, S., Papachristos, C. and Alexis, K.. “Visual-Thermal Landmarks and Inertial Fusion for Navigation in Degraded Visual Environments.” In 2019 IEEE Aerospace Conference, IEEE (2019) pp. 19.Google Scholar
Khattak, S., Papachristos, C. and Alexis, K.. “Keyframe-Based Direct Thermal–Inertial Odometry.” In 2019 International Conference on Robotics and Automation (ICRA), IEEE (2019) pp. 35633569.Google Scholar
Fasiolo, D. T., Scalera, L. and Maset, E., “Comparing lidar and imu-based slam approaches for 3D robotic mapping,” Robotica 41(9117 (2023).Google Scholar
Bakambu, J. N. and Polotski, V., “Autonomous system for navigation and surveying in underground mines,” J. Field Robot. 24(10), 829847 (2007).CrossRefGoogle Scholar
Thrun, S., Thayer, S., Whittaker, W., Baker, C. R., Burgard, W., Ferguson, D., Hähnel, D., Montemerlo, M. D., Morris, A., Omohundro, Z. and Reverte, C. F., “Autonomous exploration and mapping of abandoned mines.” IEEE Robotics & Automation Magazine 11, 79–91 (2004).CrossRefGoogle Scholar
Zhang, J. and Singh, S., “Loam: Lidar Odometry and Mapping in Real-Time,” In: Robotics: Science and Systems. vol. 2, Berkeley, CA, (2014) pp. 19.Google Scholar
Shan, T. and Englot, B.. “Lego-Loam: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain.” In. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2018) pp. 47584765.Google Scholar
Li, M., Zhu, H., You, S., Wang, L. and Tang, C., “Efficient laser-based 3d slam for coal mine rescue robots,” IEEE Access 7, 1412414138 (2018).CrossRefGoogle Scholar
Papachristos, C., Khattak, S., Mascarich, F. and Alexis, K.. “Autonomous navigation and mapping in underground mines using aerial robots.” In 2019 IEEE Aerospace Conference, IEEE (2019) pp. 18.Google Scholar
Chow, J. F., Kocer, B. B., Henawy, J., Seet, G., Li, Z., Yau, W. Y. and Pratama, M., “Toward underground localization: Lidar inertial odometry enabled aerial robot navigation.” (2019) [J]. CoRR abs/1910.13085.Google Scholar
Kohlbrecher, S., Von Stryk, O., Meyer, J. and Klingauf, U.. “A flexible and scalable slam system with full 3d motion estimation.” In 2011 IEEE international symposium on safety, security, and rescue robotics, IEEE (2011) pp. 155160.Google Scholar
Grisetti, G., Stachniss, C. and Burgard, W.. Improving Grid-Based Slam with Rao-Blackwellized particle Filters by Adaptive Proposals and Selective Resampling. In: Proceedings of the 2005 IEEE international conference on robotics and automation, IEEE (2005) pp. 24322437.Google Scholar
Hess, W., Kohler, D., Rapp, H. and Andor, D.. " Real-Time Loop Closure in 2D Lidar Slam.” 2016 IEEE international conference on robotics and automation (ICRA), IEEE (2016) pp. 12711278.Google Scholar
Koval, A., Kanellakis, C. and Nikolakopoulos, G., “Evaluation of lidar-based 3d slam algorithms in subt environment,” IFAC-PapersOnLine 55(38), 126131 (2022).CrossRefGoogle Scholar
Wang, G., Wu, X., Liu, Z. and Wang, H.. “Pwclo-net: Deep Lidar Odometry in 3D Point Clouds using Hierarchical Embedding Mask Optimization.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2021) pp. 1591015919.Google Scholar
Alexis, K.. “Resilient Autonomous Exploration and Mapping of Underground Mines using Aerial Robots.” In 2019 19th International Conference on Advanced Robotics (ICAR), IEEE (2019) pp. 18.Google Scholar
Jacobson, A., Zeng, F., Smith, D., Boswell, N., Peynot, T. and Milford, M., “What localizes beneath: A metric multisensor localization and mapping system for autonomous underground mining vehicles,” J. Field Robot. 38(1), 527 (2021).CrossRefGoogle Scholar
Lavigne, N. J. and Marshall, J. A., “A landmark-bounded method for large-scale underground mine mapping,” J. Field Robot. 29(6), 861879 (2012).CrossRefGoogle Scholar
Li, M.-G., Zhu, H., You, S.-Z. and Tang, C.-Q., “Uwb-based localization system aided with inertial sensor for underground coal mine applications,” IEEE Sens. J. 20(12), 66526669 (2020).CrossRefGoogle Scholar
Wang, G., Wang, W., Ding, P., Liu, Y., Wang, H., Fan, Z., Bai, H., Hongbiao, Z. and Du, Z., “Development of a search and rescue robot system for the underground building environment,” J. Field Robot. 40(3), 655683 (2023).CrossRefGoogle Scholar
Li, S., Gu, J., Li, Z., Li, S., Guo, B., Gao, S., Zhao, F., Yang, Y., Li, G. and Dong, L., “A visual slam-based lightweight multi-modal semantic framework for an intelligent substation robot,” Robotica, 42(7), 21692183 (2024). doi: 10.1017/S0263574724000511.CrossRefGoogle Scholar
Ma, T., Jiang, G., Ou, Y. and Xu, S., “Semantic geometric fusion multi-object tracking and lidar odometry in dynamic environment,” Robotica 42(3), 891910 (2024). doi: 10.1017/S0263574723001868.CrossRefGoogle Scholar
Chen, M., Feng, Y., Zhao, M., Wang, S. and Liang, Q., “Fusion of sparse lidar and uncertainty-based 3d vision for accurate depth estimation with bayesian kriging,” Opt. Eng. 61(1), 013106013106 (2022).CrossRefGoogle Scholar
Chen, M., Feng, Y., Wang, S. and Liang, Q., “A mine intersection recognition method based on geometric invariant point detection using 3D point cloud,” IEEE Robot. Auto. Lett. 7(4), 1193411941 (2022).CrossRefGoogle Scholar
Zhou, L., Koppel, D. and Kaess, M., “Lidar slam with plane adjustment for indoor environment,” IEEE Robot. Auto. Lett. 6(4), 70737080 (2021).CrossRefGoogle Scholar
Zhang, Y., “Lilo: A novel lidar–imu slam system with loop optimization,” IEEE T. Aero. Elec. Sys. 58(4), 26492659 (2021).CrossRefGoogle Scholar
Chalvatzaras, A., Pratikakis, I. and Amanatiadis, A. A., “A survey on map-based localization techniques for autonomous vehicles,” IEEE T. Intell. Veh. 8(2), 15741596 (2022).CrossRefGoogle Scholar
Liu, Q., Di, X. and Xu, B., “Autonomous vehicle self-localization in urban environments based on 3D curvature feature points–monte carlo localization,” Robotica 40(3), 817833 (2022).CrossRefGoogle Scholar
Kschischang, F. R., Frey, B. J. and Loeliger, H.-A., “Factor graphs and the sum-product algorithm,” IEEE T. Inform. Theory 47(2), 498519 (2001).CrossRefGoogle Scholar
Thrun, S., “Probabilistic robotics,” Commun. ACM 45(3), 5257 (2002).CrossRefGoogle Scholar
Gao, X., Zhang, T., Gao, X. and Zhang, T., “Lie Group and Lie Algebra,” In: Introduction to Visual SLAM: From Theory to Practice, (2021) pp. 6386.CrossRefGoogle Scholar
Rogers, J. G., Gregory, J. M., Fink, J. and Stump, E.. “Test your Slam! The Subt-Tunnel Dataset and Metric for Mapping.” In 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE (2020) pp. 955961.Google Scholar
Zhang, Z. and Scaramuzza, D.. “A Tutorial on Quantitative Trajectory Evaluation for Visual (-Inertial) Odometry.” In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2018) pp. 72447251.Google Scholar
Wang, J. and Olson, E.. “Apriltag 2: Efficient and Robust Fiducial Detection.” In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2016) pp. 41934198.Google Scholar
Figure 0

Figure 1. The framework for UAV pose estimation based on semantic knowledge database.

Figure 1

Figure 2. The pose factor graph model of UAV.

Figure 2

Figure 3. The procedure for intersection location constraints constructing.

Figure 3

Figure 4. The 3D location observation of intersection.

Figure 4

Figure 5. UAV flight localization experiment in the Edgar Mine environment. (a) The Edgar Mine deployed in the ROS Gazebo. (b) The UAV simulation platform in the ROS Gazebo.

Figure 5

Figure 6. The UAV flight trajectory.

Figure 6

Figure 7. The reconstructed environmental map of Edgar Mine.

Figure 7

Figure 8. The intersection semantic knowledge database of the Edgar Mine.

Figure 8

Figure 9. The estimated trajectories during UAV flight based on different methods in Edgar Mine.

Figure 9

Table I. UAV flight trajectory errors analysis in Edgar Mine environment.

Figure 10

Figure 10. The localization error maps with different methods. (a) The error distribution of LOAM. (b) The error distribution of LOAM + Scan context. (c) The error distribution of ours(no map). (d) The error distribution of ours (map).

Figure 11

Figure 11. The UAV hardware platform.

Figure 12

Figure 12. The experiment of UAV localization in the indoor corridor environment. (a) The 2D map of the indoor corridor. (b) The distribution of apriltag in indoor corridor. (c) An example of Apriltag pasted on the wall of the indoor corridor. (d) The process of data recording.

Figure 13

Table II. The 3D location of Apriltag QR codes on the wall surfaces.

Figure 14

Figure 13. The reconstructed point cloud map of the indoor corridor environment.

Figure 15

Figure 14. The dense point clouds of each intersection in the indoor corridor environment. (a)– (e) The point cloud of intersection 1 to 5, respectively.

Figure 16

Figure 15. The generated descriptors of each intersection. (a) The PWCV descriptors. (b) The CWC descriptors.

Figure 17

Figure 16. The similarities of PWCV descriptors and CWC descriptors in the indoor corridor.

Figure 18

Figure 17. The trajectory of UAV in indoor corridor.

Figure 19

Figure 18. The estimated trajectory error curves with different methods in the indoor corridor.