1. Introduction
Gastrointestinal cancer is the second leading cause of cancer death in the world and accounts for about 35% of all cancer-related deaths [Reference Bernhardt, Nicolau, Soler and Doignon1, Reference Shao, Pei, Chen, Zhu, Wu, Sun and Zhang2]. Some hospitals are now equipped with two-dimensional endoscopic instruments for doctors, such as the da Vinci® surgical system (Intuitive Surgical, Inc., Sunnyvale, CA), to assist in performing minimally invasive surgery (MIS) of the gastrointestinal tract, abdominal cavity, chest cavity, and throat. The most direct and effective screening for gastrointestinal cancers is two-dimensional endoscopy, such as capsule endoscopy, upper gastrointestinal endoscopy, and colonoscopy [Reference Feuerstein3–Reference Low, Morris, Matsumoto, Stokken, O’Brien and Choby6].
In traditional endovascular MIS processes, the position of diseased tissue is generally estimated by visually examining 2D endoscope images. However, the endoscope images usually lack sufficient texture. When combined with irregular illumination, extensive, similar areas, and low contrast, it becomes difficult for surgeons to quickly and accurately locate lesions. Other problems due to hand-eye coordination and visual misdirection may also occur during operation [Reference Afifi, Takada, Yoshimura and Nakaguchi7]. Recently, computer vision-based algorithms have attracted much attention for success in stereoscopic endoscope position tracking and providing intraoperative reconstruction of surgical scenes. Tatar et al. [Reference Tatar, Mollinger, Den Dulk, van Duyl, Goosen and Bossche8] attempted to use a depth camera combined with a time-of-flight method to locate positions of surgical instruments. Lamata et al. [Reference Lamata, Morvan and Reimers9] investigated the features (mutual reflection, diffuse reflection, highlight parts, and colors) of human liver photos based on the Lambert-body method and tried to reconstruct a 3D model of the liver by adjusting the albedo and light intensity of the endoscopic images. Wu et al. [Reference Wu, Sun and Chang10] aimed to track geometric constraints of surgical instruments and reconstruct 3D structures from 2D endoscopic images with a constrained decomposition method. Seshamani et al. [Reference Seshamani, Lau and Hager11] combined a video mosaic method and an online processing technique to expand the field of view to better assist surgeons in performing surgeries and lesion diagnosis. Due to the complex features of an enterocele, endoscopic images often have strong illumination variation and feature sparsity, resulting in difficulties for the aforementioned methods to realize precise organ 3D reconstruction and lesion localization.
Recently, the Structure from Motion (SfM) approach was proposed to construct high-quality 3D models of human organs based on endoscopic images. The SfM approach mainly consists of feature extraction, keypoint matching, attitude estimation, and beam adjustment. Based on the SfM technique, Thormaehlen et al. [Reference Thormahlen, Broszio and Meier12] generated a 3D model of the human colon with surface texture features. Koppel et al. [Reference Koppel, Chen and Wang13] developed an automated SfM approach to reconstruct a 3D model of the colon from endoscopic images to assist surgeons in surgical planning. Mirota et al. [Reference Mirota, Wang and Taylor14] proposed a direct SfM approach to track endoscope position using video data to improve the accuracy of Endonasal Skull Base Surgery navigation. Kaufman et al. [Reference Kaufman and Wang15] applied a direct Shape from Shading (SfS) algorithm to better extract detailed information of surface textures from endoscopic images and combined the SfM method to reconstruct a refined 3D model of human organs. Assisted by manual drawing of the outline of the major colonic folds, Hong et al. [Reference Hong, Tavanapong and Wong16] reconstructed a virtual colon segment based on an individual colonoscopy image to aid surgeons in detecting colorectal cancer lesions. However, accurate reconstruction of human organs based on SfM methods requires stable camera motion since it needs to match feature points between multiple images and calculate the camera pose. Furthermore, data obtained from sensors such as monocular cameras, Inertial Measurement Units, ultrasonic lidar, etc., are usually large, thus requiring computing resources to perform batch data processing. Hence, SfM techniques are usually applied offline. For actual surgical operation, real-time feedback plays an important role in providing surgeons with timely and accurate information to allow them to make optimal decisions and adapt their approach as necessary during the procedure. A real-time online computer vision-based algorithm is hence highly desirable to improve accuracy and precision of surgical interventions and reduce the risk of complications or adverse outcomes.
The Visual Simultaneous Localization and Mapping (VSLAM) method is a real-time online data processing technique which requires less computing resources compared to the SfM approach. VSLAM utilizes endoscopic video or image sequences to estimate the pose and location of the endoscope and to reconstruct the abdominal cavity and other scenes of the MIS [Reference Jang, Han, Yoon, Jai and Choi17–Reference Xie, Yao, Wang and Liu19]. The goal of VSLAM is to improve the visual perception of surgeons, and it plays an important role in developing online surgical navigation systems and medical augmented reality technology. Much research in recent years has focused on improving the accuracy and efficiency of VSLAM methods for medical applications, particularly in the context of MIS systems. Mountney et al. [Reference Mountney, Stoyanov and Davison20] first explored the application of VSLAM in MIS by extending the Extended Kalman Filter SLAM (EKF-SLAM) framework to handle complex light reflection and low-texture environments. However, the obtained point clouds were too sparse and could not represent 3D shapes and detailed surface textures of human organs. Mountney and Yang [Reference Mountney and Yang21] proposed a novel VSLAM method to online estimate tissue deformation and motion of the laparoscopic camera by establishing a periodic tissue deformation parameter model and generating a joint registered 3D map with preoperative data. However, the slow speed of the system’s map-building algorithm can lead to poor real-time tracking and loss of feature points. In [Reference Klein and Murray22], Klein and Murray proposed a Parallel Tracking and Mapping (PTAM) algorithm, a monocular VSLAM approach based on keyframes. The PTAM can run in real time on a single CPU and handle large-scale environments and a variety of lighting conditions. However, it requires high-quality feature detection and feature matching for camera locating and scene mapping.
The aforementioned methods are generally based on monocular endoscopes, where it is difficult to process endoscopic images with small viewing angles and rapid frame transitions. Lin et al. [Reference Lin, Johnson and Qian23] extended the application scope of PTAM to stereo endoscopy, which allows for simultaneous stereoscopic tracking, 3D reconstruction, detection of deformation points in the MIS setting and can generate denser 3D maps compared to EKF-SLAM methods. However, this stereo system suffers from time-consuming feature point matching. Later, Lin et al. [Reference Lin, Sun, Sanchez and Qian24] improved texture feature selection, green channel selection, and reflective area processing of the endoscopic images and proposed a revised VSLAM method to restore the surface structure of a 3D scene of abdominal surgery based on SLAM. However, the proposed method relies heavily on tissue surface vascular texture. In cases where the tissue being imaged has little or no vascularity, this method may not be effective in detecting unique features. Recently, Mur-Artal [Reference Mur-Artal, Montiel J.M. and Tardos25] provided an ORBSLAM system constructed via a robust camera tracking and mapping estimator with remarkable camera relocation capabilities. Mahmoud [Reference Mahmoud, Cirauqui and Hostettler26] applied the ORBSLAM algorithm to track the position of the endoscope without additional tracking elements and provide 3D reconstruction in real time. This extended the ORBSLAM to reconstruct semi-dense maps of soft organs. However, although the above two ORBSLAM methods based on feature point approaches reduce computational complexity, the reduction in the amount of information compared to the original graph also implies that some useful information is lost. While the two ORBSLAM methods reduce computational complexity, the reduction in useful information can lead to inaccurate camera location and visceral surface texture mapping.
Feature point detection is a fundamental and important processing step in Visual Odometry (VO) or VSLAM. Local features, such as the Scale Invariant Feature Transform (SIFT), Speed Up Robust Feature (SUFT), Oriented FAST, and Rotated BRIEF (ORB), for camera pose estimation are commonly hand-crafted by calling OpenCV algorithms from a third-party function library. However, the feature points extracted by these algorithms are often unevenly distributed, with large amounts of useful data lost, resulting in inaccurate camera positioning and scene mapping [Reference Mahmoud, Collins, Hostettler, Soler, Doignon and Montiel27–Reference Rublee, Rabaud and Konolige29]. Moreover, the surface of the human viscera often has poor texture. Endoscope images often have a small field of view and are commonly taken with lighting changes and specular reflection, Fig. 1. Weak textures and specular reflections pose challenges to VSLAM [Reference Mahmoud, Collins, Hostettler, Soler, Doignon and Montiel27], making many SfM or SLAM frameworks such as ORB-SLAM3 [Reference Campos, Elvira and Rodriguez30] ineffective in these situations. In this paper, a self-supervised feature extraction method “SuperPoint” [Reference Detone, Malisiewicz and Rabinovich31] and a matching feature technique “SuperGlue” are applied to address challenges such as illumination changes, weak textures, and specular reflections in the human viscera. Moreover, this approach accelerates convolutional Neural Network (CNN) computations to enable real-time endoscopic pose estimation and viscera surface map construction.
Feature matching is another critical step in feature-based VO or SLAM techniques. This involves finding the same features in two images and establishing correspondences between them to achieve camera pose estimation and map updates. The performance of the feature-matching process directly affects the accuracy and stability of the VO or SLAM system. Chang et al. [Reference Chang, Stoyanov and Davison32] used feature matching to perform heart surface sparse reconstruction through structural propagation. The algorithm obtained parallax data between point pairs to estimate stereo parallax of each frame and motion information between consecutive frames. However, the method obtained a sparse parallax field, and further complex interpolation calculations were required to obtain a denser reconstructed scene of the heart surface. Lin et al. [Reference Lin, Sun and Sanchez33] utilized a vessel-based line-matching approach based on block-matching geometry to avoid pixel-wise matching. However, the application of local characteristics of image features of the viscera can lead to mismatched point pairs and thus incorrect camera location. Direct methods such as DSO [Reference Engel, Koltun and Cremers34] or DSM [Reference Zubizarreta, Aguinaga and Montiel35] and hybrid methods such as SVO [Reference Forster, Pizzoli and Scaramuzza36] assume that ambient illumination remains constant, which is difficult to ensure due to severe illumination variations of endoscopic images. The Self-Supervised Learning (SSL) approach can match images by using image content itself as supervision, without requiring explicit labels or annotations. SSL methods have shown promising performance in image-matching tasks such as stereo matching and optical flow estimation of real-life scenarios and have enhanced robustness to local illumination changes [Reference Sarlin, Detone and Malisiewicz37]. However, the performance of SSL in endoscopic image matching is unknown and remains to be studied. This paper proposes an improved SSL method with adaptive deep learning to address data association between endoscopic images.
This paper introduces SPSVO, a self-supervised surgical perception stereo visual odometer for endoscope pose (position and rotation) estimation and scene reconstruction. The proposed method overcomes adverse effects of endoscopic images on feature extraction and tracking, such as irregular illumination, poor surface texture, low contrast and extensive, and similar areas. The main contributions of this paper are as follows:
-
• A VO system is proposed that integrates a SuperPoint feature extraction method based on CNN and a SuperGlue feature-matching network. The SPSVO system enables extraction of enriched feature points compared with common hand-crafted local feature-detecting methods, such as ORB, SIFT, and SUFT.
-
• An image illumination pre-processing technique is proposed to address mirror reflection and illumination variations of endoscopic images.
-
• The SPSVO system includes image pre-processing, feature extraction, stereo matching, feature tracking, keyframe selection, and pose graph optimization.
-
• The performance of the proposed system is evaluated based on a public dataset: “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38]. Results indicate that the proposed system outperforms ORB-SLAM2 [Reference Mur-Artal and Tardós39] and ORB_SLAM2_Endoscopy [40] methods in feature detection and tracking. ORB-SLAM2 cannot extract sufficient feature points to initialize the scene map of viscera and thus results in loss of the endoscope track.
-
• The proposed system is capable of accurate and rapid operation within the human viscera; the computation speed of the SPSVO system is as fast as 131ms per frame, enabling real-time surgical navigation.
The rest of this paper is organized as follows: Section 2 presents related work on endoscopic VSLAM methods. Section 3 presents the proposed SPSVO system. Section 4 presents experimental results and analysis. Finally, conclusions are drawn in Section 5.
2. Related work
2.1. VSLAM and VO for endoscopy
VSLAM is a technique that uses camera vision for simultaneous robot self-locating and scene map construction [Reference Durrant-Whyte and Bailey41]. It enables autonomous robot exploration in unknown or partially unknown environments. The architecture of a classical VSLAM system typically includes a front-end visual odometer, backend optimization, loop closure detection, and finally mapping, as shown in Fig. 2.
VSLAM has the potential to estimate the relative pose of the endoscope camera and construct a viscera surface texture map, which is important for lesion localization and surgical navigation. However, complicated intraoperative scenes (e.g., deformable targets, surface texture, sparsity of visual features, viscera specular reflection, etc.) and strict accuracy requirements have posed challenges to the application of VSLAM to minimally invasive surgery. Recently, Lamarca et al. [Reference Bartoli, Montiel and Lamarca42] proposed a monocular non-rigid SLAM method that combines shape from template (SfT) and non-rigid structure from motion (NRSfM) methods for non-rigid environment scene map construction. However, this method is susceptible to variations in illumination and does not perform well under poor visual texture conditions, rendering it unsuitable for reconstruction of viscera with non-isometric deformations. Later, Gong et al. [Reference Gong, Chen and Li43] constructed an online tracking and relocation framework which employs a rotation invariant Haar-like descriptor and a simplified random forest discriminator to select and track the target region for gastrointestinal biopsy images. Song [Reference Song, Wang and Zhao44] constructed a real-time SLAM system to address scope-tracking problems through an improved localization technique. Much work has focused on adapting VSLAM to enable application to an endoscopic scene, addressing problems such as poor texture [Reference Wei, Feng and Li45, Reference Song, Zhu and Lin46], narrow field of view [Reference Seshamani, Lau and Hager11], and specular reflections [Reference Wei, Yang and Shi47]. Still, the variable illumination problem remains unaddressed. Intraoperative scenarios require accurate camera localization; complex viscera images can lead to mismatched point pairs and thus incorrect camera location. Data association also remains a challenging problem for VSLAM systems in MIS scenarios [Reference Yadav and Kala48]. This paper focuses on addressing the problems of variable illumination and data association for intraoperative scenes.
2.2. SLAM based on SuperPoint and SuperGlue
CNNs have made outstanding achievements in computer vision to aid lesion diagnosis or intraoperative scene reconstruction [Reference Yadav and Kala48–Reference Li, Shi and Long52]. Researchers have studied and improved many aspects of VSLAM with learning-based feature extraction techniques to address variable illumination and poor visceral surface texture in complex surgical scenarios [Reference Liu, Zheng and Killeen53, Reference Liu, Li and Ishii54]. Bruno et al. [Reference Bruno H.M. and Colombini49] presented a novel hybrid VSLAM algorithm based on a Learned Invariant Feature Transform network to perform feature extraction in a traditional backend based on an ORB-SLAM system. Li et al. [Reference Li, Shi and Long52] attempted to use an end-to-end deep CNN in VSLAM to extract local descriptors and global descriptors from endoscopic images for pose estimation. Schmidt et al. [Reference Schmidt and Salcudean50] proposed Real-Time Rotated descriptor (ReTRo), which was more effective than classical descriptors and allowed for the development of surgical tracking and mapping frameworks. However, the aforementioned methods are based on traditional Fast Library for Approximate Nearest Neighbors (FLANN) techniques to track keypoints and match extracted features. FLANN does not perform well at feature point matching of high-similarity images, resulting in mismatches between extracted new features and potential features. Its performance is even worse under variable illumination; therefore, FLANN is not always applicable for MIS [Reference Muja and Lowe55].
This paper proposes to apply a SuperPoint approach for keypoint detection and to utilize the SuperGlue technique to deal with complex data associations in intraoperative scenes. SuperPoint [Reference Detone, Malisiewicz and Rabinovich31] is a self-supervised framework for detecting features and describing points of interest, while SuperGlue [Reference Sarlin, Detone and Malisiewicz37] is a network that can simultaneously filter outliers and match features. Recently, researchers have studied the effectiveness of SuperPoint and SuperGlue in VSLAM systems for MIS [Reference Barbed, Chadebecq and Morlana56, Reference Sarlin, Cadena and Siegwart57]. Barbed et al. [Reference Barbed, Chadebecq and Morlana56] demonstrated that SuperPoint delivers better feature detection in VSLAM than using hand-crafted local features. Laura et al. [Reference Oliva Maza, Steidle and Klodmann58] applied SuperPoint to a monocular VSLAM system to estimate the pose of the ureteroscope tip. Sarlin et al. [Reference Sarlin, Cadena and Siegwart57] proposed a Hierarchical Feature Network (HF-Net) algorithm based on SuperPoint and SuperGlue to predict local features and global descriptors for a 6-DoF localization of the camera. However, existing algorithms require substantial computing power to run in real time, which presents a significant obstacle to building maps in real time. In this work, a SPSVO algorithm is proposed to accelerate the CNN to realize real-time endoscopic pose estimation and viscera surface map construction.
3. Proposed SPSVO approach
3.1. System overview
The proposed SPSVO approach consists of four main modules: feature extraction, stereo matching, keyframe selection, and pose graph optimization, as shown in Fig. 3. The SPSVO can perform feature matching and keypoint tracking between stereo images and images in different frames, and can avoid incorrect data associations by using matching results of relevant key points. For real-time performance, the SPSVO performs feature tracking of images from only the left eye to reduce computation time. Nvidia TensorRT Toolkit is used to accelerate feature extraction and matching. On the backend, the SPSVO uses a traditional pose graph optimization framework for map construction. The above modules are designed to enable real-time application of the SPSVO within human enterococci and achieve accurate tracking by combining the efficiency of traditional optimization methods and the robustness of learning-based techniques.
3.2. Image pre-processing
For image pre-processing, the SPSVO uses Contrast-Constrained Adaptive Histogram Equalization (CLAHE) [Reference Zuiderveld59] to enhance contrast, brightness, details, and texture of the input image. Due to severe variability in illumination in optical colonoscopy, some parts of the L-channel color space of the image are overexposed, resulting in image specular reflections, while some images are underexposed and lead to dark areas. In this work, pixels with a luminance greater than 50 are marked as reflective regions, and pixel values in the reflective region are set to the average of surrounding pixels. Possible noise is eliminated by a morphological closure operation. Performance of the CLAHE is demonstrated in Fig. 4. The proposed CLAHE effectively improves the uniformity of illumination and improves the contrast of endoscopic images. Due to the proximity of the endoscope light source to the inner wall of the organ and rapid movement of the endoscope, this pre-processing step allows the system to eliminate the effects of mismatches caused by specular reflections.
3.3. Proposed SuperPoint model
The SuperPoint network consists of four parts: encoding network, feature point detection network, descriptor detection network, and loss function. The encoder network converts the input image into a high-dimensional tensor representation for the decoder, making it easier to detect and describe key points. The feature point detection network is a decoding structure that calculates keypoint probability for each pixel and embeds a sub-pixel convolution algorithm to reduce computational effort. The descriptor detection network is also a decoding structure that extracts semi-dense descriptors first, performs a bicubic interpolation algorithm to obtain full descriptors, and uses L2-normalization to obtain unit-length descriptors. The loss function is a measure of the difference between the network output and ground truth label, guiding the network to optimize and improve its performance in detecting and describing key points of the input image. This provides better performance for related applications such as VSLAM, 3D reconstruction, and autonomous navigation. The SuperPoint network is trained in PyTorch. The input of the SuperPoint network is a single image I with $I\in R^{H\times W}$ , where $H$ is the height and $W$ is the width of the image, in pixels. The output of the network is positions of key points extracted in each image and their corresponding descriptors.
Based on Barbed [Reference Barbed, Chadebecq and Morlana56], the loss function can be expressed as
where the X and X´ are outputs of the original detection header of image I and warped image I´, respectively. The associated detection pseudo-labels are Y and Y.´; D and D´ are outputs of the raw description header. $S\in R^{H\mathit{/}8\times W\mathit{/}8\times H\mathit{/}8\times H\mathit{/}8}$ is the homography estimation matrix. L P represents the loss of feature points during detection, which can be used to measure the difference between detected outputs and the pseudo-label. The L d is the loss function of the descriptor; λ is a weight parameter used to balance the weight of L p and L d .
As shown in Fig. 1(b), there are generally multiple specular reflection areas (white spot areas) that exist in an endoscopic image. Most existing feature detection methods tend to detect many feature points around contour areas or specular reflection areas [Reference Barbed, Chadebecq and Morlana56]. For VSLAM, the more evenly the feature points are distributed in the image, the more accurately feature matching can estimate spatial pose relation. To make feature points extracted by SuperPoint evenly distributed in the region of interest, the specularity loss (L S ), which reconsiders weights of all extracted key points in specular regions, is proposed. The revised loss function is defined as
in which λ S is a scale weighting factor determined by characteristics of the dataset and contribution of each objective function to the model performance. In this work, $\lambda _{s}=100$ . The L S is defined as
where softmd() and d2s() are SoftMax functions. The $\varepsilon$ is a constant with $\varepsilon =10^{-10}$ [Reference Detone, Malisiewicz and Rabinovich31, Reference Barbed, Chadebecq and Morlana56]. The $m(I)_{hw}$ is a weighting mask, where $m(I)_{hw}\gt 0$ for pixels near a specularity and 0 otherwise. The value of L S is close to zero when there is no key point at that location.
The default thresholds of the parameters of the ORB-SLAM2 and SPSVO are determined based on [Reference Campos, Elvira and Rodriguez30, Reference Xu, Hao and Wang51], as shown in Table 1. The algorithms were run with default thresholds at the beginning and calibrated by comparing with the results of the ground truth values through increasing or decreasing the thresholds. In this work, ±40% variations were made with respect to the default thresholds. Figure 5 shows the comparison of the number of keypoints matched per keyframe with feature points threshold of 1600, as can be observed that the proposed SPSVO outperforms the ORB-SLAM2 in terms of matched feature points (approximately 700 points versus 500 points).
Comparison of the distribution of feature points extracted by the SPSVO algorithm and ORB-SLAM2 on the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38] is shown in Fig. 6; the image resolution is $480\times 640$ . According to the results of Table 1, the upper threshold of feature point extraction is set to 1600 to ensure that both algorithms have the potential to obtain perfect system performance in most scenarios. It can be seen that the SPSVO extracts more effective features than ORB-SLAM2. The large number and even distribution of feature points will provide more scene information, thus improving the accuracy of camera localization. Furthermore, the feature points extracted by SPSVO are evenly distributed and located in textured areas, which is beneficial for subsequent VSLAM tasks such as keypoint matching, camera localization, map construction, and path planning.
3.4. Feature matching
The SuperGlue algorithm is commonly applied to simultaneously address feature matching and outlier filtering for real-time pose estimation in indoor and outdoor environments [Reference Jang, Yoon and Kim60–Reference Su and Yu62]. SuperGlue needs to be trained on the true value of the trajectory in the abdominal cavity to achieve an adaptive intra-abdominal environment. A bi-directional brute force matching algorithm is utilized to establish correspondence between features in consecutive frames of an image sequence. Additionally, SPSVO uses the Random Sample Consensus algorithm to remove false matches of feature points for robust geometric estimation, see Algorithm 1. Figure 7 shows the results of the proposed algorithm for stereo matching. The successfully matched feature pairs are connected by lines. It can be seen that the SPSVO can accurately match a large number of key points. Moreover, the SPSVO has good consistency in feature matching between frames, where a feature point can be consistently matched across multiple frames. Consistent matching indicates that the proposed SPSVO can effectively estimate camera position.
3.5. Keyframe selection
Keyframe selection plays an important role in reducing computational cost, decreasing redundant information, and improving accuracy of VSLAM [Reference Klein and Murray22, Reference Mur-Artal, Montiel J.M. and Tardos25, Reference Strasdat, Montiel and Davison63]. The general criteria for keyframe selection are (1) distribution of the keyframes should not be too dense or too sparse; (2) the number of keyframes should generate sufficient local map points [Reference Liu, Li and Ishii54]. Unlike other SLAM or VO systems, SPSVO integrates a learning-based matching method that can effectively match frames with large differences in baseline length. Therefore, during feature-matching SPSVO only matches the current frame with keyframes, which can reduce tracking error. The keyframe selection criteria should take into account the movement between frames, information gain, tracking stability, and previous experience. Based on the key frame selection principle [Reference Campos, Elvira and Rodriguez30, Reference Xu, Hao and Wang51], the keyframe selection criteria corresponding to the matching process of SPSVO are defined as:
-
• The distance between the current frame and the nearest keyframe ( $L$ ) satisfies the condition of $L\gt D_{f}$ ;
-
• The angle between the current frame and the nearest keyframe ( $\theta$ ) satisfies the condition of $\theta \gt \theta _{f}$ ;
-
• The number of map points ( $N_{A}$ ) tracked by the current frame satisfies the condition $N_{1}^{u}\lt N_{A}\lt N_{2}^{l}$ ;
-
• The number of the map points ( $N_{B}$ ) tracked by the current frame satisfies the condition $N_{B}\lt N_{3}$ ;
-
• The number of frames since the last keyframe inserted ( $N_{C}$ ) satisfies the condition of $N_{C}\gt N_{4}$ .
in which, $D_{f},\theta _{f},N_{1}^{u},N_{2}^{l},N_{3},N_{4}$ are preset thresholds. A frame is selected as a keyframe if it meets any of the above conditions, see Algorithm 2. The proposed keyframe selection criteria consider both image quality and keypoint quality. These can play an important role in filtering useless or incorrect information and avoiding adverse impacts on endoscope localization and scene mapping.
3.6. Keyframe selection
The Levenberg Marquardt (LM) algorithm is used as the optimization solver in the backend of the proposed SPSVO to construct the Covisibility Graph. For each optimizing iterative loop, when LM optimization converges, both inputs and outputs of the optimization process are set as inputs of the loss function for decoding network training. The optimization variables are keyframes and map points, and the corresponding constraints are the monocular and stereo constraints.
3.6.1 The monocular constraint
If a 3D map point ${}^{w}{P}{_{i}^{}}$ is observed by the left eye camera, the reprojection error $e_{k,i}$ of the i-th point in the k-th frame is defined as
where ${}^{w}{P}{_{i}^{}}$ is the i-th point observed by frame k, w is the world coordinate system and c is the camera coordinate system. R and t are the rotation and translation of the camera. $\overset{\wedge }{p}_{i}=(\overset{\wedge }{u}_{i},\overset{\wedge }{v}_{i})$ is the observation data of the map point on the frame, and $\pi _{i}(\cdot )$ is the camera projection model representing coordinates of the 3D map point projection on the left eye image, expressed as
where $[\begin{array}{lll} x_{i} & y_{i} & z_{i} \end{array}]^{\mathrm{T}}$ are the world coordinates of point ${}^{w}{P}{_{i}^{}}$ , and $f_{x},f_{y},c_{x},c_{y}$ are the intrinsic parameters of camera.
3.6.2 The stereo constraint
If a 3D map point ${}^{w}{P}{_{j}^{}}$ is observed by both left and right cameras at the same time, the reprojection error is defined as
where $\overset{\wedge }{p}_{j}=(\overset{\wedge }{u}_{j},\overset{\wedge }{v}_{j},\overset{\wedge }{r_{j}})$ is the observation data of the map point on the k-th frame of the right image, and $\overset{\wedge }{r}_{j}$ is the horizontal coordinate of the right image. $\pi _{j}(\cdot )$ is the camera projection model representing the 3D map point projection on the stereo image and defined as
where b represents the baseline of the stereo camera. $[\begin{array}{lll} x_{j} & y_{j} & z_{j} \end{array}]^{\mathrm{T}}$ are the world coordinates of point ${}^{w}{P}{_{j}^{}}$ .
3.6.3 Graph optimization
Assuming that the distribution of key points satisfies a Gaussian distribution [Reference Szeliski64], the final cost function of the proposed SPSVO can be defined as
where $\rho _{k,i}$ and $\rho _{k,j}$ are robust kernel functions to further reduce the impact of any possible outliers. $(e_{k,i})^{\mathrm{T}}$ and $(e_{k,j})^{\mathrm{T}}$ are the transpose of matrix $e_{k,i}$ and $e_{k,j}$ , respectively. $\Sigma _{k,i}$ and $\Sigma _{k,j}$ are covariance matrices, and $(\Sigma _{k,i})^{-1}, (\Sigma _{k,j})^{-1}$ are the inverse of these covariance matrices, respectively.
4. Experimental validation of the proposed SPSVO method
In this section, the performance of the proposed SPSVO is evaluated based on the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38] and compared with ORB-SLAM2. SPSVO is a stereo VO system without loop closure detection module. Furthermore, the colon_reconstruction_dataset does not involve scene re-identification or map closure situations, so the impact of loop closure detection module on algorithm comparison is very limited. Therefore, to ensure fair and accurate comparison, loop closure detection is turned off in ORB-SLAM2. Frame threshold is defined as the number of times a map point is observed by a keyframe for monocular and stereo constraints in graph optimization.
4.1. Dataset
The “colon_reconstruction_dataset” contains 16 stereo colonoscope sequences (named as Case 0–Case 15, there are total of 17,362 frames) with corresponding depth and ego-motion ground truth.
4.2. Implementation details
The proposed SPSVO algorithm runs in a C++ environment on a laptop with an i7-10750H CPU and NVIDIA GTX1650Ti. SPSVO uses Nvidia TensorRT Toolkit to accelerate feature extraction and matching networks and uses the LM algorithm of the g2o library for nonlinear squared optimization. OpenCV and the Ceres library are applied to implement computer vision functions and statistical estimation, respectively.
4.3. Results on the colon reconstruction dataset
The performances of the ORB-SLAM2 and SPSVO were tested with the “colon_reconstruction_dataset”; however, the ORB-SLAM2 could only successfully obtain the endoscope trajectories of “Case 0,” and results are shown in Figs. 8, 9 and Table 2. The data sequences of “Case 0” contain 4751 frames of images for each left and right camera and have slower camera motion speed and smaller translation and rotation amplitude compared to “Case 1” to “Case 10.” It can be observed from Fig. 8 that ORB-SLAM2 has larger drift error compared to the proposed SPSVO method.
Comparisons between estimated trajectories and true trajectories of the endoscope are shown in Fig. 10. Colored solid lines represent estimated trajectories of the SPSVO. gray dotted lines represent real motion trajectories of the endoscope corresponding to the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38]. Statistics for SPSVO are shown in Table 3. The average measurement error of SPSVO for the 10 cases is between 0.058 and 0.740 mm, with the RMSE between 0.278 and 0.690 mm. This indicates that the proposed SPSVO method can accurately track the true trajectory of the endoscope. Figure 11 shows the variation of the absolute pose error between estimated and true trajectories with respect to time. It can be observed that the proposed SPSVO method has high accuracy and reliability for endoscope trajectory estimation. ORB-SLAM2 cannot extract enough feature points to initialize the viscera scene map, resulting in a loss of feature tracking and failure to construct endoscopic trajectories. Therefore, quantitative results for ORB-SLAM2 on Case1-Case10 are not presented.
4.4. Computational cost
Computational time of the SPSVO and ORB-SLAM2 on Case 0 sequence for one frame of the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38] is shown in Table 4. For fair comparison, 1000 points were extracted in this experiment, loop closure, relocalization, and visualization parts were disabled. Keypoint detection takes 25 ms for keypoint extraction of one stereo image. 29 ms are required for stereo matching and feature tracking between frames. Pose estimation is fast and only costs 8ms for one image. Therefore, SPSVO can operate at 14 fps; this speed can be further boosted by parallel implementation. It can be observed that the proposed SPSVO method has faster processing speed compared to ORB-SLAM2.
5. Conclusions
An important goal in VSLAM for medical applications is accurate estimation of endoscopic pose to better assist surgeons in locating and diagnosing lesions. Extreme illumination variations and weak texture of endoscopy images result in difficulties for accurate estimation of camera motion and scene reconstruction. This paper proposed a novel self-supervised Surgical Perception Stereo Visual Odometer (SPSVO) framework for real-time endoscopic pose estimation and viscera surface map construction. The proposed SPSVO method reduced adverse effects of local illumination variability and specular reflections by using a self-supervised learning (SSL) approach for feature extraction and matching, as well as image illumination pre-processing. In the proposed SPSVO, keyframe selection strategies and the Nvidia TensorRT Toolkit were applied to accelerate computation speed for real-time lesion localization and surgical navigation. Comparison between estimated and the ground truth trajectories of the endoscope were obtained from the colon_reconstruction_dataset. Through experimental tests, the following conclusions are made:
-
1. The proposed SPSVO system achieves superior performance in variable illumination environments and can track key points in human enterococci with intraperitoneal cavities. Simulation results show that SPSVO has average tracking error of 0.058–0.704 mm with respect to true camera trajectories in the given dataset. Comparison with existing methods also indicates that the proposed method outperforms ORB-SLAM2.
-
2. The proposed SPSVO system combines advantages of traditional optimization and learning-based methods and demonstrates an operating speed of 14 frames per second on a normal computer. This is adequate for real-time navigation in surgical procedures.
-
3. The proposed method can effectively eliminate effects of irregular illumination and specular reflections and can accurately estimate the position of the endoscope.
Acknowledgments
I would first like to thank Dr Qimin Li and Dr Yang Luo, whose expertise was invaluable in formulating the research questions and methodology. Your insightful feedback pushed me to sharpen my thinking and brought my work to a higher level.
Author contribution
Junjie Zhao: Conceptualization, Methodology, Software. Yang Luo: Data curation, Writing – Original draft preparation. Qimin Li: Supervision. Natalie Baddour: Writing – Reviewing and Editing. Md Sulayman Hossen: Writing- Reviewing and Editing.
Financial support
This research is funded by the Open Fund of Guangdong Provincial Key Laboratory of Precision Gear Digital Manufacturing Equipment Technology Enterprises (Grant No. 2021B1212050012-04), with contributions from Zhongshan MLTOR Numerical Control Technology Co., LTD and South China University of Technology, as well as the Innovation Group Science Fund of Chongqing Natural Science Foundation (No. cstc2019jcyj-cxttX0003).
Competing interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Ethical approval
Not applicable.
Data availability statement
The datasets “colon_reconstruction_dataset” for this study can be found at the https://github.com/zsustc/colon_reconstruction_dataset.