1. Introduction
Tracking odor/gas plumes and searching the release source with mobile robots show advantages in various scenarios, for example, searching for toxic gas leakage. Compared to professional firemen and searching animals, robots do not require much training, can keep working for a long period, and would be less threatened by dangerous surroundings [Reference Chen and Huang1–Reference Ma, Mao, Tan, Gao, Zhang and Xie3].
Robotic odor source localization utilizes a mobile robot or a team of robots to search for the odor release source. The robots usually integrate an odor concentration sensor and a wind sensor to perceive the environment and carry out some bio-inspired or information-gathering behaviors, including chemotaxis (climbing the odor concentration gradient) [Reference Larsch, Flavell, Liu, Gordus, Albrecht and Bargmann4], anemotaxis (moving upwind if odor plumes are detected) [Reference Chen and Huang5], and Infotaxis [Reference Vergassola, Villermaux and Shraiman6, Reference Chen, Marjovi, Huang and Martinoli7].
Conventional odor source searching behaviors showed their capability in laminar and steady airflows, because the odor concentration is of a smooth gradient and can be relatively easily modeled as a pseudo-Gaussian distribution [Reference Arya8]. However, the odor concentration distribution in turbulent airflows can hardly be calculated analytically. It is hard to follow a smooth concentration gradient to locate the odor source. The robot is very likely to lose the plumes and wander in the searching area.
In recent years, thanks to the development of hardware with higher computational capabilities, some fuzzy logic control methods [Reference Chen and Huang9–Reference Wang, Pang and Li11] and learning-based methods [Reference Wang, Pang and Li12–Reference Hu, Song and Chen14] have been proposed for odor source localization in complex and dynamic environments. Fuzzy logic control fuzzifies the measured signals with preset rules and is flexible for various application scenarios. However, the performance of a fuzzy controller heavily depends on its rules, which are hard to tune to an optimal manually. Adaptive Neural Fuzzy Inference System can be utilized to tune the behavior rules by learning from preset searching strategies [Reference Wang and Pang10]. However, the performance of the present searching strategies cannot be guaranteed in case the robot works in an unknown environment where prior knowledge of the odor concentration and wind field is not available.
Existing reinforcement learning-based methods use neural networks to model an action-value function and let the robot work under an optimal action policy. Compared with fuzzy logic rules which are set manually or learned from preset behaviors, the well-trained models through deep reinforcement learning are better adaptive to unknown and complex environments. However, the interpretability of deep reinforcement learning remains a challenging problem. Learning models are essentially a black box because it is difficult to explain the acquired knowledge during the training process. The interpretability of learning models has attracted a rapidly growing research interest in the past few years. Some researchers adopted elaborately designed methods to explain existing learning algorithms. Some others tried to linguistically explain the knowledge a model has acquired during training. Takagi–Sugeno–Kang (TSK) fuzzy systems were a good alternative to neural networks for this purpose. Compared with neural networks [Reference Chen, Zhang, Leng, Chen and Fu15–Reference Fang, Long, Sun, Liu, Zhang and Fang18], fuzzy rules are more close to the human decision-making methodology [Reference Li, Cao and Ding19, Reference Yang, Jiang, Na, Li, Cheng and Su20] and can be even initialized with human’s prior knowledge and make the training process faster [Reference Chen, Leng and Fu21]. In this case, human experts can provide proactive interventions to the design and tuning process of TSK fuzzy systems, which is an advantage of TSK fuzzy systems to other neural network models. Thanks to the flexibility of the TSK fuzzy systems, the fuzzy system-based controllers have been widely applied in robotics [Reference Salehi, Pishkenari and Zohoor22–Reference Veysi, Soltanpour and Khooban26].
To the best of our knowledge, no previous work used a TSK fuzzy system to model an action policy for robotic odor plume tracking and tuned it with reinforcement learning. Since the TSK fuzzy system may have some issues on weak generalization ability, poor training effect for big data, and low convergence rate, some optimization methods [Reference Wu, Yuan, Huang and Tan27] including Layer Normalization (LN) and DropRule are applied in the structure of the fuzzy system and the training process. Moreover, the outputs of the proposed fuzzy system include multiple continuous variables, which is different from widely used discrete action spaces in reinforcement [Reference Chen, Fu and Huang13]. This feature can make the fuzzy system-based controller more adaptive and speed up the odor source searching process in real-world scenarios.
The proposed reinforcement learning fuzzy system is also promising to be applied to other robotic problems, for example, human-robot interactions [Reference Su, Qi, Schmirander, Ovur, Cai and Xiong28–Reference Fang, Ding, Sun, Shan, Wang, Wang and Zhang31], environment adaptation of robots [Reference Zhang, Luo, Xiao, Zhang, Liu, Zhu, Lu, Rong, de Silva and Fu32, Reference Chen, Chen, Wang, Yang, Ma, Leng and Fu33], calibration [Reference Guo, Song, Tang, Zhou and Jiang34–Reference Guo, Tang, Zhou, Song, Jiang, Xie and Ye36], path planning and control [Reference Cao, Huang, Xiong, Wu, Zhang, Li and Hasegawa37–Reference Fang, Sun, Wu, Liu, Wang, Huang, Huang, Liu and Wen40]. The fuzzy system-based controllers can be initialized with expert knowledge, trained with reinforcement learning in simulated scenarios, and finally adapted to the real world.
The contributions of this paper are threefold:
(1) A multi-continuous-output TSK fuzzy system is designed and integrated into the framework of reinforcement learning to model the action policy of the robot. The structure of the fuzzy system and the training process is optimized to achieve a faster training process and robust performance.
(2) The proposed fuzzy system is applied for odor plume tracking control in dynamic airflow. The influence of the reward settings on the trained Multi-Continuous-Output TSK (MCOTSK)-based controller is investigated.
(3) The performance of the proposed odor plume tracking method is compared with a benchmark method and the results are analyzed to investigate how reinforcement learning can promote the searching performance.
The following sections of the paper are organized as follows: Section 2 presents the structure of the proposed multi-continuous-output fuzzy inference system and how reinforcement learning is utilized to tune the system. Section 3 presents the filament-based dynamic plumes and how the proposed system is trained and applied in the odor plume tracking task. Section 4 compares the proposed MCOTSK-based controller trained with two different reward settings in a simulated large-dimension scenario with dynamic plumes and analyzes the results. The proposed method is also compared with a benchmark method. Section 5 validates the trained controller on a real robot in odor plume tracking tasks. Section 6 concludes the paper.
2. Methods
2.1. General TSK fuzzy system structure
TSK fuzzy systems are widely used machine learning models for regression problems [Reference Wu, Yuan, Huang and Tan27, Reference Nguyen, Taniguchi, Eciolaza, Campos, Palhares and Sugeno41]. It maps the relationships between inputs and outputs through the fuzzy logic theory. It does not require prior expert knowledge to set the parameters of the system, but applying learning algorithms to tune the parameters, for example, evolutionary algorithms [Reference Wu and Tan42] and gradient descent [Reference Wang and Mendel43]. Figure 1 shows a five-layer TSK fuzzy system architecture.
Assume the input vector of a TSK fuzzy system can be expressed as $\mathbf{x}=\left (x_{1}, \ldots, x_{M}\right )^{T} \in \mathbb{R}^{M \times 1}$ . The input is fuzzified by $R$ rules:
where $A_{r,m} (r = 1,\ldots,R;\,m = 1,\ldots,M)$ are fuzzy sets, $y_r(\mathbf{x})$ is the output of Rule $r$ , and $b_{r,0}$ and $b_{r,m}$ are the weight parameters.
The first layer of the TSK fuzzy system is the fuzzification layer, of which the output can be expressed as:
where $\mu _{A_{r,m}}$ are the membership functions (MFs) of the fuzzy sets $A_{r,m} (r = 1,\ldots,R;\,m = 1,\ldots,M)$ and are set to be Gaussian MFs, because they are widely used and their derivatives are easier to compute. $a_{r,m}$ and $c_{r,m}$ are parameters related to the shape of the Gaussian MFs.
In the second layer, all nodes are fixed and marked as $\pi$ . In order to draw conclusions from a set of rules defined for a TSK fuzzy system, the strength of the premise of each rule, referred to as “firing strength” of the premise, is calculated in this layer given a set of input values $\left (x_{1}, \ldots, x_{M}\right )^{T}$ and their membership grade $\mu _{A_{r,m}}(x_m)$ to each fuzzy set $ A_{r,m} (r = 1,\ldots,R;\,m = 1,\ldots,M)$ . The outputs of this layer are the products of the inputs and can be expressed with Eq. (3). For a certain fuzzy rule, if the membership grade of one of the inputs is close to zero, which means this rule can be hardly satisfied, the multiplication layer can ensure that the product of all the membership grades is also close to zero, so that this rule is almost inactive in the following layers and will not make much sense for decision-making.
The third layer is the normalization layer. The outputs of this layer are the normalization of the input signals and represent the contribution of Rule $r$ to the sum of the firing strength of all rules:
The fourth layer is an adaptive-parameter layer. The output of this layer is the product of the normalized firing strength $\theta _r^3$ and $y_r(\mathbf{x})$ :
The output of the last layer is the sum of all the input signals:
2.2. Adapt the TSK fuzzy system for multiple continuous outputs and reinforcement learning
As mentioned in the introduction, a large action space of the robot is required in real-world odor source searching scenarios. The FIS is expected to have multiple outputs and generate continuous control commands. A larger challenge is that there is usually little expert knowledge that can be utilized to tune the FIS. In this case, the robot is expected to conduct the “trial-and-error” process and dynamically tune the TSK fuzzy system.
In order to provide a solution for the above issues in this paper, the generic structure of the TSK fuzzy system is adapted as “MCOTSK fuzzy system” in this paper. The first four layers of MCOTSK are the same as the generic structure. The fifth layer is consist of $P$ nodes, of which the outputs can be expressed as:
where $\boldsymbol{\Theta }_4 = \left (\theta _1^4, \ldots, \theta _R^4\right )^{T}$ , and $\boldsymbol{\Omega }_p= \left (\omega _{p,1}, \ldots, \omega _{p,R}\right )^{T}$ , $(p = 1,\ldots,P)$ . $P$ is the total number of the outputs. The structure of MCOTSK and the scheme of tuning the MCOTSK-based controller using a typical reinforcement learning algorithm, Deep Deterministic Policy Gradient (DDPG), are presented in Fig. 2.
In the framework of the DDPG algorithm, an “Actor” is required to map the state $s$ of the environment to the action $a$ of the robot. In the odor plume tracking task, the state of the environment can be the measured wind direction, wind velocity, odor concentration, etc. The action means control commands for the robot, which can be the turning angle, the movement length, the movement velocity, etc. In this paper, the MCOTSK serves as the Actor model, the inputs of which are the measured states and the outputs are the parameters of the odor plume tracking controller. In order to optimize the MCOTSK-based Actor, the adaptive parameters $a_{r,m}$ , $b_{r,m}$ , $b_{r,0}$ , $c_{r,m}$ and $\omega _{p,r}$ $(r = 1,\ldots,R;\,m = 1,\ldots,M;\,p = 1,\ldots,P)$ need to be tuned. The DDPG algorithm also involves a “Critic” model as the action-value functions $q(s,a)$ , which calculates the expected cumulative future reward of the current action and state. Except for the Actor model and the Critic model, a “Target actor” model and a “Target critic” model are initialized the same as the Actor model and the Critic model, respectively.
At each time step $t$ during the training process, an action command $a_t$ is calculated from the input state $s_t$ with the proposed MCOTSK, and the robot takes a corresponding movement. After the robot interacts with the environment and takes another observation, an updated state $s_{t+1}$ is obtained and serves as the input of the MCOTSK at the next step.
Meanwhile, $s_{t+1}$ is sent to the Target actor model to calculate the action command $a_{\text{targ},t+1}$ . The reward $r_t$ the robot gets at time step $t$ and the action value $q_{\text{targ}}$ calculated with the Target critic model are used to calculate the target action value $r+\gamma q_{\text{targ}}(s_{t+1},a_{\text{targ},t+1})$ . The Temporal-Difference error between the action value $q(s_t,a_t)$ and the target action value are used to adjust the Critic model by minimizing the loss $L(\phi, \mathcal{D})$ with stochastic gradient descent:
where $\phi$ is the parameters of the Critic model, and $\mathcal{D}$ is the replay buffer storing previous experience $\left (s_{t}, a_{t}, r_{t}, s_{t+1}\right )$ . The Actor model is optimized by maximizing the action value $\underset{(s,a)\sim \mathcal{D}}{\textrm{E}}[q(s,a)]$ .
The parameters $\phi _{\text{targ}}$ of the Target actor and critic models are updated through a soft updating policy at each training step to make the training process more stable:
where $\rho$ is set as $0.9$ in this paper.
The DropRule technique [Reference Wu, Yuan, Huang and Tan27] is applied in the training process of the MCOFIS-based actor model to reduce overfitting and increase generalization. DropRule randomly drops some fuzzy rules during the training process; that is, at each iteration of training, the firing strength of a fuzzy rule is set to zero with probability $P\in (0,1)$ and remains unchanged with probability $1-P$ . By randomly discarding some fuzzy rules, each rule is forced to work robustly with a randomly remaining subset of rules, and in this way, each rule maximizes its own modeling capability, instead of relying on other rules. Besides, LN is used to normalize the firing strength of the rules. Similar to the LN layer in Transformer, the LN layer added in the MCOTSK model can solve the gradient vanishing problems and improve the performance [Reference Cui44].
3. Application in odor plume tracking
Since it is hard and time-consuming to generate variable dynamic odor plumes with controllable parameters to train the models in the real world, the proposed models are trained in simulated environments in this paper.
In this section, the filament-based dynamic plume model is introduced and utilized to generate random odor plume tracking tasks in this paper. An MCOTSK-based Lévy Taxis plume tracking controller is designed. Two reward settings of the plume tracking process are designed and used in the training process of the MCOTSK model.
3.1. Filament-based dynamic odor plume model
In ref. [Reference Farrell, Murlis, Long, Li and Cardé45], a filament-based odor plume model is presented to simulate plumes dispersed in dynamic changing airflow. The modeled odor concentration distribution is intermittent, and the spatial gradient rapidly changes. This model resembles plumes in real-world outdoor scenarios well. The plumes are modeled as plenty of filaments released from the airflow and dispersed by the airflow (illustrated as the red puffs in Fig. 3). In this paper, this model is used to build a simulated environment, in which the dimension of the searching area is $40$ m $\times \,10\,$ m, and the coordinate system is presented in Fig. 3. The position of the odor releasing source is $(5\,\text{m},\,0\,\text{m})$ . The wind velocity is set as $1\,$ m/s. The wind direction is aligned to X-axis at $t=0$ and changes at each time step. The noise gain on the wind direction is set to be $5$ to simulate dynamic airflow.
The concentration at location $\mathbf{p}$ contributed by the $i$ -th filament is modeled as:
where $Q$ is the filament release rate, $\mathbf{p}_{i}(t)$ is the spatial extent of the $i$ -th filament at time step $t$ , $R_i(t)$ is the dispersion radius of the filament, and $\zeta$ is the growth rate of the filaments.
3.2. MCOTSK-based Lévy Taxis plume tracking controller
The plume tracking algorithm in this paper is a modified version of Lévy Taxis, which was originally a random walk-based plume finding method proposed by Pasternak et al [Reference Pasternak, Bartumeus and Grasso46]. The Lévy Taxis algorithm was modified as Adaptive Lévy Taxis [Reference Emery, Rahbar, Marjovi and Martinoli47] and Fuzzy Lévy Taxis [Reference Chen and Huang9] to work as plume tracking algorithms. With the Lévy Taxis plume tracking algorithm, as soon as the robot starts its odor plume tracking task from a random position in the searching area, it conducts random walk behaviors: at each step, the robot turns its heading $\theta _{\text{a}}$ to the angle $T_a$ and moves forward for a length $M_l$ . $T_a$ and $M_l$ are determined by the distributions presented in Eq. (13) and Eq. (14):
where $\text{rnd}$ follows a uniform distribution $\text{rnd}\sim u(0,1)$ . The variables $\alpha(0\leq \alpha \leq 1)$ and $\mu (1\lt \mu \leq 3)$ are two key parameters adjusting the shapes of the above distributions. $L_{\min}$ is the minimum step length and is $0.5\,\text{m}$ in the training process of MCOTSK. $\text{bias}$ is a function of the upwind angle $\theta _{\text{u}}$ and the robot heading $\theta _{\text{a}}$ , which keeps the center of $T_a$ ’s distribution as a weighted sum of the upwind direction and the current robot heading [formulated as Eq. (15)] to mimic the bio-inspired anemotaxis behaviors. Figure 4 presents an illustration of the wind direction, the upwind angle $\theta _{\text{u}}$ , and the robot heading $\theta _{\text{a}}$ .
In order to determine the key parameters $\alpha$ , $\beta$ , and $\mu$ in the Lévy Taxis controller, Adaptive Lévy Taxis [Reference Emery, Rahbar, Marjovi and Martinoli47] formulated the parameters as fixed functions of the concentration gradient $\nabla C = C_{\text{c}} - C_{\text{p}}$ . $C_{\text{c}}$ and $C_{\text{p}}$ are the odor concentration values measured in the current step and the previous step, respectively. In order to enhance the flexibility of the plume tracking controller, ref. [Reference Chen and Huang9] made the parameters as the output of a Mamdani-type fuzzy system, of which the inputs are $\nabla C$ and $C_{\text{c}}$ . The results in [Reference Chen and Huang9] demonstrated that the Lévy Taxis controller based on the fuzzy system can achieve faster odor source localization in various scenarios, but the rules of the fuzzy system and its membership functions are tuned manually, which requires prior expert knowledge on promising plume tracking behaviors.
In this paper, the MCOTSK model was utilized to determine the parameters $\alpha$ , $\beta$ , and $\mu$ of the Lévy Taxis controller. At each iteration, the robot measures the current odor concentration $C_{\text{c}}$ at its location and calculates the concentration gradient $\nabla C$ . The state vector of the environment $\mathbf{s}=\left (C_{\text{c}}, \nabla C\right )^{T}$ serves as the input of the MCOTSK model. The outputs of MCOTSK go through a Tanh activation layer and are rescaled to their proper range. The rescaled outputs are the determined parameters, and the action of the robot including the turning angle $T_a$ and the movement length $M_l$ can be calculated with the Lévy Taxis controller in Eqs. (13) and (14). The robot will keep moving according to the controller until it finds the odor source or reaches the step limit.
3.3. The training process of MCOTSK
In this paper, the MCOTSK is automatically tuned by the DDPG reinforcement learning algorithm. Each trial of odor plume tracking task is a training episode in the DDPG algorithm. A trial will stop when the robot enters the stopping area (represented by the yellow round patch in Fig. 3), hits the boundaries of the simulated area, or the number of searching steps exceeds a limit, which is 60 steps in this paper. At each searching step $t$ , the experience of the robot $\left (s_{t}, a_{t}, r_{t}, s_{t+1}\right )$ is stored in an experience replay buffer $\mathcal{D}$ , of which the size is $5000$ . And a batch of experience (batch size = 32) randomly selected from $\mathcal{D}$ is used to tune the Actor and the Critic at each step. An artificial neural network was used to model the Critic while the proposed MCOTSK was used to model the Actor. The number of rules $R$ is 10, and the DropRule rate $P = 0.2$ . The learning rate of the Actor is $0.001$ and that of the Critic is $0.002$ . The robot gets the reward $r_t$ in time step $t$ :
In this paper, two reward settings are used to train the models, respectively. In the first setting, the $r(C_{\text{c}},\theta _{\text{a}},\theta _{\text{u}})$ term is formulated as Eq. (17), where $C_0$ is a constant and set to be 30 in this paper. Since it is designed to let the robot learn the bio-inspired anemotaxis and chemotaxis behaviors, this setting is called the behavior-oriented reward setting in the rest of the paper. In the other reward setting, the $r(C_{\text{c}},\theta _{\text{a}},\theta _{\text{u}})$ term is set to be a constant 0. The robot will learn to reach the odor source with as fewer steps as possible; therefore, this setting is called the result-oriented reward setting.
The DDPG reinforcement learning algorithm was implemented with PyTorch and run on a computer with an AMD Ryzen 5 2600 six-core processor, an 8 GB memory chip (DDR3 SDRAM), and a GeForce GTX 1050 Ti graphics card. The randomly changing wind field and the filament-based odor plumes were used to train the models. The source code can be found at https://github.com/cxxacxx/MCOTSK. The models were trained for 1000 episodes. During the process of training, we recorded the reward the robot obtained in each episode. Figure. 5(a) and (b) present the average reward in every 20 episodes during the training using the above two reward settings, respectively. It can be seen that in the training process with both the reward settings, the average reward started from around −50. And the average reward curves converge to around 5 and −10, respectively, after around 400 episodes. From the increasing average rewards, it can be seen that the robot can learn to track the plumes in dynamic airflow and reaches the odor source with the proposed MCOTSK model and the reinforcement learning algorithm.
4. Performance evaluation in simulation
In this section, the MCOTSK-based plume tracking controllers trained with two different reward settings are compared with the Fuzzy Lévy Taxis method, which was designed with expert knowledge and proven to be adaptive in various environment settings in ref. [Reference Chen and Huang9]. The test settings are presented, and the results are discussed.
4.1. Simulation settings
To investigate the influence of the reward settings on the MCOTSK-based plume tracking controllers and compare the proposed algorithm with the Fuzzy Lévy Taxis method, plume tracking tests were conducted in a simulated testing environment that is different from the training environment. The robot starts from random positions in the rectangle area shown in Fig. 3 and tracks the odor plumes with the three controllers respectively: (1) the MCOTSK-based controller trained with the behavior-oriented reward setting (MCOTSK-BOR), (2) the MCOTSK-based controller trained with the result-oriented reward setting (MCOTSK-ROR), and (3) Fuzzy Lévy Taxis. For each controller, 200 trials are conducted.
4.2. Evaluation metrics
Three metrics are utilized to evaluate the controllers. The first is the success rate: the proportion of trials in which the robot enters the stopping area near the odor source. The second metric is the number of tracking steps in all successful trials. The third one is the distance overhead, which is the traveled distance from the starting position to the stopping position divided by the straight distance in the successful trails. The latter two metrics reflect the odor source searching efficiency.
4.3. Simulation results and discussions
The results of the Monte Carlo tests were shown in Fig. 6. It can be seen that the success rate of the three controllers is very similar, which means that they all showed enough capability for plume tracking in odor plume tracking. In terms of efficiency of the searching process, Fig. 6(b) and (c) showed that the MCOTSK trained with the behavior-oriented reward setting can achieve lower distance overhead and the number of steps is also lower than the other two controllers. This result can be explained that the robot has learned from the elaborately designed reward setting and conducted bio-inspired and well-tuned plume tracking behaviors. However, the design of the behavior-oriented reward setting still required some expert knowledge. The MCOTSK trained with the result-oriented reward setting required little expert knowledge, but it still showed better results compared with the benchmark method in terms of efficiency. Figure 7 presented a typical plume tracking trajectory of the robot with each controller. It can be seen that the MCOTSK can achieve an obviously more straightforward tracking trajectory, which was almost aligned to the wind direction. And with no surprise, the Fuzzy Lévy Taxis generated a more meandering trajectory than the trained controllers.
From the results, it can be demonstrated that tuning the MCOTSK model with the DDPG reinforcement learning algorithm is feasible and the trained controllers can achieve even better results than the manually tuned fuzzy controller, which requires expert knowledge. The reward settings indeed can affect the performance of the trained controller, which can provide some inspiration for the following work to design the reward settings elaborately to let the robot learn expected behaviors.
5. Experiments
In this section, the MCOTSK-based plume tracking controller trained with the behavior-oriented reward is validated through robotic experiments. The adaptation of the controller from the simulated environment to the real environment is introduced. The experiment results are presented and discussed.
5.1. Experiment setup
The experiments were conducted in a laboratory at Huazhong University of Science and University, of which the size was $3.04$ m $\times \,3.75\,$ m (shown in Fig. 8(a)). A smoke machine was placed in the laboratory to generate smoke plumes. Two electric fans were utilized to generate indoor turbulent airflow to disperse the plumes.
The olfactory robot described in ref. [Reference Chen and Huang5] was employed to conduct the plume tracking tasks in this paper. The robot (shown in Fig. 8(b)) was remolded from Turtlebot 3. A Gill WindSonic sensor and a Plantower PMS7003 sensor were mounted on the robot to measure the wind direction and the particulate matter concentration, respectively. The sampling rates of the wind sensor and the particulate matter sensor are 4 Hz and 1 Hz, respectively. Because the proposed controller works in a discrete-time manner and takes observations at the beginning of each searching step before moving the robot to a new position, the sampling rates only affect the duration of each observation stage, but do not affect the performance (e.g., the success rate, the distance overhead) of the proposed controller. A Raspberry Pi was mounted on the robot to communicate with the remote PC through User Datagram Protocol (UDP) unicast and send movement commands to the OpenCR through the serial. The wheels of the robot were actuated by the OpenCR board, which executed movement commands received from the Raspberry Pi. The real-time position of the robot during experiments was captured by a camera mounted on the ceiling of the laboratory by recognizing the red and green LED markers on the top of the robot with SwisTrack [Reference Lochmatter, Roduit, Cianci, Correll, Jacot and Martinoli48]. The captured position of the robot was used to record the ground truth trajectories during plume tracking.
5.2. Sim to real adaptations
Since the model is trained in simulated environments, but needs to be deployed on a real robot, some adaptations to the plume tracking algorithm are required.
The first one is to rescale the measured particulate matter concentration to a suitable range of the inputs for the MCOTSK. The measured number of particles with a diameter beyond 0.3 μm (PM0.3) in 0.1 L of air around the robot (denoted by $n_{0.3}$ in this paper) varies from around 2000 to more than 30,000 during the plume tracking process. But the range of the input $C_{\text{c}}$ in the simulated environment is from 0 to around 30. Therefore, in the robotic experiments, the input $C_{\text{c}}$ for the MCOTSK model is calculated by:
where $n_{\text{baseline}}$ is the number of PM0.3 measured in clean air and is set as 1888 in this paper.
The second adaptation is that the minimum movement length $L_{\text{min}}$ at each step is set to be $0.05$ m, and the maximum of the movement length is set to be $0.2$ m. The moving speed of the robot is 0.079 m/s. This setting can ensure the safety of the robot in the small searching area and prevent it from getting burned by the smoke machine. Besides, the radius of the stopping area in the experiment is set to be 0.5 m.
5.3. Plume tracking experiments and results
In order to validate that the relative position of the smoke leakage source and the robot does not affect the performance of the proposed controller, the smoke machine was placed at three different positions in the experiments (see Table. I), and for each source position, the robot started from three different positions to track the smoke plumes and searching for the smoke leakage source.
The results of 9 plume tracking experiments are presented in Table I. Figure 9 shows the robot’s trajectories during nine experiments. The mean and medium distance overhead are 1.1385 and 1.0544, respectively, which are close to 1 and match the simulation results well. The results and the trajectories can demonstrate that the proposed MCOTSK-based controller makes the robot track the smoke plumes and find the smoke source with an almost straight path. The relative position of the smoke leakage source and the robot is not related to the performance of the proposed controller.
Videos recorded during experiments were attached to the manuscript.
6. Conclusions
In this paper, a multi-continuous-output TSK fuzzy system was designed and tuned with reinforcement learning. The structure of the fuzzy system and the training process was optimized with advanced techniques in machine learning, including DropRule and LN. The trained fuzzy system-based plume tracking controllers can achieve around 85% success rate, which is similar to a manually tuned benchmark method, and higher odor source searching efficiency. The results also showed that a well-designed reward setting in the training process can further improve the performance of the controller. The controller was validated through experiments on a real robot, and the experiment results matched the simulation results well.
In our future work, the influence of the optimization techniques on the reinforcement learning TSK fuzzy system will be analyzed. To achieve a robust performance of the fuzzy system-based controller, more rigorous mathematical reasoning and stability analysis are also required [Reference Li, Zhao, Zhang, Wu, Zhang, Li, Li and Su49–Reference Li, Li and Kan52].
Supplementary materials
To view supplementary material for this article, please visit https://doi.org/10.1017/S0263574722001321.
Author contributions
X C and B Y contributed to the conception and implementation of the study. J H contributed to providing the experiment devices and facilities. Y L and C F contributed to supervising the study, reviewing and revising the manuscript.
Financial support
This work was supported by the National Natural Science Foundation of China [Grant U1913205, 62103180, and 52175272]; Guangdong Innovative and Entrepreneurial Research Team Program [Grant 2016ZT06G587]; the China Postdoctoral Science Foundation (2021M701577); the Science, Technology and Innovation Commission of Shenzhen Municipality [ZDSYS20200811143601004 and KYTDPT20181011104007]; the Stable Support Plan Program of Shenzhen Natural Science Fund [Grant 20200925174640002]; and Centers for Mechanical Engineering Research and Education at MIT and SUSTech.
Conflicts of interest
The authors declare no conflicts of interest.