Nomenclature
- $a_t$
-
an action
- $\hat{A}_t$
-
advantage function estimate
- DQN
-
Deep Q-Network
- E
-
expectation
- $\hat{E}$
-
expectation estimation
- J
-
objective function
- $J^{CLIP}$
-
PPO surrogate objective function
- LSTM
-
Long-Short Term Memory
- MLP
-
Multi-layer Perceptron
- $\mathcal{N}$
-
normal distribution
- PPO
-
Proximal Policy Optimisation
- P
-
probability
- q
-
pitch rate
- R
-
return
- RL
-
Reinforcement Learning
- RNN
-
Recurrent Neural Network
- $r_t$
-
reward
- S
-
state space
- $s_0$
-
initial state
- $s_t$
-
state at time step t
- t
-
discrete time step
- T
-
final time step
- TRPO
-
Trust Region Policy Optimisation
- u, w
-
velocity in aircraft body axis
- $\mathcal{U}$
-
uniform distribution
- $u_r$
-
airspeed component
- UAV
-
Uncrewed Aerial Vehicle
- VTOL
-
Vertical Takeoff and Landing
- x
-
aircraft state vector
- $x_e$ , $z_e$
-
position in earth coordinates (z down)
Greek symbol
- $\epsilon$
-
PPO hyperparameter
- $\eta$
-
elevator angle
- $\theta$
-
neural network parameters
- $\theta_e$
-
pitch angle
- $\Lambda$
-
wing sweep
- $\mu$
-
dynamics parameters
- $\pi$
-
policy parameterised by $\theta$
- $\tau$
-
trajectory
1.0 Introduction
Fixed-wing uncrewed aerial vehicles (UAVs) typically have several advantages over multirotor vehicles, such as greater range and flight endurance. These attributes make them more suited to long-range surveillance or monitoring tasks. However, due to the nature of fixed-wing flight, these operations tend to be restricted to open environments. Runways or equipment such as catapults and arresting systems are often required to operate the UAV. In addition, conventional flight control systems for UAVs are designed for restricted flight envelopes, with limited ability to control highly agile manoeuvres. As such, they are not capable of sustained high-alpha flight or rapid avoidance manoeuvres. These constraints limit the ability for such fixed-wing aircraft to operate in complex environments, such as urban areas.
Vertical Take-Off and Landing (VTOL) aircraft, such as quad-plane and tailsitter configurations, are one method for landing in limited spaces. Another is to use a bio-inspired, perched landing approach, as described by Waldock et al. [Reference Waldock, Greatwood, Salama and Richardson1]. A perched landing uses a dynamic stall that decelerates the aircraft to safely land. This bio-inspired manoeuvre can be used to land small, fixed-wing aircraft in confined or challenging environments. Perching UAVs, both multirotor and fixed-wing, have been explored by others. For example, Meckstroth et al. [Reference Meckstroth and Reich2] and Moore et al. [Reference Moore, Cory and Tedrake3] both demonstrated experimental perching manoeuvres with small fixed-wing UAVs. Meckstroth et al. used a motion capture system to measure pitch-up manoeuvres, generating a numerical model of perching based on component buildup methods. Moore et al. demonstrated perching at height, controlled using Linear Quadratic Regulator trees. More recently, Novati et al. [Reference Novati, Mahadevan and Koumoutsakos4] employ model-free reinforcement learning (RL) for controlled gliding and perching for a simulated elliptical body. Here, RL was able to generate robust controllers that could generalise to previously unseen initial conditions.
This paper builds upon previous work carried out by the Bristol Flight Lab, where a perched landing manoeuvre is performed by a custom variable-sweep aircraft using an RL controller. Greatwood et al. created the aircraft [Reference Greatwood, Waldock and Richardson5] and Waldock et al. demonstrated the efficacy of a DQN-based schedule optimiser when compared against a non-linear optimisation algorithm [Reference Waldock, Greatwood, Salama and Richardson1]. RL was used as a schedule optimiser in an open-loop manner, where the DQN algorithm was used to pre-generate a schedule of actions [Reference Waldock, Greatwood, Salama and Richardson1]. Clarke et al. trained an RL controller in simulation, then deployed it on the aircraft to form a closed-loop RL controller [Reference Clarke, Fletcher, Greatwood, Waldock and Richardson6]. The aircraft state from the flight controller was sent to the deployed network to select an action during flight. Fletcher et al. investigated the addition of atmospheric disturbances to the simulation to reduce the gap between simulation and reality and improve real-world performance [Reference Fletcher, Clarke, Richardson and Hansen7].
Modern reinforcement learning algorithms, using neural networks as non-linear function approximators, have been applied across several control tasks, though mainly in virtual environments, such as video games [Reference Berner, Brockman, Chan, Cheung, Dębiak, Dennison, Farhi, Fischer, Hashme, Hesse, Józefowicz, Gray, Olsson, Pachocki, Petrov, Pinto, Raiman, Salimans, Schlatter, Schneider, Sidor, Sutskever, Tang, Wolski and Zhang8] and simulated robotics tasks [Reference Brockman, Cheung, Pettersson, Schneider, Schulman, Tang and Zaremba9]. RL control has been applied to several aerial robotics applications in recent years. For example, Koch et al. used RL to generate flight controllers for the attitude control of multirotors, demonstrating both simulated and real-world operation [Reference Koch, Mancuso, West and Bestavros10]. For fixed-wing aircraft, Bohn et al. used the Proximal Policy Optimisation (PPO) RL algorithm to make an attitude controller for an X8 UAV, demonstrating superior performance to that of a tuned PID [Reference Bohn, Coates, Moe and Johansen11].
Real-world applications, such as robotics, pose a greater challenge than simulated tasks. Current RL methods typically require a significant amount of experience to be collected during training, based on an agent’s interactions with an environment. One approach to this is to learn in-situ, learning from experience gained by the robot in the real world. This can be unfeasible for an aerial robot due to the time required and the likelihood of damaging hardware. A more practical approach is to learn in simulation and then transfer the trained agent to the robot for real-world testing. However, it is unlikely that the simulator will precisely model reality. This is particularly true for highly dynamic environments such as agile fixed-wing flight. For example, a numerical model generated from wind tunnel testing will have several simplifications and sources of error. The cumulative discrepancies between the simulated environment and the real world are known as the reality gap.
This project investigates several enhancements and modifications to the reinforcement learning process to improve the performance of the perched landing controller, both in simulation and during flight testing. These enhancements span three key categories: input observation modifications, modelling refinements and changes to the reinforcement learning process and architecture. Results from simulation are presented, demonstrating the effect of each modification individually and the cumulative effect when combined. The objective is to find a combination of improvements that achieves greater performance than the baseline model. Results from flight testing are then shown, demonstrating the impact of these improvements on the real-world performance of the RL controller.
2.0 Aircraft model
The test platform for this work is a custom sweep-wing aircraft, based on a standard Bixler 2 model aircraft, seen in Fig. 1. The Bixler 2 is an off-the-shelf foam trainer model aircraft. The aircraft was modified with the installation of a servo and custom mechanism. The modifications allow the sweep angle of the wings to be rapidly changed during flight. Wingtip incidence control was also added. Rapidly varying sweep can generate pitching moments by changing the centre of lift relative to the centre of mass. These pitching moments are far in excess of what the small elevator alone can generate. Full details of the hardware modifications and airframe design are provided in Greatwood et al. [Reference Greatwood, Waldock and Richardson5].
As well as wing modifications, a suite of avionics hardware is installed, enabling flight testing. This hardware includes a Pixhawk flight controller, running a custom fork of the ArduPlane firmware [Reference Tridgell, Ferreira, Morphett, Walser, De Marchi, du Breuil, Barker, Mackay, Pittenger, Geyer, Olson, Castelnuovo, Shamaev, Staroselskii, de Sousa, Beraud, Hall, Lawrence, Badaire, Denecke, Riseborough, Kancir, Mayoral Vilches and Lucas12], as well as sensors such as an airspeed sensor and GPS module. A Raspberry Pi is onboard to run the RL controller, and the Pi communicates with the Pixhawk over the MAVLink protocol.
The numerical model of the aircraft is based on wind tunnel data. A non-linear, longitudinal model was developed previously, with aerodynamic coefficients based on aircraft state. A full description of the wind tunnel testing and aerodynamic coefficients is given by Waldock et al. [Reference Waldock, Greatwood, Salama and Richardson1]. No propulsion is modelled as the perched landing manoeuvre does not require thrust. The ability to model atmospheric disturbances, in the form of steady-state winds and Dryden gusts, was later added to better resemble real-world flight conditions [Reference Fletcher, Clarke, Richardson and Hansen7]. The numerical model is implemented in Python and forms the simulated environment for the reinforcement learning process. The model is wrapped as an OpenAI Gym [Reference Brockman, Cheung, Pettersson, Schneider, Schulman, Tang and Zaremba9] environment, making it compatible with popular RL libraries, such as Stable Baselines [Reference Hill, Raffin, Ernestus, Gleave, Traore, Dhariwal, Hesse, Klimov, Nichol, Plappert, Radford, Schulman, Sidor and Wu13].
3.0 Reinforcement learning
Reinforcement learning is a machine learning paradigm with the objective of training an agent from experience. An agent maps states or observations to actions. A state is a complete description of the state of the world, with no hidden information. In robotics, a complete state is not usually possible, and with an incomplete state, or observation, gathered from the robot’s sensors. During RL training, an agent interacts with an environment. The environment can be simulated, such as a video game or robotics simulator, or can be the real world. In this work, the training environment is the flight dynamics model of the aircraft. At each time step of the learning process, the agent receives an observation from the environment. Based on the observation, the agent uses a policy to determine the action to take. The action space describes how the agent interacts with an environment. In this work, the action space relates to the movement of the aircraft’s control surfaces. The agent’s policy decides the mapping from observation to action. The policy is updated based on a reward signal emitted by the environment.
Reinforcement learning is formalised as a Markov Decision Process (MDP), described by a four-element tuple: (S, A, R, P), where S is the set of all valid states, A is the set of valid actions, R is the reward function, and P is the transition probability function. At each time step t, the agent receives a state $s_t$ and selects an action $a_t$ based on the policy, $\pi_\theta(a_t|s_t)$ . Commonly, the policy will be deterministic and is parameterised by $\theta$ , the weights of a neural network. Based on the state, or state-action pair, the agent will receive a reward $r_t$ from the reward function, $r_t = R(s_{t+1},a_t)$ . The transition function $P(s_{t+1}|s_t, a_t)$ defines the probability of transitioning to $s_{t+1}$ when starting from $s_t$ and taking action $a_t$ . Generally, in the RL setting, the transition probabilities will be unknown. This scenario is known as model-free learning, where the agent does not have access to a model to predict the next state from the current. In an episodic scenario task, the goal of reinforcement learning is to find a policy that maximises the expected return, Equation (1).
RL is an area of active research interest, with a wide array of algorithms and approaches. A key differentiator is model-free and model-based algorithms. In model-based RL, the agent has access to, or learns, a dynamics model of the environment, which can then be used in a planning algorithm. In model-free RL, the agent has no access to the underlying model and instead directly maps states to actions. Model-free learning has had much more research interest and is more widely applied in robotics literature. Model-based RL tends to be more sample efficient than model-free, however it is more complex and challenging to train [Reference Moerland, Broekens, Plaat and Jonker14]. Clarke et al. used a Deep Q-Network (DQN), a model-free, value-based algorithm [Reference Clarke, Fletcher, Greatwood, Waldock and Richardson6]. DQN, developed by Mnih et al., was the first deep-RL algorithm to be successfully demonstrated. A neural network is used as a non-linear function approximator for the Q-function [Reference Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, Petersen, Beattie, Sadik, Antonoglou, King, Kumaran, Wierstra, Legg and Hassabis15]. Whilst DQN has demonstrated efficacy for some tasks, it has several limitations, including being limited to only discrete action spaces. More modern algorithms are compatible with continuous action spaces and bring other improvements such as reduced complexity and higher performance. Proximal policy optimisation (PPO), developed by Schulman et al. [Reference Schulman, Wolski, Dhariwal, Radford and Klimov16] is a more modern algorithm and has been widely applied across a range of tasks and domains. For example, it has demonstrated high levels of performance for real-world dexterous manipulation [Reference Andrychowicz, Baker, Chociej, Józefowicz, McGrew, Pachocki, Petron, Plappert, Powell, Ray, Schneider, Sidor, Tobin, Welinder, Weng and Zaremba17] and quadrupedal walking projects [Reference Tan, Zhang, Coumans, Iscen, Bai, Hafner, Bohez and Vanhoucke18]. As well as being compatible with discrete and continuous actions, PPO tends to be less sensitive to hyperparameter selection and is less complex than other algorithms.
PPO is an on-policy, model-free RL algorithm. Unlike DQN, which is a value-based method, PPO is a policy-gradient method that aims to optimise a surrogate objective function using gradient ascent. Being an on-policy algorithm, this optimisation can only update with data collected while acting according to the most recent version of the policy.
For a generic policy gradient algorithm, the aim is to maximise an objective function, Equation (2).
where $\pi$ is a stochastic policy parameterised by $\theta$ . For deep reinforcement learning, this is a neural network with weights $\theta$ . $P(\tau|\pi_\theta)$ represents the probability of a trajectory $\tau$ following policy $\pi_\theta$ , given by Equation (3). The transition model $P(s_{t+1}|s_t, a_t)$ is determined by the dynamics of the environment, and $\pi_{\theta}(a_t | s_t)$ is the probability of policy $\pi_{\theta}$ suggesting $a_t$ given $s_t$ .
The policy is optimised through gradient ascent for policy gradient algorithms to find the optimal policy $\pi^*$ . To numerically compute the policy gradient, the objective function is commonly expressed in the form of Equation (4), with the policy gradient given by Equation (5).
Here, $\hat{A}_t$ is the estimate of the advantage function, a measure of how good an action is when compared with other actions on average, relative to the current policy. An actor-critic network architecture is commonly used to estimate $\hat{A}_t$ , with a separate critic network trained to predict the value function. Employing the advantage function helps to reduce the variance of the gradient estimates.
This form of the policy gradient seen in Equations (4) and (5) is used in the vanilla policy gradient algorithm. However, performing optimisation using such a methodology can lead to large policy updates that cause instability during training. As such, Trust Region Policy Optimisation (TRPO) was created to limit the size of policy updates by enforcing Kullback–Leibler constraints [Reference Schulman, Levine, Moritz, Jordan and Abbeel19]. PPO was created to achieve similar benefits and performance as TRPO whilst being less complex and easier to implement. PPO uses a clipped surrogate objective function to constrain the update’s size. The surrogate objective function is given by Equation (6). Here, $r_t(\theta)$ is the probability ratio between the new and old policies, Equation (7), and $\epsilon$ is a hyperparameter that requires tuning.
4.0 Perched landing
4.1 Manoeuvre description
Pure fixed-wing vehicles typically require a significant length of clear ground to land intact unless specialist equipment such as arresting gear can be used. This imposes significant restrictions on their use in congested environments or minimally prepared locations. However, many birds demonstrate the ability to land in unprepared locations and minimal distance when landing on a perch. If a similar manoeuvre can be carried out by a fixed-wing vehicle, it would allow for significantly increased flexibility in operations.
The ability of a bird to change the shape of its wings far outweighs that available to conventional fixed-wing aircraft, imparting a significant advantage in available control effort relative to their size. The many degrees of freedom that enable this are currently impractical to integrate into a fixed-wing platform. Therefore, the vehicle used in this work only has the additional ability to vary the sweep of its wings. This means the required trajectory is likely to differ significantly from that of a bird. Because of this, rather than specifying a desired trajectory for the perching manoeuvre, only the desired final state is specified.
The aircraft in this work can use the variable-sweep wings and elevator in unison to generate significant pitching moments, enabling a rapid increase in drag. This can be used to quickly slow the vehicle, helping to achieve the desired final state. Figure 2 shows a representation of the path and attitude of the vehicle throughout such a manoeuvre. The dynamics of the aircraft throughout this manoeuvre are highly non-linear, particularly the dynamic stall effects caused by the rapid change in main wing pitch [Reference Corke and Thomas20]. This poses a significant challenge for a conventional flight controller. Waldock et al. approached the problem by using a set of pre-computed, open-loop schedules to drive the wing sweep and elevator. Closed-loop control was then used to correct for deviations in pitch rate from the pre-computed schedule, with control effected by additional elevator deflection. The schedule optimised through reinforcement learning was shown to perform better than that generated by a more conventional optimiser. More recently, as in this work, an RL agent has been directly used as a closed-loop controller.
Before performing a perched landing, it is assumed that the aircraft is 40m away from the desired landing site, heading towards it in level flight and at a constant airspeed. The manoeuvre is unpowered, so the throttle is set to zero at the start of the manoeuvre. In earlier work, the manoeuvre was started at a height of 2m above the target. However, when facing headwinds, this was too limiting, so was increased to 5m.
4.2 Reward function
The objective of the perched landing manoeuvre is to land flat on the ground, with velocities close to zero. As such, the reward is a function of the five relevant longitudinal aircraft states, $(x_e, z_e, \theta_e, u, w)$ . The reward function generates a scalar output from these parameters. The reward is only returned at the end of an episode, when the terminal state is reached. This is known as a sparse reward.
The terminal state is reached when the aircraft height relative to the target, $z_e$ , becomes zero, simulating a landing on the floor. The reward is calculated based on the error between the final and target states. In this case, the target value for each state element is zero. The episode will end early if the aircraft breaches boundary conditions, such as maximum airspeed or pitch angle. In these cases, the reward for the episode is 0.
Waldock et al. used a reward function based on a weighted dot product of the mean square error for each parameter. However, with the introduction of atmospheric disturbances, this function resulted in undesirable multi-modal behaviour, with high rewards given to trajectories that would not reach the target position intact. Experimentation with modifications and alternative functions was performed in previous work, leading to the reward function that has been retained for this work [Reference Fletcher, Clarke, Richardson and Hansen7]. The reward function is based on the product of a Gaussian applied to the error in each state element. This reward function is crafted such that a high reward will only be returned if all the target states are within acceptable bounds, otherwise the reward is near zero. Each state is normalised before being passed through the Gaussian function, with the normalisation values found from iterative experimentation until the desired behaviour was achieved. The reward function is defined in Equation (8).
where $s_{t_i}$ represents each state element, $\mu$ is the target value for each element, in this case zero for all, and $\sigma$ is 0.4.
5.0 Training improvements
This paper follows a similar methodology to previous work, where reinforcement learning models are trained in simulation and then deployed onto the real-world vehicle for flight testing. A series of modifications to the training process were investigated to improve real-world flight performance by minimising terminal state error.
5.1 Input observation
Previous flight testing of the perched landing manoeuvre demonstrated that atmospheric disturbances, steady winds and gusts, significantly impact real-world performance. Simulating wind and gusts showed an increase in mean reward achieved over the baseline models, with room for improvement. The baseline input observation, Equation (9), are the longitudinal states of the aircraft, including the longitudinal position of the aircraft in the world frame, $x_e, z_e$ , the pitch angle $\theta_e$ , longitudinal body velocities u, w, pitch rate q, and wing sweep and elevator angles, $\Lambda$ and $\eta$ .
This observation, however, does not include any information about the state of the air. It was hypothesised that learning, and real-world performance, could be improved by augmenting the input state with air mass data. While the wind vector could be added to the observation in simulation, this information is challenging to measure in the real world. The test aircraft has an airspeed sensor, which the flight controller uses during automated flight. As such, this is a state that can be used in the RL controller. During training, the component of airspeed along the body x-axis is added to the input vector to give the augmented input vector, Equation (10).
5.2 Improving simulation
The reality gap is the difference between simulation and reality and poses a significant problem when performing experimental flight tests. Any uncertainty or error in the simulator can lead to poor real-world performance. Sources of discrepancy include unmodelled real-world dynamics, incorrect parameters, and stochasticity in the real environment. Accurately modelling fixed-wing flight is particularly challenging, with a typical numerical model requiring several assumptions and simplifications. For example, a numerical model constructed from wind tunnel data will have sources of error such as wall interference effect and loadcell measurement noise.
Domain randomisation is one technique used to improve the sim-to-real transfer of models trained in simulation. A policy trained using domain randomisation will be more robust to modelling uncertainty and error by experiencing variations of the domain. Whilst training the RL policy during simulation, various dynamics parameters are sampled. Peng et al. and Tan et al. have shown it to be effective for generating robust controllers for robotic arm manipulation tasks [Reference Peng, Andrychowicz, Zaremba and Abbeel21] and locomotion for quadrupedal robots [Reference Tan, Zhang, Coumans, Iscen, Bai, Hafner, Bohez and Vanhoucke18]. Formally, the objective function in Equation (2) is modified to produce Equation (11), to maximise the expected return across a distribution of dynamics models $\rho_\mu$ , where $\mu$ is a set of dynamics parameters [Reference Peng, Andrychowicz, Zaremba and Abbeel21].
Domain randomisation was previously introduced for this perched landing scenario by randomising wind and gusts during training [Reference Fletcher, Clarke, Richardson and Hansen7]. Steady-state winds were randomised at the start of each episode, sampled from a uniform distribution, and gusts were injected as Dryden noise. Further sources of domain randomisation are introduced in this work.
Sensor noise is simulated using a similar method as suggested by Peng et al. [Reference Peng, Andrychowicz, Zaremba and Abbeel21]. Gaussian noise is applied to the observation at each time step, with a mean of zero and a standard deviation of 5% of the running standard deviation for each observation feature. A small amount of acceleration noise, in the form of Gaussian noise, is applied to the acceleration term of the numerical model. This is similar to the noise applied by Walodck et al. where it was found to improve performance in the real world. This noise augments the Dryden gusts already added, helping to generate models that are more robust to numerical model inaccuracies and disturbances during flight. A noise level of 0.3m/s2 is used in these experiments.
Whilst the flight controller aims to reach an initial target state before handing over control to the RL model, in reality, there is variation in these start conditions. Previous experiments demonstrated variation in initial airspeed and pitch angle in particular. As such, initial state noise was simulated between episodes, with initial pitch angle and airspeed selected from normal distributions. Table 1 details all of the domain randomisation parameters and values defined in this section.
Peng et al. identified action latency and observation noise as having a significant impact on RL training performance for controlling a robotic arm [Reference Peng, Andrychowicz, Zaremba and Abbeel21]. Ibarz et al. suggest that actuator dynamics and the lack of latency modelling are the leading causes of modelling error for learning quadrupedal walking [Reference Ibarz, Tan, Finn, Kalakrishnan, Pastor and Levine22]. A characterisation of the action latency on the test aircraft was performed. One source of latency for the real system is communication between the flight controller and RL model. At each time step, the flight controller sends the input observation to the onboard computer over a serial connection, using the observation to select an action from the trained model. This action is then sent back to the flight controller over the serial connection to be performed by the servos. This communication and computation latency was characterised by timing the delta between the flight controller sending the state and it receiving the control action from the model. This latency model does not consider mechanical latency.
Table 1 summarises all of the domain randomisation parameters used in this work, including the steady-state wind gust parameters. Similar to previous work, steady-state winds are selected from a uniform distribution. Only headwinds are modelled in this work, up to 8m/s due to poor performance of tailwind scenarios during previous flight testing. The modelling of Dryden gusts remains unchanged, using the process and parameters for low altitude, light turbulence defined by Beard and Mclain [Reference Beard and McLain23] and Langelaan et al. [Reference Langelaan, Alley and Neidhoefer24].
5.3 Hyperparameter optimisation
Investigations into improving the performance of the underlying RL algorithm were performed. Current RL algorithms are highly sensitive to appropriate hyperparameter selection and require significant optimisation for maximum performance. Henderson et al., for example, demonstrated the effect of changing several hyperparameters for several algorithms, including PPO [Reference Henderson, Islam, Bachman, Pineau, Precup and Meger25].
In previous research, the hyperparameters and network structure remained as the default values defined in the baseline implementation of PPO in Stable-Baselines [Reference Hill, Raffin, Ernestus, Gleave, Traore, Dhariwal, Hesse, Klimov, Nichol, Plappert, Radford, Schulman, Sidor and Wu13]. These settings are defined to achieve good performance across a range of environments, particularly the OpenAI Gym and Atari environments [Reference Brockman, Cheung, Pettersson, Schneider, Schulman, Tang and Zaremba9]. With sufficient computational capacity, a full hyperparameter sweep would allow for the optimisation of each hyperparameter. Instead, with the resources available, a more limited investigation was performed to identify the parameters that would significantly impact training performance. An initial hyperparameter sweep was performed using the Weights and Biases sweep tool [Reference Biewald26]. A Bayesian sweep was performed on ten parameters, based on the mean reward after 5 million time steps, for a total of 102 runs. The hyperparameter sweep showed that changing the n_steps parameter had a significant effect on training performance. This parameter is defined as the number of time steps to run for each environment before running the PPO update [Reference Hill, Raffin, Ernestus, Gleave, Traore, Dhariwal, Hesse, Klimov, Nichol, Plappert, Radford, Schulman, Sidor and Wu13]. Figure 3 shows a training history plot, comparing the reward per time step for several n_steps parameters. The plot shows a general trend of the reward increasing as n_steps increases, with the highest reward obtained when the parameter is set at 2,048, the highest value used in the sweep. With a higher number of steps, the policy update will be more generalised, instead of fluctuating between extreme policies that may be present if n_steps is lower. For the models in this paper, this is the value used, yet future investigations with n_steps greater than 2,048 could lead to further improvement.
As well as the sweep, tuned hyperparameters from other environments were identified from literature, such as those provided by the Stable Baselines Zoo collection [27]. This research suggested that increasing the nminibatches hyperparameter from the default value should lead to improved performance. This parameter is defined as the number of training minibatches per update. From the research and investigations carried out, it was decided that these two parameters would be varied in a number of test cases. Table 2 summarises these values.
5.4 Network architecture
Previous work has used a fully connected multi-layer perceptron (MLP) neural network architecture with two layers of 64 nodes. This is the default network architecture for an MLP as defined in the Stable-Baselines library [Reference Hill, Raffin, Ernestus, Gleave, Traore, Dhariwal, Hesse, Klimov, Nichol, Plappert, Radford, Schulman, Sidor and Wu13]. One potential change is to increase the size of the network, such as changing the number of layers or the number of nodes of each layer. Raffin et al. and Henderson et al. both show performance increases when using a larger neural network [Reference Raffin, Kober and Stulp28,Reference Henderson, Islam, Bachman, Pineau, Precup and Meger25]. As such, investigations with a more extensive network architecture, two layers of 256 nodes, were performed in this work.
With the use of further domain randomisation as described in Section 5.2, there is a significant change in the dynamics of the environment during training. Peng et al. suggest that a policy with a form of memory will perform better than a standard MLP in such a scenario [Reference Peng, Andrychowicz, Zaremba and Abbeel21]. One method of adding memory is the use of recurrent neural networks (RNNs), such as the Long-Short Term Memory (LSTM) architecture used by OpenAI [Reference Andrychowicz, Baker, Chociej, Józefowicz, McGrew, Pachocki, Petron, Plappert, Powell, Ray, Schneider, Sidor, Tobin, Welinder, Weng and Zaremba17]. However, RNNs are usually more challenging to train, requiring finer tuning of the underlying hyperparameters and increased computational time. An alternative approach, as used by Haaronja et al. [Reference Haarnoja, Zhou, Hartikainen, Tucker, Ha, Tan, Kumar, Zhu, Gupta, Abbeel and Levine29] and Xiao et al. [Reference Xiao, Jang, Kalashnikov, Levine, Ibarz, Hausman and Herzog30], is frame stacking. Instead of a single observation, the observation at each time step is augmented by a window of $n-1$ previous observations, where n is the number of frames in the stack. Frame stacking can be used with a standard MLP structure, requires minimal modification to the existing architecture, and is easier to train than an LSTM. Initial investigations were conducted to identify the number of frames n to use, summarised in Fig. 4. The plot shows that using a stack of four frames results in a greater reward during evaluation than eight. As such, test cases using frame stacking in Section 7 use a n of 4.
5.5 Continuous actions
Discrete actions spaces have been used previously with both Deep Q-Network [Reference Clarke, Fletcher, Greatwood, Waldock and Richardson6] and PPO [Reference Fletcher, Clarke, Richardson and Hansen7]. The actions are encoded as a pair of arrays of angular rates for the wing sweep and elevator, with the action effectively selecting a pair of indices into these arrays. Equation (12) shows the discrete action arrays.
PPO is compatible with continuous action spaces, where ranges of actions are defined, and the model selects a continuous value in that range. In this case, the continuous action space was encoded as the minimum and maximum angular rates for both the sweep and elevator. It was hypothesised that, with a finer level of control over actions at each time step, the agent would be able to achieve greater performance levels. However, the use of continuous actions expands the size of the action space and can increase learning difficulty. Equation (13) shows a continuous representation of actions.
6.0 Simulation methodology
For each test case, five instances are trained using different random seeds. Each instance is trained for 40 million time steps on eight CPU cores of the University of Bristol’s BluePebble HPC, with each instance taking approximately 72 hours to train. This represents approximately 1 million simulated attempts at performing the manoeuvre. Multiple workers are employed, where experience from eight environments per time step is used to train the agent.
Once trained, each test case was evaluated in simulation. This was performed by running each trained model through the simulator to obtain time histories and simulated rewards. Two methods of evaluation are used in the simulated investigations. The baseline evaluation represents the evaluation scenario used in previous work, with randomisation of only steady winds and gusts. Enhanced evaluation employs further domain randomisation, such as variable initial conditions, latency and sensor noise, as described in Section 5.2. This presents a more challenging environment, with trained models expected to receive less reward in this scenario than when evaluated under baseline conditions. Table 1 provides details of the domain randomisation parameters used in the two evaluation scenarios. During evaluation, each model is evaluated against a range of steady wind speeds. To account for the stochastic nature of the evaluation environments, the reward presented at each wind speed for each test case is the mean of 100 evaluations for the five instances.
7.0 Simulation results
The first set of experiments looked to compare the performance of adding each improvement in succession, with the baseline test case representing a model without any improvements. Table 3 shows the test cases for this set of investigations.
Figure 5 shows the training history of the test cases from Table 3. Each line represents the average training performance for the five instances of each test case. The plot shows the reward obtained per time step. An overall increase in reward with time step is expected to show that the models are training successfully. An exponential moving average with a smoothing factor of 0.7 is applied to the data. The baseline and Test Case 1, in general, obtain the highest reward per episode, due to the lack of additional variation of the domain during training. Using domain randomisation for the other test cases leads to a more challenging scenario with greater variance between episodes, hence lower reward. For the remaining test cases described, Test Cases 5 and 6 demonstrate generally higher rewards per episode, with Test Cases 2 and 3 obtaining the lowest reward per episode during training. This suggests that improvements such as frame stacking and tuning of hyperparameters lead to greater training performance when additional noise is added.
Figure 6 shows a plot of the mean rewards for the two evaluation scenarios, with higher reward showing superior performance. Figure 6 shows that, when evaluated against the baseline scenario, the baseline model and Test Case 1 obtain significantly higher rewards than the other test cases. Test Cases 2–6 add additional noise to the environment, and there is a general decrease in mean reward. The baseline and Test Case 1, trained without noise, have more experience during training of this simpler environment and obtain higher rewards when evaluated against the baseline scenario. However, these models perform poorly when evaluated against the enhanced scenario. These models cannot generalise to the noisier environment, as they have insufficient experience of these states during training. In general, across all test cases, there is a decrease in reward from the baseline evaluation to the enhanced evaluation, due to the more challenging environment.
Test Case 1 shows that the addition of airspeed into the observation leads to a slight increase in mean reward across both evaluations. This suggests that this additional information is useful to the model when learning. The addition of noise during training in Test Case 2 results in an increase in reward, as these models have experience of a more comprehensive array of states and can generalise more successfully. Test Case 5 obtains the highest mean reward of 0.178 under enhanced evaluation. This test case employs all previous improvements yet still uses discrete actions. Continuous actions in Test Case 6 lead to a decrease in mean reward compared to Test Case 5. This is most likely a product of the training constraints. With sufficient training time and a comprehensive parameter tuning process, the models with continuous actions would likely show performance above that of discrete. Test Case 3 also shows worse performance than Test Case 2, which is identical except uses a larger network. Whilst this result suggests that the smaller default network size should be used, further investigation showed that performance is sensitive to the combination of improvements. As such, the best performing model may not be that with all improvements enabled.
7.1 Combinations of modifications
Based on the first set of simulation results, a series of tests were conducted with various combinations of improvements to identify the best performing model. This section provides results from a selection of the experiments conducted. Table 4 describes each of the test cases, with the baseline included for comparison. Figure 7 shows the learning curves for these test cases. Similar to Fig. 5, the baseline models show the highest reward during training, due to a simpler training environment. With additional domain randomisation applied to the other test cases, there is a general decrease in reward per time step of training. Test Case 7 adds additional domain randomisation during training, yet without any of the learning improvements, resulting in the worst learning performance. The remaining test cases from Table 4 have learning improvements such as frame stacking and tuned parameters for improved learning performance. Some test cases are identical to the previous section, are given the same name, and are included for a complete comparison.
In Table 5, the baseline case obtains the lowest overall mean reward of 0.099, as well as the lowest reward at each evaluation headwind. With the lack of domain randomisation and with default training parameters and architecture, these models are least able to generalise to the evaluation environment detailed in Section 7. Using a larger neural network led to a decrease in mean evaluation reward in the previous set of results. However, in Table 5, Test Case 5 obtains a slightly higher mean reward of 0.178 compared to 0.176 for Test Case 10. Similarly, Test Case 9 obtains a reward of 0.175, whilst Test Case 8 receives a mean reward of 0.170. Test Cases 5 and 10 are identical except for network size. This relationship is also true for Test Cases 9 and 8. This suggests that having a larger neural network is advantageous to training performance for these more complex models. Table 5 shows that Test Case 5 still receives the highest overall reward of 0.178. Similar to Fig. 6, the inclusion of airspeed in the observation results in a slightly higher mean reward. The difference in mean reward between Test Cases 3 and 5 emphasises the importance of the learning improvements, such as frame stacking and hyperparameter tuning.
7.2 Impact of individual improvements
The results presented so far suggest that the implemented improvements are of varying levels of importance. In the previous set of results, there is a small increase in reward when airspeed is added to the observation, and a more significant increase in reward is seen with the use of frame stacking and change of hyperparameters. Figure 8 shows the effect of adding each improvement individually to the baseline model. Five models are trained for each test case using the same methodology as previously and then evaluated against the enhanced evaluation scenario described previously. The plots show the percentage difference in mean reward for each test case compared to the baseline model. The data show that the most significant increase in mean reward comes from changing the hyperparameters. Even though only two hyperparameters have been adjusted, a significant increase in reward above the baseline is observed. It is likely that with a complete hyperparameter optimisation, an even greater increase in training performance will be seen. The impact of adding acceleration noise in training is also significant, as models trained with only the acceleration noise modification show a considerable increase in reward when compared to the baseline. Using a larger neural network and using continuous actions both result in a decrease in reward. The model of latency used in this work has negligible impact, with only a slight increase in reward when included during training. It is likely that training with a more comprehensive characterisation of latency, including mechanical effects, would be required before significantly impacting performance.
Figures 9 and 10 show example time history plots during simulated evaluation. They represent the best performing model of the baseline and Test Case 5, respectively. Each model is evaluated against a series of steady-state winds, with all noise sources enabled. The plots are from a single evaluation with random seeds and so only provide example trajectories. There will be variance between evaluations with domain randomisation, so the plots do not accurately represent performance. Present in the plots are the key longitudinal states, such as pitch angle and body velocities, and the actions. The plots show the variance present and the effect of wind speed on performance. For example, when facing a headwind of 8m/s, both models fall well short of the target position. As the headwind decreases, the aircraft lands closer to its target position.
8.0 Experimental flight testing
Over two separate sessions, experimental flight testing was conducted to validate the simulation results. These flight tests use an automated testing process, as used and described in previous experimentation by the authors [Reference Fletcher, Clarke, Richardson and Hansen7]. The autopilot flies pre-generated waypoints based on the prevailing wind direction. The waypoints are generated using a script to create a racetrack pattern, with parameters to adjust the size and location of the pattern. A heading parameter is used to specify a wind direction, with the pattern rotated such that the straight legs of the circuit are parallel to the wind direction.
At a designated waypoint, the autopilot hands over control of the elevator and wing sweep to the RL model running on the onboard computer. The ArduPilot firmware was modified to include a new waypoint type to specify the point at which the perching manoeuvre begins and the aircraft switches to RL control. An interface script transforms the state reported by the autopilot, such that the initial position and direction match those used for training; in this case, $x_e, z_e = (-40, -5)$ .
When the aircraft passes through ground level in this transformed state, i.e. 5m below the RL start altitude, the RL controller disengages, and the autopilot attempts to recover the aircraft, fly to the next waypoint and repeat the pattern. The test pilot only has to launch and land the aircraft during optimal operation, with the autopilot performing the rest. However, the pilot retains manual override to take over control of the aircraft if necessary. The autopilot aims for an airspeed of 13m/s for the manoeuvre entry waypoint.
For each test case being evaluated during flight testing, the model that obtained the highest reward in simulation was deployed on the vehicle. The first set of tests compared the baseline models with Test Case 7 from Table 4, a model trained with domain randomisation, as described in Section 7. These tests were conducted on a day with inconsistent winds from variable directions, gusting up to approximately 5m/s. Table 6 shows the results from these first tests. During a flight, the aircraft is launched, performs several attempts at the manoeuvre, and then lands when the battery level is low. Due to inconsistent power usage, there is variance in the number of manoeuvre attempts per flight.
In Table 6, the Test Case 7 model obtains the higher overall mean reward of 0.011. compared to 0.006 for the baseline. However, the individual rewards and greater standard deviation suggest that this overall mean is skewed by a single attempt, with a reward of 0.248 obtained on the first attempt of flight 1. If this result is ignored, the mean for Test Case 7 drops to 0.001. This suggests that the baseline model performed better than Test Case 7 during these tests when this outlying result is ignored. This is supported by Table 7, which shows the mean final states across all attempts for both models. Table 7 shows that the baseline model has lower final state error for both position and pitch angle. However, the Test Case 7 model attains a lower final mean velocity.
Figures 11 and 12 show time history plots from the experiments, with each line corresponding to a manoeuvre attempt. The plots show how the key longitudinal states of the aircraft, as well as the surface deflections. Of particular interest is the difference between the chosen actions for the two models and the effect on the corresponding trajectories. Figure 11 shows that the policy for the baseline model opts to initially sweep the wings rearwards as it pitches the aircraft downwards into a dive, and then shifts them forward as it pitches up into a flare. For the remainder of each manoeuvre, both actuators are then used to attempt to maintain pitch close to zero degrees. Figure 12 demonstrates a different policy. The wing sweeps forward early in the manoeuvre, and the vehicle pitches up. For many of the attempts, the aircraft maintains maximum forward sweep for the manoeuvre duration. Many of the attempts fall short of the target position, falling vertically after losing lift, suggesting the aircraft has stalled. Therefore, it is likely that the underperformance of the Test Case 7 model in the real world is due to the reality gap. The early stalling behaviour and subsequent lower reward is not seen during simulation, suggesting there is underlying real-world aerodynamics not present in the numerical model. The behaviour shown by the baseline model may be more stable. This discrepancy is likely exacerbated by the aircraft’s velocity, with the Test Case 7 model achieving a lower mean final velocity than the baseline.
The second session of flight tests compared the performance of the baseline model against Test Cases 5 and 9. On this day, there were steady winds of approximately 5m/s gusting up to 8m/s, with consistent wind direction. Both Test Cases were trained with domain randomisation, increased network size, frame stacking and tuned hyperparameters. Both also used discrete actions, while only Test Case 5 has airspeed as part of its observation. These experiments used the same baseline model as the first testing session. Table 8 provides a summary of the rewards obtained for each manoeuvre attempt for each test case across a number of flights. Due to time constraints, only one set of flights for Test Case 5 could be obtained. The baseline model shows a slight increase from the first flight tests, with an increase in reward from 0.006 to 0.008. The Test Case 9 model achieves a greater mean reward of 0.044. Test Case 5 achieves a mean reward of 0.036; however, this is across only eight attempts.
Table 9 shows the average final states of the aircraft. Compared with Table 7, the baseline attempts show a greater position error of 20.78m, compared to 15.71m in the first flight tests. However, there is a decrease in final pitch and velocity error, leading to the overall higher reward. Test Case 9 has reduced mean position and velocity errors of 14.61m and 6.02m/s respectively, yet does have a greater pitch error of 5.72 degrees. Test Case 5 generally showed the greatest mean error level across the three states, however this was with only eight attempts.
Figure 13 shows the state history for the baseline model. The plots show similar characteristics to Fig. 11, with the policy of the model selecting similar sequences of actions with similar resultant trajectories. Compared with Fig. 13, Fig. 14, showing the state histories for Test Case 9, shows that general trend of landing closer to the target position of 0m, as well as reduced body velocities. Like the baseline model, this policy opts to pitch downwards at the start of the manoeuvre and then flare later on for a perched landing. Figures 11 through 14 demonstrate the level of variance between attempts from the same model. Even for the best performing Test Case 9, shown in Fig. 14, there is significant variance in the selected actions and the resultant trajectories. This is likely due to the nature of the perched landing manoeuvre – the aircraft is highly sensitive to gusts once it has committed to a flare trajectory, especially when lacking throttle.
9.0 Conclusions
Using the PPO algorithm, reinforcement learning was used to generate a series of models to perform a perched landing manoeuvre. A series of training enhancements were identified to improve real-world performance, measured primarily through the final reward achieved. These enhancements include passing additional information into the input observation, introducing further noise into the training through domain randomisation, and modifying the network architecture and underlying RL algorithm. A series of investigations were conducted, assessing the impact of each enhancement on the final reward, both individually and in combination. Simulations demonstrated that the most significant individual increase in reward came from tuning two of the key hyperparameters of the PPO algorithm. Other modifications, such as using a continuous action space instead of discrete, decreased the final reward. After testing several combinations, test cases with higher performance than the baseline model were identified for flight testing.
These trained models were deployed on the flight test aircraft. The results from several repeat attempts at the manoeuvre for each model were collected using an automated flight testing process. The first flight testing session compared a baseline model to a model trained with domain randomisation, Test Case 7. These flights demonstrated the limitations of the current approach, as the test case which performed better in simulation, performed worse in reality. This is likely due to the reality gap. The Test Case 7 model chooses a policy that works in simulation and receives a high reward yet leads to premature stall in reality. This suggests that either improvement of the numerical model through further data collection is required, or further domain randomisation of the aerodynamic parameters could be attempted.
The second session of flight testing compared a baseline model with models from Test Cases 5 and 9. These test cases were two of the best performing test cases from simulation, with their properties detailed in Table 4. Test Case 5 adds airspeed to the observation, whereas Test Case 9 uses the standard observation. Test Case 9 achieves higher mean rewards than the baseline model and generally achieves a lower final state error. Due to time constraints, only a limited number of Test Case 5 manoeuvre attempts could be collected. From the limited data set, Test Case 5 achieves a mean reward greater than the baseline yet lower than Test Case 9. However, the average final state errors are higher than the other two test cases. This suggests that further reward shaping may be required to achieve the desired perching behaviour across a range of real-world conditions.
Overall, enhanced models demonstrate improved performance compared to the baseline in both simulation and the real world. However, even the best performing model still lands somewhat short of the target position. It is likely that any further improvement in performance cannot come from changes to the RL process alone. One potential future pathway is to integrate throttle control into the RL controller. This would allow the aircraft the perform the manoeuvre when facing stronger headwinds instead of having insufficient energy and falling short, as is currently the case. Refinement and expansion of the underlying numerical model is another area of potential future work. The results suggest that, even with the domain randomisation performed in this work, there is still a significant reality gap. Further wind tunnel testing, real-world system identification methods, and model-based RL methods are future research paths that could improve simulation to reality transfer.
Acknowledgements
This work was part-funded by the Defence Science and Technology Laboratory (DSTL), Ministry of Defence. This work was partially supported by the EPSRC Centre for Doctoral Training in Future Autonomous and Robotic Systems (FARSCOPE) at the Bristol Robotics Laboratory. The authors would like to thank all contributors to ArduPilot [Reference Tridgell, Ferreira, Morphett, Walser, De Marchi, du Breuil, Barker, Mackay, Pittenger, Geyer, Olson, Castelnuovo, Shamaev, Staroselskii, de Sousa, Beraud, Hall, Lawrence, Badaire, Denecke, Riseborough, Kancir, Mayoral Vilches and Lucas12], and the pymavlink and stable-baselines [Reference Hill, Raffin, Ernestus, Gleave, Traore, Dhariwal, Hesse, Klimov, Nichol, Plappert, Radford, Schulman, Sidor and Wu13] libraries, all of which have enabled this work. This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol – http://www.bris.ac.uk/acrc/