Nomenclature
- 3D
-
Three-dimension
- RL
-
Reinforcement learning
- DL
-
Deep learning
- MDP
-
Markov decision process
- DNN
-
Deep neural network
- DPG
-
Deterministic policy gradient
- ODPDAC
-
Off-Line deep policy deterministic actor-critic
- $r$
-
geocentric distance (m)
- $\lambda $
-
longitude (deg)
- $\phi $
-
latitude (deg)
- $V$
-
velocity (m/s)
- $\theta $
-
flight-path angle (deg)
- $\sigma $
-
heading angle (deg)
- $\upsilon $
-
bank angle (deg)
- $g$
-
Earth’s gravitational acceleration (m/s2)
- $\mu $
-
Earth’s gravitational constant (m/s)
- ${C_\sigma }$ , ${C_\theta }$
-
centripetal accelerations
- ${\tilde C_\sigma }$ , ${\tilde C_\theta }$ , ${\tilde C_V}$
-
earth rotation accelerations
- ${\omega _{\rm{e}}}$
-
angular velocity of Earth’s rotation (rad/s)
- $Y$
-
lift (N)
- $D$
-
drag (N)
- $M$
-
mass of HV (kg)
- ${S_r}$
-
reference area (m/s2)
- $q$
-
dynamic pressure (Pa)
- $\rho $
-
atmospheric density (kg/m3)
- ${C_L}$
-
lift coefficients
- ${C_D}$
-
drag coefficients
- $\alpha $
-
attack angle (deg)
- $Ma$
-
Mach number
- $\boldsymbol{{x}}$
-
motion state vector
- $\dot Q$
-
heat rate (W/m2)
- $n$
-
overload
- ${k_Q}$
-
heat rate constant
- ${{\boldsymbol{{x}}}_0}$
-
initial states
- ${{\boldsymbol{{x}}}_{\boldsymbol{{f}}}}$
-
terminal states
- $J$
-
objective function
- ${C_1}$ , ${C_2}$
-
weighting factors
- $r_f^{}$
-
actual terminal geocentric distance (m)
- $\lambda _f^{}$
-
actual terminal longitude (deg)
- $\phi _f^{}$
-
actual terminal latitude (deg)
- $r_T^{}$
-
preset terminal geocentric distance (m)
- $\lambda _T^{}$
-
preset terminal longitude (deg)
- $\phi _T^{}$
-
preset terminal latitude (deg)
- ${\alpha _{\max }}$
-
maximum attack angle (deg)
- ${\alpha _{\max L/D}}$
-
maximum lift-to-drag ratio attack angle (deg)
- ${V_1}$ , ${V_2}$
-
velocity constants (m/s)
- $a$
-
control action
- ${\upsilon _{\max }}$
-
maximum amplitude of bank angle (deg)
- ${\dot \upsilon _{\max }}$
-
maximum rate of bank angle (deg/s)
- S
-
states space
- A
-
actions space
- P
-
dynamic state transition function
- R
-
reward function
- $\gamma $
-
discount factor
- ${\mathop{\rm H}\nolimits}\!\left( x \right)$
-
Heaviside function
- $\dot \theta $
-
change rate of flight-path angle (deg/s)
- ${S_r}$
-
normalised constants of flight range (m)
- ${h_r}$
-
normalised constants of altitude (m)
- $\Delta {h_f}$
-
terminal altitude error (m)
- $\Delta {S_f}$
-
terminal landing error (m)
- ${p_1},{p_2},{p_3},{p_4},{p_5}$
-
constants
- ${V_t}$
-
terminal velocity (m/s)
- ${G_t}$
-
reward-to-go starting at time
- $Q\!\left( {s,a\left| {{\theta _Q}} \right.} \right)$
-
action-value function
- $s'$
-
next step state
- $\mu \!\left( {s\left| {{\theta _\mu }} \right.} \right)$
-
policy function
- ${\theta _\mu }$
-
the network parameters of Actor network
- ${\theta _Q}$
-
the network parameters of Critic network
- $Q'\!\left( {s,a\left| {{\theta ^{Q^{\prime}}}} \right.} \right)$
-
target Critic network
- $\mu^{\prime}\!\left( {s\left| {{\theta ^{\mu^{\prime}}}} \right.} \right)$
-
target Actor network
- ${\theta _{\mu^{\prime}}}$
-
the network parameters of target Actor network
- ${\theta _{Q^{\prime}}}$
-
the network parameters of target Critic network
- ${L_s}$
-
mean square error
- ${N_b}$
-
number of batch samples
- $\delta _{}^{TD}$
-
temporal difference error
- $\Delta \lambda $
-
per deviation of longitude (deg)
- $\Delta \phi $
-
per deviation of latitude (deg)
- $\Delta h$
-
per deviation of altitude (m)
- ${s_I}$
-
actual flight states
- $\bar s$
-
arithmetic mean of the input samples
- $\bar \upsilon $
-
arithmetic mean of the input samples
- $\Delta s$
-
maximum difference of the input samples
- $\Delta \upsilon $
-
maximum difference of the output samples
- $L$
-
loss function
- ${L_2}$
-
regularised loss
- $\Delta h_0^{}$
-
initial deviations of altitude (m)
- $\Delta \lambda _0^{}$
-
initial deviations of longitude (deg)
- $\Delta \phi _0^{}$
-
initial deviations of latitude (deg)
- $N\!\left( {\mu ,{\sigma ^2}} \right)$
-
the normal distributed
- $\mu $
-
mean value
- $\sigma $
-
variance
- CEP
-
radius of a circle centred on the target point (m)
1.0 Introduction
The goal of the reentry glide phase of a hypersonic vehicle is to accurately guide the vehicle from its current position to the predetermined terminal area energy management (TAEM) interface with multiple constraints, strong nonlinearities and uncertainty [Reference Bao, Wang and Tang1]. However, it is difficult to achieve due to the flight uncertainty and the external disturbances in glide phase [Reference Bao, Wang and Tang2–Reference He, Liu, Tang and Bao4]. Therefore, it is critical to develop a novel efficient, stable and less computationally intensive onboard approach to generate optimal trajectories with greater autonomy and better generalisation. The trajectory generation problem is regarded as an optimal control problem, which can be traditionally solved by indirect and direct methods. The indirect method transforms the trajectory optimisation problem into a Hamiltonian two-point boundary value problem with state variables and covariates [Reference Wei, Han, Pu, Li and Huang5]. However, the strict initial value requirements make the convergence uncertain via numerical iteration. In contrast, the direct method does not require the necessary conditions for the optimal solution but approximates the optimal control problem as a nonlinear programming with parameterisation [Reference Dancila and Botez6]. Common direct solution methods include the pseudo spectral method [Reference Chai, Tsourdos, Savvaris, Chai and Xia7, Reference Rizvi, Linshu, Dajun and Shah8] and the convex optimisation method [Reference Kwon, Jung, Cheon and Bang9–Reference Sagliano, Heidecker, Macés Hernández, Farì, Schlotterer, Woicke, Seelbinder and Dumont11].
The development of artificial intelligence algorithms represented by reinforcement learning (RL) and deep learning (DL) has offered new routes to intelligent flight control technologies [Reference Shirobokov, Trofimov and Ovchinnikov12]. Deep neural networks (DNNs) could theoretically approximate any nonlinear system [Reference Schmidhuber13, Reference Basturk and Cetek14], and they are investigated to accurately map the input and output in optimal control and the fundamentals of optimisation models [Reference Nie, Li and Zhang15]. Yang et al. [Reference Shi and Wang16] reported a DL-based method for onboard optimal 2D trajectory generation of hypersonic vehicles, and the DNN outlined the functional relationship between the flight state and the optimal action of the reentry trajectory. Although DNN is an attractive option for onboard trajectory generation, the dependence on samples renders the trajectory generated by DNN impossible to surpass the original samples and get further improvement [Reference Sánchez and Izzo17, Reference Cheng, Wang, Jiang and Li18].
As one of the core techniques of artificial intelligence, RL is considered to be a more attractive option for onboard trajectory generation, and it empowers the agent with self-supervised learning capabilities through trial-and-error mechanisms and exploration-exploitation balance [Reference Tenenbaum, Kemp, Griffiths and Goodman19–Reference Han, Zheng, Liu, Wang, Cheng, Fan and Wang21]. Compared to optimal control methods, although RL algorithms have high computational costs during training, it shows much lower computational costs than optimal control when deployed [Reference Shi, Zhao, Wang and Jin22]. Gaudet et al. applied RL to develop a new guidance system [Reference Gaudet, Linares and Furfaro23, Reference Gaudet, Linares and Furfaro24] and an adaptive integrated guidance, navigation, and control system [Reference Gaudet, Linares and Furfaro25] by reinforcement meta-learning on Mars and asteroid landing. Conceptually similar works have also been carried out in Refs (Reference Zavoli and Federici26–Reference Xu and Chen29) in designing autonomous closed-loop guidance. Most prior RL-based trajectory planning researches is aimed at spacecraft, while the hypersonic vehicles remain poorly discussed [Reference Zhou, Zhou, Chen and Shi30].
RL essentially solves a sequential decision problem while the onboard trajectory generation requires optimal control commands in real time according to the current state. Thus far, RL is an effective approach to online trajectory generation in principle, but two difficulties are of the utmost importance to be solved:
-
(1) How to design an appropriate RL algorithm for high-dimensional continuous state space problems such as trajectory generation? Recently investigators have highlighted the effects of deep RL. The DeepMind team implemented the Deep Q-learning algorithm by using DNN approximate action and state value functions to handle RL problems in high-dimensional continuous state space [Reference Mnih, Kavukcuoglu, Silver and Rusu31]. Concerning the RL algorithm, Silver et al. [Reference Silver, Lever and Heess32] demonstrated the deterministic policy gradient (DPG) algorithm and proved that the deterministic policy gradient method performs more effectively than the stochastic policy gradient algorithm in high-dimensional continuous action space. Combined with the advantages of the DNN and policy gradient, an offline deep policy deterministic actor-critic (ODPDAC) algorithm is constructed and discussed in this paper to address the trajectory generation for hypersonic vehicles in glide phase.
-
(2) How to reasonably initialise the RL actor network to accelerate the learning and converging process? RL can reach the optimal policy through interaction with the environment without any prior information in theory. However, if the initial parameters of the network are quite different from the final optimal parameters, the training difficulty and time will increase significantly or even lead to failure. Consequently, referring to the initialisation method of the first-generation Alpha Go in 2016, a pre-trained network for the subsequent RL actor network initialisation was obtained through supervised learning of human game data [Reference Silver, Huang, Maddison, Guez, Sifre, van den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, Dieleman, Grewe, Nham, Kalchbrenner, Sutskever, Lillicrap, Leach, Kavukcuoglu, Graepel and Hassabis33].
Based on the above analysis, this paper proposed a novel onboard trajectory generation approach for hypersonic vehicles with better generalisation and autonomy by RL algorithm as shown in Fig. 1. Initially, a large number of glide trajectory samples are generated based on the convex optimisation method. Then, the trajectory samples are used for supervised learning to pre-train the initial RL actor network. Based on the ODPDAC algorithm, the actor network is end-to-end trained and optimised to directly learn the mapping relationship between the states and optimal action. Furthermore, the resulting RL actor network is tested in gliding flight to realise high-precision online trajectory generation.
The originality of this paper is described in the following three aspects:
-
(1) It is the first report on the design of a real-time online 3D trajectory generation method for hypersonic vehicles during the gliding phase based upon end-to-end RL.
-
(2) Initialisation of the RL actor network by DL improves the stability and convergence of training while also promoting the success of policy optimisation.
-
(3) The 3D trajectory generator optimised by RL reveals strong validity and generalisation for onboard real-time trajectory generation.
The first section describes the trajectory generation model of the reentry glide phase of the hypersonic vehicle. In the next section, the reentry trajectory generation problem and models are described. The following section provides the design, testing, and optimisation of the RL algorithm. Then the simulations are carried out in Section 4 and some discussions are presented in Section 5. Finally, the main work of this paper is summarised in the last section.
2.0 Problem formulation and generation model
2.1 Reentry dynamics
The motion equations of the three-dimensional unpowered flight for hypersonic vehicles over a spherical, rotating Earth are expressed as
In Equation (1), $t$ is the time, $r$ is the geocentric distance, $\lambda $ and $\phi $ are the longitude and latitude, respectively. $V$ is the velocity, $\theta $ is the flight-path angle, and $\sigma $ is the heading angle measured clockwise from the north. $\upsilon $ is the bank angle. $g = \mu /{r^2}$ is the Earth’s gravitational acceleration, where $\mu $ is the Earth’s gravitational constant. The Coriolis accelerations ${C_\sigma }$ and ${C_\theta }$ , and the centripetal accelerations ${\tilde C_\sigma }$ , ${\tilde C_\theta }$ and ${\tilde C_V}$ caused by the Earth’s rotation are given by
where ${\omega _{\rm{e}}}$ is the angular velocity of rotation. $Y$ and $D$ are lift and drag acceleration as follows
where $M$ is the mass of a hypersonic vehicle, ${S_r}$ is the reference area. ${C_Y}$ and ${C_D}$ are the lift and drag coefficients respectively, and both are functions of the angle-of-attack $\alpha $ and Mach number $Ma$ . $q$ is the dynamic pressure calculated by $q = 0.5\rho {V^2}$ . The atmospheric density $\rho $ is modeled as
where ${\rho _0} = 1.225{\rm{kg/}}{{\rm{m}}^{\rm{3}}}$ , ${h_s}$ is the scale height constant of 7110m, and $h$ is the altitude.
2.2 Reentry trajectory generation models
Equation (1) is expressed as
where ${\boldsymbol{{x}}}$ denotes the motion state vector of $[\begin{matrix} r & \lambda & \phi & V & \theta & \sigma \end{matrix} ]^{\rm{T}}$ . The path constraints of heat rate $\dot Q$ , dynamic pressure $q$ , and overload $n$ are considered to prevent aerodynamic thermal ablation and structural damage to the vehicle as follows
where ${k_{\dot Q}}$ is a heat rate constant related to vehicle shape. The quasi-equilibrium gliding condition is imposed as the soft constraint by approximating the flight-path angle rate to zero with neglecting the Earth’s rotation as
The boundary constraints include initial and terminal states ${{\boldsymbol{{x}}}_0}$ and ${{\boldsymbol{{x}}}_{\boldsymbol{{f}}}}$ .
The minimum relative distance between the vehicle target point and the preset target point is expressed with the objective function $J$ as
where $r_f^{}$ , $\lambda _f^{}$ and $\phi _f^{}$ are the actual terminal longitude, latitude, and geocentric distance of vehicles and $r_T^{}$ , $\lambda _T^{}$ and $\phi _T^{}$ are the preset target values of every variable, respectively. ${C_1}$ and ${C_2}$ are the weighting factors. A fixed angle-of-attack scheme with a piecewise linear function is given by
where ${\alpha _{\max }}$ and ${\alpha _{\max Y/D}}$ are the maximum angle-of-attack and maximum lift-to-drag ratio, respectively. ${V_1} = 5000 $ and ${V_2} = 3500$ are segmentation velocities. The bank angle is the only control variable with magnitude and rate constraints as follows
To sum up, the hypersonic vehicle glide section trajectory generation problem may be summarised as:
3.0 RL-based trajectory generation method
3.1 Markov decision process (MDP) modeling
The mathematical model of RL is usually described by MDP as composed of five elements (S, A, P, R, $\gamma $ ), where S and A are the state space and action space of the agent, respectively. P is the environmental dynamic transfer function and R is the reward function. $\gamma $ is the discount factor. The agent is the system composed of the hypersonic vehicle, action policy $\mu $ , and action-value function $Q$ . The environment is the system dynamics. The motion state vector in Equation (1) is selected as state S
and bank angle $\upsilon $ is the action A as
P is 1 as determined by the physical system. To achieve the minimum terminal position error of the trajectory generation with complex constraints in the glide phase, the reward function is set as
where ${p_i}\!\left( {i = 1,2, \cdots 5} \right)$ are constants. ${\mathop{\rm H}\nolimits} \!\left( x \right)$ is the Heaviside step function as follow
In Equation (15), $\dot \theta $ is the flight-path angle rate and the larger $\left| {\dot \theta } \right|$ denotes more intense altitude oscillations and larger punishment (negative reward). $\Delta {h_f}$ is the terminal altitude error defined by $\Delta {h_f} = \left| {{h_f} - {h_T}} \right|$ . $\Delta {S_f}$ is the terminal position error defined by [Reference Zhou, Zhang, Xie, Tang and Bao34]
where ${r_f}$ is the terminal geocentric distance, ${S_r}$ and ${h_r}$ are the normalisation constants in the reward function. ${V_t}$ is the terminal velocity. According to Equation (15), if $V \gt {V_t}$ , when the motion states exceed any path constraints in terms of heat rate, dynamic pressure, or overload, a punishment $ - {p_1}$ will be assigned. When it does not exceed any constraint, a path punishment $ - {p_2}\left| {\dot \theta } \right|$ will be assigned according to the severity of the oscillation. When $V = {V_t}$ , trajectory generation stops and a terminal reward is determined by the terminal position and altitude error. Equation (15) illustrates that smaller terminal position and altitude errors lead to greater rewards. The desired trajectory, guided by the reward function in Equation (15) is the gliding trajectory which satisfies the path constraints with the smallest terminal position and altitude errors. The discount factor $\gamma $ is a real value $ \in \left[ {0,{\rm{ }}1} \right]$ for the rewards achieved by the agent in the past, present and future. If $\gamma = 0$ , the agent cares for immediate reward only. If $\gamma = 1$ , the agent cares about all future rewards. Ideally, the agent should consider as many steps’ rewards as possible, but too large $\gamma $ may lead to a difficult convergence. The design principle of the discount factor $\gamma $ is to be as large as possible with allowing it to converge. Considering the large decision step number of gliding trajectory generation, $\gamma $ should be relatively larger to count in more decision steps. In summary, the MDP model of gliding trajectory generation for hypersonic vehicles is described as
3.2 RL algorithm for trajectory generation
3.2.1 RL overview
During RL, the agent’s goal is to learn a policy that maximises its total expected rewards, which is defined as [Reference Tenenbaum, Kemp, Griffiths and Goodman19]
where ${G_t}$ is the reward-to-go starting at the time $t$ . The action-value function $Q$ is defined as [Reference Tsitsiklis20]
The DPG algorithm has been developed for a high-dimensional continuous space problem such as trajectory generation. DPG maps the state $s$ to a deterministic action $a$ by expressing the policy as a policy function ${\mu _w}\!\left( s \right)$ ( $w$ is the policy parameter). When the policy ${\mu _w}\!\left( s \right)$ is deterministic, the Bellman Equation is used to calculate the action-value function $Q$ as [Reference Tsitsiklis20]
where $s'$ is the next state.
Theorem. DPG theorem [Reference Silver, Lever and Heess32].
In an MDP model, assume that $\varpi \!\left( s \right)$ , $P\!\left( {s'|s,a} \right)$ , ${\nabla _a}P\!\left( {s'|s,a} \right)$ , ${\mu _w}\!\left( s \right)$ , ${\nabla _w}{\mu _w}\!\left( s \right)$ , $R\!\left( {s,a} \right)$ , ${\nabla _a}R\!\left( {s,a} \right)$ , $\rho \!\left( s \right)$ exist, and they are continuous functions for $s$ , $s'$ , $a$ , $w$ , (where $\rho \!\left( s \right)$ denotes the initial state probability distribution and $P\!\left( {s'|s,a} \right)$ denotes the state transfer probability. The above conditions are to ensure that ${\nabla _w}{\mu _w}\!\left( s \right)$ and ${\nabla _a}Q\!\left( {s,a} \right)$ exist), then the DPG must exist as
where $J$ is the objective function of RL. Aiming to explore other state spaces during the RL process, an offline policy gradient method is implemented. It is indicated that the action policy is stochastic, while the evaluation policy is deterministic. DPG algorithm follows the actor-critic learning framework where the actor outputs policy $\mu $ with the input of state $s$ based on the deterministic policy and the critic evaluates $Q$ . The critic uses a differentiable approximation function to estimate $Q$ and the actor updates the policy parameters $w$ along the gradient $Q$ .
3.2.2 ODPDAC algorithm
Considering the advantages in high-dimensional, continuous nonlinear space, DNN is devoted to approximating the action-value function (critic network) and policy network (actor network) for end-to-end learning. Then ${w^\mu }$ and ${w^Q}$ denote the parameters of the actor network $\mu $ and critic network $Q$ . Based on Equation (22), the optimal policy function is iteratively solved by [Reference Silver, Lever and Heess32]
However, due to the Markovian property of RL data, the training of the networks is unstable because the prerequisite assumptions of independent and homogeneous distribution of samples are not supported. Experience replay is introduced to store the data accumulated by the agent while continuously interacting with the environment, in memory B . The data is stored in units of time steps, such as $\left( {{s_i},{a_i},{R_i},{{s'}_i}} \right)$ . To update the parameters of the neural network, the data is extracted from B by uniform random sampling. As the samples are randomly selected, experience replay is implemented to break the correlation between the data for stability and convergence during training. Combining the above analysis, ODPDAC is shown in Fig. 2.
The critic network is updated by minimising the loss function as
where ${N_b}$ is the batch size, $\delta _{TD}^{}$ denotes time difference error combined with Equation (21) as [Reference Zavoli and Federici26]
The actor network optimises and updates the network parameters according to the policy gradient ascent method as
where ${\nabla _a}Q$ is the gradient of actor-value function $Q$ relative to action $a$ in critic network, ${\nabla _{{w_\mu }}}\mu \!\left( {s\left| {{w_\mu }} \right.} \right)$ is the gradient of policy function $\mu $ output by actor network relative to network parameters ${w^\mu }$ . Both are expressed as
The algorithm is given as follows.
It is noted that the actor network is initialised by the DL pre-trained DNN obtained by supervised learning in Section 4.1.2 to improve the learning efficiency and convergence. Then the continuous interaction training between the hypersonic vehicle and the environment is carried out by ODPDAC to achieve the optimisation of the actor network.
4.0 Tests and analysis
4.1 DL pre-training
4.1.1 Multiple trajectories generation and sample processing
The hypersonic vehicle model for simulation is based on the published high-performance Common Aero Vehicle (CAV-H) model [Reference Phillips35], which has a maximum lift-to-drag ratio of 3.5 at an angle-of-attack of 10 degrees. The reference area of CAV-H is $0.48{{\rm{m}}^2}$ , and the mass is 907 kg. The convex optimisation method referred to in Ref. (Reference Zhou, He, Zhang, Tang and Bao36) is implemented to generate the standard trajectories offline. The initial states, terminal states and path constraints are shown in Table 1.
Using CVX software on an Intel(R) Core(TM)I7-9700 CPU@3.00Hz with 16GB RAM and NVidia GeForce RTX 2060 super GPU, multiple trajectories are generated with different initial position deviations as follows
where $\lambda _i^{},\phi _j^{}$ and $h_k^{}$ are longitude, latitude and altitude respectively. $i,j$ and $k$ are corresponding deviation times for $\lambda _i^{},\phi _j^{}$ and $h_k^{}$ , respectively. $\Delta \lambda $ , $\Delta \phi $ and $\Delta h$ are corresponding per deviation. There are 125 initial positions consisting of three variables $\left( {\lambda _i^{},\phi _j^{},h_k^{}} \right)$ obtained by combinations of different deviation times $\left( {i,j,k} \right)$ . Then, the convex optimisation method is used to solve Equation (12) to collect the control input of the bank angle. Velocity is set to be the trajectory termination condition. Substituting the bank angle commands into the equation for numerical integration generated 125 reentry trajectories. All of them begin at different spatial locations, enveloping a certain area in the longitudinal and latitudinal planes, and finally arrive at the preset target location. The average time required to compute a complete trajectory is 20.83s. These 125 trajectories do not exceed the path constraint limits of the heat rate, dynamic pressure, and overload. The 5,051,254 sequence pairs of the motion state-control input are sampled on the trajectories with an interval time of 0.5s and permuted for subsequent DL training. The samples are normalised to make them dimensionless with the same magnitude to promote the efficiency and accuracy of DL by
where $s$ is the motion state vector for DNN’s input, $a$ is the DNN output. ${s_I}$ and $\upsilon $ are state vector and bank angle of samples respectively. $\bar s$ and $\bar \upsilon $ represent the mean of the input and output samples while $\Delta s$ and $\Delta \upsilon $ representing the maximum difference between input and output samples. According to the 90/10 standard involving 10-fold cross-validation, the normalised sample data is divided into training sets with 4,546,128 sequence pairs and a test set with 502,126 sequence pairs.
4.1.2 DNN structure and hyperparameters
With the samples collected above, DL is employed to train a DNN by learning a nonlinear mapping between the input of the motion state and the output of the control variable. To approximate the nonlinearity between the input and output, hidden layers 2 through 5 use the ReLU function, while the hidden layer 1 and the output layer use the tanh function to realise the nonlinear transformation. After comparative tests, the DNN structure and hyperparameters for trajectory generation are briefly summarised in Table 2.
For training and testing the DNN on the samples while avoiding overfitting, the loss function is defined as
In Equation (31), $m$ is the number of training samples, $\hat \upsilon $ and $\upsilon $ denote the label and network outputs, respectively. The first item on the right side of Equation (31) is the mean square error and ${L_2}$ is the regularised loss as in Equation (32), where n is the number of weights, $w$ represents weight coefficients, and $\lambda $ is the regularisation coefficient. The regularisation ${L_2}$ is set to constrain the weight coefficients of the network to reduce the complexity of the model and improve the generalisation ability. The learning rate and regularisation coefficient $\lambda $ are set to 0.005 and 0.0015 respectively in DL training using Python 3.7.
4.1.3 Results of DL
After 1,000 epochs, the variation of the loss function is displayed in Fig. 3 as follows.
It is inferred from Fig. 3 that the loss function converges to a small value in the DL process and fulfills the error precision requirement at the end of training. To further confirm the DL training results, the trained DNN is embedded in the simulation of the trajectory generation as shown in Fig. 4.
As can be seen from Fig. 4, the state variables are input to the DL pre-trained DNN, which outputs the bank angle to generate the trajectory in real-time. The simulation conditions are the same for Section 4.1.1.
From Table 3, it can be inferred that the trajectory generated by the DL pre-trained DNN is within the path and terminal constraints. Due to the difficulty in determining the optimised learning parameters for DL, the unexpected overfitting or underfitting result in a terminal position error of 106.27 km. Since the trajectory could be reliably generated, the pre-trained DNN obtained by DL is appropriate for the initialisation of the subsequent RL actor network.
4.2 Tests of RL
4.2.1 Network structures and syperparameters
As shown in Fig. 2, ODPDAC contains two DNNs: actor and critic. The structure and hyperparameters of the actor network are consistent with the DL pre-trained DNN in Section 4.1.2 to facilitate the initialisation of the network parameters. The settings of the critic network are referred to as the network. Through repeated attempts and comparisons, the structure and hyperparameters of the actor and critic networks are presented in Table 4. The values of ${p_i}\!\left( {i = 1,2, \cdots ,5} \right)$ are set as 100, 100, 30, 1.5, and 0.5, respectively. The values of $\Delta {S_f}$ and ${h_r}$ are ${1^ \circ }$ and $100{\rm{m}}$ , respectively. The learning rate and the total number of training episodes are $2.5 \times {10^{ - 5}}$ and 3,500, respectively.
4.2.2 Training results
To verify the performance of the DL pre-training for RL, two cases of RL with pre-training and RL without pre-training are set up for comparative analysis. After 3,000 episodes of RL training, the variations of the terminal reward and terminal position altitude errors of the trajectories with training episodes are shown in Fig. 5.
In Fig. 5, during the training process of RL with DL pre-training, the reward of RL grows gradually with the number of training episodes. In the initial phase, the exploration value is large and the optimisation of the policy is insufficient. By continuous learning, the reward and actor network is optimised rapidly, within the first thousand training episodes. After 2,500 episodes, the reward value stably converges to a larger value and finally reaches the maximum of 46.63. The terminal position and altitude errors both converge to smaller values of 19.14km and 44.2m, respectively. However, the RL without DL pre-training failed with unacceptable terminal position and altitude errors. It is extremely difficult to learn a feasible solution in a short time for large-scale continuous space problems such as glide trajectory generation. Therefore, the achievable but not optimal initialisation from pre-training can effectively improve the stability and accelerate the convergence of RL.
4.2.3 Effectiveness of the actor network
To verify the effectiveness of the actor network optimised by RL training, it is substituted into the dynamics model of Equation (1) as shown in Fig. 4 to implement the numerical simulation of real-time trajectory generation. The integral time step of the 6-DOF dynamic simulation is set to 0.1s. The simulation conditions are the same as those in Section 4.1.2 and the test results are presented in Fig. 6.
From Fig. 6(a) and (b), the RL actor network completes real-time trajectory generation in the glide phase. The average computing time is only 0.4329ms within an integral time step of 0.1s. The network output can be solved quickly by simple vector/matrix multiplication of the input state with the parameters in each hidden layer. Therefore, the online planning time of the RL policy network satisfies the onboard planning time requirement of a hypersonic vehicle. Then, the total time required to compute a complete trajectory is 9.28s which is less than the 20.83s required by the convex optimisation method. It indicates the very low online computational burden of policy networks with the DNN model. The hypersonic vehicle reaches the preset target position and altitude within terminal boundary constraints. Figure 6(c) shows that the velocity of the vehicle decreases gently to terminal velocity, without sharp changes or oscillation. The flight-path angle is kept near zero except in the initial descent phase where the values change drastically owing to the lack of lift. The heading angle changes slowly throughout the flight without oscillation, and the direction does not vary frequently. Figure 6(d) presents that the heat rate, dynamic pressure, and overload are kept below the maximum constraint during the whole flight. As shown in Fig. 6(e), the change of the bank angle output by the RL actor network is improved to enhance the effectiveness and reward of trajectory generation. The terminal errors and maximum constraint values in the glide phase with the RL actor network are reported in Table 5.
It is apparent from Table 5 that the trajectory meets the path constraints in the whole process and the terminal position error is only 19.57km which is significantly reduced compared with the 106.27km shown in Table 3. The results prove the effectiveness of the trajectory generator based on ODPDAC. The actor network is optimised according to the reward orientation which considerably improves the performance of trajectory generation. Moreover, the time required by the actor network during each simulation step is so short that the generation of the control variable is near real-time.
4.2.4 Generalisation of actor networks
(1) Monte Carlo experiment
The generalisation of the RL actor network for online trajectory generation is tested under 500 initial position deviations randomly generated based on the Monte Carlo sampling principle. The initial deviations of altitude $\Delta h_0^{}\!\left( {\rm{m}} \right)$ , longitude $\Delta \lambda _0^{}\!\left( {{\rm{deg}}} \right)$ , and latitude $\Delta \phi _0^{}\!\left( {{\rm{deg}}} \right)$ are given as follows.
In Equation (33), $N\!\left( {\mu ,{\sigma ^2}} \right)$ denotes the normal distribution where $\mu $ is the mean value and ${\sigma ^2}$ is the variance. To compare the effect of RL, the DL pre-trained DNN obtained in Section 4.1 is also tested using the same Monte Carlo experiments and the final results of the two networks compared in Fig. 7.
As shown in Fig. 7(a), the RL actor network generated online trajectories in the glide phase and reaches the preset position and altitude range despite initial positional uncertainty. In Fig. 7(b), the velocity of all trajectories decreases slowly without sharp change and vibration. Figure 7(c) shows that heat rate, dynamic pressure, and overload of all trajectories are kept within the maximum constraints. Figure 7(d) shows that the bank angle presents a large difference to adapt to online trajectory generation with different initial positional uncertainty. According to Fig. 7(e) to (h), in the case of initial position uncertainty, the remaining trajectories complete the online trajectory generation during the flight process, but the distribution of the process state and terminal position vary widely.
(2) Comparative analysis
To further compare the generalisation of the RL actor network and the pre-training of DNN, the terminal position errors, terminal altitude errors, and rewards of two groups of trajectories are illustrated in Fig. 8 and Table 6. In Fig. 8(a), CEP is the radius of a circle centred on the target point. In multiple Monte Carlo experiments, the vehicle has a 50% probability of falling within this CEP circle [Reference Bao, Wang and Tang2].
Figure 8(a) depicts that the CEP of RL is lower than half of DL which demonstrates the high landing accuracy of the RL actor network under large initial state uncertainties. Meanwhile, the mean value of the terminal reward of RL presented in Fig. 8(b) also appears larger than that of DL while the standard deviation is contradictory. Accordingly, it can be observed in Fig. 8(c) that the absolute value of the mean and standard deviation of the terminal position error of RL are smaller than the corresponding values of DL. As can be seen in Fig. 8(d), although there is not a great deal of difference in the mean value of altitude error between RL and DL, the altitude error standard deviations of RL is significantly less than DL which implies that control of altitude of DL is more sensitive to initial state uncertainties than RL.
In summing up, the statistical results compared in Table 6 of RL and DL make us conclude that the actor network obtained by RL greatly promotes the generalisation of online trajectory planning while DL pre-trained DNN is difficult to adapt to situations involving large initial state uncertainties.
5.0 Discussions
Based on the above design and simulation, there are three interesting points worthy of further discussion as follows.
-
(1) The flight target preset in the simulation is used for the position estimation of the TAEM. Since the requirement of terminal position error for the TAEM is not strict, terminal position error of 19.57km is reasonable and acceptable. The subsequent flight in dive phase will accurately guide the hypersonic vehicle to the predetermined ground target. If the terminal high position accuracy of the reentry gliding phase is required, the position error can be eliminated by the terminal guidance methods, including the range and azimuth error guidance [Reference Bao, Wang and Tang2] and the relative line of sight angle guidance [Reference Bao, Wang and Tang1].
-
(2) The simulation results indicate the landing accuracy and generalisation of the RL actor network are notably better than DL pre-trained DNN. A possible explanation may be that DL only learns the mapping relationships of inputs and outputs in samples and aims to minimise the error between outputs and samples. Different from the learning criteria of DL, RL aims to obtain the maximum reward in the environment rather than minimise the output error, so there is no overfitting of RL as in DL, which greatly improves the stability and reliability of the actor network. In the exploration vs. exploitation mechanism of RL, the actor network explores more unknown states through random actions that DL does not, which promotes generalisation and autonomy under unknown uncertainties.
-
(3) The model basis for achieving intelligent properties such as intelligent planning and decision-making for hypersonic vehicles will still involve neural networks, due to end-to-end learning can simulate the intrinsic mapping information between observed states and decisive actions to the greatest extent possible to autonomously cope with various unknown and uncertain situations. When the neural network is deployed online after end-to-end learning, the computation time is much smaller than the complex optimisation and control algorithms. In a sense, it is particularly apt for online real-time planning and control.
6.0 Conclusions
In this paper, an onboard 3D trajectory generation method is designed based on the RL algorithm. To accelerate the convergence speed and success rate of RL, the pre-trained DNN is utilised to initialise the RL actor network. Based on the ODPDAC algorithm and reward function guided by the highest terminal accuracy, an onboard trajectory generator through end-to-end learning was established. Simulation results show that the actor network could directly output trajectory control commands in 0.429ms according to the motion state observed online. The RL-based planning method significantly improves the terminal accuracy of the trajectory. In the case of a biased initial position state, the RL actor network shows better generalisation.
Future work will focus on the challenges of small-sample learning, more flight constraints, and variable target points in the trajectory generation by RL. Additionally, considering environmental uncertainties will make the problem more complex. In conclusion, our work, involving the study of a new perspective and a deeper understanding of the application of artificial intelligence algorithms in flight control, has proved to be encouraging. These initial results will contribute to the development of intelligent control of the hypersonic vehicle.
Acknowledgment
This work was supported by the National Natural Science Foundation of China (Grant No. 62003355).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest concerning the research, authorship, and/or publication of this paper.