1. Introduction
Human emotions can be perceived not only through explicit facial expressions [Reference Teijeiro-Mosquera, Biel, Alba-Castro and Gatica-Perez1], voice information [Reference Korayem, Azargoshasb, Korayem and Tabibian2], or text cues [Reference Liu, Zhou, Ji, Zhao and Wan3], but also through implicit body language, including eye movements [Reference Yun4], body postures [Reference Liu, Khan, Farooq, Hao and Arshad5], and gait traits [Reference Xue, Li, Wang and Zhu6]. Nonverbal communication plays a major role in recent human–robot interaction (HRI) [Reference Göngör and Tutsoy7]. Body language delivers nonverbal signals that can provide important cues for a person’s mental and physiological state and intentions. Gait is a unique biometric trait that can be obtained from a distance without individuals’ attention or cooperation [Reference Jain, Semwal and Kaushik8]. Meanwhile, ref. [Reference Cutting and Kozlowski9] has reported that a human’s walking pattern is difficult to imitate or intentionally deceive. Human gait conveys significant information that can be used to identify people and recognize emotions [Reference Sheng and Li10]. HRI can not only transfer mechanical power [Reference Li, Ren, Zhao, Deng and Feng11, Reference Li, Xu, Wei, Shi and Su12] but also emotional signals [Reference Narayanan, Manoghar, Dorbala, Manocha and Bera13] between the human and robotic machines. Emotion is a ubiquitous element of HRI. Compared to traditional emotion detection biometrics, such as facial expression, voice, and physiological signals, gait provides a new source and can be obtained from a long distance without the subject’s cooperation. Gait fills the emotion recognition field gaps when other traits are infeasible in long-distance observation. Recent paper [Reference Xu, Fang, Hu, Ngai, Guo, Leung, Cheng and Hu14] presented a review of current gait emotion recognition research and possible future developments. There are many application scenarios based on gait-based emotion recognition such as psychology diagnosis, emotionally aware robot [Reference Narayanan, Manoghar, Dorbala, Manocha and Bera13], customer services, interactive games, and virtual reality [Reference Bhattacharya, Rewkowski, Guhan, Williams, Mittal, Bera and Manocha15]. This field has great potential to be improved to a higher level to support a broader range of applications.
Understanding human emotion through facial expressions has been well studied [Reference Göngör and Tutsoy7]. However, the ability to rely on body language to perceive emotion becomes important when a person is not directly facing the robot, or facial expressions are not visible from a distance. Recent work [Reference Li, Li and Kan16] observed that collaborative robots can improve interaction and performance by understanding the movement intentions of human operators. For example, in space-sharing application scenarios such as hospitals, airports, and shopping malls, robots can understand the intention of pedestrians through gait recognition of emotional states and determine whether to provide friendly navigation services or to wisely avoid causing untimely disturbances (as illustrated in Fig. 1). It is expected that the emotionally aware robot can navigate safely through crowds without causing discomfort to nearby pedestrians. Meanwhile, identity recognition is a prerequisite for robots to provide personalized services. Since each person’s emotional expression will have individual differences, having personalized emotion understanding capability is the key to achieving intelligent HRI. Gait-based identity and emotion recognition as an aspect of nonverbal communication can help analyze and understand human intentions.
Previous work [Reference Peri, Parthasarathy, Bradshaw and Sundaram17] discovered that variances in a person’s emotional states during training and testing datasets can degrade the recognition performance in an identity verification task from gait. Moreover, several research [Reference Liang, Liu, Zhou, Jiang, Zhang and Wang18, Reference Zhang, Provost and Essl19] have indicated that through multi-task learning (MTL), the emotion recognition task can benefit from training with secondary related tasks. However, most of the existing works learn identity representations and emotional feature separately and treat them independently to each other. In ref. [Reference Sheng and Li10], models trained with MTL for gait-based emotion and identity recognition have shown additional performance improvements. They believed gait-based identity and emotion recognition are interrelated tasks that are favorable for jointly learning. The MTL models entangle information between the tasks to capture the joint dependencies from the multi-labels of the training data [Reference Yu, Xu, Zhang and Ou20]. However, there has also been a noticeable absence of studies on MTL for emotional gait, mainly due to the lack of gait datasets annotated with both emotion and identity labels.
Deep learning models often require a great quantity of data for training to obtain good predictions or classification performance. Nevertheless, the procedure of collecting gait samples is often costly and time-consuming, making it very difficult to obtain a well-annotated dataset with sufficient samples [Reference Sheng and Li21]. It is particularly prominent in gait emotion recognition tasks because the annotation of emotional categories is ambiguous and vulnerable to subjective factors [Reference Yi and Mak22]. To reduce the impact of personal subjectivity, it is often necessary to recruit multiple annotators to strengthen the annotation reliability. However, in some cases, it is impossible to ensure the accuracy of the results even for experienced annotators [Reference Huang23]. Therefore, the insufficient data problem severely hinders the application of gait emotion recognition in reality.
With the increasing applications of deep learning in emotion recognition tasks, data augmentation via generative adversarial networks (GANs) on the training set to augment the original to obtain improvement for recognition results may offer a solution for this challenge. Using a data augmentation strategy similar to ours, ref. [Reference Bhattacharya, Mittal, Chandra, Randhavane, Bera and Manocha24] recorded hundreds of annotated gait videos and augmented them with synthetic gaits built on conditional variational autoencoder (CVAE) to increase the emotion classification accuracy.
Traditional methods for data augmentation are generally based on GANs or autoencoders, such as conditional GANs (cGANs) [Reference Mirza and Osindero25] or conditional VAE (CVAE) [Reference Sohn, Lee and Yan26]. The decoder of CVAE produces random samples from a conditional distribution and generates synthetic data to learn different distributions for the specific categories [Reference Gao, Chakraborty, Tembine and Olaleye27]. Pix2pix [Reference Isola, Zhu, Zhou and Efros28] can generate high-quality image results in the case of paired training data using a cGAN to implement the mapping function. To train with unpaired data, CycleGAN [Reference Zhu, Park, Isola and Efros29], DiscoGAN [Reference Kim, Cha, Kim, Lee and Kim30], MUNIT [Reference Huang, Liu, Belongie and Kautz31], and StarGAN [Reference Choi, Uh, Yoo and Ha32] exploit cycle consistency to constrain the training process. Applying data enhancement to gait emotion recognition, ref. [Reference Bhattacharya, Mittal, Chandra, Randhavane, Bera and Manocha24] designed a gait generation network STEP, based on CVAE to generate thousands of synthetic samples.
Motivated by the achievements of emotional conversion in voice [Reference Rizos, Baird, Elliott and Schuller33, Reference Su and Lee34] and face expression [Reference Zhu, Gao, Song and Mao35], we propose the emotional gait conversion approach to transform natural gaits into emotional gaits by separating identity and emotion representations for data augmentation. The contributions of this work can be summarized as follows:
-
We introduce a MTL discriminator for gait identity and emotion joint learning, which takes into account nonverbal communication clues to enhance HRI.
-
We propose a novel emotional gait conversion model with adversarial loss and cycle consistency loss to realize the mutual transformation between natural gait and emotional gait.
-
We propose two kinds of data augmentation strategies by the emotional conversion model to increase the amount and diversity of the existing restricted dataset.
-
We present an augmented synthetic dataset of human emotional gait, validated by a multitask classifier and achieved a corresponding 2.1% and 6.8% absolute increase in identity recognition and emotion recognition, respectively.
2. The proposed method
The main idea of this work is to increase the amount and diversity of the original limited dataset by transforming natural gaits into emotional gaits. We first extract gait trajectories from the original videos to represent the discriminative gait features. Then two autoencoders are trained to separate latent identity embedding and emotion-specific embedding using two auxiliary classifiers to guarantee the minimal mutual information related to each other. In the second stage, we propose a novel cycle consistency GAN to realize the synthesis of the separated identity and emotion features from different samples. After carrying out this data generation process, we can train an enhanced gait emotion classifier on the augmented dataset to obtain a significantly improved performance. Figure 2 illustrates how we incorporate our data augmentation method for gait emotion and identity recognition into an end-to-end emotionally guided navigation pipeline.
2.1. Gait trajectories generation
In this work, the gait data were recorded by two Microsoft Azure Kinect DK sensors placed in front and on the side of the subjects. Kinect DK is a convenient body tracking toolkit to capture RBG image, depth information, and human skeleton coordinates all at once, reducing the need for sophisticated model extraction processes. By the body tracking function, we can extract a real-time data stream of the body joints, represented by 25 joint coordinates in a 3D space. We selected 20 joints with relatively large ranges of motion to represent the gait movement. Then, we concatenated the coordinates of each joint to form a continuous trajectory by the motion across time. Finally, to eliminate the impact of the distance variations between people and cameras, we normalized the coordinates using the distance between a subject’s hip and neck.
2.2. Learning separated representations
Let $x \in \mathcal{X}$ be a gait trajectory sequence and $\mathcal{X}$ be the collection of all the trajectories in the training data. In stage 1, $E_{id}$ denotes the identity encoder and $E_{em}$ denotes the emotion encoder. To learn separated identity and emotion representations, we employ two classifiers $C_{id}$ and $C_{em}$ with adversarial learning constraints on the feature encoders. These constraints ensure that changes in one factor cannot be predicted from another factor to realize independence between them. Based on the adversarial training concept, $E_{em}$ maximizes the retention of emotional information and discards identity information by minimizing the negative log probability to differentiate the identities. On the other hand, the classifier $C_{em}$ is trained adversarially to induce the encoder $E_{id}$ to extract only identity-related features. We thus apply the loss:
To perform random sampling at test time, we restrict the emotion feature representation to a conditionally independent Gaussian distribution, by introducing KL divergence loss to match the posterior distribution $p(z_{em}|x)$ to the prior $N(0, I)$ . We thus apply the loss:
where ${KL}(p\|q)$ represents the Kullback–Leibler Divergence score and quantifies the difference between two given probability distributions $p$ and $q$ .
The generator $\mathrm{G}$ is trained to generate $x^{\prime }$ which is a reconstruction of $x$ from the concatenation of emotion representation $z_{em}$ and identity representation $z_{id}$ , given the original emotion label $c^x$ and target emotion label $c^{x^{\prime }}$ :
By using both original and target label as conditional information, this restriction encourages all the converted data to be close to real data. The mean absolute error is minimized in training the generator. So the reconstruction loss is given:
The full objective in stage 1 is deployed by the following equation:
which integrates the above losses and the hyperparameters $\lambda _{1}$ s control the importance of each term. The encoders and the discriminators are trained alternatively.
2.3. Cycle-consistent GANs
Here, to learn an emotional gait conversion with paired emotional gait samples using the separated representation of identity in stage 1, we propose a cycle consistency technique to exploit the further features for cyclic reconstruction. Let $x, y \in \mathcal{X}$ be the two sampled gait trajectory sequences (as illustrated in Fig. 3). $c_{em}^{x}$ and $c_{id}^{x}$ denote the emotion label and identity label of sequence $x$ , respectively, and $c_{em}^{y}$ and $c_{id}^{y}$ denote the labels of sequence $y$ . We encode them into vector $\{v_{id}^x\}$ and $\{v_{id}^y, v_{em}^y\}$ by the pretrained encoders $E_{em}$ and $E_{id}$ . We then perform the generation process by reassembling the extracted identity vector $v_{id}^x$ and the emotion vector $v_{em}^y$ into a combined representation of a synthetic sample $z$ :
We further encode $z$ into $\{v_{em}^z, v_{id}^z\}$ . Then, a cycle consistency loss $\mathcal{L}^{id}_{cycl}$ for $v_{id}^x$ , $v_{id}^y$ , and $v_{id}^z$ , the same structure as triplet loss [Reference Schroff, Kalenichenko and Philbin36], is designed to enforce identity preservation:
where $\alpha$ is the value of the margin in two terms. Another cycle consistency loss $\mathcal{L}^{em}_{cycl}$ between $v_{em}^y$ and $v_{em}^z$ is used to enforce emotion preservation:
We employ the reconstruction loss $\mathcal{L}_{rec}$ only when $c_{id}^{x} = c_{id}^{y}$ :
We also impose domain adversarial losses by a unified MTL discriminator $D_{MTL}$ to discriminate between natural gaits and generated gaits in each conversion process and distinguish the generated data in both the emotion and identity domains. This adversarial MTL loss can be expressed as:
Here, we also restrict the emotion attribute representation to a conditionally independent Gaussian distribution, by introducing KL divergence loss $L_{\mathrm{KL}}$ . The overall loss is a weighted sum of the above losses:
where hyperparameters $\lambda _{2}$ s are the regularization weights.
2.4. Gait-based recognition with data augmentation
According to its own specific defects of the training datasets, we design two strategies for data augmentation. For the small-scale dataset with complete labels, data augmentation is implemented by disentangling and composing the emotion and identity feature vector from different people, as illustrated in Fig. 4. In this strategy, we synthesize each target samples with three alternative emotion vectors and the specific identity vectors to generate the same amount of each emotional samples. For the large-scale dataset with restricted labels, data augmentation is implemented by random emotion sampling, which is shown in Fig. 5. With the random emotion vector, we can generate different variants of emotion-labeled samples to increase the amount and diversity of the original dataset.
After applying data augmentation strategies, we can easily train a multitask discriminator on the augmented and original dataset as our recognition model and then assess the quality of these synthetic samples through the discriminator. As illustrated in Fig. 3(c), the discriminator $D_{MTL}$ attempts to discriminate between natural gaits and generated gaits in each conversion process and distinguish the generated data in both of the emotion and identity domains.
3. Experiment
3.1. Data preparation
To evaluate our approach and measure the quality of the synthetic dataset, we conducted several experiments for verification tasks on the public UPCV gait (K1&K2) dataset and multi-class labeled EmoGait3d dataset.
The UPCV gait dataset contains 60 subjects in total from two subsets: UPCV gait K1 [Reference Kastaniotis, Theodorakopoulos, Theoharatos, Economou and Fotopoulos37] and UPCV gait K2 [Reference Kastaniotis, Theodorakopoulos, Economou and Fotopoulos38]. The former contains five gait sequences for 30 participants captured using the Microsoft Kinect V1 sensor, and the latter captured by the Kinect V2 sensor contains a total of 300 sequences from 30 walkers. Each person walks in a straight line at a normal speed. The sensor maintains a fixed viewpoint in the walking direction at a frame rate of 30 fps. While, samples in UPCV gait are only annotated with identity labels and hardly perceived for their emotion categories through walking characteristics. Here, we regard the dataset as a large-scale restricted dataset and annotate all the samples with the emotion label of a neutral state. Because each gait sequence has a varied temporal duration, we extract 32-frame subsequences with a three-frame interval from each original sequence. With the pose estimation algorithm, we estimate the joint coordinates from each continuous 32-frame image sequence to obtain a $32\times 20\times 3$ trajectories vector as a gait sample. In the UPCV gait dataset, we can get a set of 15,053 samples as the original dataset. By implementing the data augmentation of random emotion sampling, each neutral sample can be transferred into positive, neutral, and negative samples. We finally obtained a set of $15053\times 3$ synthetic samples as the augmented dataset of UPCV gait.
The EmoGait3d dataset is built to validate the effectiveness of the MTL structure by jointly training on multiple gait-related tasks. It consists of 1484 real-world gait videos annotated with identity labels and emotion labels. We recruited 27 volunteers (10 female and 17 male, aged 18–35 years) from campuses and took RGB and depth videos with two Microsoft Azure Kinect DK sensors. Each participant was asked to walk multiple times under three emotions (shown in Fig. 6). Participants’ emotions were elicited by watching emotional movie clips, which were selected prior to the experiments based on their questionnaires. After completing the data collection, subjects were required to rate their emotional state during walking with a value on a scale from 1 to 10. When the emotion evoked by the film was consistent with the subject’s self-assessment emotion, and the rating score was higher than 8, the video could be labeled as the elicited emotion. Otherwise, it would be marked as an invalid video. With the proposed data augmentation method, we generated $1484\times 3$ synthetic emotional samples (shown in Fig. 7), by separating identity and emotion representations from the original EmoGait3d dataset for each of the three emotion categories.
3.2. Implementation details
The network architecture is illustrated in Fig. 3 with details listed in Table I. The encoders take 32-dimensional gait skeleton sequences as input and learn disentangled identity and emotion representations. In the emotion encoder, we apply instance normalization (IN) to removes the identity information while preserving the emotion information. The identity encoder provides the global identity information $\mu _i$ and $\sigma _i$ to the generator by adaptive instance normalization (AdaIN) layer before activation. $\mu _e$ and $\sigma _e$ denote the channel-wise mean and standard variation of the emotion feature vector $e$ . The formula for a layer is given as follows:
The generator and encoders are implemented with recurrent layers and 1d convolutional layers to capture temporal dependencies and spatial patterns, respectively. Then, the temporal and spatial features are combined to represent a more discriminative embedding vector to feed the dense layers.
The experiments are conducted on a system with two GTX TITAN XP GPUs. We first train the encoders to learn separated identification and emotion representations from 32-dimensional gait skeletal sequences. Then, the separated features are then combined to generate the synthetic emotional sample by dense layers. We use the Adam optimizer with a learning rate of 0.001. The batch size is set at 128. To reduce overfitting, we use the dropout approach with a dropout rate of 0.5. The discriminator and generator are updated with a 1:5 iteration frequency. We selected the parameters by using the early stopping criterion. If the validation error does not improve before the training epoch reaches the set value, the training procedure will be terminated earlier. We first pretrained the identity and emotion classifiers with $\mathcal{L}^{emo}_{cls}$ and $\mathcal{L}^{id}_{cls}$ in Eq. (1) and (2) for 10,000 mini-batches. Then we train the models in stage 1 and stage 2 successively for 30,000 mini-batches and 20,000 mini-batches. Also inference speed is an important aspect to evaluate the model. The preprocessing for pose estimation takes most of the time. The network inference procedure is relatively faster, which takes about 0.17 ms for each frame. Our model has low complexity and need to be optimized for real-world applications.
3.3. Objective evaluation
We evaluate the quality of the synthetic samples by comparing the recognition performance of the original and augmented EmoGait3d using the same setting of MTL classifiers. As shown in Table II, noticeable performance improvements of 2.1% and 6.8% can be observed by augmenting the original dataset. The experimental results show that samples generated by our model carry discriminative information that contributes to consistently higher performance for gait-based identity and emotion recognition. There is no emotion annotation in the original UPCV gait dataset, so we cannot get the emotion recognition results. While after data augmentation, the UPCV gait dataset is transferred to an emotional gait dataset with no significant reduction in the discriminative identity features.
To highlight the effectiveness of our model, we also trained respective MTL classifiers for identity and emotion recognition using augmented data from CVAE, CGAN, CVAE-GAN, CycleGAN, StarGAN, and MUNIT and compared their performance, as shown in Table III. All the settings of baseline generative data augmentation approaches and classifiers are the same as ours for a fair comparison. The performance of our model obtains the best results of them. In contrast to these generative models, our model employs the separated features, and cycle consistency loss clearly outperforms all the others, especially for the gait emotion recognition task, which is 1.3% better than the baseline model MUNIT in average recognition accuracy. We can also observe that the model’s performance without stage 1 or disentangle learning process significantly declines, which shows the prominent effect of the two-stage emotional gait conversion model intuitively.
Both CVAE and CGAN can generate synthetic data similar to the training data. For CVAE, the generated gait sample is relatively stable, but the curves tend to be straight lines to cheat the discriminator. For CGAN, the diversity of the generated sample is better, but the naturalness of the generated sample is poor. Since CVAE-GAN combines a variational autoencoder with GAN, the quality of the generated data is better than CVAE and CGAN. Without the cycle loss as Cycle-GAN, the CVAE-GAN model fails to capture the temporal details of gait trajectories. Due to the absence of a feature separating process, the performance of the synthetic sample generated by CycleGAN or StarGAN is also not ideal. MUNIT adopts a weaker form of cycle consistency constraint between the content and style spaces, the generated sample of which is deficient in temporal details.
3.4. Subjective evaluation and Discussion
We also performed subjective human evaluations for the synthetic gait. Twenty subjects were given pairs of converted samples in random order and asked which one they preferred in terms of two measures: the naturalness and the similarity in emotional characteristics of the converted gait trajectories. We computed the distance between 600 pairs of synthetic gait trajectories converted from 200 real samples. As shown in Fig. 8, we calculated average preference scores on these synthetic samples from source to target emotion. Higher values indicate higher quality of the synthetic sample after emotional conversion. The proposed model achieves the highest scores in terms of the naturalness and the similarity in emotional characteristics of the converted gait samples.
To evaluate the effect of our model, we further visualize the feature distribution of each emotion class from the original and enhanced EmoGait3d datasets. As shown in Fig. 9, we observe that almost all of the identity and emotion features for each type of synthetic sample are well generated, and the synthetic samples are well aligned with the authentic samples. It shows the effectiveness of learned features intuitively. The well-aligned data distributions are key in increasing the amount and diversity of the original EmoGait3d dataset to achieve improved accuracy for gait emotion recognition.
4. Conclusion
This paper proposes a novel emotional gait conversion model with adversarial loss and cycle consistency loss as a data augmentation method to overcome the insufficient data problem for gait emotion recognition. Meanwhile, this is the first work to realize the mutual transformation between natural gait and emotional gait. By the emotional gait conversion model, we generated numerous synthetic gait samples that enhance the diversity of the original datasets. Experimental results show that our emotion classifiers are competitive with state-of-the-art gait emotion recognition systems by the augmented dataset. It is expected that the integration of emotion recognition as an aspect of nonverbal communication enhances HRI. We only identify three emotional states through gait information, while human emotions are extremely diverse. We will gather gait data from more emotions in the future to investigate the fine-grained space of gait-based emotions. Moreover, different modalities can complement each other to represent more discriminative features. We will try to incorporate appearance information to promote the performance of gait-based recognition.
Author contributions
Weijie Sheng and Xinde Li conceived and designed the study. Weijie Sheng and Xiaoyan Lu conducted data gathering. Weijie Sheng performed statistical analyses. Weijie Sheng and Xiaoyan Lu wrote the article.
Financial support
This work was supported in part by the National Natural Science Foundation of China under Grant 62233003 and 62073072, and in part by the Key Projects of Key R&D Program of Jiangsu Province under Grant BE2020006 and Grant BE2020006-1 and in part by Shenzhen Natural Science Foundation under Grant JCYJ20210324132202005 and JCYJ20220818101206014.
Conflicts of interest
The authors declare no conflicts of interest exist.
Ethical approval
Not applicable.