Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-01-13T12:09:37.412Z Has data issue: false hasContentIssue false

Learning vision-based robotic manipulation tasks sequentially in offline reinforcement learning settings

Published online by Cambridge University Press:  02 May 2024

Sudhir Pratap Yadav*
Affiliation:
iHub Drishti Foundation, Jodhpur, India, iHub Drishti Foundation
Rajendra Nagar
Affiliation:
Department of Electrical Engineering, IIT Jodhpur, Jodhpur, India, IIT Jodhpur
Suril V. Shah
Affiliation:
Department of Mechanical Engineering, IIT Jodhpur, Jodhpur, India, IIT Jodhpur
*
Corresponding author: Sudhir Pratap Yadav; Email: yadav.1@iitj.ac.in
Rights & Permissions [Opens in a new window]

Abstract

With the rise of deep reinforcement learning (RL) methods, many complex robotic manipulation tasks are being solved. However, harnessing the full power of deep learning requires large datasets. Online RL does not suit itself readily into this paradigm due to costly and time-consuming agent-environment interaction. Therefore, many offline RL algorithms have recently been proposed to learn robotic tasks. But mainly, all such methods focus on a single-task or multitask learning, which requires retraining whenever we need to learn a new task. Continuously learning tasks without forgetting previous knowledge combined with the power of offline deep RL would allow us to scale the number of tasks by adding them one after another. This paper investigates the effectiveness of regularisation-based methods like synaptic intelligence for sequentially learning image-based robotic manipulation tasks in an offline-RL setup. We evaluate the performance of this combined framework against common challenges of sequential learning: catastrophic forgetting and forward knowledge transfer. We performed experiments with different task combinations to analyse the effect of task ordering. We also investigated the effect of the number of object configurations and the density of robot trajectories. We found that learning tasks sequentially helps in the retention of knowledge from previous tasks, thereby reducing the time required to learn a new task. Regularisation-based approaches for continuous learning, like the synaptic intelligence method, help mitigate catastrophic forgetting but have shown only limited transfer of knowledge from previous tasks.

Type
Research Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press

1. Introduction

Robotics has experienced a significant transformation with the integration of deep Reinforcement Learning (RL), which has revolutionised robot capabilities in manipulation tasks. Unlike traditional control architecture that depends on fixed rules and explicit programming, RL enables robots to learn adaptively from observations, modifying their behaviour in response to contextual cues. This advancement allows robots to adjust and optimise actions for new tasks, enhancing their utility in diverse scenarios. This allows robots to handle rigid objects in various industrial operations and manage deformable items. While significant progress has been made in enabling robots to handle a variety of rigid and deformable objects, challenges persist in the scalability and efficiency of these learning models.

A significant challenge emerges when a single agent, such as a domestic robot, must learn new tasks as different situations arise. Existing multitask frameworks lack the flexibility to incorporate new tasks without retraining the agent on all existing tasks. This paper uses a sequential learning approach where the robot acquires tasks one after the other. This method enables the robot to adapt to new circumstances without the necessity for comprehensive retraining. We use offline RL as the base framework to learn a single image-based robotic manipulation task and then use a regularisation-based continual learning approach for learning tasks sequentially. This combined framework forms the main contribution of this work.

We mainly focus on two challenges of continual learning: catastrophic forgetting and forward knowledge transfer. Catastrophic forgetting is the tendency of artificial neural networks to forget previous information when new information is provided. In a continual learning scenario, the neural network’s accuracy on previous tasks drops significantly as it tries to learn a new task. Forward knowledge transfer tries to capture the improvement in the performance of the current task based on the learning from previous tasks. For example, once a robot learns to pick up an object, it can reuse this knowledge in other tasks that require picking up objects without the need to learn it again. An effective continual learning algorithm should be able to increase its performance on the current task (based on previous tasks) while maintaining its performance on previous tasks.

The developed framework, combining offline reinforcement learning with synaptic intelligence for continual learning, offers a promising approach to overcoming challenges associated with adapting robots to new tasks. Additionally, we analyse the effects of task ordering and the number of object configurations on both forgetting and the knowledge transfer between tasks.

1.1. Related work

In the past several years, the field of robotics has witnessed substantial progress, particularly marked by robots gaining a variety of manipulation skills via deep Reinforcement Learning (RL). The introduction of the Deep Q-Network (DQN) [Reference Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra and Riedmiller1] marked a significant advancement in robotic manipulation, paving the way for the development of advanced end-to-end policy training methods [Reference Kalashnikov, Irpan, Pastor, Ibarz, Herzog, Jang, Quillen, Holly, Kalakrishnan and Vanhoucke2Reference Haarnoja, Pong, Zhou, Dalal, Abbeel and Levine5].

1.1.1 Single-task RL

These developments enabled robots to effectively perform a range of rigid object manipulations, including pick-and-place operations [Reference Gualtieri, ten Pas and Platt6,Reference Berscheid, Meißner and Kröger7], stacking [Reference Lee, Devin, Zhou, Lampe, Bousmalis, Springenberg, Byravan, Abdolmaleki, Gileadi and Khosid8], sorting [Reference Bao, Zhang, Peng, Shao and Song9], insertion tasks [Reference Wu, Zhang, Qin and Xu10Reference Schoettler, Nair, Luo, Bahl, Ojea, Solowjow and Levine12], as well as more complex challenges such as opening doors [Reference Nemec, Žlajpah and Ude13], opening cabinets [Reference Chen, Zeng, Wang, Lu and Yang14], using electric drills [Reference Sun, Naito, Namiki, Liu, Matsuzawa and Takanishi15], and completing assembly tasks [Reference Apolinarska, Pacher, Li, Cote, Pastrana, Gramazio and Kohler16Reference Kulkarni, Kober, Babuška and Santina18]. Additionally, the application of RL expanded to include the manipulation of deformable objects, like ropes [Reference Nair, Chen, Agrawal, Isola, Abbeel, Malik and Levine19] and folding clothes [Reference Lee, Ward, Cosgun, Dasagi, Corke and Leitner20], with learning systems typically acquiring these skills from task-specific datasets.

1.1.2 Multitask RL

However, for such an approach to be more effective across a broader range of tasks, it necessitates the collection and use of specific data for each task, along with training distinct networks. A promising solution to this challenge is multitask reinforcement learning (RL). In this approach, an agent undergoes simultaneous training across multiple tasks. Throughout the training process, the algorithm has access to data (sampled trajectories) from all tasks and optimises them jointly. This strategy enables the agent to develop generalised skills applicable across various tasks. Multitask RL has been applied successfully to learn robotic manipulation tasks [Reference Gupta, Yu, Zhao, Kumar, Rovinsky, Xu, Devlin and Levine21Reference Yu, Kumar, Gupta, Levine, Hausman and Finn25]. However, in multitask RL, the set of tasks and the distribution of task-related data remain constant. Consequently, the agent requires retraining from the beginning for any new task, even if there is significant overlap with previously learned tasks. Such a requirement for retraining renders the scaling of this approach to human-equivalent mastery of all manipulation tasks impractical. In contrast, humans leverage their experience from prior tasks to facilitate new task learning, avoiding the need to start from scratch. The sequential (or continual) learning model attempts to address this limitation by providing a framework where an agent learns tasks sequentially. As a result, when encountering a new task, the agent does not necessitate complete retraining.

1.1.3 Online continual RL

Continual learning research began with a focus on classification tasks using datasets like MNIST and CIFAR [Reference Goodfellow, Mirza, Xiao, Courville and Bengio26Reference Li and Hoiem28]. Recently, the field has expanded to include continual reinforcement learning (RL), with applications in Atari games [Reference Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, Hassabis, Clopath, Kumaran and Hadsell29] and GYM environments [Reference Brockman, Cheung, Pettersson, Schneider, Schulman, Tang and Zaremba30], and extended to robotic manipulation tasks [Reference Wołczyk, Zajac, Pascanu, Kuciński and Miłoś31], [Reference Caccia, Mueller, Kim, Charlin and Fakoor32]. [Reference Wołczyk, Zajac, Pascanu, Kuciński and Miłoś31] introduced a benchmark for continual learning in robotic manipulation, providing baselines for key continual learning methods in online RL settings, particularly using the soft actor-critic (SAC) method [Reference Haarnoja, Zhou, Abbeel and Levine33]. However, this research primarily focuses on online-continual RL with low-dimensional observation spaces, such as joint and task space data, under the assumption of complete access to the simulator. In contrast, our study emphasises offline-continual RL with high-dimensional observation space, specifically images, in the sequential learning of robotic manipulation tasks.

1.1.4 Catastrophic forgetting

In sequential deep reinforcement learning, neural networks trained on one task often experience performance degradation when retrained on another, a phenomenon known as catastrophic forgetting. This issue, central to connectionist models, arises from a stability-plasticity dilemma: if we strive to make the network flexible enough to accommodate new information, it tends to lose its stability, consequently resulting in a degradation of its performance on previously learned tasks. Conversely, if we lean towards enhancing its stability, the network may struggle to effectively acquire the new task, as presented in ref. [Reference French34]. One strategy to address catastrophic forgetting involves the adoption of penalty-based methods. These methods apply constraints on neural network parameters, ensuring that the weights of the neural network stay closer to the solutions derived from previous tasks. Notably, Elastic Weight Consolidation (EWC) [Reference Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, Hassabis, Clopath, Kumaran and Hadsell29] and Synaptic Intelligence (SI) [Reference Zenke, Poole and Ganguli35] have made significant contributions in this area. Kirkpatrick et al. (2017) [Reference Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, Hassabis, Clopath, Kumaran and Hadsell29] introduced EWC, offering a regularisation-based solution to catastrophic forgetting; however, its calculation of parameter importance is not localised. This paper adopts the Synaptic Intelligence approach, as proposed by Zenke et al. (2017) [Reference Zenke, Poole and Ganguli35], due to its localised assessment of synaptic importance (weights in Neural Networks). The local nature of this computation aids in maintaining solution generality, unaffected by specific problem characteristics. Furthermore, SI boasts advantages in computational speed and simplicity of implementation compared to EWC, which necessitates the computation of the Fisher Information Matrix.

Singh et al. (2020) [Reference Singh, Yu, Yang, Zhang, Kumar and Levine36] applied Offline-RL to image-based manipulation tasks, focusing on initial condition generalisation without exploring sequential task learning. In contrast, to the best of our knowledge, this study is the first to explore sequential learning in image-based robotic manipulation within offline-RL settings.

2. Learning image-based robotic manipulation tasks sequentially

In this section, we formulate our RL agent and environment interaction setup to learn robotic manipulation tasks. We then discuss the problem of sequential task learning and present an approach to solve this problem.

2.1. RL formulation for learning image-based robotic manipulation tasks

Agent and environment interaction is formally defined by the Markov Decision Process (MDP). A Markov Decision Process is a discrete-time stochastic control process. In RL, we formally define the MDP as a tuple $\langle \mathcal{S},\mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$ . Here, $\mathcal{S}$ is a finite set of states, $\mathcal{A}$ is a finite set of actions, $\mathcal{P}$ is the state transition probability matrix, $\mathcal{R}$ set of rewards for all state-action pair and $\gamma$ is the discount factor. A stochastic policy is defined as a probability distribution over actions, given the states, that is, the probability of taking each action for every state. $\pi (\boldsymbol{a}_t|\boldsymbol{s}_t) = \mathbb{P}[\boldsymbol{a}_t \in \mathcal{A}|\boldsymbol{s}_t \in \mathcal{S}]$ . We formulate the vision-based robotic manipulation tasks in the MDP framework as below.

  • Environment: It consists of WidowX 250, a five-axe robot arm equipped with a gripper. We place a table in front of the robot and a camera in the environment in an eye-to-hand configuration. Every task consists of an object placed on the table, which needs to be manipulated to complete the task successfully.

  • State: The state $\boldsymbol{s}_t$ represents the RGB image of the environment captured at time step $t$ . We use images of size $48\times 48\times 3$ .

  • Action: We define the action at the time step $t$ as a 7-dimensional vector $\boldsymbol{a}_t=\begin{bmatrix}\Delta \boldsymbol{x}_t&\Delta \boldsymbol{o}_t&g_t\end{bmatrix}^\top$ . Here, $\Delta \boldsymbol{x}_t\in \mathbb{R}^3$ , $\Delta \boldsymbol{o}_t\in \mathbb{R}^3$ , $g_t\in \{0,1\}$ denotes the change in position, change in orientation, and gripper command (open/close), respectively, at time step $t$ .

  • Reward: The reward $r(\boldsymbol{s}_t,\boldsymbol{a}_t)\in \{0,1\}$ is a binary variable which is equal to 1 if the task is successful and 0, otherwise.

The reward is kept simple and not shaped according to the tasks so that the same reward framework can be used while scaling for a large number of tasks. Also, giving a reward at each time step, instead of at the end of the episode, makes the sum of rewards during an episode dependent on time steps. Therefore, if the agent completes a task in fewer steps, the total reward for that episode will be more.

2.2. Sequential learning problem and solution

We define the sequential tasks learning problem as follows. The agent is required to learn $N$ number of tasks but with the condition that tasks will be given sequentially to the agent and not simultaneously. Therefore, when the agent is learning to perform a particular task, it can only access the data of the current task. This learning process reassembles how a human learns. Let a sequence of robotic manipulation tasks $T_1, T_2, \ldots, T_N$ be given. We assume that each task has the same type of state and action space. Each task has its own data in typical offline reinforcement learning format $\langle \boldsymbol{s}_t, \boldsymbol{a}_t, r_t, \boldsymbol{s}_{t+1}\rangle$ . The agent has to learn a single policy $\pi$ , a mapping from state to action, for all tasks. If we naively train a neural network in a sequential manner, the problem of catastrophic forgetting will occur, which means performance on the previous task will decrease drastically as soon as the neural network starts learning a new task.

We use a regularisation-based approach presented in ref. [Reference Zenke, Poole and Ganguli35] to mitigate the problem of catastrophic forgetting. Fig. 1 describes the framework we developed to solve this problem.

3. Integrating sequential task learning with offline RL

In this section, we first discuss the SAC-CQL [Reference Kumar, Zhou, Tucker and Levine37] offline RL algorithm used for learning a single robotic manipulation task and its implementation details. We then discuss the Synaptic Intelligence (SI) regularisation method for continual learning and provide details to integrate these methods to learn sequential robotic manipulation tasks.

Figure 1. The SAC-CQL-SI method (soft actor-critic - conservative Q-learning - synaptic intelligence) for sequential learning is depicted in the block diagram. This method begins with a dataset comprising tasks 1 to N, from which one task is selected sequentially, as denoted by the task index $\mathbf{k}$ . Initially, the algorithm samples a mini-batch from the dataset of the current task. This batch is then processed by the soft actor-critic conservative Q-learning (SAC-CQL) algorithm, which calculates the losses for both the actor (policy network) and the critic (Q-network). In instances where the task index exceeds one, quadratic regularisation [Reference Zenke, Poole and Ganguli35] is integrated into the actor loss to mitigate forgetting. These calculated losses are subsequently used to update the neural networks that embody the policy (actor-network) and the Q-value function (critic network). The process involves continuous sampling of subsequent batches, with training on the current task persisting until a predetermined number of training steps is reached. After completing training for a task, the agent transitions to the next task by loading its data and incrementing the task index. This cycle is repeated, allowing the agent to progress through and learn each task successively.

3.1. SAC-CQL algorithm for offline RL

There are two frameworks, namely online and offline learning, to train an RL agent. In the case of an online RL training framework, an RL agent interacts with the environment to collect experience, update itself (train), interact again, and so on. Simply put, the environment is always available for the RL agent to evaluate and improve itself further. This interaction loop is repeated for many episodes during training until the RL agent gets good enough to perform the task successfully. This dynamic approach allows the agent to adapt to unforeseen circumstances but may be computationally expensive and less sample-efficient. While in offline RL settings, we collect data once and then it is not required to interact with the environment. These data can be collected by executing a hand-designed policy or by a human controlling the robot (human demonstration). Data are a sequence of $\langle \boldsymbol{s}_t, \boldsymbol{a}_t, r_t, \boldsymbol{s}_{t+1}\rangle$ tuples. Offline RL poses unique challenges such as data distribution shifts and the need to balance exploration and exploitation without the ability to collect additional data during the learning process. While it may lack the adaptability of online RL, offline RL is computationally efficient and allows for systematically exploring individual tasks with carefully curated datasets.

In recent years, SAC [Reference Haarnoja, Zhou, Abbeel and Levine33] has emerged as one of the robust ways for training RL agents in continuous action space (when action is a real vector), which typically is the case in robotics. SAC is an off-policy entropy-based actor-critic method for continuous action MDPs. Entropy-based methods add entropy term to the existing optimisation goal of maximising expected reward. In addition to maximising expected reward, the RL agent also needs to maximise the entropy of the overall policy. This helps make the policy inherently exploratory and not stuck inside a local minima. Haarnoja et al. [Reference Haarnoja, Zhou, Abbeel and Levine33] define the RL objective in maximum entropy RL settings as in (1).

(1) \begin{equation} J(\pi ) = \sum \limits _{t=0}^{T} \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t)\sim \rho _\pi }[r(\mathbf{s}_t, \mathbf{a}_t)+\alpha \mathcal{H}(\pi (\!\cdot |\mathbf{s}_t))]. \end{equation}

Here, $\rho _\pi (\mathbf{s}_t, \mathbf{a}_t)$ denotes the joint distribution of the state and actions over all trajectories that the agent could take and $\mathcal{H}(\pi (\!\cdot |\mathbf{s}_t))$ is the entropy of the policy for state $\mathbf{s}_t$ as defined in (2). $\alpha$ is the temperature parameter controlling the entropy in the policy. $\mathbb{E}$ represents the expectation over all state and action pairs sampled from the trajectory distribution $\rho _\pi$ . Overall, this objective function tries to maximise the expected sum of rewards along with the entropy of the policy.

(2) \begin{equation} \mathcal{H}(\pi (\!\cdot |\mathbf{s}_t)) = \mathbb{E}[-\text{log}(f_\pi (\!\cdot |\mathbf{s}_t))]. \end{equation}

Here, $\pi (\!\cdot |\mathbf{s}_t)$ is a probability distribution over actions and $f_\pi (\!\cdot |\mathbf{s}_t)$ is the probability density function of the policy $\pi$ , we have selected Gaussian distribution to represent the policy.

SAC provides an actor-critic framework where the actor separately represents the policy, and the critic only helps in improving the actor, thus limiting the role of critic only to training. As our state ( $\boldsymbol{s}$ ) is an image, we use convolutional neural networks (CNNs) to represent both actor and critic. Also, instead of using a single Q-value network for the critic, we use two Q-value networks and take their minimum to estimate better the Q-value, as proposed in ref. [Reference Van Hasselt, Guez and Silver38]. To stabilise the learning, we use two more neural networks to represent target Q-values for each critic network, as described in DQN [Reference Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra and Riedmiller1]. Therefore, we use 5 CNNs to implement the SAC algorithm.

Figure 2. CNN architecture of policy network. It has three convolutional layers with max-pooling following the first two. The convolutional layers are followed by a four-layer MLP (multilayer perceptron). The output layer is multiheaded, with the number of heads corresponding to the number of tasks. Based on the current task index, only one head is enabled during training and testing. Given an input state (an RGB image), the network produces two 7-dimensional vectors, $\mu$ and $\sigma$ , representing the mean and standard deviation of the stochastic policy modelled by a Gaussian distribution.

Figure 3. The Q-network’s CNN architecture is similar to that of the policy network, with two notable distinctions: firstly, the action vector is incorporated into the first layer of the MLP, and secondly, the network’s output is a scalar Q-value. This network takes state (an RGB image) and action (a 7-dimensional vector) and produces scalar Q-value, that is, $Q(\mathbf{s},\mathbf{a})$ .

Figures 2 and 3 illustrate the architectures of the policy and the Q-value neural networks, respectively. As we parameterise the policy and Q-value using neural networks, $\phi$ represents the set of weights of the policy network. $\theta _1$ , $\theta _2$ , $\hat{\theta }_1$ , and $\hat{\theta }_2$ represent the set of weights of two Q-value networks and two target Q-value networks for the critic, respectively. Since our policy is stochastic, we use tanh-Gaussian policy, as used in ref. [Reference Singh, Yu, Yang, Zhang, Kumar and Levine36]. The policy network takes the state as input and outputs the mean ( $\mu$ ) and standard deviation ( $\sigma$ ) of the Gaussian distribution for each action. Action is then sampled from this distribution and passed through tanh function to bound actions between $(\!-\!1, 1)$ . Target Q-value is defined as

(3) \begin{equation} \hat{Q}_{\hat{\theta }_1,\hat{\theta }_2}(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}) = \\ \mathbf{r}_t + \gamma \mathbb{E}_{(\mathbf{s}_{t+1} \sim \mathcal{D}, \mathbf{a}_{t+1} \sim \pi _\phi (\!\cdot |\mathbf{s}_{t+1}))}[\hat{Q}_{\text{{min}}} - \alpha \text{log}(\pi _\phi (\mathbf{a}_{t+1}|\mathbf{s}_{t+1}))], \end{equation}

where $\hat{Q}_{\text{{min}}}$ represents the minimum Q-value of both target Q-networks and is given by

(4) \begin{equation} \hat{Q}_{\text{{min}}}(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}) = \text{ min}[Q_{\hat{\theta }_1}(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}), Q_{\hat{\theta }_2}(\mathbf{s}_{t+1}, \mathbf{a}_{t+1})] \end{equation}

The target Q-value in (3) is then used to calculate Q-loss for each critic network as

(5) \begin{equation} J_Q({\theta _i}) = \frac{1}{2} \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \mathcal{D}}[(\hat{Q}_{\hat{\theta }_i,\hat{\theta }_2}(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}) - Q_{\theta _1}(\mathbf{s}_t, \mathbf{a}_t))^2], \end{equation}

where $i\in \{1,2\}$ , $\mathbf{a}_t^\pi$ is the action sampled from policy $\pi _\phi$ for state $\mathbf{s}_t$ and $\mathcal{D}$ represents the current task data. Further, policy-loss for actor-network is defined as

(6) \begin{equation} J_\pi (\phi ) =\mathbb{E}_{(\mathbf{s}_t \sim \mathcal{D}, \mathbf{a}_t \sim \pi _\phi (\!\cdot |\mathbf{s}_t))}[\alpha \text{log}(\pi _\phi (\mathbf{a}_t|\mathbf{s}_t)) -\text{ min}[Q_{\theta _1}(\mathbf{s}_t, \mathbf{a}_t^\pi ), Q_{\theta _2}(\mathbf{s}_t, \mathbf{a}_t^\pi )]] \end{equation}

For offline-RL, we use the non-Lagrange version of the conservative Q-learning (CQL) approach proposed in ref. [Reference Kumar, Zhou, Tucker and Levine37] as it only requires adding a regularisation loss to already well-established continuous RL methods like SAC. Upon adding this CQL-loss to (5), total Q-loss becomes

(7) \begin{equation} J_Q^{\text{total}}({\theta _i}) = J_Q({\theta _i})+\alpha _{\text{cql}} \mathbb{E}_{\mathbf{s}_t \sim \mathcal{D}}[\text{log}\sum \limits _{\mathbf{a}_t} \text{exp}(Q_{\theta _i}(\mathbf{s}_t, \mathbf{a}_t))-\mathbb{E}_{\mathbf{a}_t \sim \mathcal{D}}[Q_{\theta _i}(\mathbf{s}_t, \mathbf{a}_t)]], \end{equation}

where $i\in \{1,2\}$ , $\alpha _{\text{cql}}$ controls the amount of CQL-loss to be added to Q-loss to penalise actions that are too far away from the existing trajectories, thus keeping the policy conservative in the sense of exploration. These losses are then used to update actor and critic networks using the Adam [Reference Kingma and Ba39] optimisation algorithm.

3.2. Applying synaptic intelligence in offline RL

Synaptic intelligence is a regularisation-based algorithm proposed in ref. [Reference Zenke, Poole and Ganguli35] for sequential task learning. It regularises the loss function of a task with a quadratic loss function as defined in (8) to reduce catastrophic forgetting.

(8) \begin{equation} L_\mu = \sum \limits _{k} \Omega ^\mu _k(\tilde{\phi }_k-\phi _k)^2 \end{equation}

Here, $L_\mu$ is the SI loss for the current task being learned with index $\mu$ , $\phi _k$ is $k$ -th weight of the policy network, and $\tilde{\phi }_k$ is the reference weight corresponding to policy network parameters at the end of the previous task. $\Omega ^\mu _k$ is per-parameter regularisation strength. For more details on calculating $\Omega ^\mu _k$ , refer to [Reference Zenke, Poole and Ganguli35]. SI algorithm penalises neural network weights based on their contributions to the change in the overall loss function. Weights that contributed more to the previous tasks are penalised more and thus do not deviate much from their original values, while the other weights help in learning new tasks. SI defines the importance of weights as the sum of the gradients over the training trajectory, as this approximates the contribution to the reduction in the overall loss function. We use a similar approach to apply SI to Offline-RL as presented in ref. [Reference Wołczyk, Zajac, Pascanu, Kuciński and Miłoś31]. Although the authors did not use SI or offline RL, the approach is similar to applying any regularisation-based continual learning method for the actor-critic RL framework. We regularise the actor to reduce forgetting on previous tasks while learning new tasks using offline reinforcement learning. We add quadratic loss as defined in ref. [Reference Zenke, Poole and Ganguli35] to the policy-loss term in the SAC-CQL algorithm. So now overall policy-loss becomes as described in (9)

(9) \begin{equation} J_{\pi }^{\text{total}}(\phi ) = J_\pi (\phi ) + c L_\mu \end{equation}

Here, $c$ is regularisation strength. Another aspect of continual learning is providing the current task index to the neural network. There are many approaches to tackle this problem, from 1-hot encoding to recognising the task from context. We chose the most straightforward option of a multi-head neural network. Each head of the neural network represents a separate task. Therefore, we select the head for a given task.

4. Experiments, results, and discussion

In this section, we first discuss the RL environment setup and provide details of data collection for offline RL. Further, we evaluate the performance of SI with varying numbers of object configurations and densities for different task ordering.

4.1. Experimental setup

Our experimental setup is based on a simulated environment, Roboverse, used in ref. [Reference Singh, Yu, Yang, Zhang, Kumar and Levine36]. It is a GYM [Reference Brockman, Cheung, Pettersson, Schneider, Schulman, Tang and Zaremba30]-like environment based upon open-source physics simulator py-bullet [Reference Erwin and Yunfei40]. We collected data for three tasks using this simulated environment.

4.1.1. Object space

We define object space as a subset of the robot’s workspace where the task’s target object is to be placed. In our case, it is a rectangular area on the table before the robot. When initialising the task, the target object is randomly placed in the object space. Fig. 4 object space of all three tasks is visible as a rectangle on the table.

Table I. Details of the collected 6 datasets for each task. The number of trajectories decreases as the area of the object space decreases to maintain a consistent object-space density.

4.1.2. Tasks definitions

We selected three tasks for all our experiments with some similarities. In each task, an object is placed in front of the robot on the table. At the start of every new episode, the object’s initial position is randomly changed within the object-space area. Completion of a task requires some interaction with the object. Task definitions are given below.

  1. 1. Press Button: Button is placed in the object space. The objective of the task is to press the button. This is the easiest task, as this task can be seen as a go-to-goal task where the goal point is the point on the button in pressed configuration.

  2. 2. Pick Shed: This task aims to pick the object successfully. Thus, the robot also needs to learn to close the gripper at a specific position, apart from reaching the object.

  3. 3. Open Drawer: The objective of this task is to open the drawer.

4.1.3. Data collection

To examine the impact of the object-space area and object placement density, we collected a dataset for each combination of the object-space area ( $40\,\textrm{cm}^{2}$ , $360\,\textrm{cm}^{2}$ , and $1000\,\textrm{cm}^{2}$ ) and the object placement density (10 and 20 placements per $\textrm{cm}^{2}$ ) resulting in a total of six datasets for each task. Table I shows the quantitative details of 6 datasets for each task. It shows the number of trajectories and images collected for all combinations of object density and size of area. The length of a single episode (single trajectory) is 20; therefore, 20 data points per trajectory are collected. Format of each data point is $\langle \boldsymbol{s}_t, \boldsymbol{a}_t, r_t, \boldsymbol{s}_{t+1}\rangle$ , here $\boldsymbol{s}_t$ is $48\times 48\times 3$ RGB image. Fig. 4 displays trajectories (in green colour) and reward distribution across object space in the dataset for all three tasks, with an object-space area of 360 cm $^2$ and a density of 20 object configurations per cm $^2$ . It can be seen that when the object is placed closer to the robot, the reward is high as the task is completed in a few steps, while it becomes low as the object moves away.

Figure 4. Top row displays sampled trajectories and bottom row displays the scatter plot of the reward distribution for tasks button-press, pick-shed and open drawer with object-space area 1000 $\text{cm}^2$ and density 20 object configurations per $\text{cm}^2$ . The robot’s base is at (0.6m, 0.0m), which is shown as a black semicircle.

Hand-designed policies

We collect data by employing hand-designed policies. The core of the hand-designed policies revolves around action selection, where the policy decides the next action based on the robot’s current state and the task’s requirements. These actions include moving the end-effector, closing/opening the gripper, lifting the end-effector upward, or halting all movement. The policy has full access to simulation and thus can use various parameters required for task completion, which are otherwise unavailable to the RL agent. We use the following hand-designed policies

  1. 1. Press-button Policy: The policy first calculates the distance between the gripper and the button. It moves the gripper towards the top of the button until a certain threshold is crossed. Once the gripper is on the top of the button, it is moved down to press it. The state of the button is continuously monitored at every step. The task is considered successful if the button is pressed and a reward of 1 is awarded per step until the end of the episode.

  2. 2. Pick-shed Policy: The policy calculates the distance between gripper and object. If the distance is greater than a threshold, the gripper is moved in the direction of the object. If the gripper and object are close enough, the policy gives action to close the gripper. Then, the object is lifted up in the z direction until a certain height threshold.

  3. 3. Open Drawer Policy: Similar to the above policy, this policy also calculates the distance between the drawer handle and gripper. The gripper is moved towards the door handle until a certain threshold. Once the gripper is above the handle, the gripper is closed. Finally, the gripper is moved in the direction of opening the drawer. This direction depends on the drawer’s orientation, which is provided in the policy from the simulation.

Our data collection procedure, as defined in algorithm 1, follows the standard agent-environment loop. We initiate an outer loop to iterate over the total number of trajectories (or episodes) $N_T$ , which is determined by considering the area and density of the object space, ensuring a consistent density across different areas. Each episode consists of 20 time steps. During each time step, the hand-designed policy generates an action ( $\boldsymbol{a}_t$ ) based on the current environment state. Then, a Gaussian noise ( $N$ ) is added to the action. The purpose of this noise is twofold: it reduces the policy’s accuracy from 100% to approximately 80% so that we get both successful and failure cases, and it introduces variations in the trajectories. The noisy action is then provided to the simulator, which produces the next state ( $\boldsymbol{s}_{t+1}$ ) and the corresponding reward ( $r_t$ ). We store this information as a typical tuple $\lt \boldsymbol{s}_t, \boldsymbol{a}_t, r_t, \boldsymbol{s}_{t+1}\gt$ , which is commonly utilised in reinforcement learning.

Algorithm 1. Data Collection Procedure.

4.2. Empirical results and analysis

For one sequential learning experiment, we select a sequence of two tasks from the three tasks set, as mentioned in the previous section. This selection yields six possible combinations: button-shed, button-drawer, shed-button, shed-drawer, drawer-shed, and drawer-button. We perform two sets of experiments for each doublet sequence, one with SI regularisation and another without SI regularisation. Each set contains six experiments by varying the area and density of object space. Therefore, in total, we performed 72 experiments of sequential learning. Apart from these 72 experiments, we also trained the agent for single tasks using SAC-CQL for reference baseline performance to evaluate forward transfer. We do behaviour cloning for the initial 5k steps to learn faster as we have a limited compute budget. We use metrics mentioned in ref. [Reference Wołczyk, Zajac, Pascanu, Kuciński and Miłoś31] for evaluating the performance of a continual learning agent. Each task is trained for $\Delta = 100K$ steps. The total number of tasks in a sequence is $N=2$ . Total steps $T = 2 \cdot \Delta$ . The $i\text{-th}$ task is trained from $t \in [(i-1) \cdot \Delta, i \cdot \Delta ]$ .

4.2.1 Task accuracy

We evaluate the agent after every 1000 training steps by sampling ten trajectories from the environment for each task. The agent’s accuracy $p_i(t)$ , for a task $i$ , is defined as the number of successful trajectories out of those ten trials. Fig. 5 shows the accuracy of three experiments corresponding to button-shed, button-drawer, and drawer-button task combinations for the area size of $40cm^2$ with a density of 10 and 20 object configurations per $cm^2$ . The top row represents sequential learning with SI, while the bottom represents sequential learning without SI. SMMMMMMMMI is working better, as evidenced by overlapping Task-1 and Task-2 accuracy. We observed that SI was most helpful in button-shed task doublet due to the overlapping nature of these tasks, as both require reaching the object. This shows the benefit of using SI for overlapping tasks.

Figure 5. Task accuracy for tasks button-shed, button-drawer and drawer-button. The top row is with SI, and the bottom row is without SI. In each plot, the X-axis represents the number of gradient update steps, and the Y-axis represents accuracy.

4.2.2 Forgetting

It measures the decrease in accuracy of the task as we train more tasks and is defined as $F_i:= p_i(i.\Delta )-p_i(T)$ . Here, $p_i(t) \in [0,1]$ is the success rate of task $i$ at time $t$ . Fig. 6 shows the forgetting of Task-1 after training Task-2. We can see that SI performed better or equal in all cases. In fact, in some cases, like button-shed forgetting is negative, which means that the performance of Task-1 improved after training on Task-2. This indicates knowledge transfer from Task-1 to Task-2. This phenomenon is not seen in the case of sequential learning without SI. This indicates that SI helps in reducing catastrophic forgetting. No significant trends are observed in the variation of object-space area, but forgetting increases with increased object-space density. This might be due to the limited computing budget (100K) per task, as tasks with more area size and density would require more training to show good results.

Figure 6. Forgetting matrix (0-button, 1-shed, 2-drawer). The top row is with SI regularisation, and the bottom row is without regularisation. Measuring units for area and density are cm $^2$ and objects/cm $^2$ respectively for every task.

4.2.3 Forward transfer

It measures knowledge transfer by comparing the performance of a given task when trained individually versus learning the task after the network is already trained on previous tasks and is defined as

(10) \begin{equation} FT_i := \frac{\text{AUC}_i - \text{AUC}_i^b}{1- \text{AUC}_i^b}, \end{equation}

where $\text{AUC}_i = \frac{1}{\Delta } \int _{(i-1) \cdot \Delta }^{i \cdot \Delta } p_i(t)\text{d}t$ represents area under the accuracy curve of task $i$ and $\text{AUC}_i^b = \frac{1}{\Delta } \int _{0}^{\Delta } p_i^b(t)\text{d}t$ represents area under curve of the reference baseline task. $p_i^b(t)$ represents reference baseline performance. We use single-task training performance as the reference for Task-2 while evaluating forward transfer. Fig. 7 shows forward transfer for Task-2 after it is trained on Task-1. We observed that in most cases, training without SI gives a better transfer ratio than training with SI. Since reducing catastrophic forgetting is the primary objective of the sequential learning framework, we set a high value of SI regularisation strength. This restricts the movement of weights from the solution of the previous task, which helps to reduce catastrophic forgetting but also hinders the ability to learn new task thus reducing forward transfer. This can also be noticed in Fig. 5, where the accuracy of Task-2 is lower for SI than its non-SI counterpart. This highlights the problem of stability-plasticity; any method that tries to make learning more stable to reduce forgetting inadvertently also restricts the flexibility of the connectionist model to learn a new task.

4.2.4 Training time

We used NVIDIA DGX A100 GPU for training. Training time for one experiment with three sequential tasks on 1 GPU is 18h (6h per task). We used 8 GPUs in parallel for training and testing all 72 experiments, which took approximately 7 days. Apart from these metrics, we observed that the agent requires 14k, 10k, and 16k steps on average to achieve its first success on Task-2 when trained directly, sequentially without SI, and sequentially with SI, respectively. This shows the advantage of the sequential training framework as the agent learns the task faster when trained sequentially (without SI) than when directly training the task, but the agent slows down a little when we add SI to reduce forgetting.

Figure 7. Forward transfer matrix (0-button, 1-shed, 2-drawer). The top row is with SI regularisation, and the bottom row is without regularisation. Measuring units for area and density are cm $^2$ and objects/cm $^2$ respectively for every task.

Figure 8. Accuracy for sequentially learning pick-shed and press-button tasks (area = 360cm $^2$ , density = 20objects/cm $^2$ ). The left column is with SI, and the right column is without SI. In each plot, the X-axis represents the number of gradient update steps, and the Y-axis represents accuracy.

Fig. 8 shows another interesting observation we made in the case of sequential learning of pick-shed and press-button (area = 360cm $^2$ , density = 20objects/cm $^2$ ) tasks. While training for Task-1 (pick shed), the agent showed some success on Task-2 (press button) even before getting any success on Task-1 itself. This might be due to the nature of the tasks, as the trajectory of the press-button task is common for another task. Therefore, the agent tends to acquire knowledge for similar tasks. This may also result from behaviour cloning for the initial 5k steps, where the agent tries to mimic the data collection policy for a few initial training steps. Also, we observed that increasing the object-space area (keeping the density the same) helps in knowledge transfer, which the increase can be seen in average forward transfer with area size.

5. Conclusion and future work

We investigated catastrophic forgetting and forward knowledge transfer for sequentially learning image-based robotic manipulation tasks by combining a continual learning approach with an offline RL framework. We use SAC-CQL as an offline deep RL framework with synaptic intelligence (SI) to mitigate catastrophic forgetting. Multiheaded CNN was used to provide knowledge of the current task index to the neural network. We performed a series of experiments with different task combinations and with a varying number of object configurations and densities. We found that SI is useful for reducing forgetting. However, it showed a limited transfer of knowledge from previous tasks. We also found that the ordering of tasks significantly affects the performance of sequential task learning. Experiments also suggest the importance of prior knowledge for continual learning. Agent trained only with state-action pairs of many diverse tasks (even without reward) may provide better prior knowledge.

In addition to the findings presented in this work, there are several promising directions for extending the research on sequential learning of image-based robotic manipulation tasks. Firstly, the scope of our framework can be expanded to encompass a more comprehensive array of robotic manipulation tasks, thereby demonstrating its versatility and applicability across diverse manipulation domains. This extension would involve testing the agent’s ability to continuously acquire new skills, even when the task set evolves. Secondly, the order in which tasks are presented to the agent can significantly influence the learning process. Therefore, investigating curriculum learning strategies that dynamically arrange the sequence of tasks to optimise knowledge transfer represents a valuable avenue for future exploration. Finally, we used a simulator for data collection. However, simulation may not account for varying lighting conditions, distribution of natural images, unmodelled dynamics, and uncertainty of the real world. These factors introduce a discrepancy between the simulated and real-world scenarios, known as the reality gap. As such, the future will also focus on datasets collected from real hardware for a more comprehensive understanding of continual learning in robotic manipulation in real-world scenarios.

Author contributions

All three authors, Sudhir Pratap Yadav, Rajendra Nagar, and Suril V. Shah, were involved in conceptualising the problem, discussing solutions, and writing the paper. At the same time, Sudhir Pratap Yadav carried out data collection and experiments.

Financial support

This work was done in collaboration with IIT Jodhpur and the iHub Drishti Foundation, IIT Jodhpur.

Competing interests

The author(s) declare none.

Data availability statement

The data and code supporting this study’s findings are openly available at https://github.com/sudhirpratapyadav/sac-cql-si; further inquiries can be directed to the corresponding author/s.

References

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., “Playing atari with deep reinforcement learning,” (2013). arXiv preprint arXiv: 1312.5602, 2013.Google Scholar
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M. and Vanhoucke, V., “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, (2018). arXiv preprint arXiv: 1806.10293, 2018.Google Scholar
Devin, C., Gupta, A., Darrell, T., Abbeel, P. and Levine, S., “Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer,” In: International Conference on Robotics and Automation (ICRA), (2017) pp. 21692176.Google Scholar
Gu, S., Holly, E., Lillicrap, T. and Levine, S., “Deep reinforcement learning for robotic manipulation,” (2016). arXiv preprint arXiv: 1610.00633 1, 2016.Google Scholar
Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P. and Levine, S., “Composable Deep Reinforcement Learning for Robotic Manipulation,” In: International Conference on Robotics and Automation (ICRA), (2018) pp. 62446251.Google Scholar
Gualtieri, M., ten Pas, A. and Platt, R., “Category level pick and place using deep reinforcement learning,” Computing Research Repository (2017). arXiv preprint arXiv: 1707.05615.Google Scholar
Berscheid, L., Meißner, P. and Kröger, T., “Self-supervised learning for precise pick-and-place without object model,” Robot Automa Lett 5(3), 48284835 (2020).CrossRefGoogle Scholar
Lee, A., Devin, C., Zhou, Y., Lampe, T., Bousmalis, K., Springenberg, J., Byravan, A., Abdolmaleki, A., Gileadi, N. and Khosid, D., “Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes.” In: Conference on Robot Learning, (2021).Google Scholar
Bao, J., Zhang, G., Peng, Y., Shao, Z. and Song, A., “Learn multi-step object sorting tasks through deep reinforcement learning,” Robotica 40(11), 38783894 (2022).CrossRefGoogle Scholar
Wu, X., Zhang, D., Qin, F. and Xu, D., “Deep reinforcement learning of robotic precision insertion skill accelerated by demonstrations,” In: International Conference on Automation Science and Engineering (CASE), 1651-1656, (2019).Google Scholar
Yasutomi, A., Mori, H. and Ogata, T., “A Peg-in-Hole Task Strategy for Holes in Concrete,” In: International Conference on Robotics and Automation (ICRA), (2021) pp. 22052211.Google Scholar
Schoettler, G., Nair, A., Luo, J., Bahl, S., Ojea, J., Solowjow, E. and Levine, S., “Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards,” In: International Conference on Intelligent Robots and Systems (IROS), (2020) pp. 55485555.Google Scholar
Nemec, B., Žlajpah, L. and Ude, A., “Door Opening by Joining Reinforcement Learning and Intelligent Control,” In: International Conference on Advanced Robotics (ICAR, (2017) pp. 222228.Google Scholar
Chen, Y., Zeng, C., Wang, Z., Lu, P. and Yang, C., “Zero-shot sim-to-real transfer of reinforcement learning framework for robotics manipulation with demonstration and force feedback,” Robotica 41(3), 10151024 (2023).CrossRefGoogle Scholar
Sun, X., Naito, H., Namiki, A., Liu, Y., Matsuzawa, T. and Takanishi, A., “Assist system for remote manipulation of electric drills by the robot WAREC-1R using deep reinforcement learning,” Robotica 40(2), 365376 (2022).CrossRefGoogle Scholar
Apolinarska, A., Pacher, M., Li, H., Cote, N., Pastrana, R., Gramazio, F. and Kohler, M., “Robotic assembly of timber joints using reinforcement learning,” Automat Constr 125, 103569 (2021).CrossRefGoogle Scholar
Neves, M. and Neto, P., “Deep reinforcement learning applied to an assembly sequence planning problem with user preferences,” Int J Adv Manuf Tech 122(11-12), 42354245 (2022).CrossRefGoogle Scholar
Kulkarni, P., Kober, J., Babuška, R. and Santina, C. D., “Learning assembly tasks in a few minutes by combining impedance control and residual recurrent reinforcement learning,” Adv Intell Syst 4(1), 2100095 (2022).CrossRefGoogle Scholar
Nair, A., Chen, D., Agrawal, P., Isola, P., Abbeel, P., Malik, J. and Levine, S., “Combining Self-Supervised Learning and Imitation for Vision-based Rope Manipulation,” In: International Conference on Robotics and Automation (ICRA), (2017) pp. 21462153.Google Scholar
Lee, R., Ward, D., Cosgun, A., Dasagi, V., Corke, P. and Leitner, J., “Learning arbitrary-goal fabric folding with one hour of real robot experience (2020). arXiv preprint arXiv: 2010.03209.Google Scholar
Gupta, A., Yu, J., Zhao, T., Kumar, V., Rovinsky, A., Xu, K., Devlin, T. and Levine, S., “Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors Without Human Intervention,” In: International Conference on Robotics and Automation (ICRA), (2021) pp. 66646671.Google Scholar
Kalashnikov, D., Varley, J., Chebotar, Y., Swanson, B., Jonschkowski, R., Finn, C., Levine, S. and Hausman, K., “Scaling Up Multi-Task Robotic Reinforcement Learning,” In: Conference on Robot Learning (CoRL), (2021).Google Scholar
Sodhani, S., Zhang, A. and Pineau, J., “Multi-Task Reinforcement Learning with Context-based Representations,” In: International Conference on Machine Learning, (2021) pp. 97679779.Google Scholar
Teh, Y., Bapst, V., Czarnecki, W., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N. and Pascanu, R., “Distral: Robust Multitask Reinforcement Learning,” In: Advances in Neural Information Processing Systems, (2017).Google Scholar
Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K. and Finn, C., “Gradient surgery for multi-task learning,” Adv Neur Infor Pro Syst 33, 58245836 (2020).Google Scholar
Goodfellow, I., Mirza, M., Xiao, D., Courville, A. and Bengio, Y., “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” (2013). arXiv preprint arXiv: 1312.6211, 2013.Google Scholar
Mallya, A. and Lazebnik, S., “Packnet: Adding Multiple Tasks to a Single Network by Iterative Pruning,” In: Computer Vision and Pattern Recognition, (2018) pp. 77657773.Google Scholar
Li, Z. and Hoiem, D., “Learning without forgetting,” IEEE Trans Patt Anal Mach Intell 40(12), 29352947 (2017).CrossRefGoogle Scholar
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D. and Hadsell, R., “Overcoming catastrophic forgetting in neural networks,” Proceed Nat Acad Sci 114(13), 35213526 (2017).CrossRefGoogle ScholarPubMed
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W., “OpenAI gym,” Comp Res Reposit (2016). arXiv preprint arXiv: 1606.01540.Google Scholar
Wołczyk, M., Zajac, M., Pascanu, R., Kuciński, Ł. and Miłoś, P., “Continual world: A robotic benchmark for continual reinforcement learning,” Adv Neur Infor Pro Syst 34, 2849628510 (2021).Google Scholar
Caccia, M., Mueller, J., Kim, T., Charlin, L. and Fakoor, R., “Task-agnostic continual reinforcement learning: In praise of a simple baseline,” (2022). arXiv preprint arXiv: 2205.14495, 2022.Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” In: International Conference on Machine Learning, (2018) pp. 18611870.Google Scholar
French, R., “Catastrophic forgetting in connectionist networks,” Trends Cogn Sci 3(4), 128135 (1999).CrossRefGoogle ScholarPubMed
Zenke, F., Poole, B. and Ganguli, S., “Continual Learning through Synaptic Intelligence,” In: International Conference on Machine Learning, (2017) pp. 39873995.Google Scholar
Singh, A., Yu, A., Yang, J., Zhang, J., Kumar, A. and Levine, S., “Cog: Connecting new skills to past experience with offline reinforcement learning,” (2020). arXiv preprint arXiv: 2010.14500.Google Scholar
Kumar, A., Zhou, A., Tucker, G. and Levine, S., “Conservative q-learning for offline reinforcement learning,” Adv Neur Info Pro Syst 33, 11791191 (2020).Google Scholar
Van Hasselt, H., Guez, A. and Silver, D., “Deep reinforcement learning with double Q-learning,” Proceed AAAI Conf Arti Intell 30(1), (2016). doi: 10.1609/aaai.v30i1.10295 Google Scholar
Kingma, D. and Ba, J., “Adam: A method for stochastic optimization,” (2014). arXiv preprint arXiv: 1412.6980.Google Scholar
Erwin, C. and Yunfei, B.. PyBullet, a Python Module for Physics Simulation for Games, Robotics and Machine Learning (PyBullet, 2016).Google Scholar
Figure 0

Figure 1. The SAC-CQL-SI method (soft actor-critic - conservative Q-learning - synaptic intelligence) for sequential learning is depicted in the block diagram. This method begins with a dataset comprising tasks 1 to N, from which one task is selected sequentially, as denoted by the task index $\mathbf{k}$. Initially, the algorithm samples a mini-batch from the dataset of the current task. This batch is then processed by the soft actor-critic conservative Q-learning (SAC-CQL) algorithm, which calculates the losses for both the actor (policy network) and the critic (Q-network). In instances where the task index exceeds one, quadratic regularisation [35] is integrated into the actor loss to mitigate forgetting. These calculated losses are subsequently used to update the neural networks that embody the policy (actor-network) and the Q-value function (critic network). The process involves continuous sampling of subsequent batches, with training on the current task persisting until a predetermined number of training steps is reached. After completing training for a task, the agent transitions to the next task by loading its data and incrementing the task index. This cycle is repeated, allowing the agent to progress through and learn each task successively.

Figure 1

Figure 2. CNN architecture of policy network. It has three convolutional layers with max-pooling following the first two. The convolutional layers are followed by a four-layer MLP (multilayer perceptron). The output layer is multiheaded, with the number of heads corresponding to the number of tasks. Based on the current task index, only one head is enabled during training and testing. Given an input state (an RGB image), the network produces two 7-dimensional vectors, $\mu$ and $\sigma$, representing the mean and standard deviation of the stochastic policy modelled by a Gaussian distribution.

Figure 2

Figure 3. The Q-network’s CNN architecture is similar to that of the policy network, with two notable distinctions: firstly, the action vector is incorporated into the first layer of the MLP, and secondly, the network’s output is a scalar Q-value. This network takes state (an RGB image) and action (a 7-dimensional vector) and produces scalar Q-value, that is, $Q(\mathbf{s},\mathbf{a})$.

Figure 3

Table I. Details of the collected 6 datasets for each task. The number of trajectories decreases as the area of the object space decreases to maintain a consistent object-space density.

Figure 4

Figure 4. Top row displays sampled trajectories and bottom row displays the scatter plot of the reward distribution for tasks button-press, pick-shed and open drawer with object-space area 1000$\text{cm}^2$ and density 20 object configurations per $\text{cm}^2$. The robot’s base is at (0.6m, 0.0m), which is shown as a black semicircle.

Figure 5

Algorithm 1. Data Collection Procedure.

Figure 6

Figure 5. Task accuracy for tasks button-shed, button-drawer and drawer-button. The top row is with SI, and the bottom row is without SI. In each plot, the X-axis represents the number of gradient update steps, and the Y-axis represents accuracy.

Figure 7

Figure 6. Forgetting matrix (0-button, 1-shed, 2-drawer). The top row is with SI regularisation, and the bottom row is without regularisation. Measuring units for area and density are cm$^2$ and objects/cm$^2$ respectively for every task.

Figure 8

Figure 7. Forward transfer matrix (0-button, 1-shed, 2-drawer). The top row is with SI regularisation, and the bottom row is without regularisation. Measuring units for area and density are cm$^2$ and objects/cm$^2$ respectively for every task.

Figure 9

Figure 8. Accuracy for sequentially learning pick-shed and press-button tasks (area = 360cm$^2$, density = 20objects/cm$^2$). The left column is with SI, and the right column is without SI. In each plot, the X-axis represents the number of gradient update steps, and the Y-axis represents accuracy.