1. Introduction
“Picking out the impurities” is one of the typical scenarios in production line which is both time consuming and laborious for the workers. Typical applications of picking out the impurities includes separating slates from coal in coal production line, rejecting stones from mushroom in mushroom sorting, etc. In these situations, there are great needs for the robots to substitute humans to finish the work. Although robot grasping has been widely applied in production line, most of the existing systems only work in structured environments which is unable to handle dynamic situations, especially in picking out diverse objects in dense clutter [Reference Deng, Guo and Wei1].
Recently, target-oriented grasping and push-grasping have attracted great attention in robotics. References [2–Reference Danielczuk, Kurenkov and Balakrishna4] mainly focused on picking up an object of a specific, user-selected semantic category, which is also called semantic grasping. To adapt quickly to target-oriented tasks, target information, such as accurate target masks [Reference Danielczuk, Angelova, Vanhoucke and Goldberg5] which is often obtained by pre-trained segmentation module with expensive labeling must be effectively combined. While in the problem of grasping the impurities, it is difficult to pre-define all the categories of the impurities to obtain the segmentation masks.
The challenge of grasping the impurities does not only lie in perception, but also with the manipulation capabilities. In early studies, the problem of manipulation capabilities was mainly addressed by hand-crafted features and pre-defining the sequences of actions [Reference Gupta and Sukhatme6, Reference Dogar and Srinivasa7]. The hard-coded heuristics limit the types of synergistic behaviors between multiple primitive actions which can be performed. Recently, reinforcement learning methods are introduced into robot manipulation. Many studies were focused on training policies with multiple primitive actions through self-supervised trial and error [Reference Boularias, Bagnell and Stentz8, 9]. Although great progress has been achieved, there are still many challenges remaining to be solved when applying the synergies of different primitive actions to target-oriented grasping, for example,useless pushing actions in regions with no target and the low efficiency of grasping [Reference Liu, Yuan and Deng10].
In this article, we propose a novel target-oriented push-grasping system which could actively discover and grasp the targets in dense clutter with the synergies between pushing and grasping actions as shown in Fig. 1. The main contributions are summarized as follows:
• We present a target-oriented attention module, which includes target saliency detection and density-based occluded-region inference. The attention module is not only able to locate the visible targets, but also predicts the regions where the targets are most likely to be occluded. Through the attention module, the robot can quickly discover where are the targets and perform policies with primitive actions to efficiently grasp the targets.
• We propose an active pushing mechanism based on a novel metric, namely Target-Centric Dispersion Degree (TCDD). TCDD is used to estimate the dispersion degree around the targets and to calculate the rewards for the pushing network. With the novel metric, the pushing network is trained to actively guide the robot to separate the target from other non-target objects and reduce the rate of useless pushes.
• We introduce an integrated push-grasping system framework for the robot to perform the synergies between pushing and grasping actions with Deep Q-Learning Network (DQN). The robot could properly select pushes or grasps to discover the impurities, seperate them from others, and then grasp them out. Experimental results demonstrate the superior performance of our approach when picking out impurities in dense clutter.
2. Related Work
Grasping is one of the key research directions in the field of robotics which has been widely studied for decades [Reference Sahbani, El-Khoury and Bidaud11–Reference Marwan, Chua and Kwek13]. Marwan et al. [Reference Marwan, Chua and Kwek13] thoroughly reviewed recent approaches for robot reaching and grasping. Early work of robotic grasping was mainly focused on analytical methods and 3D reasoning to predict the grasp locations and configurations [Reference Ding and Liu14–Reference Ponce, Sullivan, Boissonnat and Merlet16]. Although analytical methods make grasping feasible, it requires physical modeling of the grasping objects which has high computational complexity and is difficult to apply to unstructured real-world environments. More recent data-driven methods explore the prospects of training model-agnostic deep grasping policies [Reference Jiang, Moseson and Saxena17–Reference Zeng, Song and Yu20]. These methods detect grasps by exploiting learned visual features without explicitly using object-specific knowledge (e.g., shape, pose, dynamics) [Reference Choi, Schwarting, DelPreto and Rus21, Reference Mahler, Matl, Liu, Li, Gealy and Goldberg22]. Chu et al. [Reference Chu, Xu and Patricio23] proposed a deep network architecture which predicts multiple grasp candidates in situations when none, single or multiple objects are in the view. The identification of grasp configurations for objects is broken down into a grasp detection process followed by a more refined grasp orientation classification process in ref. [Reference Chu, Xu and Patricio23]. Fang et al. [Reference Fang, Wang, Gou and Lu24] contributed a large-scale grasp pose detection dataset namely GraspNet- $1$ Billion which contains $97,280$ RGB-D images with more than $1$ billion grasp poses. Based on this dataset, they proposed an end-to-end grasp pose prediction network which achieved state-of-the-art performance in real-world experiments. Compared with analytical methods, data-driven methods learn grasping without considering the physical modeling of the objects. Our approach is fully self-supervised by trial and error.
To effectively grasp objects in cluttered environments, many methods are focused on combining prehensile and non-prehensile manipulation policies. Most of the early work used hand-crafted features and pre-defined sequences of actions [Reference Gupta and Sukhatme6, Reference Dogar and Srinivasa7, Reference Omrčen, Böge, Asfour, Ude and Dillmann25]. Recently, reinforcement learning is introduced to train end-to-end deep networks to learn complementary prehensile and non-prehensile manipulation policies [Reference Boularias, Bagnell and Stentz8, 9, 26]. Target-oriented grasping [Reference Jang, Devin, Vanhoucke and Levine27–Reference Kurenkov, Taglic and Kulkarni29] cares more about the targets. It is often assumed that the targets are visible in target-oriented grasping since the visibility of the targets are required to learn object representations through interactions [Reference Jang, Devin, Vanhoucke and Levine27] or semantic segmentation [Reference Fang, Bai, Hinterstoisser, Savarese and Kalakrishnan28]. Danielczuk et al. [Reference Danielczuk, Kurenkov and Balakrishna4] proposed an action-heuristic method to select actions between grasping, sucking and pushing to retrieve targets occluded by clutter and further introduced X-Ray [Reference Danielczuk, Angelova, Vanhoucke and Goldberg5] which is able to efficiently extract the target object from a heap where it is fully occluded or partially occluded. The networks are trained with a dataset of $100$ k RGB-D images labeled with occluded target objects. Yang et al. [Reference Yang, Liang and Choi30] seperately learned a Bayesian-based policy to search for the targets and a classifier-based policy to coordinate the target-oriented pushing and grasping to grasp the targets in clutter.
Different from common target-oriented grasping where the categories of the grasping targets are specific, our work is focused on picking out the “impurities” in dense clutter. In our approach, the grasping targets are not limited to specific classes but are visually different from other objects in the clutter. This scenario exists more widely in industrial production line.
3. Problem Formulation
In target-oriented grasping, the goal is to grasp the target ( $X^{*}$ ) from physical environment ( $E$ ) containing various objects $X$ in the task scope $H$ . The push-grasping problem is then formulated as a Markov Decision Process (MDP) defined by tuples ( $S,A,T,R,\gamma$ ):
• State ( ${{s}_{t}}\in S$ ): In a bounded environment $E$ containing $N$ objects, the state ${s}_{t}$ is represented as a RGB-D heightmap image of the environment. The RGB-D heightmap image shown in Fig. 1 is obtained by transforming the RGB-D image to be orthographically back-project upwards in the gravity direction [26]. In our work, the workspace of the robot is set as a $448\,\mathrm{mm}\times 448\,\mathrm{mm}$ surface, which is evenly divided into $224\times 224$ action grids. The resolution of the heightmap in the workspace is $2\,\mathrm{mm}^{2}$ .
• Action ( ${{a}_{t}}\in A$ ): A fixed set of parameterized motion primitives. We define two actions, push and grasp, in our system. The action $a\in \{grasp,push\}$ is parametrized as a vector $(x,y,z,\psi )$ , where $(x, y, z)$ denotes the center position of the gripper. $\psi \in [0,2\pi ]$ denotes the rotation of the gripper in the table plane. If the robot chooses to execute the grasp action, the gripper moves to $(x, y, z)$ and rotates $\psi$ , then closes the fingers. If the robot chooses to execute the push action, the gripper will first close the fingers, then moves to $(x, y, z)$ and make a linear movement of 10 cm in length along the direction $\psi$ .
• Transition ( $T$ ): Transition probability distribution in task environment, $P: S\times S\times A\to \left [ 0, 1 \right ]$ . In model-free methods, $T$ is unknown and the robot directly interacts with the environment.
• Reward ( $R$ ): ${{r}_{t}}=r({{s}_{t}},{{a}_{t}})\in R$ is defined as a binary reward to indicate whether the target is successfully grasped or the pushing actions are effective based on the metric TCDD.
• Future discount ( $\gamma$ ): The goal in reinforcement learning is to learn an optimal policy ${\pi }^{*}$ to maximize the future reward ${{R}_{t}}=\sum \limits _{i=t}^{T}{{{\gamma }^{i-t}}}{{r}_{t}}$ at time $t$ and with the discount factor $\gamma \in \left [ 0,1 \right ]$ . Our future discount $\gamma$ is set as constant $0.5$ .
4. Overview of the System
First as shown in Fig. 2, the RGB-D images are obtained from a fixed-mounted camera and are then fed into the attention module. In the attention module, the saliency detection submodule indicates the area where the target is included with high saliency. The density-based occluded-region inference sub-module predicts the clutters with high densities are the places where the targets are likely to be occluded. If the target is in the view, the attention module will output a cropped image centered on the location with the highest saliency. If no target is found, the attention module will output a cropped image centered on the location with the highest density as the candidate area to discover the target.
Second, under the condition that no target is found, the pushing network with the output of the attention module as input would calculate the best pushing action to scatter objects in the designated local region. The robot repeats the action of pushing until the density of the objects in the view becomes uniform, which means the target objects are unlikely to be occluded. Under the condition that the targets are found, the pushing and grasping networks are both activated. Through the synergy between pushing and grasping, the target objects are isolated from surrounding objects until grasped. The above process is repeated until all targets in the environment are picked out.
The pushing and grasping network uses fully convolutional action-value function (FCAVF) to map heightmap images to action-value tables in a Q-learning framework. The synergy between pushing and grasping actions is achieved by selecting the primitive action with the highest Q value from the pixel-wise Q table.
5. Method
In the following, we first introduce the details of the attention module in our proposed system. Then, we present our target-centric push–grasp synergy framework.
5.1. Attention module
Our attention module is designed with two components: target saliency detection and density-based occluded-region inference. These two components work in parallel in the proposed system. The saliency detection module has a higher priority over the occluded-region inference module. That is, the attention module will output the location of the clutter only if no target is found in current view. Algorithm 1 summarizes the details of the attention module.
5.1.1. Target saliency detection
Although the impurities are difficult to clean up if they are occluded by other objects, it is easy to identify when there is no occlusion. Since non-target objects are of the same type, the impurities are visually different in appearance with the surrounding objects. Based on this intuition, we introduce visual saliency detection to detect the salient objects in the view.
Visual saliency detection [Reference Cong, Lei, Fu, Cheng, Lin and Huang31] simulates the mechanism of human visual attention in that salient objects could be easily identified in human visual system. Through visual saliency detection, we are able to focus on targets in highly salient regions which are apparently different from other objects in the view. In our approach, we introduce VOCUS2 [Reference Frintrop, Werner and García32] to obtain the saliency map. VOCUS2 follows FIT-based approach [Reference Itti, Koch and Niebur33] which achieves the state-of-the-art performance for salient object segmentation. With VOCUS2, we obtain competitive results for salient object segmentation with the weighted f-measure. Moreover, the real-time performance can be guaranteed.
As shown in Fig. 3, we first get the saliency map $M$ from the color image $I_{c}$ . There may exist multiple salient regions $(m_{0},...,m_{k})$ . The target mask image $I_{m}$ centered on the region of the highest saliency $m_{t}$ is then returned. $m_{t}$ is of a size of $112 \times 112$ pixels. The RGB image $I_{c}$ and the corresponding depth image are cropped into $I_{tcd}$ centered on the target with the same size of $I_{m}$ . $I_{m}$ and $I_{tcd}$ are used as the input to the pushing and grasping networks.
5.1.2. Density-based occluded region inference
If no target is visible in the cluttered scene, it is necessary to determine if the targets are occluded in some places. In dense clutter, the objects are scattered around in the scene which exhibits different density distributions of objects in different regions. Obviously, the higher density the region is, the higher possibility that the targets are occluded. To this end, we use DBSCAN [Reference Bäcklund and Hedblom34] (Density-Based Spatial Clustering of Application with Noise) to obtain the clustering results of objects in the view based on the depth image.
Different from other clustering methods, DBSCAN defines a cluster as the largest set of densely connected points. DBSCAN is able to discover clusters of arbitrary shapes in spatial database with noise. As shown in Fig. 4, with DBSCAN, we can obtain the clusters $C$ in the depth image $I_{d}$ . The largest cluster $C_{t}$ will be first selected as the region to explore. The all-one mask $I_{dm}$ centered on the cluster $c_{t}$ will be output. $I_{dm}$ has the same size of $I_{m}$ . The depth image $I_{d}$ with the corresponding RGB image is then cropped into the image $I_{ccd}$ centered on the cluster with the same size of $I_{dm}$ . $I_{dm}$ and $I_{ccd}$ are also used as the input to the pushing and grasping networks. As shown in Fig. 5, if the height of the cluster in the scene is lower than a certain threshold or the area of the cluster is smaller than a certain threshold, the scene is considered to be in the state of no target presented.
5.2. Target-centric push–grasp synergy
5.2.1. DQN with fully convolutional action-value functions (FCAVF)
DQN [Reference Mnih, Kavukcuoglu and Silver35] with Fully Convolutional Action-Value Function [26] uses a feed-forward fully convolutional network ${{Q}^{\pi }}\left ( s,a;\theta \right )$ as the Q function estimator to approximate the Q function in pixel. The weight of the network $\theta$ is updated through gradient descent as
where $y_{t}^{Q}=r+\gamma \underset{{{a}'}}{\mathop{\max }}\,Q\left ({s}',{a}';{{\theta }_{i-1}} \right )$ is the Q-target value calculated by the target network.
As shown in Fig. 6, in our approach, the target mask $I_{dm}$ and $I_{m}$ obtained from the attention module are first transformed into target color and depth images $I_{cc}$ and $I_{ct}$ by combining the cropped RGB and depth image $I_{ccd}$ and $I_{tcd}$ through the pixel-wise $and$ operation. Then, with target-centric RGB and depth images $I_{ccd}$ and $I_{tcd}$ , we obtain two orthographic RGB-D heightmap images (target-mask heightmap and target-centric heightmap as shown in Fig. 2) after reprojection. In case of occluded-region inference, the target-mask heightmap is the same as the target-centric heightmap.
The pushing and grasping networks are trained to take two RGB-D heightmap images as input and infer pixel-wise predictions of Q values. The input heightmaps are first rotated into $16$ orientations. Only horizontal pushes (to the right) and grasps are considered in the rotated images. The RGB channels of the heightmaps combined with the channel-wise cloned depth channel are two separate inputs into the two DenseNet towers [Reference Huang, Liu, Maaten and Weinberger36], which are followed by a channel-wise concatenation and $2$ additional convolutional layers with bilinear interpolation (as shown in Fig. 2). Finally, the total output is $32$ pixel-wise maps of Q values ( $16$ for pushes in different directions and $16$ for grasps in different orientations). Q value at each pixel represents the expected return for performing the pushing or grasping action at the corresponding location and rotation angle. The action which has the highest Q value in the pixel-wise maps will be chosen.
5.2.2. Learning active pushing for grasping
For the pushing action in target-agnostic grasping task, only considering the changes of the environment after pushing as the positive reward might be an effective solution [26]. However, it cannot guarantee to isolate the targets from the densely surroundings which sometimes even makes the objects closer to each other. To this end, we introduce a novel metric of Target-Centric Dispersion Degree (TCDD), which is specifically designed to calculate the spread-out degree of target from the surrounding objects. With this metric, the pushing network could be trained to guide the robot to separate the target from the surrounding objects efficiently.
During training in simulation, we first calculate the distance between the target and the surrounding objects in the scene. If the distance is greater than the finger-opening distance of the gripper $\eta$ , it indicates that the target will not affect other objects when being grasped. Therefore, the dispersion distance between objects is defined as follows (Fig. 7(a)):
Here ${p}_{t}$ and ${p}_{j}$ denote the center coordinates of the target $t$ and object $j$ , respectively. If there are $K$ objects in the scene, the TCDD is defined as
According to the formula defined above, the larger the ${\alpha }_{t}$ is, the greater the dispersion degree of the target object $t$ in the scene. If ${\alpha }_{t}=1$ , the target $t$ is considered to be $isolated$ (Fig. 7(b)) which can be grasped freely. Based on TCDD, we train the pushing network in simulation and apply the pushing policy in testing.
5.2.3. Reward
The rewards for grasping and pushing actions are specified separately. We set the reward for grasping ${{R}_{g}}\left ({{s}_{t}},{{s}_{t+1}} \right )=1$ if the grasp is successful (the target is grasped at a specified location). The reward for pushing ${{R}_{p}}\left ({{s}_{t}},{{s}_{t+1}} \right )=\text{0}\text{.5}$ if ${{\alpha }_{t+1}}-{{\alpha }_{t}}>\delta$ , which means the objects become more scattered after the pushing action. For other conditions, we consider the actions are failed and the rewards are set to $0$ . To reduce the effect of noise, we set $\delta =0.005$ .
6. Experimental Results and Analysis
Extensive experiments are carried out to evaluate the performance of our proposed system. The goals of our experiments are to answer the following questions: (1) Can our proposed system solve the problem of grasping impurities in different experimental settings? (2) Does our approach improve task performance compared to baseline methods? (3) How are the different components of our proposed system influence the performance in different settings?
6.1. Simulation experimental setup
We use a UR5 robot arm with a parallel gripper RG2 and a statically mounted camera in V-REP [Reference Huang, Liu, Maaten and Weinberger36] with bullet $V2.83$ in simulation. The RG2 gripper is a flexible collaborative gripper with $2$ kg payload and built-in Quick Changer up to $110\,\textrm{mm}$ stroke. $n$ targets and $m$ basic blocks are added in a workspace to simulate the dense cluttered environment. The target objects are of different colors from other blocks to simulate the impurities. The robot needs to grasp the target objects via a sequence of pushing and grasping actions. The scene is reset and objects are randomly dropped if all targets are grasped.
To validate the effectiveness of our approach, we set up three experimental scenarios: (1) Random Case. As shown in Fig. 8(a), objects are randomly dropped onto the table. (2) Challenging Case. Objects are laid closely side by side, which is difficult to grasp if there is no synergy of actions. An example of the configuration is shown in Fig. 8(b). (3) Invisible Case. There are multiple closely arranged heaps, which makes the targets invisible. This case is shown in Fig. 8(c).
For each test, we execute $20$ test runs for each scene and use two metrics to evaluate the performance: (1) the average success rate of grasp per-completion, (2) the action efficiency, which is defined as $N_{object}/N_{action}$ to describe the efficiency of the policy to complete the task. ${N}_{object}$ is the number of targets in testing and ${N}_{action}$ is the number of actions per-completion.
6.2. Training details
In our experiments, the models are trained by self-supervision. The agent takes the output of the attention module as input to learn the policies. The policies of pushing and grasping are learned jointly through trial and error. Our push–grasp action policy is trained to minimize the temporal difference error ${{\delta }_{t}}=Q\left ({{s}_{t}},{{a}_{t}};{{\theta }_{t}} \right )-y_{t}^{\theta _{i}^{-}}$ at each iteration $i$ using the Huber loss function:
where ${\theta }_{t}$ is the parameter of the neural network at iteration $t$ . The target network parameter $\theta _{i}^{-}$ is set fixed between individual updates. At iteration $t$ , the gradient is pass only at the single pixel $p$ where the executed action ${a}_{i}$ was executed.
We train our model in PyTorch with version $0.3$ . Our model is optimized by Stochastic Gradient Descent (SGD) with the momentum of $0.9$ and a weight decay of $2\times{{10}^{-5}}$ . The learning rate is set to ${10}^{-5}$ .The network uses a prioritized experience replay [Reference Rohmer, Singh and Freese37] with stochastic rank-based prioritization in training. We train our model only in the “Random Case” and test in all three settings. Figure 9 illustrates the training process of our policy learning. The performance in training is measured by the success rate of the last $300$ attempts.
6.3. Simulation results
We verify the effectiveness of our proposed push-grasping system by evaluating the performance in a series of tests. In these test runs, the system needs to successfully pick up the targets from the desktop in different scene settings. For better training, we change the scene if the same action fails for $3$ consecutive times.
6.3.1. Performance
We compare the performance of our approach with three baselines: (1) VPG [26] uses push–grasp synergies to grasp objects. (2) S-VPG is an extension of VPG, which uses the visual saliency detection mask as the input to both pushing and grasping networks. The action with the highest Q value is executed. (3) S-RGM, which adopts BASNet [Reference Qin, Zhang, Huang, Gao, Dehghan and Jagersand38] for visual saliency detection and predicts the $5$ -dimensional grasp rectangle of the objects proposed in ref. [Reference Chu, Xu and Patricio23]. A heuristic pushing mechanism is used if the grasping action is failed.
The comparison results in three experimental settings are summerized in Fig. 10. In random and challenging cases, VPG performs poorly and shows success rates of $34.6\%$ and $25.4\%$ . Since VPG is target agnostic, the targets are picked out by chance. S-VPG has better performance compared with VPG and achieves success rates of $52.6\%$ and $32\%$ in random case and challenging case, respectively. The reason lies in that S-VPG has the target information. S-VPG directly applies the target mask on the results of VPG, which cannot be applied in case of unknown targets. However, S-VPG lacks the ability to actively isolate the target object from surrounding objects, which makes it inefficient in our task. S-RGM achieves similar performance as S-VPG in random case ( $46.2\%$ ) and VPG in challenging case ( $22.4\%$ ). The saliency detection in S-RGM could guide the robot to focus on the targets. However, the predicted grasp rectangle is not accurate enough for successful grasping which lowers the success rate. Our proposed method achieves the best performance by more than $25\%$ and $37\%$ in terms of task success rate compared with S-VPG in random and challenging cases. The action efficiency of all three baselines are not very high ( $18.9\%$ , $26.5\%$ and $15.4\%$ in random case and $5.3\%$ , $4.4\%$ and $6.5\%$ in challenging case). The action efficiency becomes lower as the difficulty level increases from random case to challenging case. While ours remain high efficiency ( $46.3\%$ in random case and $45.1\%$ in challenging case) and does not influenced by the change of the scenario.
In invisible case,the performance of VPG and S-VPG drops significantly where the success rates are about $5\%$ and the action efficiencies are only about $1\%$ . This is mainly due to the fact that these two methods do not have effective mechanism to find the targets actively under the condition that no target is available. S-RGM achieves better performance ( $20\%$ success rate) compared with VPG and S-VPG since S-RGM adopts a similar heuristic mechanism as our proposed method to guide the robot to push the area with higher cluster density, which increases the possibility of finding the targets. However, S-RGM is still not effective enough ( $2.2\%$ action efficiency). Our method has the highest success rate ( $78.5\%$ ) and achieves almost similar or better performance as the random case and the challenging case. The action efficiency drops to ( $16.8\%$ ) because of more pushing actions are needed for finding the targets. In a nutshell, our method focuses on discovering and grasping the targets where the number of useless actions in irrelevant regions is reduced. With our approach, the robot is able to fully utilize active pushing actions to improve the grasping success rate of the targets.
6.3.2. Ablation studies
We perform ablation experiments to evaluate the importance of each component in our proposed method. The following conditions are considered: (1) No Target Saliency Detection (No-TSD). Without the saliency detection, the whole RGB and depth images of the scene with $224\times 224$ pixels are re-projected onto orthographic RGB-D heightmaps and serve as the representation of the current state. (2) No Occluded Region Inference (No-OI). If there is no target detected in the scene, images centered on the maximum value of the saliency map are replaced as the input to the push-grasping policy. (3) No TCDD (No-T). The metric for training pushing network is used to evaluate whether the scene changes before and after the pushing action [26]. If the difference after the action execution exceeds a certain threshold $\tau$ ( $\tau \ge 300 \,\textrm{{pixels}}$ ), the push reward is set to $1$ , otherwise it is $0$ .
The detailed experimental results are shown in Table I. In the case of NO-TSD, the robot performs poorly in all cases. With the increasing difficulty levels of the cases, the grasp success rate quickly drops from $34.6\%$ to $5.1\%$ . The action efficiency drops from $18.9\%$ to $0.6\%$ . This is due to the fact that the policy from No-TSD lacks the target information. The problem decays to be target-agnostic. In the case of No-OI, the grasp success rate ( $53.6\%$ and $49.7\%$ ) and action efficiency ( $34.5\%$ and $29.3\%$ ) are not greatly influenced in random and challenging cases, but drop greatly in invisible case ( $15.7\%$ and $8.9\%$ ). The reason is that No-OI lacks the ability to predict the candidate regions where the targets are likely to be occluded. In the case of No-T, the policy performance is better than that of No-OI which remains satisfactory performance in random and challenging cases. However, the performance is still not satisfactory in the invisible case. This is due to the ineffectiveness of the metric if the targets are occluded by the surrounding objects. Therefore, Target Saliency Detection (TSD) is the most important component in our method. TSD could guide the robot to quickly locate the targets. Occluded region inference also greatly helps the robot to discover targets in invisible case. TCDD shows its effectivenes when grasping the targets from the heaps. Figure 11 shows the simulation results in all three cases.
6.4. Results on a physical robot
We also validate the effectiveness of the learned policy on a real-world UR5 robotic arm with a parallel gripper Robotiq $85$ . The maximum grip stoke of Robotiq85 is 85 mm (adjustable) and has a similar structure with RG $2$ , which makes it possible to directly apply the grasping policy learned from simulation to real-world experiments. To better transfer the learned policy from simulation environment to the real-world environment, we adopt the domain adaptation method [Reference Schaul, Quan, Antonoglou and Silver39] to deal with the reality gap. As shown in Fig. 12, we can see that the image from real-world environment (b) becomes more similar in color distribution as the image in simulation (a) after domain adaptation. With domain adaption, our learned policy trained in simulation can be transferred into the real-world experiments directly.
The learned policy are evaluated in all three cases in two real-world experiments (toy blocks and real-world objects) shown in Fig. 13. Through changing the initial locations and distributions of objects, we test $10$ groups of experiments in each case and the maximum step is set to $15$ in each group. Experimental results are shown in Table II.
From Table II, we can see that our method performs stable and satisfactory in both two real-world experiments. In random case, our method uses less pushes than other cases. In challenging case, the robot needs more pushes for better grasping. In invisible case, the robot needs further more pushes to scatter the surrounding objects to discover the target. Figures 14 and 15 show the results of the learned policy in three settings.
7. Conclusions
In this article, we presented the problem of “Picking out the impurities” in dense clutter where an attention-based push-grasp synergy system with deep reinforcement learning is proposed. In our approach, the proposed attention model is able to quickly locate the targets and also predicts the most likely regions where the targets are occluded. The new metric TCDD effectively guides the robot to isolate the target from the surroundings. The number of useless pushes is also reduced. Our system is trained in simulation with self-supervision. Extensive experimental results in simulation and real-world environment demonstrate that our proposed method can effectively pick out the impurities in dense clutter with better exploration efficiency and higher grasp success rate than baseline approaches. Moreover, the trained polices can be generalized to challenging environments with noises.
Acknowledgement
This work was supported by the National Key R&D Program of China (grant 2019YFB1311901) and the National Natural Science Foundation of China under Grant U1713222 and 61773378.
Conflicts of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.