OrbitGrasp: SE(3)-Equivariant Grasp Learning

Boce Hu1, Xupeng Zhu$\star$1, Dian Wang$\star$1, Zihao Dong$\star$1, Haojie Huang$\star$1, Chenghao Wang$\star$1, Robin Walters1, Robert Platt1 2
1 Northeastern University
2 Boston Dynamics AI Institute
$\star$equal contribution
Conference on Robot Learning (CoRL) 2024

Abstract

While grasp detection is an important part of any robotic manipulation pipeline, reliable and accurate grasp detection in $SE(3)$ remains a research challenge. Many robotics applications in unstructured environments such as the home or warehouse would benefit a lot from better grasp performance. This paper proposes a novel framework for detecting $SE(3)$ grasp poses based on point cloud input. Our main contribution is to propose an $SE(3)$-equivariant model that maps each point in the cloud to a continuous grasp quality function over the 2-sphere $S^2$ using a spherical harmonic basis. Compared with reasoning about a finite set of samples, this formulation improves the accuracy and efficiency of our model when a large number of samples would otherwise be needed. In order to accomplish this, we propose a novel variation on EquiFormerV2 that leverages a UNet-style encoder-decoder architecture to enlarge the number of points the model can handle. Our resulting method, which we name $\textit{OrbitGrasp}$, significantly outperforms baselines in both simulation and physical experiments.

Introduction

Summary of Orbitgrasp

In this paper, we propose $\textbf{OrbitGrasp}$, an $SE(3)$-equivariant grasp learning framework using spherical harmonics for 6-DoF grasp detection. Our method leverages an $SE(3)$-equivariant network that maps each point in a point cloud to a grasp quality function over the 2-sphere $S^2$. For each point in the cloud, this function encodes the grasp quality for possible hand approach directions toward that point. By applying geometric constrains, we reduce the action space to an $\textit{orbit}$ (i.e., an $S^1$ manifold embedded in $S^2$) of approach directions, defined relative to the surface normal at each contact point. As shown below:

We infer an orbit of grasps (yellow ellipse) defined relative to the surface normal (red arrow) at the contact point (pink dot). Since our model is equivariant over $SO(3)$, the optimal pose (represented by the solid gripper) on the orbit rotates consistently with the scene (left and right show a rotation by 90 degrees).

OrbitGrasp divides the input point cloud into several sub-point clouds $B_i$ (neighborhoods around center points $c_i$) and processes each through the network and outputs a grasp quality function $f_p\colon S^2 \to \mathbb{R}$ for each point $p$ in $B_i$. The model produces Fourier coefficients for each $p$ (represented as different channels in the network output), which are used to reconstruct $f_p$ based on spherical harmonics. The Orbit Pose Sampler generates multiple poses for each $p$ perpendicular to the surface normal $n_p$ and queries corresponding $f_p(\cdot)$ to evaluate these grasp qualities along the orbit. The grasp with the highest quality is then selected, thereby producing the optimal grasp pose $a^*$, as shown on the right.

Grasp Pose Representation

Since our model only infers grasp quality over $S^2$, we must obtain the remaining orientation DoF. We accomplish this by constraining one of the two gripper fingers to make contact such that the object surface normal at the contact point is parallel to the gripper closing direction (see figure below). Specifically, for a point $p \in \mathcal{B}_i \subset B_i$ in region $B_i$ with object surface normal $n_p$, we constrain the hand y-axis (gripper closing direction) to be parallel to $n_p$. Therefore, valid hand orientations form a submanifold in $SO(3)$ homeomorphic to a 1-sphere $S^1$ which we call the $\textit{orbit}$ at $p$ \begin{equation} O_{p} = \{R = [r_1, n_p, r_3] \in SO(3) \} \label{eqn:grasp_pose_orthogonal} \end{equation} where $r_1, n_p, r_3$ are the columns of the 3-by-3 rotation matrix $R$. Valid orientations are determined by the $z$-axis of the gripper (the approach direction of the hand) which may be any unit vector perpendicular to $n_p$. We may thus specify valid grasps by their approach vector $r_3 \in \overline{O}_{p} = \{ r_3 \in S^1 : n_p^\top r_3 = 0 \}$ since $r_1 = -n_p \times r_3$. In the figure below, green and blue denote the $y$, $z$ directions of the hand, and $n_p$ is the normal vector at $p$ (red). Black is the orbit of the approach direction.

Simulation Experiments

We evaluated our method on a widely used grasping benchmark with 303 training and 40 test objects from various object datasets. Two camera configurations were tested: a single-view, with a randomly positioned camera on a spherical region around the workspace, and a three-camera multi-view setup. We assessed two tasks: $\textbf{$\textit{Pile}$}$ (as shown left), where objects are randomly dropped into the workspace, and $\textbf{$\textit{Packed}$}$ (as shown right), where objects are placed upright in random poses.

$\textbf{Visualization of Pile Scene}$

$\textbf{Visualization of Packed Scene}$

We compared the OrbitGrasp with various baselines in terms of grasp success rate (GSR) and declutter rate (DR). OrbitGrasp (3M) is trained on the downsampled 3M dataset. OrbitGrasp (6M) is trained on the full 6M dataset. The results indicate that our method outperforms all baselines across both settings and tasks in terms of both GSR and DR, in both the 3M and 6M training sets. The high GSR indicates that our model can predict accurate grasp quality. On the other hand, the high DR signifies that our model infers accurate grasp poses that do not move objects outside of the workspace.

Physical Experiments

To assess our method's real-world performance, we conduct physical experiments involving two tasks under two camera settings, replicating those in the simulation. We directly transfer the trained model from the simulation to the real-world setting to evaluate their performance gap.

$\textbf{Real world Experiment Setting.}$ (a) Robot platform setup. (b) Upper: Packed object set with 10 objects. Bottom: Packed scene (c) Upper: Pile object set with 25 objects. Bottom: Pile scene

$\textbf{Result Visualization}$

Packed Scene 1

Packed Scene 2

Pile Scene 1

Pile Scene 2

Pile Scene 3

We compared the results of OrbitGrasp with VNEdgeGrasp using the same metrics as in the simulation experiments.

$\textbf{Quantitative Results}$

Video

Citation


@inproceedings{
   huorbitgrasp,
   title={OrbitGrasp: SE (3)-Equivariant Grasp Learning},
   author={Hu, Boce and Zhu, Xupeng and Wang, Dian and Dong, Zihao and Huang, Haojie and Wang, Chenghao and Walters, Robin and Platt, Robert},
   booktitle={8th Annual Conference on Robot Learning},
   year={2024},
   url={https://openreview.net/forum?id=clqzoCrulY}
}