We delineate the architecture of C2P-Net in Fig.
2. It consists of two components dedicated to two stages: initial rigid registration and pyramid non-rigid registration. Given an ex vivo point cloud as a template extracted from a
\(\mu \)CT model:
\(P_{exv}={\{x_i\in {{\mathcal {R}}^{3}}\}_{i=1,2,...,N}}\), and a partial point cloud of the simulated in vivo shape variant:
\(P_{inv}={\{y_j\in {{\mathcal {R}}^{3}}\}_{j=1,2,...,M}}\), we adopted the Neighborhood-aware Geometric Encoding Network (NgeNet) [
21] to solve the initial rigid registration task. This stage is formulated as:
$$\begin{aligned} \tau ,\sigma =\textrm{NgeNet}(P_{exv}, P_{inv})\quad where\ (u, v) \in \sigma \end{aligned}$$
(1)
where
\(\tau \in {SE(3)} \) is the rigid transformation matrix which aligns
\(P_{exv}\) with
\(P_{inv}\), and
\((u, v) \in \sigma \) is the sparse correspondence set where
u and
v are indices for points in
\(P_{exv}\) and
\(P_{inv}\). Due to the multi-scale structure and a voting mechanism integrating features from different resolutions, NgeNet handles noise well and predicts correspondence robustly.
Based on the previous predicted correspondence set and the rigidly aligned source and target point clouds, we employ the Neural Deformation Pyramid (NDP) [
22] to predict the non-rigid deformation of the given point cloud pair. NDP defines the non-rigid registration problem as a hierarchical motion decomposition problem. At each pyramid level, the input points from last level are mapped to sinusoidal encodings with different frequencies:
\(\Gamma (p^{k-1})=(\textrm{sin}(2^{k+k_0}p^{k-1}), \textrm{cos}(2^{k+k_0}p^{k-1}))\),
k is the current level number,
\(k_0\) controls the initial frequency and
\(p^{k-1}\) is an output point from the last level. Lower frequencies at shallower levels represent rigid sub-motion, while higher frequencies at deep levels emphasize non-rigid deformations. In this way, a sequence of sub-motions is estimated from rigid to non-rigid, and the final displacement field is the combination of such a sequence. Formally, we denote the stage as:
$$\begin{aligned} \phi _{est}=\textrm{NDP}_{n,m}({\tilde{P}}_{exv}, P_{inv}, \sigma ) \end{aligned}$$
(2)
where
\({\tilde{P}}_{exv}\) is the source point cloud
\({P}_{exv}\) transformed by
\(\tau \),
n is the number of pyramid layers of the NDP neural network,
m is the maximal iteration within a single pyramid layer, and
\(\phi _{est}\) is the predicted displacement field describing how each point should move to the target. Combined losses are calculated at each iteration, including correspondence loss and regularization loss, and back-propagated to update the weights of each MLP. Of which, the correspondence loss
\(L_{CD}\) is defined as the Chamfer distance (
3) between
\({{\tilde{P}}}_{exv}\) which is masked by the correspondence
\(\sigma \) and
\(P_{inv}\).
$$\begin{aligned}{} & {} \textrm{CD}(A,B) {=}\frac{1}{|{A}|}\sum _{x_i\in {A}}\min _{y_j\in {B}}|{x_i{-}y_j}|{+}\frac{1}{|{B}|}\sum _{y_j\in {B}}\min _{x_i\in {A}}|{x_i{-}y_j}|\nonumber \\ \end{aligned}$$
(3)
$$\begin{aligned}{} & {} L_\textrm{CD} =\textrm{CD}({{\tilde{P}}}_{exv}^{\sigma }, P_{inv}) \end{aligned}$$
(4)
\({{\tilde{P}}}_{exv}^{\sigma }=\{{{\tilde{P}}}_{exv}[u]|\exists {v}:(u,v)\in \sigma \}\) are the masked ex vivo points that have correspondence in the target in vivo point cloud.