Hidden Gems: 4D Radar Scene Flow Learning Using
Cross-Modal Supervision

Presentation Video

Demo Video


Figure 1. Cross-modal supervised learning pipeline for 4D radar scene flow estimation. Given two consecutive radar point clouds as the input, the model architecture, which is composed of two stages (blue/orange block colours for stage 1/2), outputs the final scene flow together with the motion segmentation and a rigid ego-motion transformation. Cross-modal supervision signals retrieved from co-located modalities, i.e., LiDAR, RGB camera and odometer, are utilized to constrain outputs with various loss functions. This essentially leads to a multi-task learning problem.


This work proposes a novel approach to 4D radar-based scene flow estimation via cross-modal learning. Our approach is motivated by the co-located sensing redundancy in modern autonomous vehicles. Such redundancy implicitly provides various forms of supervision cues to the radar scene flow estimation. Specifically, we introduce a multi-task model architecture for the identified cross-modal learning problem and propose loss functions to opportunistically engage scene flow estimation using multiple cross-modal constraints for effective model training. Extensive experiments show the state-of-the-art performance of our method and demonstrate the effectiveness of cross-modal supervised learning to infer more accurate 4D radar scene flow. We also show its usefulness to two subtasks - motion segmentation and ego-motion estimation.

Figure 2. Cross-modal supervision cues are retrieved from co-located odometer, LiDAR and camera sensors to benefit 4D radar scene flow learning. The source point cloud (red) is warped with our estimated scene flow and gets closer to the target one (blue).

Qualitative results

We evaluate our approach on the public View-of-Delft dataset. Apart from scene flow estimation, our multi-task model can also predict a motion segmentation and an ego-motion rigid transformation as by-products. Below are some GIFs showing qualitative results of scene flow estimation and two subtasks. For more results, please see our demo video or supplementary.

Scene Flow Estimation

Scene flow estimation visualization. The left videos are the corresponding images captured by the camera with radar points projected to them. The middle and right columns shows our estimated and ground truth scene flow in the Bird's Eye View (BEV). Color of points in the BEV images represents the magnitude and direction of scene flow vectors. See the color wheel legend in the bottom right.

Motion Segmentation

Visualization of motion segmentation results. The left column shows radar points from the source frame projected to the corresponding RGB image. Another two columns shows our and groud truth motion segmentation results on the BEV, where moving and stationary points are rendered as orange and blue, respectively.

Ego-motion Estimation

Qualitative results of ego-motion estimation. The left compares our odometry results as a byproduct of our approach and the results from ICP. The ground truth is generated using the RTK-GPS/IMU measurements. The right columns shows the correponding scene flow estimation. We plot the results on two challenging test sequences.


                author    = {Ding, Fangqiang and Palffy, Andras and Gavrila, Dariu M. and Lu, Chris Xiaoxuan},
                title     = {Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision},
                booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
                month     = {June},
                year      = {2023},
                pages     = {9340-9349}


This research is supported by the EPSRC, as part of the CDT in Robotics and Autonomous Systems at Heriot-Watt University and The University of Edinburgh (EP/S023208/1).