Motivation and Problem
Stereo matching has undergone rapid evolution over the past decade thanks to deep learning, enabling high-accuracy and high-resolution depth maps crucial for autonomous driving, 3D scene reconstruction, augmented reality, and robotic navigation.
Recent deep-based stereo models achieved remarkable performance, and even demonstrated zero-shot generalization capabilities thanks to large quantities of labeled data -- millions of labeled synthetic and real images.
However, learning accurate depth from event cameras presents significant challenges:
Limited labeled data: Despite the growing interest in event-based stereo matching, the availability of labeled datasets remains very limited compared to the traditional frame-based domain. The acquisition of such data requires massive efforts and large computing resources.
Costly and sparse annotations: Capturing accurate ground truth for event streams typically requires active sensors like LiDARs, which are expensive, yield sparse data -- if not accumulated -- and suffer from calibration issues.
|
Limitations of LiDAR-supervised real-world datasets.
Despite their popularity, LiDAR annotations remain sparse (A), poorly capture dynamic scenes (B-C), are prone to reprojection errors (D), and struggle on transparent or reflective surfaces (E).
|
As a result, learning accurate and generalizable stereo depth from event data remains a significant challenge -- especially in the absence of large-scale annotated datasets.
|
|
Our Solution: EventHub
We introduce EventHub, a novel framework for training deep-based event stereo networks effortlessly and without any ground-truth from costly active sensors. Our key innovation is a data factory that generates high-quality training data through two complementary approaches:
-
Synthetic Data Generation via Novel View Synthesis: We leverage state-of-the-art novel view synthesis (SVRaster) to generate event stereo stereo event streams and proxy depth labels. This approach requires only standard RGB images -- for example captured by a smartphone -- and can generate large-scale synthetic training sets.
-
Real Data Distillation: When paired RGB stereo images and event stereo data are available, we distill the knowledge of stereo foundation models -- like FoundationStereo or StereoAnywhere -- to annotate the latter. This enables us to leverage the robustness of pre-trained RGB stereo models without requiring expensive LiDAR annotations.
Additionally, we demonstrate how to repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. This bi-directional knowledge transfer also enables event models to improve RGB stereo performance in challenging nighttime conditions.
|
|
Method Overview
Our framework is designed to generate training data for event-based stereo networks without expensive ground-truth annotations. Instead of relying on costly LiDAR sensors or manual labeling, we leverage two complementary data generation strategies.
|
EventHub Framework Overview.
Our framework employs two complementary approaches: (i) Event Data Factory: SVRaster generates synthetic event stereo pairs and depth labels from sparse RGB images via virtual camera trajectories; (ii) Stereo Cross-Modal Distillation: existing RGB stereo models produce proxy depth labels for real event data in calibrated RGB-Event stereo setups; (iii) Both data sources are combined in EventHub to train/adapt event stereo networks.
|
1. Event Data Factory via Novel View Synthesis
For datasets with only RGB images (e.g., NeRF-Stereo or ScanNet), we employ a novel pipeline that leverages SVRaster, a state-of-the-art efficient radiance field algorithm. Given sparse RGB images of a static scene:
Image Capture and Camera Calibration: We collect multi-view RGB images of static scenes and rely on COLMAP to extract accurate camera intrinsics and poses.
Regularized Dense 3D Optimization: We employ SVRaster's fast training pipeline, enhancing it with specific constraints (such as normal consistency and DepthAnythingV2 priors) to produce highly precise depth maps alongside the novel views.
Virtual Trajectory Construction: To simulate the ego-motion required to trigger an event camera, we design smoothly continuous global and local virtual camera trajectories exploring the reconstructed 3D space.
Motion-Adaptive Stereo Rendering: Leveraging continuous trajectories, we generate event streams by dynamically adapting the rendering framerate according to the observed pixel optical flow, computing disparity and confidence maps metrics alongside stereo events.
2. Cross-Modal Distillation from RGB Stereo
For datasets with calibrated RGB-Event stereo pairs (e.g., DSEC), we leverage the power of pre-trained RGB stereo foundation models like FoundationStereo. These models are fine-tuned on RGB pairs to estimate proxy depth, which is then reprojected and aligned to supervise event-based model training. This approach eliminates the need for LiDAR annotations entirely.
|
Qualitative examples of events and proxy annotations by EventHub.
From top to bottom, examples obtained from NeRF-Stereo, ScanNet++ through novel view synthesis, and from DSEC through cross-modal distillation.
|
3. Adapting RGB Stereo Models to Events
We adapt state-of-the-art RGB stereo models to process event data by employing event representations such as Tencode. We explore fine-tuning pre-trained RGB stereo networks (FoundationStereo, StereoAnywhere) on the event domain, demonstrating that with high-quality proxy labels, these foundation models can effectively transfer to event data and achieve state-of-the-art performance with unprecedented generalization capabilities.
|
|
Ablation Study - Training Data Sources.
Qualitative results on DSEC dataset comparing different training protocols: Events only, Photometric loss, EV-SceneFlow, and our EventHub training strategies (MIX 3, MIX 4), alongside LiDAR ground-truth supervision. Our approach demonstrates effective performance comparable to or exceeding fully supervised methods without requiring expensive active sensor annotations.
|
|
|
Generalization to MVSEC Dataset.
Qualitative results demonstrating zero-shot generalization on MVSEC, showing how models trained on EventHub data generalize effectively to unseen datasets with diverse motion patterns and camera setups.
|
|
|
Generalization to M3ED Dataset.
Qualitative results on M3ED showing diverse challenging scenarios including nighttime operation, dynamic objects, and rapid motion. EventHub-trained models achieve impressive generalization across these difficult conditions.
|
|
BibTeX
@InProceedings{Bartolomei_2026_CVPR,
title={{EventHub}: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors},
author={Bartolomei, Luca and Tosi, Fabio and Poggi, Matteo and Mattoccia, Stefano and Gallego, Guillermo},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}