Ours
This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene.
Ours
Feature Splatting
Ours
Feature Splatting
Ours
Feature Splatting
Ours
Feature Splatting
Unlike baseline approaches (i.e. Feature Splatting) dependent on natural language for localization and segmentation, which fails in multi-object scenarios and suffer from severe tearing and blur, our method utilizes generative priors to resolve object separation gaps in 3D reconstruction. By jointly exploiting metric cues and appearance information from reconstruciton process, our method guarantees accurate spatial alignment and visual fidelity, ultimately enabling the generation of physically reasonable motion.
Two examples of render-and-compare refinement for spatial alignment. Each column shows the intermediate result at increasing optimization steps. (a) Example of mask-based alignment for the bear statue, where the rendered object progressively aligns with the observed mask region in the image. (b) Example of pose refinement for the ball object, where the object position gradually converges to the correct placement within the scene. As optimization proceeds, the rendered geometry becomes increasingly consistent with the mask constraint and the scene context, resulting in more accurate spatial alignment for downstream physical simulation.
We invited 12 participants and showed each of them the rendered videos generated by different methods across all scenes. As shown in the table, our method significantly outperforms the baselines in both motion realism and visual quality. Additionally, Large Multimodal Model (LMM) have recently emerged as a powerful tool for assessing the quality of rendered videos. The results output from the LMM align closely with human judgments, validating the effectiveness of our approach.