Ziyang Song

I am a researcher at Tencent Video, working on multimodal video generation & world models for filmmaking.

I got my PhD degree (2021 ~ 2025) from The Hong Kong Polytechnic University, fortunately supervised by Prof. Bo Yang. My PhD research focused on 3D reconstruction and scene understanding. Prior to that, I got my M.Eng and B.Eng degrees (Honors Youth Program) from Xi'an Jiaotong University.

During my PhD study, I interned at TikTok (San Jose, CA) with Xinyu Gong. During my M.Eng study, I interned at SenseTime with Dongliang Wang, and Tencent Robotics X with Wanchao Chi.

Email  /  CV  /  LinkedIn  /  Google Scholar  /  Github

profile photo
Selected Publications
OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos
Ziyang Song, Jinxi Li, Bo Yang
International Conference on Machine Learning (ICML), 2024
arXiv / Video / Code

The first framework to represent dynamic 3D scenes in infinitely many ways from a monocular RGB video.

Unsupervised 3D Object Segmentation of Point Clouds by Geometry Consistency
Ziyang Song, Bo Yang
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
IEEE Xplore / Video / Code

The journal version of our OGC (NeurIPS 2022). More experiments and analysis are included.

NVFi: Neural Velocity Fields for 3D Physics Learning from Dynamic Videos
Jinxi Li, Ziyang Song, Bo Yang
Advances in Neural Information Processing Systems (NeurIPS), 2023
arXiv / Code

A novel representation of dynamic 3D scenes by disentangling physical velocities from geometry and appearance, enabling: 1) future frame extrapolation, 2) unsupervised semantic scene decomposition, and 3) velocity transfer.

ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation
Liang Xu*, Ziyang Song*, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, Wenjun Zeng, Wei Wu
International Conference on Computer Vision (ICCV), 2023
arXiv / Project Page / Code

(* denotes equal contribution)

A GAN-based Transformer for general action-conditioned 3D human motion generation, including single-person actions and multi-person interactive actions.

OGC: Unsupervised 3D Object Segmentation from Rigid Dynamics of Point Clouds
Ziyang Song, Bo Yang
Advances in Neural Information Processing Systems (NeurIPS), 2022
arXiv / Video / Code

We propose the first unsupervised 3D object segmentation method, learning from dynamic motion patterns in point cloud sequences.


Last update: 2026.01. Thanks.