I am a prospective Ph.D. applicant and currently a research intern at ByteDance Seed, where I work on vision-language-action (VLA) models for embodied AI.
This work is part of a broader question that motivates me: how can multimodal agents develop spatial intelligence — not only to recognize and describe the world, but to understand space, anticipate change, and act with grounded common sense? SpatialTree (CVPR 2026 Highlight) is my attempt to frame this as a hierarchy from perception to action, and my current VLA work pushes that hierarchy toward real-world interaction.
For my Ph.D., I hope to pursue this question at the intersection of embodied AI, multimodal learning, and world models. Please feel free to reach out if my work resonates with yours.
I study spatial intelligence as a bridge from multimodal perception to embodied action. My recent work spans evaluating and post-training MLLMs for spatial abilities, building geometry-aware world models, and developing VLA systems that connect vision-language reasoning with real-world interaction.