Sheng’s homepage

Who I am

I’m Sheng Zhao (赵晟), and you can call me Alex. I’m a senior at Tsinghua University, Beijing, China 🏫. Now I am pursuing B.Sc. in Mathematics and Physics and B.Eng. in Electrical Engineering🎓, at Weiyang College. For more information, please refer to my Curriculum Vitae.

Education

Tsinghua University
2021.9 - Present (2025.7 Expected)
B.Eng. in Electrical Engineering
B.S. in Fundamental Sciences (Mathematics and Physics)

Research Experience

PI Lab, Tsinghua University
2023.4 - 2024.6
Advisor: Professor Xin Yi.
C3I Lab, Tsinghua University
2023.11 - 2024.6
Advisor: Professor; Shanghai AI Lab Manager Bowen Zhou.
Bear Lab, University of Rochester
2024.1 - 2024.9
Advisor: Professor Yukang Yan.

I am a human-centered technical researcher working at the intersection of eXtended Reality (XR) and Generative Artificial Intelligence (GenAI). I focus on understanding human behaviors when interacting with digital objects, considering visual representations, linguistic information, and interaction setups. To this end, I implemented computational prototypes, conducted empirical studies, and deployed systems to advance our capabilities of perception, cognition, and interaction. Through my research, I aim to enhance human well-being in various ways, including improving emotional states [3], fostering collaboration [4], enriching communication experiences [2], sparking creativity [5], and advancing usable privacy and security [1], all within digital interactive environments such as XR, with the help of GenAI.

Peer-review Papers

[1]. Sheng Zhao*, Junrui Zhu*, Xueyang Wang, Hongyi Li, Suning Zhang, Xin Yi, and Hewu Li. CoordAuth: Hands-Free Two-Factor Authentication in Virtual Reality Leveraging Head-Eye Coordination.
Accepted as Conference Paper, The 32nd IEEE Conference on Virtual Reality and 3D User Interfaces (VR’25). (* Equal Contribution)

[2]. Xueyang Wang*, Sheng Zhao*, Yihe Wang, Ziyu Han, Xinge Liu, Xin Tong, and Xin Yi. Raise Your Eyebrows Higher: Facilitating Emotional Communication in Social Virtual Reality Through Region-specific Facial Expression Scaling.
Revise & Resubmit (Accept with Minor Revision from 1AC), 2025 ACM CHI Conference on Human Factors in Computing Systems (CHI’25). (* Equal Contribution)

[3]. Sheng Zhao, Janet Johnson, Xin Yi, and Yukang Yan. Open Your Heart: Investigating the Impact of Dynamic Conversational Strategies and the Multiple Virtual Agent Setting on Emotional Support.
Revise & Resubmit, 2025 ACM CHI Conference on Human Factors in Computing Systems (CHI’25).

[4]. Janet Johnson, Macarena Peralta, Mansanjam Kaur, Sophia Huang, Sheng Zhao, Hannah Guan, Shwetha Rajaram, and Michael Nebeling. Exploring Collaborative GenAI Agents in Synchronous Group Settings: Eliciting Team Perceptions and Design Considerations for the Future of Work.
Under review, The 28th ACM SIGCHI Conference on Computer-Supported Cooperative Work & Social Computing (CSCW’25).

Repository

Singe Image Guided Novel View Image Synthesis by SV3D.
Facilitating more controllable and flexible novel view generation from a single image through precise camera parameters and enhanced attention blocks.
[Github]

Misc

👉 Question: Why do I focus on XR? What kind of research do I want to conduct in XR?

Unfold
From my understanding, XR’s 3D digital representations have revolutionized how humans interact with digital objects, enabling spatial and immersive multimodal experiences. However, current XR technologies face significant challenges, such as the bulky weight of head-mounted displays (HMDs), the lack of physical feedback when interacting with digital entities, and high rendering requirements. These limitations prevent XR from becoming a universal environment for computing and interaction.

This reminds me of the initial skepticism surrounding touch-based smartphones when they first emerged. However, with improvements in interaction methods like text entry and advancements in hardware capabilities, the iPhone achieved great success. Could XR have similar potential? As a human-centered researcher, I aim to identify how XR's 3D virtual interactions can provide the greatest benefits to people, particularly in which areas or downstream tasks, despite these existing limitations. The optimization of issues related to HMD rendering methods, imaging, and weight might have to be left to the experts in optics and hardware. 😈


👉 Question: Why am I interested in working on GenAI?

Unfold
(Well, I could say it's because it's trendy... 😅) Not becasue of trendy in fact, my true motivation lies in exploring the broader context of human interaction with the digital world. Specifically, what GenAI has brought to humanity and how it might impact XR research (whether positively or negatively).

In my view, XR offers richer visual representations and a wider range of interactive settings, such as opportunities for collaboration. GenAI, including large language models (LLMs) that provide rich linguistic information, diffusion models capable of generating high-resolution visual representations, and visual language models (VLMs) that align visual content with linguistic prompts, presents promising opportunities for advancing intelligent, multimodal interactive environments. However, overall, GenAI seems to offer greater benefits to 2D-based mobile phone interactions due to their portability and controllability. It seems harder for XR to benefit from GenAI.

Recently, Meta's Orion glasses provided a naive yet efficient answer: the context-aware ability of "Meta Look." This approach involves taking a photo with the camera, then transmitting the prompt and image to a small parameter VLM for processing. But this doesn't fully leverage the potential of AR (XR); after all, this can be easily done with a mobile phone, as seen with projects like WorldScribe (best paper @ UIST 2024), within most scenarios.

LLMR (honorable mention @ CHI 2024) seems to offer another solution, but it doesn’t appear to be fully direct: it’s not a true end-to-end generation process. Instead, it involves serializing visual representations into textual forms, and then using LLM agents (as tools) to infer executable code for scene compilation. I’ve been thinking—could there be a more efficient encoder for representing XR information (such as 3D assets, avatars, or data representations)? Then, we could use a diffusion model for controlled generation. This end-to-end approach would likely be more intuitive and efficient.

Moreover, with the emergence of various diffusion model variants (e.g., image 2 image, video generation, 3D reconstruction, editing, control), can XR environments benefit from diffusion models? Or could XR become a useful tool for advancing diffusion techniques? For example, visualizing the latent space of diffusion model for MLEs when they are coding / using diffusion to inference / training or debugging models.

Additionally, I believe that multimodal representations in XR have great potential in fields like A11Y, generative tools, and visualization tools, which would be excellent directions for XR research (XAIR, honorable mention @ CHI 2023).


👉 Question: What have I learned from my past research on XR + GenAI?

Unfold
Leveraging the powerful language capabilities of LLMs as the backend for multiple embodied agents in XR has the potential for various applications, such as in collaborative or individual settings tailored to specific scenarios (e.g., emotional support, collaboration dynamics). This is reflected in my work, including the manuscript for my CHI 2025 revision & resubmission, as well as my submissions to CSCW 2025 and CHI 2025 LBW.

However, despite the impressive capabilities of LLMs, their ability to enhance human well-being is still limited. LLM-supported avatars fall short compared to humans in recognizing and responding to human values. Additionally, due to the uncanny valley effect, mitigating the negative impacts of real-time embodied interactions requires significant effort (specific measures are discussed in my CHI 2025 paper, accepted with minor revisions).

This calls for advancements in both the NLP and computer graphics fields. After all, HCI is an applied discipline that requires foundational theoretical research and practical development to truly shine. Perhaps I will also attempt to address (or at least mitigate) these significant challenges from a more theoretical and technical perspective.


👉 Question: What do I want to do in the future?

Unfold
This is a difficult question to answer. There are too many complex and obscure issues that need to be addressed. Over the past two years, I have worked on several projects (covering submissions to IMWUT, CHI, CSCW, VR, etc.), but when I reflect on them, I realize that they struggle to fundamentally solve the problems I mentioned earlier. At best, they offer some limited insights. In the future, I need more time to think and reflect on what exactly I need to do. It's a painful process, but it is also rewarding. ✨😊

Overall, I aim to improve the usability of XR, including understanding human behavior and needs when interacting with XR, enhancing the accessibility of XR systems to the physical world (through sensors, cameras,and fabrication), and from a GenAI perspective, exploring how XR can provide well-being to GenAI developers; how to leverage GenAI to enhance the diversity and practicality of XR applications, considering visual and textual representations, as well as interactive settings.

I am also open to exploring other research opportunities, such as using diffusion models for 3D reconstruction or building multimodal datasets for interactions between humans and the digital or physical world. Ultimately, I want to pursue research that solves real-world problems and makes tangible contributions to improving human life.

Hobbies

  • 🎵 Favorite musicians: Stefanie Sun (孙燕姿) and Jay Chou (周杰伦).
  • 📚 Favorite book: Dream of the Red Chamber (红楼梦).
  • 🚴‍♂️ Passionate about outdoor activities and proud member of the Tsinghua Cycling Team. Strava account here.