Real-Time Synchronized Interaction Framework for Emotion-Aware Humanoid Robots

Author: Yanrong Chen

Yanrong.Chen21@student.xjtlu.edu.cn

Supervisor: Xihan Bian

Xihan.Bian@xjtlu.edu.cn

Abstract

As humanoid robots increasingly introduced into social scene, achieving emotionally synchronized multimodal interaction poses significant challenges. To further grow the integration of humanoid robots into service roles, we present a real-time framework for NAO robots that synchronizes speech prosody with full-body gestures through three innovations: (1) A dual-channel emotion engine where LLM simultaneously generates context-aware text responses and biomechanically feasible motion descriptors, constrained by a structured joint movement library; (2) Dynamic time warping enhanced by duration-aware sequencing to temporally align speech output with kinematic motion keyframes; (3) Closed-loop feasibility verification ensuring gestures adhere to NAO’s physical joint limits through real-time adaptation. Evaluations show 21% higher emotional alignment than rule-based systems, achieved by coordinating vocal pitch (valence-driven) with upper-limb kinematics while maintaining lower-body stability. By enabling seamless sensorimotor coordination, this framework advances the deployment of context-aware social robots in dynamic applications such as personalized healthcare, interactive education, and responsive customer service platforms.

Method

Our framework synchronizes speech and gesture in real time through LLM planning, timing alignment, and physical feasibility checks.

Experiments



Emotion-Driven Gesture Comparison

In this section, we compare gesture generation performance across different emotional states using real speech audio. Each case includes both predefined gestures and those generated by our system.

Emotion: Neutral

Input Audio: 03-01-01-01-02-01-01.wav

Predefined Gesture

ReSIn-HR(Ours) Generated Gesture

Emotion: Happy

Input Audio: 03-01-03-02-01-01-01.wav

Predefined Gesture

ReSIn-HR(Ours) Generated Gesture

Emotion: Fearful

Input Audio: 03-01-06-02-01-01-01.wav

Predefined Gesture

ReSIn-HR(Ours) Generated Gesture

Emotion-Driven Gesture Demonstration

Using manually written emotional utterances, we show how our system adapts gestures to speech content and emotional tone.

Happy: "Hey buddy! I just got the job I wanted. Isn’t that amazing?"

Sad: "I studied so hard, but still didn’t pass the exam. I feel really down."

Angry: "Why didn’t anyone reply to my emails? I’m getting really frustrated."

Failures and Limitations

Some failure cases of our policy reveal that the robot occasionally loses stability, particularly when performing wide arm swings or taking exaggerated steps. These issues are primarily caused by shifts in the center of mass, underscoring the necessity of incorporating balance control mechanisms during dynamic gesture execution. Future directions include personalized gesture adaptation, latency minimization, and extensions to richer multimodal inputs. Overall, this work advances expressive and embodied human-robot interaction, with code and models to be released to the community.

Down and Up Swing: Balance is affected by vertical motion.

Backward Arm Movement: Shifting mass backwards causes instability.

Forward Swing: Excessive forward force compromises stability.