ποΈ Talk2Scene¶
Audio-driven intelligent animation generation β from dialogue to visual storytelling.
Talk2Scene is an audio-driven intelligent animation tool that automatically parses voice dialogue files, recognizes text content and timestamps, and uses AI to recommend matching character stances (STA), expressions (EXP), actions (ACT), backgrounds (BG), and CG illustrations inserted at the right moments. It produces structured scene event data and composes preview videos showing AI characters performing dynamically across scenes.
Designed for content creators, educators, virtual streamers, and AI enthusiasts β Talk2Scene turns audio into engaging visual narratives for interview videos, AI interactive demos, educational presentations, and more.
π‘ Why Talk2Scene¶
Manually composing visual scenes for dialogue-driven content is tedious and error-prone. Talk2Scene automates the entire workflow: feed in audio or a transcript, and the pipeline produces time-synced scene events β ready for browser playback or video export β without touching a single frame by hand.
ποΈ Architecture¶
flowchart LR
A[Audio] --> B[Transcription\nWhisper / OpenAI API]
T[Text JSONL] --> C
B --> C[Scene Generation\nLLM]
C --> D[JSONL Events]
D --> E[Browser Viewer]
D --> F[Static PNG Render]
D --> G[Video Export\nffmpeg]
Scenes are composed from five layer types stacked bottom-up:
flowchart LR
BG --> STA --> ACT --> EXP
A CG illustration, when active, replaces the entire layered scene.
πΌοΈ Example Output¶
Example Video¶
Rendered Scenes¶
Left: Basic scene (Lab + Stand Front + Neutral) Β· Center: Cafe scene (Cafe + Stand Front + Thinking) Β· Right: CG mode (Pandora's Tech)
Asset Layers¶
Each scene is composed by stacking transparent asset layers on a background. Below is one sample from each category:
| Layer | Sample | Code | Description |
|---|---|---|---|
| π BG | ![]() |
BG_Lab_Modern |
Background (opaque) |
| π§ STA | ![]() |
STA_Stand_Front |
Stance / pose (transparent) |
| π EXP | ![]() |
EXP_Smile_EyesClosed |
Expression overlay (transparent) |
| π€ ACT | ![]() |
ACT_WaveGreeting |
Action overlay (transparent) |
| β¨ CG | ![]() |
CG_PandorasTech |
Full-scene illustration (replaces all layers) |
π¦ Install¶
Important
Requires Python 3.11+, uv, and FFmpeg.
Set your OpenAI API key:
π Usage¶
π Text Mode¶
Generate scenes from a pre-transcribed JSONL file:
π§ Batch Mode¶
Process an audio file end-to-end (place audio in input/):
π¬ Video Mode¶
Render a completed session into video:
π‘ Stream Mode¶
Consume audio or pre-transcribed text from Redis in real time:
π¬ Contact¶
- βοΈ Email: hobart.yang@qq.com
- π Issues: Open an issue on GitHub
π License¶
Licensed under the Apache License 2.0.




