Boson x Eigen AI Hackathon 2026
Dhvani
Real-time YouTube dubbing with voice cloning and emotion preservation.
Every video. Any language. Instantly.
Higgs ASR 3 GPT-OSS 120B Higgs TTS 2.5 Voice Clone
The scale of the problem
YouTube is massive. Dubbing is not.
The numbers tell the story
500hrs
of video uploaded
every single minute
That's 720,000 hours per day
56%
of YouTube content
is English-only
Despite English being ~17% of global population
$35/min
average cost of
professional dubbing
A 10-min video costs $350+ to dub manually
The growing gap
Non-English internet users are exploding.
Content supply hasn't caught up
Internet Users (billions) vs English Content Share on YouTube (%)
Language accessibility gap
Billions of users. Almost no dubbed content.
YouTube content availability vs. global speaker population
% of YouTube content available vs. % of global internet users — by language
The opportunity
Real-time dubbing. No waiting. Less cost.
Dhvani solves this end-to-end — paste a URL, pick a language, hear it now
<2s
first dubbed chunk
starts playing
3
languages supported
ES · JA · ZH
100%
voice cloned from
original speaker
Live Demo
Dhvani
System design
The full pipeline
Every stage runs on a different service — orchestrated in Python
📺
YouTube
pytubefix + ffmpeg
Download audio
Convert to 16kHz mono WAV
🎙️
ASR
Higgs ASR 3
English transcription
+ caption validation
🌐
Translation
GPT-OSS 120B
Translate to target lang
→ Higgs AST fallback
🔊
TTS
Higgs TTS 2.5
Voice clone + emotion
→ WAV output
🌐
Browser
WebSocket + Web Audio
Timestamp-synced
playback
FastAPI backend  ·  asyncio.gather()  ·  WebSocket streaming  ·  base64 WAV transport
Chunking strategy
Audio sliced into 3-second windows
Fixed-size chunks enable parallel processing and timestamp-accurate playback
Each chunk = 3s × 16kHz × 16-bit mono = ~96KB PCM
Normal speech
Voice reference (loudest of first 5)
Silence (skipped)
Currently dubbing
Silence Detection
RMS < 0.005 → chunk skipped entirely. No wasted API calls.
Min chunk size
Segments < 1600 samples (0.1s) are discarded — too short to transcribe.
Timestamp metadata
Every chunk carries start_s + dur_s → browser syncs audio to video clock.
Concurrency
3 stages. Running simultaneously.
While Voice Output plays chunk N → Translation processes N+1 → Transcription processes N+2
Transcription
Translation
Voice Output
→ Browser
Transcription
Translation
Voice Output
Playing
python   await asyncio.gather( transcription_stage(),  translation_stage(),  voice_output_stage() )   # bounded queues: maxsize=6
Quality control
ASR hallucination filter
ASR speech-to-text models sometimes invent text. We catch it using YouTube's own captions.
Higgs ASR 3 output
"subscribe to my channel and hit the bell"
word overlap: 22% — FAIL
YouTube caption (ground truth)
"the solution scales linearly with input size"
decision
Overlap < 30% → use caption instead
vs
Higgs ASR 3 output
"the solution scales linearly with input"
word overlap: 83% — PASS
YouTube caption
"the solution scales linearly with input size"
decision
Overlap ≥ 30% → use ASR output
Voice cloning
Capture the speaker. Clone the voice.
Picks the loudest of the first 5 chunks as the voice reference for Higgs TTS 2.5
Loudest chunk →
🎤 Voice fingerprint extracted
Higgs TTS 2.5 voice_reference_file
Dubbed in same voice
RMS threshold
If loudest RMS < 0.01 (near silence), voice clone disabled — TTS falls back to default voice.
Hallucination guard
TTS output > 500KB for a 3s chunk → dropped. Higgs TTS 2.5 can hallucinate long audio on CJK.
Performance
Latency per 3-second chunk
End-to-end from audio chunk ready → dubbed audio playing in browser
~400ms
ASR
Higgs ASR 3 (cloud)
~600ms
Translation
GPT-OSS 120B
~3200ms
TTS
Higgs TTS 2.5 (voice clone)
Sequential vs Parallel pipeline total latency (ms per chunk)
Built in 48 hours
Dhvani
ध्वनि — Sanskrit for "sound"
What we built
End-to-end real-time dubbing pipeline
Voice cloning from source speaker
Emotion-preserving TTS speed control
Hallucination filter via caption validation
Fully async 3-queue concurrent pipeline
What's next
More languages — Hindi, Korean, Arabic
Lip-sync video generation
Live stream dubbing (not just on-demand)
Browser extension
Speaker diarization for multi-person videos
Built by
Phani Sai Ram Munipalli
&
Vinay Mokidi
← → to navigate