Dhvani — Demo Presentation

Boson x Eigen AI Hackathon 2026

Dhvani

Real-time YouTube dubbing with voice cloning and emotion preservation.
Every video. Any language. Instantly.

Higgs ASR 3 GPT-OSS 120B Higgs TTS 2.5 Voice Clone

The scale of the problem

YouTube is massive. Dubbing is not.

The numbers tell the story

500hrs

of video uploaded
every single minute

That's 720,000 hours per day

56%

of YouTube content
is English-only

Despite English being ~17% of global population

$35/min

average cost of
professional dubbing

A 10-min video costs $350+ to dub manually

The growing gap

Non-English internet users are exploding.

Content supply hasn't caught up

Internet Users (billions) vs English Content Share on YouTube (%)

Language accessibility gap

Billions of users. Almost no dubbed content.

YouTube content availability vs. global speaker population

% of YouTube content available vs. % of global internet users — by language

The opportunity

Real-time dubbing. No waiting. Less cost.

Dhvani solves this end-to-end — paste a URL, pick a language, hear it now

<2s

first dubbed chunk
starts playing

3

languages supported
ES · JA · ZH

100%

voice cloned from
original speaker

Live Demo

Dhvani

System design

The full pipeline

Every stage runs on a different service — orchestrated in Python

📺

YouTube

pytubefix + ffmpeg

Download audio
Convert to 16kHz mono WAV

→

🎙️
ASR
Higgs ASR 3
English transcription
+ caption validation

→

🌐
Translation
GPT-OSS 120B
Translate to target lang
→ Higgs AST fallback

→

🔊
TTS
Higgs TTS 2.5
Voice clone + emotion
→ WAV output

→

🌐

Browser

WebSocket + Web Audio

Timestamp-synced
playback

FastAPI backend · asyncio.gather() · WebSocket streaming · base64 WAV transport

Chunking strategy

Audio sliced into 3-second windows

Fixed-size chunks enable parallel processing and timestamp-accurate playback

Each chunk = 3s × 16kHz × 16-bit mono = ~96KB PCM

Normal speech

Voice reference (loudest of first 5)

Silence (skipped)

Currently dubbing

Silence Detection

RMS < 0.005 → chunk skipped entirely. No wasted API calls.

Min chunk size

Segments < 1600 samples (0.1s) are discarded — too short to transcribe.

Timestamp metadata

Every chunk carries start_s + dur_s → browser syncs audio to video clock.

Concurrency

3 stages. Running simultaneously.

While Voice Output plays chunk N → Translation processes N+1 → Transcription processes N+2

Transcription

Translation

Voice Output

→ Browser

Transcription

Translation

Voice Output

Playing

python await asyncio.gather( transcription_stage(), translation_stage(), voice_output_stage() ) # bounded queues: maxsize=6

Quality control

ASR hallucination filter

ASR speech-to-text models sometimes invent text. We catch it using YouTube's own captions.

Higgs ASR 3 output

"subscribe to my channel and hit the bell"

word overlap: 22% — FAIL

→

YouTube caption (ground truth)

"the solution scales linearly with input size"

→

decision

Overlap < 30% → use caption instead

vs

Higgs ASR 3 output

"the solution scales linearly with input"

word overlap: 83% — PASS

→

YouTube caption

"the solution scales linearly with input size"

→

decision

Overlap ≥ 30% → use ASR output ✓

Voice cloning

Capture the speaker. Clone the voice.

Picks the loudest of the first 5 chunks as the voice reference for Higgs TTS 2.5

Loudest chunk →

🎤 Voice fingerprint extracted

→

Higgs TTS 2.5 voice_reference_file

→

Dubbed in same voice

RMS threshold

If loudest RMS < 0.01 (near silence), voice clone disabled — TTS falls back to default voice.

Hallucination guard

TTS output > 500KB for a 3s chunk → dropped. Higgs TTS 2.5 can hallucinate long audio on CJK.

Performance

Latency per 3-second chunk

End-to-end from audio chunk ready → dubbed audio playing in browser

~400ms

ASR

Higgs ASR 3 (cloud)

~600ms

Translation

GPT-OSS 120B

~3200ms

TTS

Higgs TTS 2.5 (voice clone)

Sequential vs Parallel pipeline total latency (ms per chunk)

Built in 48 hours

Dhvani

ध्वनि — Sanskrit for "sound"

What we built

End-to-end real-time dubbing pipeline

Voice cloning from source speaker

Emotion-preserving TTS speed control

Hallucination filter via caption validation

Fully async 3-queue concurrent pipeline

What's next

More languages — Hindi, Korean, Arabic

Lip-sync video generation

Live stream dubbing (not just on-demand)

Browser extension

Speaker diarization for multi-person videos

Built by

Phani Sai Ram Munipalli

&

Vinay Mokidi