Soundoftext.co.id
Trained on over 50,000 hours of audio data, Voicebox outperforms previous TTS benchmarks.
Meta has introduced Voicebox, a sophisticated speech generation model excelling in text-to-speech (TTS) synthesis across six languages and demonstrating superior noise elimination capabilities
It predicts masked sections in audio inputs, allowing tasks like noise removal and cross-lingual style transfer.
Voicebox, utilizing a flow-matching architecture, distinguishes itself from autoregressive models.
Meta refrains from open-sourcing Voicebox, citing safety concerns.
Despite training on audiobooks in multiple languages,
Voicebox, trained for specific tasks, exhibits in-context learning for style transfer and noise removal.
To balance openness and responsibility, Meta shares audio samples and a detailed research paper.
Discussions explore Meta's decision, considering the model's potential replication with abundant training data from audiobooks, podcasts, and broadcast archives.
For safety, Meta introduces a classifier detecting synthesized speech, reaffirming its commitment to ethical AI development.