VALL-E: This AI model will copy your voice in 3 seconds! Know how it works

Microsoft recently announced VALL-E, a new text-to-speech AI model. The model can accurately mimic a person’s voice when given just a three-second audio sample. The developers of VALL-E believe that when combined with other generative AI models like GPT-3, it can be used for high-quality text-to-speech applications, voice editing, where both recording and text can be edited. . be replaced with the transcript.

According to Microsoft, VALL-E is primarily a “neural codec language model” and is based on EnCodec, which was introduced by Meta in October 2022. VALL-E audio codecs extract codes from text and acoustic signals, rather than to convert them into speech by manipulating the waveform, usually through other text-to-speech. It understands the pitch and intonation of a person’s voice and uses EnCodec to extract the required data components (called ‘tokens’) and then the training data.

In this way, this system understands that person’s voice as well as the tone of his speech and can then speak any typed text exactly like that person’s voice and speaking style.

Microsoft trained VALL-E’s speech synthesis capabilities using Meta’s LibriLight audio library. Contains over 60,000 hours of English speaking from over 7,000 speakers, derived primarily from public domain audiobooks from LibriVox. For VALL-E to produce a good result, the sound present in the three second sample must be similar to the sound present in its learning algorithm.

Microsoft does not make the VALL-E code available to others to prevent VALL-E from being misused or misused by someone else. It seems that the researchers are aware of the potential social harm that this technology could cause.

Leave a Comment