According to Microsoft, VALL-E is primarily a “neural codec language model” and is based on EnCodec, which was introduced by Meta in October 2022. VALL-E audio codecs extract codes from text and acoustic signals, rather than to convert them into speech by manipulating the waveform, usually through other text-to-speech. It understands the pitch and intonation of a person’s voice and uses EnCodec to extract the required data components (called ‘tokens’) and then the training data.
In this way, this system understands that person’s voice as well as the tone of his speech and can then speak any typed text exactly like that person’s voice and speaking style.
Microsoft trained VALL-E’s speech synthesis capabilities using Meta’s LibriLight audio library. Contains over 60,000 hours of English speaking from over 7,000 speakers, derived primarily from public domain audiobooks from LibriVox. For VALL-E to produce a good result, the sound present in the three second sample must be similar to the sound present in its learning algorithm.
Microsoft does not make the VALL-E code available to others to prevent VALL-E from being misused or misused by someone else. It seems that the researchers are aware of the potential social harm that this technology could cause.