In addition to the recent hot ChatGPT, Microsoft also has other potential artificial intelligence projects, including the text-to-speech model VALL-E. Its biggest selling point is that as long as the target URL and 3-second voice samples are input, the model can produce similar Very high voice content. At present, VALL-E is still in the initial training stage, but the English speech training data provided by the development team has reached 60,000 hours.
Microsoft’s development team stated that it uses the discrete code of the existing neural audio codec model to train the VALL-E neural codec language model, and regards text-to-speech as a conditional language building model task. VALL-E will generate a discrete audio codec corresponding to the text and the target voice according to the text input and the 3-second voice prompt.
In terms of speech naturalness and similarity, Microsoft said that VALL-E performed better than the existing SOTA model, and could maintain emotions and sound environments. However, there are still areas for improvement, such as some words are not pronounced clearly, and cannot Imitate voices with accents, etc. The development team believes that VALL-E can be directly used in various speech synthesis solutions in the future, including zero-sample text-to-speech, speech editing, or with artificial intelligence models such as GPT-3 to generate more content.
Source of information and pictures:arstechnica
unwire.hk Mewe page: https://mewe.com/p/unwirehk