Vocal Synthesizer Wiki
Vocal Synthesizer Wiki
Advertisement
💻 Technology article work in progress. What is being worked on: Upcoming software
For information on how to help, see the guidelines.  More subjects categorized here.
💻


VOCALOID:AI (ボーカロイド:エーアイ) is a new vocal technology and was first demonstrated using the deceased voice of Misora Hibari to demonstrate its abilities in a live performance. It was first demonstrated in a documentary in relation to NHK broadcasting. The singer's samples were provided by Nippon Columbia.

This is the 3rd project Vocaloid has undergone to use a dead singer as a basis, with Ueki-loid and hide being the previous ones.

It was first announced with VOCALOID3.

About[]

VOCALOID:AI is a "live" performance technology that allows for VOCALOID to be used as a "live" singer. While "VOCALOID" embraces all of Yamaha's singing technology, "VOCALOID:AI" refers to specifically the develop of voice synthesizer technology branch that involves AI.

Creation process[]

HibariMisoraVAI

Alongside the voice a 3D image was used to represent Misora Hibari

The software as it suggests uses artificial intelligence to achieve its results. The technology is developed to run alongside 4K/3D images of the singer. The process is known as "Deep learning", it can learn the traits of a singer over time.

GobouP noticed the editor version used was VOCALOID3 or VOCALOID4 and the voice being adapted by VOCALOID:AI was VY1.[1] In an ITmedia article it noted that you can't record any new samples for Misora Hibari.[2] This is due to her death in 1989. Since you could not record new samples for her, a different approach was adopted an AI was used instead called "DDN" (Deep neural networks, a part of AI Deep Learning processing). This meant there is no need to cut samples and feed them into Vocaloid.[2]

Note; For comparison, to make Ueki-loid Ueki's son Kouichi had provided any missing information when all other data from taken from his father's vocal performances didn't have the full data. This included adaptation of his son's vocal samples to sound more like his father.[3][4]

The AI learns the traits of the vocalist and mimicks them to give a one-off performance as a real singer would in a live performance. To be able to use this technology, data must be collected in advance, with bad data results having to be filtered out. The voice created is a result of the AI learning the timbre and singing style; it even picks up on the singer's nuisances.

The process of Deep Learning took place over time, but it was noted that it could learn the basics of the voice within a few hours without the GPU. Also, though the technology is listed as "VOCALOID:AI", Yamaha was not sure if it would become its own product and it carried "VOCALOID" because of being in an early prototype stage.[2]

A lot of results are based on feedback from those familiar from the vocalist and trial and error of different models.[5]

DNN is a technology that has gained popularity since 2013. It gained attention with the introduction of “Sinsy” in 2016. Microsoft Japan released their own DNN "VoiceText".DNN is currently the biggest adaptation in the development of Speech Technology.[5]

Problems with the project[]

Yamaha's engineers in their interview with IT media describe the recordings themselves from years ago as a form of "bottleneck" restriction for technology going forward.

One of the draw backs of the original process was in using Misora Hibari's vocal it was present only on analog recordings on tape. This meant it was difficult as recording vocal effects had to originally be applied at the beginning of tapes and limited how sound engineering could work, compared to modern vocal processing where they can be added later. In short this meant recordings of Hibari Midora had the issue of applied vocal effects on the tapes. However, digital technology later came within the project that allowed these vocal effects to be removed entirely and later recordings of her were much clearer.[2]

Another issue is that her voice when she debuted was different to how it was later in her carer. She was known to have "seven colors of voice", which meant simply that she could adapt her voice in different ways to fit a song leading to inconsistency within her recordings as well. So she sounded differently doing Jazz compared to her performance doing Enka.[2]

The variations had issues in that if you just pasted all results together, they become difficult to hear the vocal. The machine had to keep the vocal performances separate from each other. The AI had to be programmed to acknowledge conditions that would cause it to use one set of data over another if certain conditions were met. This included separating the earlier analogue recordings to the later digital ones. You can make Enka in this way that sounds like a "70s" recording, but at the same time it could make her vocal sound more like later year recordings.[5]

The next problem is that the machine itself can make mistakes and use the wrong recordings due to its own limitations. This is uncomfortable for the listener. DNN however can sort this to a higher degree of accuracy and is less likely to provide problems with this. In the process of the music making, a change in tempo causes a ripple effect that impacts the way the entire performances lyrics act and sound, to master this is to give a more human-like appearance of the vocal. The DNN itself has bolstered the composition qualities of sound and word formation.[5]

The Resulting performance[]

It allows a virtual singer powered by this technology to perform live without the need to pre-make the results, the AI is capable of learning as it goes along. The AI bases its results only on the highest quality data that is produced. The process was done not to fool listeners, but rather to get the same reaction from the performance that Misora Hibari herself gave.[5]

The technology is currently evolving quickly according to Yamaha.[6]

One of the own proposals of the show was could the vocal performance give the same effect as the singer herself once had, could AI shake an audience.[7] The performance gave its audience a mixed reaction even from its audience.[8] The technology is very realistic and produces an effect close to having a singer in a machine. There is very little roboticness during singing results, except normal Vocaloid engine restrictions. This include VOCALOIDs issues with weak consonants and and machine-like qualities.[2] As commentators since the showing have noted it is almost as if the actual singer herself is in the machine. However, the performance lacked is the emotions of an actual human being.

Due to the result of the DNN applied in this demonstration, the sound engineers have noted they have much to discuss with Yamaha about applying it elsewhere in voice sound synthesising processing. It produces currently the most impressive results, that it is not just a speech synthesising but an amazing one.[5]

References[]

Navigation[]

Advertisement