06
12
Japan's largest conference CEDEC2021 for people involved in computer entertainment development (Tuesday, August 24 to 26 (Thu)).From this event where many celebrities have been held, we will tell you about the "application and actual operation of the natural speech synthesis technology on the learning -based speech synthesis technology" held on August 24.Let's go.
広告The theme in this session was the “natural voice required in games”.The history and requirements of speech synthesis technology from the past to the present, as well as methods for actually implementing it were explained.
I talked about the development and research of artificial intelligence in digital games since 2004, and after the development of Square Enix Yoichiro Miyake, who has been a number of books and awards, and the development of AIBO and bipedal robot QRIO.Toshiba Digital Solutions Yoshinori Kurata, who is involved in the planning and development of voice dialogue technology.
目次閉じる開くFirst, Mr. Miyake talked about the history, demand, and the level of speech synthesis.
なお、三宅氏が務めるスクウェア・エニックス AI&アーツ アルケミーは自然言語会話を用いた対話によるエンターテインメントAIを目指した研究開発が日々行われている。ゲームのCG技術とAI技術を融合し、ゲームに限らずキャラクターの会話、世の中で広く用いられるエンターテイメントAIをテーマに研究を重ね、ゲーム産業が培ってきた技術を多方面において貢献できるよう、切磋琢磨されているとのこと。
In the history of speech synthesis, Miyake says that speech synthesis has long been used in the game industry, but it has become an essential technology for future fuirth work.
"The speech synthesis has not been used in the game industry. The writer writes the script, the voice actor records, and plays the recorded voice from the time the game started.However, the development of AI technology has changed so that lines and response are being created on the spot. As a result, there is a request to generate audio on the spot, and attention has been paid to voice synthesis.There are, in short, I would like to make a conversation created on the spot with voice. Of course, I can record and use the decided conversations, but I want to pick up the behavior of the user and respond to it, or want to create a new sentence.So, speech synthesis will be an indispensable technology as a future work. "(Mr. Miyake).
In addition, it has the advantage of being able to respond flexibly, including cost, in diversity and degenerate development sites.
"Development always changes, so after voice recording, I wanted to change this, let me say something like this, and it was not always unusual to get it back in the late development.Depending on the situation, I think that if you could not respond to the extension line, you may have given up. You want to make a flexible character in game development without relying on such recording, so you can speak voice synthesis in the sense of expanding the lines.It is becoming a flow of wanting to introduce it.
Furthermore, by talking about the unique information that the user has, such as diving in the cave in 4 minutes 53 seconds, the user understands themselves, simply a programming, simply programmed.Instead of the conversation, the realism and immersion of being recognized and talking about the characters in this place.Voice synthesis is needed to create the depth of the game.In order to expand the game design, more and more people are considered this technology as a core technology. "(Mr. Miyake).
On the other hand, Miyake, on the other hand, has a high quality required in the game industry, and expresses his own levels in speech synthesis in five stages.
"There is a trend that speech synthesis has recently been incorporated in the game industry, but the quality that the game industry is looking for is probably higher than other industries. The character is simply an announcement or station.There is an aspect that you have to have individuality reality instead of home information. "(Mr. Miyake)
Perhaps Miyake says that these are what the game industry wants.Depending on the stage, it depends on the level depending on whether it is an NPC or a character that adventures together.If you come to level 5, you will not feel uncomfortable even if you are with the player all the time, in the meantime, for example, people in town, merchants, and inns.It is said that the character ranks up depending on the role.
In the last part of Miyake, the current task in the production of speech synthesis and the future of the future in the near future were talked.It also included a means of opening from the spell of fixed messages in one RPG.
"If you want to act directly, you can instruct it like this, but in the case of speech synthesis, you will learn the sound by machine learning once, so humans do not always customize.I think that it is necessary to make a sound to make one know -how.
In addition, there is a request to adjust what is completed.This sound is definitely natural, but I want to raise the ending, and this character is rural, so I want to change the intonation a little more.In addition, we want to reuse and accumulate the ones that have been recorded once, and if possible, we want to use it across multiple developments, that is, the speech synthesis system itself is accumulated as an asset, and it can be used for various games.There is also a request to create a system as a company rather than for each title.
We consider it as a future vision after 3 to 4 years, including that it can be easily responded to such requests, but on the AI side, NPCs have a language -generated AI, that is, conversation generation function.It is an achievement point.For example, I think that the same content has always been said in RPG conversations for nearly 30 years, but it is thought that the conversation will be performed in the user's situation that changes such a situation.However, recent RPGs have sound, so speech synthesis comes out there.I would like to gradually raise the conversation in the game to respond to the level in the game so that it can respond to that level.If the character can think and speak autonomously, it will also require speech synthesis in order to aim for such a thing, where the root of the RPG so far becomes automatically rich.And this is the request to be asked. "(Mr. Miyake)
Next, the lecture was a baton touch with Mr. Kurata of Toshiba Digital Solutions.Currently, the content using games and digital characters is not a fixed line, but free dialogue is required, and specifically "making voices", "Rhyoku (RyuguIt is necessary to take advantage of "Ritsu) Migration," Reducing the difference between raw sound and synthetic sounds ", and" reusing the created expressions ".The explanation of these points was explained along with the demonstration demonstration, and the details of the technology and the machine learning were explained.
According to Kurata, speech synthesis has been evolving for about 20 years, and very high -quality speech synthesis has been announced by each company in recent years.On the other hand, in the process of evolution, the structure itself has not changed so much, and there are many cases where the method of creating synthesizer modules and language analysis modules has changed.
"A learning method by HMM (Hidden Markov Model, Hidden Markov Model) has been developed from the waveform connection method that connects the phoneme split conveniently, and the synthesizer has been introduced to the generator.It was the HMM method to make the dictionary of the phonetic edition used in machine learning. In addition, since 2013, a deep -planning DNN (deep neural network) method has begun to be developed, and based on a large amount of sound.After learning and adding additional sounds to change the voice, the sound quality may have changed a lot depending on the results of the learning.
Both HMM type and DNN are basically almost the same as human voice quality.The original part of the voice is almost the same as human performance in the instrument sense, but it is at the stage where further evolution from that is required. "Kurata)
Next, the specific production process and points were told.The first important thing is to create a "voice of the voice", which is the most important in the process.
"The first step is to create a voice model, a voice dictionary, and the voice of the voice. In what direction, how do you make voice synthesis? In fact.It is not an exaggeration to say that the process will determine the goal of speech synthesis. It is not an exaggeration. What kind of data to record and what kind of voice is collected from and what voice quality.The point of making it affects the last sound quality.
At the same time, it is very important to enter the studio and secure as much sound source as possible.If you are used to voice actors, announcers, narrators, those who are used to speaking, and those who are trained are heard, as a result, voice dictionaries and voice models tend to be good.Even if an amateur talks fluffy, it doesn't fit neatly.
Machine learning in the third process also has a black box -like method for each company, and it seems that there is a device.There are some parts that change the final result, and there are various companies that can control what data can be learned well.After that, the voices that have been completed are made by adjusting each company and finishing.This is the flow of making voice dictionaries and audio models.
This makes one voice.However, the voice actor often has a lot of voice quality, and there are many voice actors who make a seven -colored voice, but they cannot make the seven -color voice in one recording.It is very important to examine the direction at the time of considering the direction of the voice, ask for the results and have them talk.If not, if you do your best with your soul, you may not be able to fit in machine learning well as a result.So, if you make various voices, you have to repeat this process many times, this may eventually lead to cost "(Mr. Kurata).
In speech synthesis, the unified audio recording according to the direction of voice and intonation is important, and the recorded audio does not use raw sound even in the same phrase, but uses synthetic sounds.Kurata says with actual sample playback.
"The key may be a little different between the raw voice and the synthesized audio, but the voice quality is quite close. It is quite well -formed.I think that the atmosphere is different when listening to things. It is an SSML tag as a means of adding an accent from here, wanting to put a time, and wanting to talk strongly. It also increases readability from SSML.There is also a phonetic string, and the accent is provided by defining the upper comma, or it is defined that it is this symbol to open the space. The point is not adjusted in the original script, so the table.You may think that you have an accent in the sound character string "(Mr. Kurata)
There is still a manual here, and in fact, there is a drawback that it takes as many time to work time in this adjustment.Therefore, the new method appeared in the "Ryugi Sage".
"There is a limit to what can be done with a phonetic string, and if you adjust it with this, the time will pass in a blink of an eye. There has been a method of imitating the expression that people speak as a new method.We are saying Ryugi -show (Iinryushi) and Ryugi Metastas, but there are functions that capture and analyze the expressions and talk similarly.
Until now, automatic learning is generated with the best results.This is a new technique when you want to read a different reading.This gives the advantage of increasing the entrance when the creator wants to make such a thing, and that the working time can be shortened. "(Kurata).
Finally, Kurata talked about future prospects and expectations about speech synthesis.
"In speech synthesis, you can re -use the same expression to another character. For example, it is made for a character A, but you can also convey the character B with the same intonation.
What we have talked about so far is that the category of manual work is still very large by using existing technology.At the time of AI, it is a self -cleaning place that is selected by self -cleansing, or that the past stock is well incorporated into learning and the voice synthesis itself actively changes the expression.I think it's the most important goal.At the moment, I'm standing at the entrance, but I hope that we can accumulate the know -how more and more with the help of creators and sound creators, and make something good to use.increase.
I would be glad if game users could make a surprising experience by taking advantage of your ideas and passion for you. "
CEDEC2021関連記事はこちら