Skip to content

Generate Response Audio

Chidi Ewenike edited this page Dec 13, 2020 · 4 revisions

When the user asks the voice assistant a question, the voice assistant must provide an audio response to answer the question. The audio responses are pre-generated using Google Cloud Platform Text-To-Speech API and stored for future use. A JSON is also generated to map answer strings to audio file names. This approach is used since the answerable questions are static (do not change & require database look-ups) and the voice assistant does not have internet access.

Generating Audio Responses

When new QA pairs are generated and used to train the Rasa model, these changes must be made for the audio responses as well. This section highlights the setup and execution to generate the files needed for audio responses.

Prerequisites

Access to Google Cloud Platform's Text-To-Speech is required to run this program. To find out how to set up an account, you can find information on the Google Cloud Platform Text-To-Speech site.

In order to generate the audio correctly, the following CSVs are required:

In actuality, any CSV files can be used as long as they are formatted correctly. The format is as follows:

  • The first column highlights the "Type" of information and the second column highlights the "Detail" of the information itself.
  • For audio responses to be generated, the "Type" column of the row must contain the word "UTTER" or "GENERIC"
  • If the "Type" column contains the word "Generic", it must follow the "Type" "GENNAME" which defines the following "GENERIC" details as the generic type "GENNAME".
  • Note that the program strips the first row of the CSV (the "Fields" row).

Run Speech Generator

To generate the response audio, the user is required to provide a path to the Google authentication JSON and a list of CSVs correctly formatted which contains strings of the audio to be generated.

python Speech_Generator.py --json /path/to/auth.json --csv utter.csv stories.csv misc.csv

Optional arguments include the accent of the speaker and the speakers voice. Sample voices and accents with their naming conventions can be found on the Supported voices and languages table.

python Speech_Generator.py --json /path/to/auth.json --csv utter.csv stories.csv misc.csv --accent US --speaker B

The correct accent and speaker names can be found on the "Voice name" column of the aforementioned table. If we observe the "en-AU-Wavenet-C" voice name, the accent would be "AU" and the speaker name would be "C". Each voice has sample audio of the speaker's voice. All voices used are "Wavenet" generated and not "Standard" since "Wavenet" sounds more natural.

Audio files will be stored in the "response_data" directory with the JSON to map answer strings to audio files. The JSON also contains the different types of generic responses and the list of words for each generic response.

answer_to_file = {
    "<<GENERICS>>" : ["REJECT"],
    "<<GEN>>REJECT" : ["No", "No thanks"]
    "The ranch is 3200 acres." : "response_0.wav"
    "Sorry, I do not know." : "response_1.wav"
}
Clone this wiki locally