-
Notifications
You must be signed in to change notification settings - Fork 10
Home
Hark Visualizer is a web application that can perform speech recognition and sound localization on 8-channel audio files (flac or wav). This is done by sending the uploaded audio file to Hark SaaS by Honda Research, which performs the localization and produces separated audio files of each utterance in the original file. These utterance audio files are then sent to Bing Speech API for transcription, which is presented to the user in real-time via web socket protocol running on a Tornado server in EC2. Download the test.wav audio file from the master branch and upload it to Hark Visualizer to see what it can do!
Visualization representations are two charts and a text transcription box. The Duration in Seconds per Azimuth chart represents represents how many seconds of speech were detected at each Azimuth. Essentially, sound localization. The Duration in Seconds per Voice chart represents the aggregate time of all utterances detected for each speaker. Transcription represents the speech recognition results from Bing Speech API, including transcription errors which are represented by 'Inaudible'. These are typically sounds that were identified as speech by Hark Saas and sent to Bing for recognition, but were actually background noise. Each of the representations on this page loads and updates dynamically as the results become available to the backend server. Each chart is interactive, with the ability for the user to cross-filter the results by clicking on an element in the chart (a slice in the pie chart, or a bar in the bar chart).
Hark Visualizer backend was written in Python primary with the Tornado web server framework. Tornado allows for asynchronous web requests, web sockets, and has support for concurrency and pipelining of asynchronous tasks. The main function in the Tornado server is the application ioloop, which is where I've included the logic for sending results to the users browser asynchronously. Http requests sent to the Tornado server will be served by an http server on port 80, and the web socket connections and data can also be handled on the same port. When a user uploads an audio file, it creates a POST request to the server with the specified audio file, and within this post request is an asynchronous call to upload the audio file. When this call is made, the main event loop of the server does not have to wait for the upload to complete by using built-in support for coroutines. The main event loop then proceeds to serve the user the visualization html page, which when loaded initializes a web socket with the backend server. Once this web socket is established, the main event loop polls hark for analysis results of the previously uploaded audio file and sends them to the browser as they become available. Once the analysis results have all been served and Hark marks the analysis session as complete, this is when the separated audio files are available from Hark. Any call for these files before this time will return a 404 from Hark. Therefore, Transcription cannot be done on the separated audio files until the analysis results are completed by Hark. By that time, the browser has already received the rest of the data, and the server proceeds with downloading each separated audio, sending it to Bing Speech API for recognition, and returning the result to the browser.
STAGING_AREA = '/tmp/'
STATIC_PATH = 'static'
HTML_TEMPLATE_PATH = 'templates'
LISTEN_PORT = 80
LANGUAGE = 'ja-JP'
-
STAGING_AREA is the working space for processing audio files.
-
STATIC_PATH is the location of the javascript and css files to be served.
-
HTML_TEMPLATE_PATH is the location of the html files to be rendered by the Tornado web server.
-
LISTEN_PORT is the port on which the Http Server and websocket listen.
-
LANGUAGE is the locale used by Bing Speech API for speech recognition.
settings = { 'static_path': os.path.join(os.path.dirname(__file__), STATIC_PATH), 'template_path': os.path.join(os.path.dirname(__file__), HTML_TEMPLATE_PATH) }
These settings are for passing to the Tornado application instance to map static files.
default_hark_config = {
'processType': 'batch',
'params': {
'numSounds': 2,
'roomName': 'sample room',
'micName': 'dome',
'thresh': 21
},
'sources': [
{'from': 0, 'to': 180},
{'from': -180, 'to': 0},
]
}
This is the default configuration metadata that the Hark session is initialized with. It presumes the uploaded audio file is an 8-channel file with two unique speakers.
The handler for HTTP requests sent to the server. Get requests are asynchronous and can be handled concurrently. The post request handles upload of the audio file in a subprocess via corountine, which is non-blocking, however when the web socket writes hark and speech recognition data back to the browser, this occurs on the same port, so HTTP requests may be queued when the port is very busy. Future work is to set up an Nginx server behind a load balancer, and to reduce contention for the port.
def async_upload(file):
A function called asynchronously for non-blocking uploads
A simple wrapper around PyHarkSaas to inject logging and additional logic
A wrapper around Speech Recognition module to inject additional logic. The Speech Recognition API is a module which supports multiple recognition APIs. Hark Visualizer uses the Bing Speech API, which is backed by an Azure instance hosting the API instance in west-us region (this was the only availability zone).
This is where the main websocket work is done. A websocket is initiated in JavaScript by the browser when the user navigates to visualization.html:
// Connect to the remote websocket server
var connection = new WebSocket("ws://harkvisualizer.com/websocket");
This triggers the Tornado web server to send the analysis results from Hark to the browser via this socket:
def open(self):
log.info('Web socket connection established')
# Do not hold packets for bandwidth optimization
self.set_nodelay(True)
# ioloop to wait before attempting to sending data
tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=1),
self.send_data)
send_data is called from the socket open method. It initiates polling results from Hark and sending them to the browser. There are two main states associated with this (which would be better arranged as a state machine rather than the if else statements below). The first state occurs the first time it is called by open(). In this state, the main event loop retrieves results from Hark every second and memoizes any new results. This state completes when Hark indicates processing is complete, at which point the second state is entered. In the second state, the audio files for each source ID in the original results are downloaded one at a time and sent to Bing Speech API to be translated. The results of each translation are sent to the browser as soon as they complete. Once all audio files have been translated, the web socket is closed.
def send_data(self, utterances_memo = []):
if hark.client.getSessionID():
results = hark.client.getResults()
utterances = results['context']
# If result contains more utterances than memo
if len(utterances) > len(utterances_memo):
# Must iterate since new utterances
# could be anywhere in the result
for utterance in utterances:
utterance_id = utterance['srcID']
# If utterance is new
if utterance_id not in utterances_memo:
# Memoize the srcID
utterances_memo.append(utterance_id)
self.write_message(json.dumps(utterance))
log.info("Utterance %d written to socket", utterance_id)
if hark.client.isFinished():
# If we have all the utterances, transcribe, then close the socket
if sum(results['scene']['numSounds'].values()) == len(utterances_memo):
for srcID in range(len(utterances_memo)):
random_string = ''.join(choice(ascii_uppercase) for i in range(10))
file_name = '{0}{1}_part{2}.flac'.format(STAGING_AREA, random_string, srcID)
hark.get_audio(srcID, file_name)
transcription = speech.translate(file_name)
utterance = utterances[srcID]
seconds, milliseconds = divmod(utterance['startTimeMs'], 1000)
minutes, seconds = divmod(seconds, 60)
self.write_message(json.dumps(
'{0} at ({1}:{2}:{3}):'.format(utterance['guid'], minutes, seconds, milliseconds)))
self.write_message(json.dumps(transcription, ensure_ascii=False))
del utterances_memo[:]
self.close()
else:
tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=1), self.send_data)
###Amazon AWS Used for hosting the web server in Japan.
###Tornado Used as the web server for serving the webapp and delivering data to the browser via websockets.
###Microsoft Cognitive Services/Azure Used for hosting the Bing Speech API server instance.
###Hark SaaS Used for analyzing the audio files.
###Speech Recognition Used for transcribing the audio file via a wrapper around the Bing Speech API, as Google Speech API is not available anymore.
###d3.js Used for creating real-time data visualizations in the browser.
###crossfilter.js Used for n-dimensional filtering of multivariate datasets across D3 charts.
###c3.js A wrapper around D3.js for building charts quickly.