-
Notifications
You must be signed in to change notification settings - Fork 10
Home
Visualization representations are two charts and a text transcription box. The Utterances at Azimuth chart represents all utterances spread over the range of azimuths detected in the audio file. The Speech in Seconds chart represents the aggregate time of all utterances detected for each speaker. Transcription represents the speech recognition results from Bing Speech API, including transcription errors which are represented by 'Inaudible'. These are typically sounds that were identified as speech by Hark Saas and sent to Bing for recognition, but were actually background noise. Each of the representations on this page loads and updates dynamically as the results become available to the backend server. Each chart is interactive, with the ability for the user to cross-filter the results by clicking on an element in the chart (a slice in the pie chart, or a bar in the bar chart).
Hark Visualizer backend was written in Python primary with the Tornado web server framework. Tornado allows for asynchronous web requests, web sockets, and has support for concurrency and pipelining of asynchronous tasks. The main function in the Tornado server is the application ioloop, which is where I've included the logic for sending results to the users browser asynchronously. Http requests sent to the Tornado server will be served by an http server on port 80, and the web socket connections and data can also be handled on the same port. When a user uploads an audio file, it creates a POST request to the server with the specified audio file, and within this post request is an asynchronous call to upload the audio file. When this call is made, the main event loop of the server does not have to wait for the upload to complete by using built-in support for coroutines. The main event loop then proceeds to serve the user the visualization html page, which when loaded initializes a web socket with the backend server. Once this web socket is established, the main event loop polls hark for analysis results of the previously uploaded audio file and sends them to the browser as they become available. Once the analysis results have all been served and Hark marks the analysis session as complete, this is when the separated audio files are available from Hark. Any call for these files before this time will return a 404 from Hark. Therefore, Transcription cannot be done on the separated audio files until the analysis results are completed by Hark. By that time, the browser has already received the rest of the data, and the server proceeds with downloading each separated audio, sending it to Bing Speech API for recognition, and returning the result to the browser.
STAGING_AREA = '/tmp/'
STATIC_PATH = 'static'
HTML_TEMPLATE_PATH = 'templates'
LISTEN_PORT = 80
LANGUAGE = 'ja-JP'
-
STAGING_AREA is the working space for processing audio files.
-
STATIC_PATH is the location of the javascript and css files to be served.
-
HTML_TEMPLATE_PATH is the location of the html files to be rendered by the Tornado web server.
-
LISTEN_PORT is the port on which the Http Server and websocket listen.
-
LANGUAGE is the locale used by Bing Speech API for speech recognition.
settings = { 'static_path': os.path.join(os.path.dirname(__file__), STATIC_PATH), 'template_path': os.path.join(os.path.dirname(__file__), HTML_TEMPLATE_PATH) }
-
These settings are for passing to the Tornado application instance to map static files.
default_hark_config = { 'processType': 'batch', 'params': { 'numSounds': 2, 'roomName': 'sample room', 'micName': 'dome', 'thresh': 21 }, 'sources': [ {'from': 0, 'to': 180}, {'from': -180, 'to': 0}, ] }
-
This is the default configuration metadata that the Hark session is initialized with. It presumes the uploaded audio file is an 8-channel file with two unique speakers.
-
The handler for HTTP requests sent to the server. Get requests are asynchronous and can be handled concurrently. The post request handles upload of the audio file in a subprocess via corountine, which is non-blocking, however when the web socket writes hark and speech recognition data back to the browser, this occurs on the same port, so HTTP requests may be queued when the port is very busy. Future work is to set up an Nginx server behind a load balancer, and to reduce contention for the port.
def async_upload(file):
-
A function called asynchronously for non-blocking uploads
- A simple wrapper around PyHarkSaas to inject logging and additional logic
- A wrapper around Speech Recognition module to inject additional logic. The Speech Recognition API is a module which supports multiple recognition APIs. Hark Visualizer uses the Bing Speech API, which is backed by an Azure instance hosting the API instance in west-us region (this was the only availability zone).
-
This is where the main websocket work is done. A websocket is initiated in JavaScript by the browser when the user navigates to visualization.html:
// Connect to the remote websocket server var connection = new WebSocket("ws://harkvisualizer.com/websocket");
This triggers the Tornado web server to send the analysis results from Hark to the browser via this socket:
# Invoked when socket is opened by browser
def open(self):
log.info('Web socket connection established')
\# Do not hold packets for bandwidth optimization
self.set_nodelay(True)
# ioloop to wait before attempting to sending data
tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=1),
self.send_data)
def send_data(self, utterances_memo = []):
if hark.client.getSessionID():
results = hark.client.getResults()
utterances = results['context']
# If result contains more utterances than memo
if len(utterances) > len(utterances_memo):
# Must iterate since new utterances
# could be anywhere in the result
for utterance in utterances:
utterance_id = utterance['srcID']
# If utterance is new
if utterance_id not in utterances_memo:
# Memoize the srcID
utterances_memo.append(utterance_id)
self.write_message(json.dumps(utterance))
log.info("Utterance %d written to socket", utterance_id)
if hark.client.isFinished():
# If we have all the utterances, transcribe, then close the socket
if sum(results['scene']['numSounds'].values()) == len(utterances_memo):
for srcID in range(len(utterances_memo)):
random_string = ''.join(choice(ascii_uppercase) for i in range(10))
file_name = '{0}{1}_part{2}.flac'.format(STAGING_AREA, random_string, srcID)
hark.get_audio(srcID, file_name)
transcription = speech.translate(file_name)
utterance = utterances[srcID]
seconds, milliseconds = divmod(utterance['startTimeMs'], 1000)
minutes, seconds = divmod(seconds, 60)
self.write_message(json.dumps(
'{0} at ({1}:{2}:{3}):'.format(utterance['guid'], minutes, seconds, milliseconds)))
self.write_message(json.dumps(transcription, ensure_ascii=False))
del utterances_memo[:]
self.close()
else:
tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=1), self.send_data)
###Amazon AWS Used for hosting the web server in Japan.
###Tornado Used as the web server for serving the webapp and delivering data to the browser via websockets.
###Microsoft Cognitive Services/Azure Used for hosting the Bing Speech API server instance.
###Hark SaaS Used for analyzing the audio files.
###Speech Recognition Used for transcribing the audio file via a wrapper around the Bing Speech API, as Google Speech API is not available anymore.
###d3.js Used for creating real-time data visualizations in the browser.
###crossfilter.js Used for n-dimensional filtering of multivariate datasets across D3 charts.
###c3.js A wrapper around D3.js for building charts quickly.