WebSocket API usage
The docker image itself exposes port 8080 which is used for communicating with underlying service.
This container exposes WebSocket API methods in order to support streaming audio processing capabilities which are present in IDVoice.
There is no formal documentation on WebSocket API as it implements the low level protocol, which semantics is not as strict as for HTTP. However, we did our best to provide as much information on WebSocket API usage as possible.
The following WebSocket endpoints are exposed:
/speech_endpoint_detector
- speech endpoint detection in audio stream (a part ofmedia
component)/speech_endpoint_detector_opus
- speech endpoint detection in Opus packets stream (a part ofmedia
component)/speech_summary_stream
- voice activity detection and speech length estimation in audio stream (a part ofmedia
component)/speech_summary_stream_opus
- voice activity detection and speech length estimation in Opus packets stream (a part ofmedia
component)/voice_verify_stream
- voice verification in audio stream (a part ofverify
component)/voice_verify_stream_opus
- voice verification in Opus packets stream (a part ofverify
component)
All the endpoints accept string data for parameters and binary data for audio stream.
The common order of working with WebSocket endpoints is the following:
- At first, you should sequentially pass the required parameters for initialization. Each parameter should be passed as a separate string. As you can see connection returns an HTTP-like result code on setting the required parameter. "200" result code means that the parameter value is valid, and it was accepted. "500" error code means that either the parameter value is invalid or another internal error occurred. For more details on the error you can examine container logs.
-
As soon as you've successfully set all the required parameters, you can start passing audio stream data chunks to the endpoint and obtain processing results. As was mentioned before the audio stream data should be in binary representation. Each endpoint accepts either byte representation of sequential Opus packets (endpoints with
_opus
ending) or PCM16 audio samples, i.e. a byte representation of 16-bit integer number array is expected ( endpoints without_opus
ending).As you can see after the initialization endpoint accepts audio samples and yields processing results. Of course, it is not necessary to retrieve result on any send action: you can send as many samples as you want and receive the accumulated results in loop afterwards.
The important part is that in the most cases the results of stream processing are JSON string representation, which can be parsed and treated like JSON objects.
Python-like pseudocode example for an abstract WS endpoint usage:
connection = create_ws_connection("<WebSocket endpoint>")
# Assuming that this endpoint requires signal sampling rate
# parameter for initialization
sample_rate = 16000
# Send parameter as string
connection.send(str(sample_rate))
# Receive result code ("200" on success, "500" on failure)
result_code = connection.receive()
# Its assumed that this object yeilds byte array chunks
pcm16_audio_stream_source = create_audio_stream_source()
# You may receive results as soon as they've been produce
while pcm16_audio_stream_source.has_data():
chunk_to_send = pcm16_audio_stream_source.get_next_chunk()
connection.send(chunk_to_send)
result = connection.receive()
print(result)
# Or you may receive all the results at once
while pcm16_audio_stream_source.has_data():
chunk_to_send = pcm16_audio_stream_source.get_next_chunk()
connection.send(chunk_to_send)
while conenction.can_receive():
result = connection.receive()
print(result)
Endpoints documentation¶
/speech_endpoint_detector
and /speech_endpoint_detector_opus
¶
Parameters:
- Minimum speech length required to start endpoint detection (milliseconds).
- Silence after utterance length required to detect speech endpoint (milliseconds).
- Input audio stream sampling rate.
Result:
A string representation of boolean value, "true" if speech endpoint is detected, "false" otherwise.
/speech_summary_stream
and /speech_summary_stream_opus
¶
Example of speech summary estimation result:
{
"current_background_length": 160.0,
"speech_events": [
{
"is_voice": true,
"audio_interval": {
"start_sample": 8000,
"end_sample": 8320,
"start_time": 1000,
"end_time": 1040,
"sample_rate": 8000
}
}
],
"total_speech_info": {
"total_length_ms": 1200.0,
"speech_length_ms": 40.0,
"background_length_ms": 1160.0
}
}
Parameters:
- Input audio stream sampling rate.
Result:
A JSON representation of SpeechSummary class instance.
/voice_verify_stream
and /voice_verify_stream_opus
¶
Example of continuous voice verification result:
[
{
"audio_interval": {
"end_sample": 48000,
"end_time": 3000,
"sample_rate": 16000,
"start_sample": 0,
"start_time": 0
},
"verify_result": {
"probability": 0.9999368786811829,
"score": 0.6484680771827698
}
}
]
Parameters:
- Base64 string representation of voice template created with one of the methods from
/voice_template_factory/*
endpoint. - Input audio stream sampling rate.
- Stream audio context length (seconds). The parameter defines the "memory" of the stream and also provides a way to manage the level of the stream "confidence" (which amount of good matching audio is required for stream to be confident about the claimed speaker identity).
- (for
/voice_verify_stream
only) Sliding window length (seconds). The parameter defines the length of audio passed to the template creation during stream processing, default value is 3 seconds.
Result:
An array of JSON representations of VerifyStreamResult class instance.
Also take a look at Python and JavaScript examples for WebSocket API at ID R&D GitHub repository.