Skip to content

WebSocket API usage

The docker image itself exposes port 8080 which is used for communicating with underlying service.

This container exposes WebSocket API methods in order to support streaming audio processing capabilities which are present in IDVoice.

There is no formal documentation on WebSocket API as it implements the low level protocol, which semantics is not as strict as for HTTP. However, we did our best to provide as much information on WebSocket API usage as possible.

The following WebSocket endpoints are exposed:

  • /speech_endpoint_detector - speech endpoint detection in audio stream (a part of media component)
  • /speech_endpoint_detector_opus - speech endpoint detection in Opus packets stream (a part of media component)
  • /speech_summary_stream - voice activity detection and speech length estimation in audio stream (a part of media component)
  • /speech_summary_stream_opus - voice activity detection and speech length estimation in Opus packets stream (a part of media component)
  • /voice_verify_stream - voice verification in audio stream (a part of verify component)
  • /voice_verify_stream_opus - voice verification in Opus packets stream (a part of verify component)

All the endpoints accept string data for parameters and binary data for audio stream.

The common order of working with WebSocket endpoints is the following:

  1. At first, you should sequentially pass the required parameters for initialization. Each parameter should be passed as a separate string. As you can see connection returns an HTTP-like result code on setting the required parameter. "200" result code means that the parameter value is valid, and it was accepted. "500" error code means that either the parameter value is invalid or another internal error occurred. For more details on the error you can examine container logs.
  2. As soon as you've successfully set all the required parameters, you can start passing audio stream data chunks to the endpoint and obtain processing results. As was mentioned before the audio stream data should be in binary representation. Each endpoint accepts either byte representation of sequential Opus packets (endpoints with _opus ending) or PCM16 audio samples, i.e. a byte representation of 16-bit integer number array is expected ( endpoints without _opus ending).

    As you can see after the initialization endpoint accepts audio samples and yields processing results. Of course, it is not necessary to retrieve result on any send action: you can send as many samples as you want and receive the accumulated results in loop afterwards.

    The important part is that in the most cases the results of stream processing are JSON string representation, which can be parsed and treated like JSON objects.

Python-like pseudocode example for an abstract WS endpoint usage:

connection = create_ws_connection("<WebSocket endpoint>")

# Assuming that this endpoint requires signal sampling rate
# parameter for initialization

sample_rate = 16000

# Send parameter as string
connection.send(str(sample_rate))

# Receive result code ("200" on success, "500" on failure)
result_code = connection.receive()

# Its assumed that this object yeilds byte array chunks
pcm16_audio_stream_source = create_audio_stream_source()

# You may receive results as soon as they've been produce
while pcm16_audio_stream_source.has_data():
   chunk_to_send = pcm16_audio_stream_source.get_next_chunk()

   connection.send(chunk_to_send)

   result = connection.receive()

   print(result)

# Or you may receive all the results at once
while pcm16_audio_stream_source.has_data():
   chunk_to_send = pcm16_audio_stream_source.get_next_chunk()
   connection.send(chunk_to_send)

while conenction.can_receive():
   result = connection.receive()
   print(result)

Endpoints documentation

/speech_endpoint_detector and /speech_endpoint_detector_opus

Parameters:

  1. Minimum speech length required to start endpoint detection (milliseconds).
  2. Silence after utterance length required to detect speech endpoint (milliseconds).
  3. Input audio stream sampling rate.

Result:

A string representation of boolean value, "true" if speech endpoint is detected, "false" otherwise.

/speech_summary_stream and /speech_summary_stream_opus

Example of speech summary estimation result:

{
  "current_background_length": 160.0,
  "speech_events": [
    {
      "is_voice": true,
      "audio_interval": {
        "start_sample": 8000,
        "end_sample": 8320,
        "start_time": 1000,
        "end_time": 1040,
        "sample_rate": 8000
      }
    }
 ],
  "total_speech_info": {
    "total_length_ms": 1200.0,
    "speech_length_ms": 40.0,
    "background_length_ms": 1160.0
  }
}

Parameters:

  1. Input audio stream sampling rate.

Result:

A JSON representation of SpeechSummary class instance.

/voice_verify_stream and /voice_verify_stream_opus

Example of continuous voice verification result:

[
  {
    "audio_interval": {
      "end_sample": 48000,
      "end_time": 3000,
      "sample_rate": 16000,
      "start_sample": 0,
      "start_time": 0
    },
    "verify_result": {
      "probability": 0.9999368786811829,
      "score": 0.6484680771827698
    }
  }
]

Parameters:

  1. Base64 string representation of voice template created with one of the methods from /voice_template_factory/* endpoint.
  2. Input audio stream sampling rate.
  3. Stream audio context length (seconds). The parameter defines the "memory" of the stream and also provides a way to manage the level of the stream "confidence" (which amount of good matching audio is required for stream to be confident about the claimed speaker identity).
  4. (for /voice_verify_stream only) Sliding window length (seconds). The parameter defines the length of audio passed to the template creation during stream processing, default value is 3 seconds.

Result:

An array of JSON representations of VerifyStreamResult class instance.

Also take a look at Python and JavaScript examples for WebSocket API at ID R&D GitHub repository.