Please excuse our look. We're just getting started here.

Want to learn more about Twilio Forums? Check out our FAQ page here.

Caption (Speech-To-Text) support for Twilio Programmable Video

Is there anybody able to support caption for Twilio Programmable Video using Twilio ecosystem or Google Cloud?

Any insights would be appreciated.


  • sbrijmohan
    sbrijmohan admin
    edited August 19

    Hi @sotheara — this isn't something I've tried yet. I think it is going to be tricky, it would probably require a good deal of knowledge around webRTC so that you can extract the raw audio from the participants to send to a speech-to-text service. Technically, it should be possible to use Google Cloud speech to text there is more information on it in this article, but I'm not entirely sure how much of a lift that is.

    @pnash do you have any insight on this?

  • Hey @sotheara, this is something I've thought about before, but I've not actually built myself, so all I have is ideas.

    I am assuming that what you want here is to capture the audio from each participant in a conversation, send that off to a speech-to-text service, then take the result and send it to the other participants in the room so that it can be displayed on their screen, over the video as a caption. If you are looking for something different, then let me know, but that's what I'm going with.

    Twilio doesn't have an in-house speech-to-text capability, so we will want to use another service, like Google Cloud's speech-to-text. In the Chrome browser you can actually access this service for free using the WebSpeech API. I wrote an article that shows you how to translate speech-to-text in browsers that support the WebSpeech API here. In browsers that don't support this, you will need to capture the audio and send it off to the transcription service yourself, this seems like quite a good blog post that explains how to do that.

    Once you receive the result for each participant, you need to then send the transcribed text to the other participants in the room. The Video SDK provides a way to send arbitrary data to other participants using the DataTrack API. There's a good blog post on how to connect participants using the DataTrack API here.

    The DataTrack API is ephemeral, so it doesn't store the text you are sending. If you want something more permanent you could add the Conversations SDK to the application and send the transcribed messages as if they were chat messages. This blog post will show you how to add Conversations to a Twilio Video room.

    So, like I said, I haven't done most of this and these are suggestions. I hope that it perhaps points you in the right direction though.

  • @pnash Thank you so much for your ideas. Will bring it to the team and let's see what we can do. I'll update the progress once I have one. :smile: