Chirp Model with Multi Channel Recordings Seems Broken

I am investigating the benefits of upgrading from the V1 Phone Call "Enhanced" model to Chirp Telephony in order to generate more accurate transcriptions that originate from phone calls. I used the Google Console UI to test these two models, and was quite surprised to find that the Chirp model appears to be broken.

I used a 52 second call recording with 2 channels to test this.

V1 Model results appear to be fine. They split out by channel and timestamp as expected. Additionally I can add punctuation as well.

image.png

Here is the output of the Chirp Telephony model with the same recording. You can see that it is remarkably worse in comparison. Beyond the no punctuation available, the model doesn't appear to be splitting things out properly by channel at all.

image (1).png

This is so bad that I have to wonder, am I doing something wrong? Am I not understanding the purpose of "Chirp" as a drop-in replacement for V1 Speech models?

1 1 94
1 REPLY 1

Hello,

Based on the Google public documentation

“Chirp processes speech in much larger chunks than other models do. This means it might not be suitable for true, real-time use.”

Currently, many of the Speech-to-Text features are not supported by the Chirp model. See below for specific restrictions:

  • Confidence scores: The API returns a value, but it isn't truly a confidence score.
  • Speech adaptation: No adaptation features supported.
  • Diarization: Automatic diarization isn't supported.
  • Forced normalization: Not supported.
  • Word level confidence: Not supported.
  • Language detection: Not supported.

Chirp does support the following features:

  • Automatic punctuation: The punctuation is predicted by the model. It can be disabled.
  • Word timings: Optionally returned.
  • Language-agnostic audio transcription: The model automatically infers the spoken language in your audio file and adds it to the results.

On the other hand, STT V1 transcription models such as phone_call are best and trained to recognize speech recorded over the phone. These models produce more accurate transcription results. 

Hope this helps.