OpenAI-Whisper API

OpenAI Whisper large-v2 powered scalable API

Access scalable, affordable and highly available REST API for on-demand speech to text transcription using OpenAI whisper large-v2 AI Model.

Using this API you can perform speech to text transcription for any audio or video file at scale.

We have extended the service to also perform diarization on demand. With diarization enabled users can retreive output of text segmented for each speaker. This feature is still experimental.

Monster API brings access to Large-v2 , the biggest version of whisper model at a fraction of the cost of the official OpenAI Whisper API.

Large-v2 AI model version offers very superior transcription quality.

Few benefits of performing speech to text transcription using Monster API:

✅ Very superior transcription quality

✅ Upto 90% lower cost than OpenAI Whisper API

✅ No audio file size limits (for now)

✅ Diarization for speaker identification

Monster API can be accessed via this workflow:

  1. Send request API: Use this API to send a request for audio transcription

  2. Fetch status API: Use this API to fetch status of your audio transcription request

Refer to our API docs for Speech to Text use-case.

# Example CURL Request for sending speech to text transcription request

curl --location 'https://api.monsterapi.ai/apis/add-task' \
--header 'x-api-key: 123' \
--header 'Authorization: Bearer 456' \
--data '{
    "model": "whisper",
    "data": {
        "file": "https://upload.wikimedia.org/wikipedia/commons/2/2d/A_J_Cook_Speech_from_Lansbury%27s_Labour_Weekly.ogg",
        "transcription_format": "srt",
        "prompt": "",
        "language": "en"
    }
}'
# Example CURL Request for getting status of your request

curl --location 'https://api.monsterapi.ai/apis/task-status' \
--header 'x-api-key: 123' \
--header 'Authorization: Bearer 456' \
--data '{
    "process_id" :  "f0ca09e9-aad1-11ed-aed8-5d93517c8890"
}'

These 2 API calls are enough to get a transcription of a provided audio or video file.

Description of parameters for sending whisper API request:

  1. "file":

    1. Required Parameter.

    2. This parameter represents an audio or video file URL provided by you for transcription.

    3. The URL must end in one of these supported file formats: m4a, mp3, mp4, mpeg, mpga, wav, webm, ogg

  2. "diarize":

    1. Optional Parameter.

    2. Default to True

    3. When the diarize option is set to true, an embedding model will be employed to identify speakers, along with their respective transcripts and durations.

  3. "transcription_format":

    1. Optional Parameter (default: text)

    2. This parameter defines the formatting of transcription output.

    3. We support 3 formats which are majorly required: "text", "srt", "word"

  4. "prompt":

    1. Optional Parameter (default: '')

    2. This parameter provides initial prompt to the whisper model for recognizing words correctly. You can pass a comma separated list of words. Prompts can be very helpful for correcting specific words or acronyms that the model often misrecognizes in the audio. For example, the following prompt improves the transcription of the words DALL·E and GPT-3, which were previously written as "GDP 3" and "DALI".

  5. "language":

    1. Optional Parameter (default: "en")

    2. This parameter specifies the language in which you'd like to get the transcription. Default in english. Following language codes are supported:

      'af', 'am', 'ar', 'as', 'az', 'ba', 'be', 'bg', 'bn', 'bo', 'br', 'bs', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fo', 'fr', 'gl', 'gu', 'ha', 'haw', 'he', 'hi', 'hr', 'ht', 'hu', 'hy', 'id', 'is', 'it', 'ja', 'jw', 'ka', 'kk', 'km', 'kn', 'ko', 'la', 'lb', 'ln', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'nn', 'no', 'oc', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'sa', 'sd', 'si', 'sk', 'sl', 'sn', 'so', 'sq', 'sr', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tk', 'tl', 'tr', 'tt', 'uk', 'ur', 'uz', 'vi', 'yi', 'yo', 'zh'
  6. "remove_silence":

    1. Optional Parameter (Boolean. Default: false)

    2. If set as true, it will use VAD (Voice Activity Detection) filter to remove silent parts of the audio and then perform transcript with only audible parts.

Examples:

Example input for a speech to text API with diarization request:

  • file: "https://upload.wikimedia.org/wikipedia/commons/1/1f/"DayBreak"_with_Jay_Young_on_the_USA_Radio_Network.ogg"

  • diarize: "true"

Expected output:

{
  "text": {
    "Sequence": [
      {
        "SpeakerID": 1,
        "Starttime": "0:00:00",
        "Endtime": "0:00:11",
        "transcription": "We're joined once again by our travel expert and also author of America's Top Roller Coasters in Amoeba Parks, Pete Trubucco. Good morning and welcome back to Daybreak USA."
      },
      {
        "SpeakerID": 2,
        "Starttime": "0:00:11",
        "Endtime": "0:00:14",
        "transcription": "Well, thanks for having me on."
      },
      {
        "SpeakerID": 1,
        "Starttime": "0:00:14",
        "Endtime": "0:00:27",
        "transcription": "If someone's lucky enough to go on vacation to an exotic location and then maybe not so lucky to have some kind of a disaster happen while they're there, maybe some civil unrest, what should they do now? What's the next step?"
      }
    ]
  }
}

Example input for a speech to text API request for plain text format:

Expected output:

Well, they should find out. I would like to know why was it suspended? Why is Mark suspended? What are the violations? Okay, because if the Church of Scientology is just paying them off, that's kind of weird.

As you may notice in the above output, we received text output in plain text format. It doesn't include any timestamps.

Looking for timestamp for sequence or words? Checkout next examples:

Example input for a speech to text API request for srt format:

Expected output:

00:00:00,000 00:00:08,000 Well, they should find out. I would like to know why was it suspended? Why is Mark suspended? What are the violations?

00:00:08,000 00:00:12,000 Okay, because if the Church of Scientology is just paying them off, that's kind of weird.

We received text output in srt format with sequence timestamps.

Speech to Text API also supports word level timestamps.

You can send a request as follows:

Example input for a speech to text API request for word level timestamp format:

Expected output:

00:00:00,000 00:00:00,360 Well,

00:00:00,420 00:00:00,540 they

00:00:00,540 00:00:00,880 should

...

...

...

00:00:11,220 00:00:11,480 that's

00:00:11,480 00:00:11,619 kind

00:00:11,619 00:00:11,720 of

00:00:11,720 00:00:12,000 weird.

We received output with each word carrying its start and end timestamp.

Based on your use-case requirement, you may choose any formatting type for the transcription output.

Last updated