To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads must be added to the GenerationConfig object. This is a list of [layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.
If your Whisper checkpoint does not have the alignment_heads property yet, it can be added in two possible ways.
Method 1. Change the model.generation_config property:
# load the model
model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")
# set the new property
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]Method 2. Add a new line to the generation_config.json file:
"alignment_heads": [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]],After you're done, use push_to_hub to make these changes permanent:
model.push_to_hub("your_pretrained_checkpoint", use_auth_token="your_token_if_not_logged_in", create_pr=True)The correct values for alignment_heads depend on the size of the model. Here are the appropriate values for the different Whisper model sizes. These are taken from the OpenAI checkpoints. If you fine-tuned your own checkpoint, you may need to inspect the cross-attention weights to find the appropriate layers and attention heads.
whisper-tiny: [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
whisper-tiny.en: [[1, 0], [2, 0], [2, 5], [3, 0], [3, 1], [3, 2], [3, 3], [3, 4]]
whisper-base: [[3, 1], [4, 2], [4, 3], [4, 7], [5, 1], [5, 2], [5, 4], [5, 6]]
whisper-base.en: [[3, 3], [4, 7], [5, 1], [5, 5], [5, 7]]
whisper-small: [[5, 3], [5, 9], [8, 0], [8, 4], [8, 7], [8, 8], [9, 0], [9, 7], [9, 9], [10, 5]]
whisper-small.en: [[6, 6], [7, 0], [7, 3], [7, 8], [8, 2], [8, 5], [8, 7], [9, 0], [9, 4], [9, 8], [9, 10], [10, 0], [10, 1], [10, 2], [10, 3], [10, 6], [10, 11], [11, 2], [11, 4]]
whisper-medium: [[13, 15], [15, 4], [15, 15], [16, 1], [20, 0], [23, 4]]
whisper-medium.en: [[11, 4], [14, 1], [14, 12], [14, 14], [15, 4], [16, 0], [16, 4], [16, 9], [17, 12], [17, 14], [18, 7], [18, 10], [18, 15], [20, 0], [20, 3], [20, 9], [20, 14], [21, 12]]
whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]
whisper-large-v2: [[10, 12], [13, 17], [16, 11], [16, 12], [16, 13], [17, 15], [17, 16], [18, 4], [18, 11], [18, 19], [19, 11], [21, 2], [21, 3], [22, 3], [22, 9], [22, 12], [23, 5], [23, 7], [23, 13], [25, 5], [26, 1], [26, 12], [27, 15]]
whisper-large: same as large-v2
checkout this repo for the best individual alignment heads of the large v2 and v3 variants.
https://github.com/nyrahealth/CrisperWhisper/blob/develop/run_experiments/experiments/head_results.json
The Repo also contains useful code to calculate these head results for other models as long as you are in the posession of a dataset with high quality timestamps like timit ( https://paperswithcode.com/dataset/timit).
The Idea here is to just evaluate the timings output by DTW for individual alignment heads against ground truth data and see which ones score highest. This variant improves upon this by specifically training alignment heads.
https://github.com/nyrahealth/CrisperWhisper