Skip to content
Back

wav2vec Fine-Tuning

Can exposing a speech model to non-speech sounds improve how it discriminates phonemes?

2024 · Technical lead — pipeline, fine-tuning, evaluation · Completed
wav2vec 2.0PyTorchaudio MLtransformers

The Problem

Humans get better at distinguishing speech sounds partly from hearing non-speech — rain, traffic, music, whatever. The perceptual system sharpens phoneme categories through general audio exposure, not just speech. So: does the same trick work on a self-supervised speech model? If you fine-tune wav2vec on non-speech audio, does it get better at telling phonemes apart — or does it forget what it already knows?

The Approach

Took wav2vec 2.0 Base (pretrained on 960 hours of LibriSpeech) and fine-tuned it on 12.5 hours of non-speech AudioSet — 108 categories, everything from rain to animal calls. Built the pipeline from scratch: automated download, 16 kHz resampling, format normalization across thousands of clips.

Added a multilabel classification head (768→108) on top of wav2vec's transformer output. Trained with gradient checkpointing and dynamic batch sizing because the GPU was a bottleneck — I had real-time temperature monitoring that would throttle before thermal shutdown. Evaluated with the ABX phoneme discrimination task, which is the standard way to measure whether a model can tell minimal pairs of speech sounds apart.

The Outcome

The baseline won. Not statistically significant (p = 0.276), but the direction was clear — fine-tuning on non-speech hurt rather than helped. The more revealing number: the fine-tuned model's delta-value standard deviation collapsed from 31.67 to 0.80. The model didn't learn sharper distinctions. It learned flatter ones. Its discriminative range compressed instead of expanding.

Baseline 79.17% Fine-tuned 70.14% Delta-value std dev: 31.67 → 0.80
The fine-tuned model was more consistent — but consistently worse.

Why It Failed (And Why That's Fine)

Training reached 0.48 epochs before GPU temps forced a stop. I had 12.5 hours of non-speech audio. The reference study used 600 hours. That's not a rounding error — we couldn't give the model enough signal to properly test the hypothesis. The hypothesis might still be right. But the engineering pipeline — AudioSet filtering, wav2vec modification, ABX evaluation wired end to end — that works, and that was the actual deliverable.