wav2vec Fine-Tuning

Can exposing a speech model to non-speech sounds improve how it discriminates phonemes?

2024 · Technical lead — pipeline, fine-tuning, evaluation · Completed

wav2vec 2.0PyTorchaudio MLtransformers

The Problem

Humans get better at distinguishing speech sounds partly from hearing non-speech — rain, traffic, music, whatever. The perceptual system sharpens phoneme categories through general audio exposure, not just speech. So: does the same trick work on a self-supervised speech model? If you fine-tune wav2vec on non-speech audio, does it get better at telling phonemes apart — or does it forget what it already knows?

The Approach

Took wav2vec 2.0 Base (pretrained on 960 hours of LibriSpeech) and fine-tuned it on 12.5 hours of non-speech AudioSet — 108 categories, everything from rain to animal calls. Built the pipeline from scratch: automated download, 16 kHz resampling, format normalization across thousands of clips.

Added a multilabel classification head (768→108) on top of wav2vec's transformer output. Trained with gradient checkpointing and dynamic batch sizing because the GPU was a bottleneck — I had real-time temperature monitoring that would throttle before thermal shutdown. Evaluated with the ABX phoneme discrimination task, which is the standard way to measure whether a model can tell minimal pairs of speech sounds apart.

The Outcome

The baseline won. Not statistically significant (p = 0.276), but the direction was clear — fine-tuning on non-speech hurt rather than helped. The more revealing number: the fine-tuned model's delta-value standard deviation collapsed from 31.67 to 0.80. The model didn't learn sharper distinctions. It learned flatter ones. Its discriminative range compressed instead of expanding.

The fine-tuned model was more consistent — but consistently worse.

What I Was Actually Doing

Training reached 0.48 epochs before GPU temps forced a stop. I had 12.5 hours of audio; the reference study used 600 — nowhere near enough to test the hypothesis properly. That part is a real limitation. But I'd also built the full pipeline from scratch: found and filtered AudioSet, modified wav2vec to accept a classification head, wired ABX evaluation, watched training loops on a laptop that wasn't built for this, fixed thermal throttling more times than I'd like to admit, and kept it running long enough to get output worth measuring. It was a group project. I did all of the technical work.