wav2vec Fine-Tuning
Can exposing a speech model to non-speech sounds improve how it discriminates phonemes?
The Problem
Humans get better at distinguishing speech sounds partly from hearing non-speech — rain, traffic, music, whatever. The perceptual system sharpens phoneme categories through general audio exposure, not just speech. So: does the same trick work on a self-supervised speech model? If you fine-tune wav2vec on non-speech audio, does it get better at telling phonemes apart — or does it forget what it already knows?
The Approach
Took wav2vec 2.0 Base (pretrained on 960 hours of LibriSpeech) and fine-tuned it on 12.5 hours of non-speech AudioSet — 108 categories, everything from rain to animal calls. Built the pipeline from scratch: automated download, 16 kHz resampling, format normalization across thousands of clips.
Added a multilabel classification head (768→108) on top of wav2vec's transformer output. Trained with gradient checkpointing and dynamic batch sizing because the GPU was a bottleneck — I had real-time temperature monitoring that would throttle before thermal shutdown. Evaluated with the ABX phoneme discrimination task, which is the standard way to measure whether a model can tell minimal pairs of speech sounds apart.
The Outcome
The baseline won. Not statistically significant (p = 0.276), but the direction was clear — fine-tuning on non-speech hurt rather than helped. The more revealing number: the fine-tuned model's delta-value standard deviation collapsed from 31.67 to 0.80. The model didn't learn sharper distinctions. It learned flatter ones. Its discriminative range compressed instead of expanding.
Why It Failed (And Why That's Fine)
Training reached 0.48 epochs before GPU temps forced a stop. I had 12.5 hours of non-speech audio. The reference study used 600 hours. That's not a rounding error — we couldn't give the model enough signal to properly test the hypothesis. The hypothesis might still be right. But the engineering pipeline — AudioSet filtering, wav2vec modification, ABX evaluation wired end to end — that works, and that was the actual deliverable.