Oh, it gets better.
So hlfaceposer.exe embeds phoneme data in WAV files to make the game's automatic lipsync feature work.
It starts out promising – the data seems to be embedded in a custom chunk, which is perfectly fine for expanding a chunk-based format like WAV.
It should be chunk type, followed by size, followed by chunk contents. I assume most implementations will ignore chunks they don't understand, at least that's how PNG handles it (and the fact the improperly formatted file plays in all software I've tested proves it), so using the non-standard "VDAT" ("Valve Data"?) chunk is fine.
Next up: size.
The chunk type identifier is 4 bytes ("VDAT"), the size is another four bytes – unsigned little endian integer. It's 54 02 00 00, meaning a length of 596 bytes (excluding chunk type and length itself). And that does check out.
The contents of the chunk though?
It's the same pseudo-JSON Valve uses for all their stuff, including choreography data. It's like JSON if you stripped away quotation marks, colons, and semicolons and relied solely on spaces and newlines as delimiters.
Shitty idea.
The format consists solely of keywords (like "WORDS" or "WORD", words and phonemes ("aa", "uw", etc.), and time stamps, all written into the file in ASCII plain text, separated only by spaces.
That's right, spaces. Not even 00 bytes – ASCII spaces. That, and curly braces to separate words, but values inside each word and phoneme? Spaces.
jfc...