I work adjacent to a group that does speech recognition. There’s a massive amount of variation in regional dialects and that’s before you get to non-native speakers. The you have people like my mother in law who doesn’t have an accent, but her diction and grammar are… unique.
If someone is speaking in sentences you can use context clues to infer intent, but it’s a lot more challenging when you’re just getting spoken commands.
I suspect it’s a training/sample gap, but it’s likely going to be really hard to get to 100%.
Btw, why is there no speech recognition yet, using LLM to recognize words and meaning better?
And can’t google it really; flooded with results for Alexa and Siri and co., which is the reverse.
I work adjacent to a group that does speech recognition. There’s a massive amount of variation in regional dialects and that’s before you get to non-native speakers. The you have people like my mother in law who doesn’t have an accent, but her diction and grammar are… unique.
If someone is speaking in sentences you can use context clues to infer intent, but it’s a lot more challenging when you’re just getting spoken commands.
I suspect it’s a training/sample gap, but it’s likely going to be really hard to get to 100%.
If I’m understanding your question correctly, heres an example model.
Exactly something like this for Windows/Linux.