The Problem with Pronunciation Feedback in AI Language Apps
Your app says you pronounced it correctly. But you didn't. And the app has no idea.
Here's what happens every time you practice pronunciation on Duolingo Max, Speak, ELSA, Praktika, or TalkPal:
- You speak a sentence
- The app transcribes what you said into text (this is where information gets lost)
- The app's language model reads that text and generates a response
- You get feedback based on the text, not on what you actually sounded like Except... the app never heard what you actually sounded like. An STT model heard an audio approximation of your speech. It then converted that approximation to text. Then a language model read the text. At no point did the system actually process your pronunciation. This creates a specific failure mode that destroys pronunciation learning: the app confidently reinforces errors it never heard.
The Transcription Problem: Why STT Models Are Bad at Detecting Pronunciation Errors
Let's be very specific about what STT models do:
A speech-to-text model is trained to convert audio to text. That's it. It's optimized for one task: "Given this audio, what text was spoken?"
It's not trained to give you pronunciation feedback. It's not trained to detect accent. It's not trained to notice that you're struggling with a particular sound.
It's trained to guess what words you said.
Here's the problem: if the STT model guesses the correct text, the downstream feedback system assumes you pronounced everything correctly.
Example in Spanish. You're learning to roll your R's. The word is "perro" (dog). You attempt it but your pronunciation is terrible — your R is barely rolled, almost a normal English R.
- What actually happened: You mispronounced the R significantly
- What the STT model hears: Audio that's close enough to the word "perro" to transcribe it correctly
- What happens next: The transcribed text is "perro," which is correct
- What the feedback system says: "Correct! You said 'perro' correctly"
- What you learned: That your terrible R pronunciation is acceptable
The system trained you to be confident in a mistake.
This is especially problematic because STT models are trained primarily on native speakers. They're optimized for clear, fluent, correct pronunciation. The exact thing learners don't have.
When an STT model encounters learner speech — with hesitation, accent interference, substituted sounds, incorrect stress patterns — the model's job is to guess "what word did they probably mean to say?" Not "how did they actually pronounce this?"
A learner with an American accent trying to pronounce Portuguese nasalization might produce something that an STT model will transcribe as "correct" because the model just recognized it as the word. The model has no way to evaluate the actual phonetic accuracy.
- •**What actually happened:** You mispronounced the R significantly
- •**What the STT model hears:** Audio that's close enough to the word "perro" to transcribe it correctly
- •**What happens next:** The transcribed text is "perro," which is correct
- •**What the feedback system says:** "Correct! You said 'perro' correctly"
- •**What you learned:** That your terrible R pronunciation is acceptable
The Acoustic Information That Gets Lost
Here's what happens when audio gets converted to text:
Your pronunciation → [STT processes audio] → Text transcript → [LLM reads text] → Feedback
Notice what's missing: the pronunciation itself.
The text "café" contains no information about:
- Whether you stressed the first or second syllable
- Whether your vowels were tense or lax
- Whether you added a glottal stop where you shouldn't have
- Whether you muddied the consonant
- Whether you pronounced it with a heavy accent or naturally
- Whether you hesitated before the word
- Whether you substituted a similar sound from your native language
Text just says: "café"
An STT model might notice some of these things while processing the audio. But the output is text. That acoustic information doesn't make it to the feedback system.
So the feedback system can only evaluate: "Did the STT model transcribe a word that matches what should have been said?"
It can't evaluate: "Did you actually pronounce it correctly?"
- •Whether you stressed the first or second syllable
- •Whether your vowels were tense or lax
- •Whether you added a glottal stop where you shouldn't have
- •Whether you muddied the consonant
- •Whether you pronounced it with a heavy accent or naturally
- •Whether you hesitated before the word
- •Whether you substituted a similar sound from your native language
What Pronunciation Feedback Actually Requires
Real pronunciation feedback requires processing the actual acoustic features of speech:
Phoneme-level accuracy: Did you produce the correct sound? Not "did the STT model transcribe the right word?" but "did the actual frequencies in your acoustic signal match the target phoneme?"
Suprasegmental features: Did you stress the right syllable? Did you use the right intonation pattern? Did you maintain the right rhythm? These are acoustic properties that don't map to individual phonemes.
Accent and dialect features: Are you using the target language's native vowel system? Are you producing the target language's consonants natively, or substituting from your L1?
Timing and duration: Did you make the vowel long enough? Did you hold the consonant for the right duration? Were your pauses in the right places?
None of this information lives in a text transcript. It lives in the acoustic signal itself.
Real pronunciation feedback requires the system to actually listen to the audio and compare it phonetically to the target pronunciation.
This is why traditional pronunciation assessment tools from companies like Microsoft Azure or Speechling use phoneme-level acoustic analysis. They're not just checking if you said the right word. They're analyzing the acoustic features of your speech and comparing them to reference pronunciations.
But here's the key limitation: these acoustic assessment tools are often bolted onto STT-based systems as a secondary layer. They work better than STT-only feedback, but they're still downstream of transcription. The primary pipeline is still text-based.
And they're expensive. Adding sophisticated acoustic analysis to every utterance costs money and processing power. Most commercial apps don't do it.
The Research: Evidence of This Failure
There's actual research on this. Studies of STT-based language learning apps found a specific pattern:
The app's feedback was sometimes LESS accurate than feedback from humans.
In one study published in Frontiers in Psychology, learners used apps with STT-based pronunciation assessment. When compared to feedback from human evaluators:
- The app sometimes marked clearly incorrect pronunciations as acceptable
- The app missed subtle suprasegmental errors (stress, intonation)
- The app's feedback was inconsistent — the same pronunciation produced different feedback on different attempts
The exception: When humans used only the text transcript (the same way STT-based systems do), their feedback was about equally inaccurate to the app's. The issue is the transcription bottleneck, not the app's inability to read.
- •The app sometimes marked clearly incorrect pronunciations as acceptable
- •The app missed subtle suprasegmental errors (stress, intonation)
- •The app's feedback was inconsistent — the same pronunciation produced different feedback on different attempts
The Contrast: What Audio-Native Feedback Actually Does
Here's what changes when a system processes audio natively instead of transcribing to text:
The model receives the full acoustic signal. It hears:
- The exact frequency of your vowels
- The presence or absence of voicing in consonants
- The duration of sounds
- The stress pattern
- The intonation contour
- Your accent features
- Your hesitations
It can then compare this directly to reference pronunciations and provide feedback on what's actually different.
"Your R is not rolled enough — it's closer to an English R sound. Here's what a native roll sounds like, and here's the difference in the acoustic features."
That's real pronunciation feedback. It's based on what you actually sounded like, not on what an STT model guessed you said.
This is why Yapr's audio-native pipeline gives better pronunciation feedback than STT-based apps. The model isn't guessing from text. It's analyzing the actual acoustic features of your speech.
- •The exact frequency of your vowels
- •The presence or absence of voicing in consonants
- •The duration of sounds
- •The stress pattern
- •The intonation contour
- •Your accent features
- •Your hesitations
Why This Matters More Than You Think
Here's the subtle but crucial problem: When an app confidently reinforces your errors, you don't realize you're making mistakes.
You practice with consistent, positive feedback. Your confidence goes up. You feel like you're improving.
Then you talk to a native speaker and realize they can barely understand you.
This is the dark side of AI language learning with bad pronunciation feedback: you can rack up hours of practice feeling successful while actually training your mouth to do the wrong thing.
This is especially brutal for heritage speakers and advanced learners. At higher levels of proficiency, suprasegmental features and accent become more important, not less. A native speaker can tell you're a learner by your stress patterns or intonation long before they notice individual mispronounced phonemes.
If the app can't assess pronunciation at that level, it's not teaching you advanced fluency. It's telling you you're correct when you're not.
The Competitive Landscape
Let's be honest about who's doing what:
ELSA is the best in the space for English pronunciation feedback specifically. They've built sophisticated acoustic analysis on top of STT. Their feedback is detailed and mostly accurate for English. But they're English-only, and the STT transcription step still causes some misses.
Speak (the $20/mo app with $162M funding) uses STT-LLM-TTS and has basic pronunciation feedback. It's functional but not detailed. And they only support 3 languages, so the acoustic models are language-specific but limited in scope.
Praktika ($38M funded) added pronunciation feedback but it's still STT-based. They have avatar avatars and a polished interface, but the core feedback mechanism is limited by transcription.
Duolingo Max ($30/mo for speaking) has minimal pronunciation feedback. Speaking is secondary to their core gamification engine. The feedback exists but it's basic.
TalkPal ($6/mo) has almost no pronunciation feedback. It's a cheap STT-LLM-TTS wrapper with generic language model responses.
Langua ($10-15/mo) has cloned native voices but still uses STT on your input, so pronunciation feedback is limited by transcription.
None of these have audio-native pronunciation feedback because none of them use audio-native architectures.
The Test You Can Do Yourself
Want to test whether a language app actually evaluates your pronunciation?
The Mispronunciation Test:
- Pick a difficult sound in your target language
- Deliberately mispronounce it the same way across multiple attempts
- Make it obviously wrong — different from the target sound
Example: In Spanish, deliberately roll your R incorrectly (use an English R) every time.
What happens with STT-based feedback:
- First attempt: "Correct! You said [word] correctly"
- Second attempt (same mispronunciation): "Correct! You said [word] correctly"
- Third attempt (identical mispronunciation): "Correct! You said [word] correctly"
The STT model transcribed the word correctly because you're producing a recognizable approximation, even though your pronunciation is wrong.
What happens with audio-native feedback:
- First attempt: "Your R is not rolled correctly. It sounds more like an English R. Here's what a native roll sounds like..."
- Second attempt: "Same issue with the R. You're still not rolling it..."
- Third attempt: "You're making the same R error again..."
The system is actually listening to how you sound, not just whether you said the right word.
- •First attempt: "Correct! You said [word] correctly"
- •Second attempt (same mispronunciation): "Correct! You said [word] correctly"
- •Third attempt (identical mispronunciation): "Correct! You said [word] correctly"
- •First attempt: "Your R is not rolled correctly. It sounds more like an English R. Here's what a native roll sounds like..."
- •Second attempt: "Same issue with the R. You're still not rolling it..."
- •Third attempt: "You're making the same R error again..."
What This Means for Your Learning
If you're using an app with STT-based pronunciation feedback, understand what you're actually getting:
You're getting word-level accuracy, not pronunciation-level accuracy.
The app can tell you whether you said the right word. It can't reliably tell you whether you pronounced it correctly.
For beginning learners, this might be okay. At A1 level, getting the words right is the priority. But as soon as you care about sounding natural — which is immediately if you want to talk to actual native speakers — STT-based feedback becomes useless.
You need a system that actually analyzes your acoustic features and compares them to native pronunciation targets.
That requires native audio processing, not transcription-based processing.
Sources:
- How to use pronunciation assessment in the Microsoft Foundry portal | Microsoft Learn
- The impact of automatic speech recognition technology on second language pronunciation | Frontiers in Psychology
- Correct Pronunciation and Why It May Sound Wrong to Natives | FutureBee AI
- How accurate is speech-to-text in 2026? | AssemblyAI
- Use pronunciation assessment - Foundry Tools | Azure AI
Yapr is a voice-first language learning app that gives pronunciation feedback on your actual acoustic features, not on what an STT model guesses you said. Audio in, audio out, with real-time feedback on how you actually sound. Try it free at yapr.ca.
- •[How to use pronunciation assessment in the Microsoft Foundry portal | Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/pronunciation-assessment-tool)
- •[The impact of automatic speech recognition technology on second language pronunciation | Frontiers in Psychology](https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2023.1210187/full)
- •[Correct Pronunciation and Why It May Sound Wrong to Natives | FutureBee AI](https://www.futurebeeai.com/knowledge-hub/correct-pronunciation-sounds-wrong)
- •[How accurate is speech-to-text in 2026? | AssemblyAI](https://www.assemblyai.com/blog/how-accurate-speech-to-text)
- •[Use pronunciation assessment - Foundry Tools | Azure AI](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment)
Yapr is a voice-first language learning app that gives pronunciation feedback on your actual acoustic features, not on what an STT model guesses you said.
Audio in, audio out, with real-time feedback on how you actually sound. Try it free at [yapr.ca](https://yapr.ca).