The best part is, you can test this out now.
Researchers from the University of Washington have created a system called “Target Speech Hearing” (TSH). This AI-powered creation allows you to zero in on a single speaker’s voice in a noisy environment – just by looking at them.
It’s not yet commercially available. But the great news is, the code for the proof-of-concept device is openly accessible for others to build upon and experiment with.
“We tend to think of AI now as web-based chatbots that answer questions. But in this project, we develop AI to modify the auditory perception of anyone wearing headphones, given their preferences.” said the senior author, Shyam Gollakota.
“With our devices, you can now hear a single speaker clearly even if you are in a noisy environment with lots of other people talking.”
How the AI-Powered Target Speech Hearing System Works
The TSH system uses AI to separate a target speaker’s voice through a simple process.
You just need to look at the person for 3-5 seconds while tapping a button. This will activate the binaural microphones to identify the voice of the speaker with a 16-degree margin of error. By doing so, the system “enrolls” the speaker and remembers the unique sound of their voice.
That captured audio then goes to an embedded computer, where a machine learning code carefully studies the vocal patterns. This helps the TSH remove all other environmental noise, leaving you with a clear audio channel for the enrolled speaker – even as they move or turn away.
What’s great is that this technology utilizes off-the-shelf headphones.
For example, the researchers have tested this with a modified Sony WH-1000XM4 headphones. They simply fitted them with binaural microphones (Sonic Presence SP15C), and the Orange Pi 5B embedded CPU for processing.
By doing so, the researchers prove that the TSH can be integrated into just about any consumer audio device.
Testing and Results
The Target Speech Hearing was presented at the ACM CHI Conference on Human Factors in Computing Systems in Honolulu. Here, the researchers presented the results of their tests with 21 subjects.
Based on these results, the subjects rated the clarity of the enrolled speaker’s voice nearly twice as high as unfiltered audio on average.
Plus, TSH’s skill only grew with continuous speech, as it absorbed more sound data to refine its grasp of the speaker’s voice.
This also proved effective even in noisy environments.
In fact, the numbers show that the TSH system achieved a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio. This is only a 0.4 dB drop compared to quieter environments.
Unfortunately, the system isn’t perfect against interference from another overpowering voice in the same direction.
But, in such cases, the user can simply re-enroll the speaker’s voice to isolate it better.
Real-World Applications and Future Plans
The TSH system holds promise for various real-world applications.
From enabling clearer communications in crowded venues to improving hearing aids, this tech could redefine how we experience sound in noisy environments.
Excitingly, the researchers already plan to integrate TSH into earbuds and hearing aids. Their vision also includes the hardware side: With AI chips potentially costing under $10 per unit at scale.
Plus, in true open-source spirit, the TSH system’s code, neural networks, and AI algorithms are openly available on GitHub. This means AI enthusiasts and developers can experiment and expand on this foundation, which the researchers hope will help them improve the tech further.
This is great for those of us you cannot hear the person in a crowd that interest you. It’s frustrating to not to full hear, understand their conversations. I try to read lips but that fails most of the time