Turning Thoughts into Spoken Words

Sophie Fessl, Ph.D.
March 6, 2019

In the operating room at Northwestern Memorial Hospital, the surgeon opens the patient’s skull and cuts through the dura, the membrane that envelops the brain. Carefully, he places a grid of electrodes onto the surface of the brain, to map where the brain areas for the most important functions end and where the glioma, the brain tumor about to be taken out, starts. While the surgeon plans where to cut and what to spare, a rare opportunity for neuroscientists opens up: 11 minutes to convert brain activity into intelligible sound.

Turning inner speech into sound

Inner speech, our inner voice that isn’t spoken out loud, is the new frontier for alternative communication devices. This technology helps people who are unable to speak to relay their thoughts, wants, and needs.  With a twitch of the cheek, or even direct brain activity, users control a cursor and steer a speech device. One of the most famous people to use a such a device was cosmologist Stephen Hawking: Rendered mostly immobile by ALS, he used an infrared switch mounted on his glasses that captured the slightest twitch in his cheek to move a cursor or select words on his speech device. Several groups of neuroscientists around the world are working to develop a more-intuitive way to communicate: by converting the brain signals of what users would like to say into sound using a voice synthesizer –turning their thoughts into words.

Three recent papers posted on the preprint server bioRxiv describe how researchers have taken steps towards such a “mind reader,” or brain-computer interface (BCI), turning data gathered from electrodes placed directly on the brain’s surface, so-called electrocorticography arrays, into computer-generated speech. None of the techniques in these studies was able to decipher speech that participants merely imagined.

Instead, they deciphered brain signals while participants read out loud, silently mouthed words, or listened to stories. However, this is a step towards deciphering imagined speech, says Stephanie Martin, a neuroscientist working on speech BCIs who was not involved in the new studies. “These studies are very exciting,” she said. “They are the first studies that show that you can actually understand speech that is synthesized from brain activity.”

For Christian Herff, a neuroscientist at Maastricht University who works with data generated during glioma surgery at Northwestern University, converting thoughts into sound is a step towards a more natural style of conversation for people who are unable to speak. “Textual representation just doesn’t give you a good conversation. A natural conversation contains so much more – I’m able to interrupt you, I can convey irony or sarcasm. So, I think this would be an even further step towards natural conversation for this type of patients.”

Neural networks aid in the search for a translation

One of the challenges in converting the pattern of which neurons turn on and off into sound is that there is no easy one-to-one translation. The translation is different for each of us, a little like needing a different dictionary for speaking to each person we meet. In previous attempts at deciphering brain activity, the BCI could generate sound, but the “words” weren’t understandable. And data is scarce, because it can only be collected invasively.

Cartoon of experiment setup
Data from electrodes placed on the brain are matched to acoustic data picked up by a microphone at the same time. Image courtesy of Christian Herff (see also video, below)

“When we record in the break during glioma removal, the most we got in our latest experiments were 11 minutes of data”, says Christian Herff. “The other possibility is to collect data from epilepsy patients who are fitted with an electrode grid to find the source of their seizures, while they wait for a seizure to occur in the hospital. But also here, groups collect about half an hour, 40 minutes of data. Siri [Apple’s voice-machine-interface], on the other hand, is trained on several hundred thousand hours.”

To make the most of scarce data, the groups used “neural network” models. Neural networks process input information by passing it through successive layers of neurons, and the connections between them can be adjusted or weighed, says Nima Mesgarani, computer scientist at Columbia University and author of one of the new studies. “Which neurons are connected to which, and how much weight is given to each, is what the network learns from the data.” In the experiments, the neural network model software was fed data about the participants’ brain activity and the audio they heard or produced, to find ways to translate between them. Using these models, researchers were able to generate words from brain activity that were, in some cases, intelligible to listeners.

New studies improve intelligibility of decoded words

The group of Tanja Schultz at the University of Bremen asked patients undergoing surgery for glioma removal to read out individual words during the preparation time, when the surgeon was mapping which brain areas to remove. While electrodes recorded signals from the brain’s speech planning and motor areas, a microphone captured the words spoken out loud. Christian Herff and Miguel Angrick, first authors and now at the University of Maastricht, trained their neural network model on one part of the data, correlating what was said with neural activity. When they then fed their model previously unseen data on neural activity from the same patient, the model compared and reconstructed what was said. A computerized scoring system for intelligibility judged that about 40 percent of the reconstructed words were intelligible.

Edward Chang and his group at the University of California, San Francisco, recorded data from patients during epilepsy monitoring. While three participants spoke several hundred sentences out loud, the researchers recorded brain activity from speech and motor areas in the cortex, and also measured the dynamics of the vocal tract, i.e. the movement of mouth and throat while speaking. Their model then decoded sound from the measure of vocal tract dynamics. After recording and decoding, the researchers asked 166 people to identify which of 10 decoded sentences corresponds to decoded audio that they heard; 83 percent of them correctly identified each sentence. In addition, Chang and his team also recorded cortical signals while the patients silently mouthed words — and were able to reconstruct audio from this data.

Martin views this as an important step towards decoding inner speech. “This is, in my view, the most encouraging study. They achieved an intermediate level from speaking out loud, to mouthing words, to just thinking words. It is impressive to see that you can reconstruct the sound that patients intended to share through articulation only.”

Finally, Nima Mesgarani took a different approach in a study that was recently also published in Scientific Reports. A previous study suggested that neural activity is similar during listening to speech and imagining speaking. Mesgarani and his team therefore recorded signals from the auditory cortex while five epilepsy patients listened to recordings of stories and people naming digits from zero to nine. Similar to the work of the Schultz group, the researchers used neural networks to turn the recorded brain signals into audio. When they played the reconstructed audio, 75 percent of listeners correctly identified which number was heard.

This is an improvement, says Mesgarani: “Previously, if you didn’t know what the original sound was, you wouldn’t have been able to tell. By combining deep neural networks and the latest innovations in speech synthesis technology, we can hear the sound of the audio recording, decoded from the listener’s brain activity. But ultimately, the motivation is that you are able to decode this inner voice so as to restore speech to people who have lost it.”

Challenges remain

One of the challenges faced in realizing a thought-to-speech BCI is the need to use very invasive recording techniques, says Martin. “Non-invasive recording techniques, like EEG or fMRI, do not have the necessary spatial and temporal resolution to decode the correlates between speech and neural activity.” And it doesn’t look like EEG recordings can realistically be used to decode thoughts, says Herff: “The problem with speech in EEG is that it is not very localized. Our skull is a fantastic low-pass filter, but all the speech decoding work is in the high gamma band, above 70 Hertz. We are not able to measure that outside of the skull – it is all filtered out.”

Although electrocorticography is invasive, Herff sees it as a possible option for people who need it to be heard: “Implanting an electrocorticography grid to detect a seizure is very similar to the implantation of deep brain stimulation devices to treat Parkinson’s disease. And Parkinson’s isn’t as severe a condition as locked-in syndrome. So I can easily see this type of electrode being implanted for a speech prosthesis, if it actually helps the patients.”

Martin has, in previous work, been able to distinguish between two words that participants were asked to think. “This is, for now, the best we can do with inner speech. But decoding one word versus another is already useful in a clinical setting. You can ask: are you hungry? And the patient can say yes or no in a very intuitive way. But we are still far from decoding natural inner speech.”

When trying to decode inner speech, researchers are running into different types of problems. “When you speak out loud, you can record – in synchrony – the sound that you produce and the brain activity,” Martin says. “But that is not possible with inner speech. We don’t know when the thought starts or when it ends. And it is more difficult to control the way people produce thoughts: some people are more visual, some are more auditory, some think in a more abstract way.”

Martin’s lab is trying to expand the repertoire of words its algorithm can decode from thought, aiming to distinguish among ten clinically relevant words. Mesgarani, too, is also taking the leap to decoding inner speech in follow-up research.

For Herff, the next improvement is to establish a closed-loop device, in which participants hear the decoded output. “With closed-loop applications, patients can imagine speech and immediately hear the output. Machine and patient can co-adapt: The algorithm learns to predict the user better and the user learns how to modulate their brain activity better to work with the algorithm.”

While thought BCIs are a “hot topic”, Martin says, and several companies are considering throwing their weight behind this quest, it will still be a while until they reach clinics. “Probably closer to decades than years,” she says.

Video posted to YouTube by gtec medical engineering