Most people take for granted their ability to listen to a single voice in a space crowded with other voices. But this “cocktail party effect,” a form of selective auditory attention, is far beyond the capability of machine-based auditory systems, and scientists have spent decades trying to understand how the brain accomplishes the feat. In a report published online Aug. 21 in the Proceedings of the National Academy of Sciences, researchers at Boston University described a set of experiments that bring this phenomenon into sharper focus.
“If I pay attention to some auditory object out in the world, I get better and better at listening to it exclusively and filtering out competing information from other signals,” says neuroscientist Barbara Shinn-Cunningham, director of the lab where the study was conducted. With postdoctoral student Virginia Best and two grad students, Shinn-Cunningham showed that this perceptual tuning process needs both spatial and voice cues and can take more than a second to recognize a voice amidst a noisy background.
“To my knowledge it’s really the first time that anyone has been able to examine in detail how well people are able to control their auditory attention in these complex situations,” says Steven Yantis, a Johns Hopkins University neuroscientist who works in this area and was not involved in the study.
In the experiments, five subjects—including the two grad students—were asked to listen to recorded sequences of numbers from five loudspeakers arranged horizontally in front of them. For each round of the test, the five loudspeakers simultaneously played different sequences of four spoken numbers, simulating five people saying different things all at once. To know which number to attend to, at each point in the sequence, the listeners were cued with a light-emitting diode (LED) above one of the loudspeakers. After the sequence had ended, the subject tried to type on a keypad the correct target numbers in order.
In the rounds of testing, sometimes the target loudspeaker was the same for all four numbers; sometimes it was different for each number. In one set of tests, the target voice stayed the same; in another, it was a different voice for each number. The delays between the numbers varied from a quarter second to a full second; in some tests, the LEDs provided advance warning of a shift in the target loudspeaker.
As expected, the listeners did best when the voice and loudspeaker were the same for every number in the sequence. When the target voice or location changed, an attention-switching “cost” was incurred; they were less accurate in recalling the target sequence.
What Shinn-Cunningham and her students didn’t expect was how large that cost could be. Even when listeners had a full second of advance warning about a new target loudspeaker position, their accuracy didn’t rise to the level seen when the position remained the same.
To Shinn-Cunningham, this suggests that the process of attending to one sound source among many has two distinct aspects. The first, selective attention, is a relatively centralized brain function that can effectively shift the focus of neuronal activity from one target to another within a few tenths of a second. The second aspect is how the brain brings that target into sharper perceptual focus.
Researchers have argued that the brain tends to separate the data in a visual scene into “objects,” each of which has a specific neural representation in the brain. Broadly speaking, visual attention to an object causes relevant sensory information to be enhanced and irrelevant information to be suppressed. In a recent paper Shinn-Cunningham proposed that auditory perception works much the same way, creating auditory “objects” or sound sources that can then be enhanced by attention.
According to Shinn-Cunningham, that fine-tuning, object-building process in the auditory domain can take a surprisingly long time, on an order of magnitude longer than the basic selective attentional disengagement and re-engagement process: “The act of building up [the object] spanned these multiple digits that were spread apart already in time by a second,” she says.
The experiments also suggest that this buildup process is optimized when the streams of data on the object location and quality—in this case, the voice quality—are more or less continuous. Shinn-Cunningham says that one of the next steps is to test whether a similar buildup process occurs in dynamic visual perception.
Yantis agrees, noting that most visual attention studies to date have used only snapshot-type images or sequences of snapshots, whereas Shinn-Cunningham’s study looked at more dynamic attentional processes. “Auditory stimuli are by their very nature dynamic; speech consists of sequences of pitches that change over time,” he says. “So there’s both a methodological and a theoretical potential for this paper to influence what people are doing in the visual domain.”
Shinn-Cunningham’s work is sponsored by the Office of Naval Research, and a significant portion of the research on the cocktail party effect has been published by military scientists. Shinn-Cunningham says that the near-term interest is to better define the limitations of human operators, such as pilots or air traffic controllers, in processing complex “scenes” of sensory information.
“The long-term goal is to create displays for these operators that are efficient, to ensure that they can really extract the important information out of a complex display,” she says. Such research, she acknowledges, is also potentially relevant to the development of automatic listening systems “that can monitor multiple scenes, pick out sources and localize and identify them.”