The AI Is in the Room

Human-AI Engagement Improves When Audio Makes Chatbots Seem Physically Present

By Byron Spice

Aaron Aupperlee

Senior Director of Media Relations, School of Computer Science
aaupperlee@cmu.edu
412-268-9068

A recent CMU study explored how people reacted to AI agents that sounded like they were physically in the same room with the humans using them.

The Breakdown

People became more engaged and treated the AI as humanlike when agents sounded like they were physically in the room.
Researchers used realistic spatial sound and everyday noises to make virtual assistants feel physically present without screens or avatars.
It also changed expectations. AI behavior was judged by social norms when they sounded more like humans.

In this ghost story, there is no haunted mansion. No candles mysteriously snuffed out. No strange things going bump in the night or glimpses of fleeting figures.

But there are noises. Papers rustle. Fingers peck at a keyboard. Water splashes into a glass. But who — or what — is making these sounds?

It isn't the work of ghosts, but of Carnegie Mellon University researchers exploring how people react to artificial intelligence agents that sound like they are physically in the same room with the humans using them.

"The question becomes, 'If I had an AI assistant, what would happen if I made the audio component more like an actual human?'" said David Lindlbauer, an assistant professor in the Human-Computer Interaction Institute (HCII). In the end, the answer would surprise even the researchers.

A team in CMU's School of Computer Science worked with experts in the Department of Psychology and other universities to develop an interface between humans and chatbots that relies only on audio cues. They aimed to more fully engage the user by making the chatbot seem as if it were physically present.

Humans rely heavily on vision to communicate and interact with the environment, so a considerable amount of research has understandably focused on interfaces people can see, such as avatars or robots, said HCII Ph.D. student Yi Fei Cheng. But the necessary equipment might not always be available or suitable, so an audio-only interface could be essential in some situations, such as when using smart glasses that include microphones and cameras but no displays.

The research team used spatialization and Foley effects to create an audio-only interface. Spatialization helps place an AI agent in the room as it speaks, moves around, completes tasks or makes other noises. Foley effects are the sound effects typically added to movies and television shows in post-production. For this research, the Foley effects included typing on a laptop, riffling through papers and pouring a glass of water.

"When a movie star sits down on a bar stool in his leather jacket, you expect a leathery rustle and a squeaky bar stool and the sounds of his hands hitting the bar," said Laurie Heller, a psychology professor who studies auditory perception and cognition. "These sounds happen in real life, and if they aren't part of the movie soundtrack, it doesn't seem realistic. It doesn't immerse you."

To test the audio-only interface, study participants spoke to AI agents that used different combinations of spatialized and Foley effects. The participants were told to acquaint themselves with the room, such as by locating a laptop, blocks, a whiteboard, books and other items that accompanied audio effects. Then they were seated in the center of the room and told to converse with the AI agent, which proceeded to give the impression of moving about the room, typing, flipping through a book or drinking water as it chatted. Afterward, they completed questionnaires and structured interviews to share their impressions of the experience.

"We found that, yes, the audio interface made the AI assistant seem more humanlike," Lindlbauer said. "We have statistically clear results demonstrating that adding spatial and Foley effects increases your engagement."

But the perception of the interface as humanlike had an unexpected side effect. The users also expected this seemingly human interface to follow human social norms.

"As soon as the participants felt like their agent was engaged in something else, such as if the agent was talking and typing at the same time, or rustling papers, the participant actually felt like 'This is not cool, my agent is not paying attention to me. My agent is distracted.' They considered this rude," Lindlbauer said. "To me, this seemed like a remarkably odd characterization of a computational system."

In the study, many of the Foley effects were automated and not tied directly to the conversation between the agent and the participant. Designing the audio cues to be more aware of conversations might reduce this sense of distraction, Cheng suggested.

Though the experiment included an interface that was specific to the office environment, Lindlbauer said a final system might not need to be so specialized.

"My gut feeling is that I could design a good number of audio effects that are independent of the space, which do not require a lot of knowledge of the space, and I could still get this boost in engagement," he said.

It's possible that a participant might look in the direction of a voice or at a laptop when they hear typing. But this mismatch between what their eyes and ears told them didn't seem to ruin the overall effect.

"Based on the data from this study, the sounds still had an effect on people consistent with another human being there," Heller said.

No one tried to convince users that the agent was real. But even the researchers themselves could get a little confused.

"Recording all this audio has given me a deep respect for the Foley artists out there. It was a meticulous process," said Jarod Bloch, an SCS junior majoring in artificial intelligence who recorded the Foley effects used in the study. "There's a whole technique for flipping pages in books that I had to master. You had to be intently flipping."

Playing the effects over and over desensitized him to them, or so he thought.

"There was a point where we adjusted the blocks effect to a level where I would react, 'Oh, somebody's playing with the blocks,' Bloch recalled, "and then I would be like … oh, wait."

"It is a little like having a ghost there," Heller said.

The researchers will present their findings at the upcoming Association for Computing Machinery Conference on Human Factors in Computing Systems (CHI 2026) in Barcelona. In addition to Lindlbauer, Heller, Cheng and Bloch, the authors include Alexander Wang, a Ph.D. student in the HCII; and colleagues at South Korea's KAIST; the University of Sydney, Australia; and the University of Michigan.