Programmers trying to teach AI bots how to play video games may be missing one vital component in their models: sound.

Noise is an important feature in video games. Music helps create certain moods. A string of dissonant sounds, for example, makes spooky soundtracks that are appropriate for zombie games. The sounds effects of growls or the rhythmic shuffling of feet is also a good indicator that an army of the undead is nearby, alerting the player to look out.

But in most machine learning research where agents are taught how to play a specific game, whether it’s Super Mario or Dota 2, sound is often left out. Developers only really focus on encoding pixels to present information visually for bots so they live in a mute world.

A group of researchers from the University of Eastern Finland (UEF), however, have found that computer agents can perform better at specific tasks if they’re given audio inputs as well as visual stimuli. They conducted a series of crude experiments to test the idea on bots playing Doom, the popular 1990s first-person shooter.

Doom is an easy environment to set up for reinforcement learning (RL), an area of machine learning that coaxes bots to learn how to perform certain tasks by incentivising them with rewards, since many researchers have already built the simulation with the ViZDoom application. The team found that agents or neural networks are much better at reaching their targets if sound is included.

In the experiments, the bots are given the task of walking to a target region of the map. It sounds simple enough, but agents need a lot of help ambling in the right direction when there are a lot of rooms. They’re awarded a point if they reach the goal and deducted a point if they keep wandering around aimlessly.

“Learning using only visual information may not always be easy for the learning agent. For example, it is difficult for the agent to reach the target using only visual information in scenarios where there are many rooms and there is no direct line of sight between the agent and the target,” the researchers explained in a paper emitted via arXiv this week.

So, the boffins decided to help the bots out by adding sounds. They start out by spawning in a random direction in one of five different rooms and have to seek out a red pillar somewhere in the game’s map. As the agent moves around, they’re fed noise samples. The pitch of the sound changes depending on whether they’re closer or further away from the target.

What’s that racket all about?

“In our experiments, the object agent is looking for is making a sound that is changing as agents gets closer or moves away from the object. It is kind of [like playing with] ‘colder’, ‘hotter’ signals,” Ville Hautamäki, the principal investigator of the project and a PhD student at UEF, told Dr. Rami Shaheen on Wednesday.

Humans know that “colder” implies getting further away from a target and “hotter” means getting closer to it. But in this case, the computers were not given this information and had to learn it over time. “We did not explicitly tell the agent what the sound means. So it had to figure it out on its own,” Hautamäki said.

The team found that when they provided visual clues only, the agents were successful at finding the red pillars about 43 per cent of the time. When sound was included, however, it increased to 86 per cent. The bots were also more likely to reach the target location in less steps too.

“In this work, we were mainly interested to show the usefulness of audio for a simple RL task. As this was an initial [experiment] to adding aural to visual inputs, we tried to keep the task as simple as possible,” Hautamäki explained.

The same idea should extend to more complex audio cues in more difficult games, the team reckoned. “Games with distinct audio clues should be easier to learn like first-person shooter with footsteps and gunshots,” he added.

How do some of the best AI algorithms perform on real robots? Not well, it turns out


“In Dota 2 and Starcraft, the audio-track includes highlights of events like being attacked, losing crucial structures and [information about] resources. These are especially useful when the screen image does not provide the same information. For example, when an enemy attacks your base and the game narrator says ‘Your base is under attack!’ which would be the case in real-time strategy games like Starcraft 2.”

Understanding what these sounds mean could help agents glean more information about their environment so they can navigate them better. It could even prove useful to more practical scenarios like self-driving cars, Hautamäki believed. Sounds like aggressive car honking could help cars understand they’re in traffic, or the sound of sirens could mean they move out of the way of incoming police vehicles.

“In this work, we [were] mainly interested to show the usefulness of audio for RL task in video games. But, in the future, we would assess the impact of audio features in other video games and high-fidelity audio simulations,” Hautamäki concluded. ®

The Total Economic Impact Of The CB Predictive Security Cloud