On Tech & Vision Podcast

BenVision: Navigating with Music

Subscribe to Podcast

On Tech and Vision Podcast with Dr. Cal Roberts

This podcast is about big ideas on how technology is making life better for people with vision loss.

When it comes to navigation technology for people who are blind or visually impaired, many apps utilize voice commands, loud tones or beeps, or haptic feedback. In an effort to create a more natural, seamless experience, the team at BenVision has created a different type of system that allows users to navigate using musical cues instead!

For this episode, Dr. Cal spoke with BenVision’s CEO and co-founder, Patrick Burton, along with its Technology Leadd, Aaditya Vaze. They shared about the inspiration behind BenVision, how they’re able to create immersive soundscapes that double as navigation aids, and the exciting future applications this technology could offer.

The episode also features BenVision’s other co-founder and Audio Director, Soobin Ha. Soobin described her creative process for designing BenVision’s soundscapes, how she harnesses the power of AI, and her bold vision of what’s to come.

Lighthouse Guild volunteer Shanell Matos tested BenVision herself and shares her thoughts on the experience. As you’ll hear, this technology is transformative!



Podcast Transcription

Roberts: In 1936, the Russian composer Sergey Profofiev released a symphonic tale for children called Peter and The Wolf. The storyline is quite simple. A young boy named Peter captures an evil wolf with the help of a songbird, a duck, and a cat.

But the way it’s told is quite complex, because it’s all communicated with the help of music. In the story, each instrument in the orchestra represents a character. For example, Peter is played by the violin. The wolf is the French horn. The duck is an oboe. And so on for each character. 

You might think various instruments interacting as characters, would sound like a jumbled mess, but the human brain is a remarkable thing. When it comes to music we’re able to enjoy all the sounds of instruments as a whole or focus on just one aspect of it.

Put it all together and it creates a beautiful, cohesive story. And there’s so much more to music than meets the ear. Musical tones can wield incredible power as an impetus for technological innovation, and we’re only at the overture of its vast potential.

I’m Doctor Cal Roberts and this is On Tech and Vision. Today’s big idea is technology that uses music to guide people with vision impairment through a space. It’s called BenVision, an app that transforms an area into a musical soundscape. Similar to the ways that different instruments in Peter and the Wolf represent various characters, BenVision uses musical cues to help people familiarize themselves with their environment, alert them to the location of objects in space and assist them in finding something or moving through a room.

At Lighthouse Guild, we’ve had the opportunity to test this exciting new navigation tool and I caught up with the BenVision team to tell me more about it.

When we work with people who are blind and visually impaired, what we’re trying to do in a big sense is use their other senses to compensate for their reduced vision, one of the concerns that we often have is, what is too much? What is sensory overload. Can you be talking to people too much?

We try to find opportunities to complement what they are hearing in terms of description with what they’re feeling or sensing. And so my guests from BenVision have a unique way of describing their environment to people who are visually impaired, and so let me welcome you both, Aaditya Vaze and Patrick Burton.

OK, who wants to get started? Tell me what you guys are doing.

Burton: Yeah. Thank you so much first of all, Cal, for having us on your show where I’m a big fan of this podcast in particular and very excited to to have a chance to tell our story.

I suppose I’ll start. So, BenVision, we originally we were a hackathon project, we originated at the MIT Reality Hack of 2023 and that was cool because we were all strangers originally. We all came from different backgrounds and different parts of the world, but we were there all with the common goal to create impact using augmented reality.

And we realized that augmented reality is a little bit ableist. It’s a visual medium. So we decided that we were going to create an application for the visually impaired. So, we researched the problem and we learned of the story of Ben Underwood, who was remarkable young blind man. Unfortunately, he’s no longer with us, but he was one of the handful of people who taught himself echolocation after cancer took his eyesight away. 

So we were inspired by him and that’s why our company and our flagship app are both named after him. Our app is called Ben, which is short for the Binaural Experience Navigator.

But anyway, we thought that if the human brain can learn echolocation and we have this amazing technology that’s available to us, and why can’t we make echolocation a little bit more intuitive and perhaps a little bit more pleasant. So, we have a woman named Soobin who’s now our Co founder. She’s an incredible musician and an incredible sound designer. She has over 10 years composing film scores, and she does video game sound design.

And so I thought, why not try music? We created an AR application that anchored musical cues to objects that it detected, and it worked out really well. Originally, our prototype was made for AR glasses, but now it’s a mobile app. We showed this application to our mentor whose name is Chris McNally. He’s a low vision individual himself, and is an assistive tech evangelist, and he told us maybe I should try to put this on the market because I see a lot of great potential to use it in navigation.

What I did after the competition is I started volunteering for a charity for the blind and I saw first hand a lot of the gaps that are left behind by traditional solutions. In one case I happened on a woman who had been standing outside in cold with no jacket for over 10 minutes because she had left her dorm in the morning to grab a cup of coffee and then, even using her white cane, she wasn’t able to find the doorway to get back in. That’s an extreme example, but I realized then that there are ways for us to use technology to help.

So, we became a company in order to do that.

Roberts: Explain the software to us.

Burton: Sure. I’ll hand that over to Aaditya who is our technology lead.

Vaze: Basically, Ben, what it does is it recognizes the space around a user and transforms it into a spatial soundscape. So, you can think of it like a bunch of virtual speakers placed at different locations around the user, so like a speaker on a door or a couch or a chair.

And then there are, like sounds coming from all these virtual speakers at the same time. So, what it does is it is like a different way rather than the speech based solution to describe a near space of a user, and it is interactive in the sense that if you start walking towards an object, the sounds keeps on going louder and louder. And, if you turn your head, or if you face in a different direction, it’s going to spatialize based on the location of the object.

In the current state there are three modes. The first mode, we call it the object mode, which works anywhere without an Internet connection, which basically uses its camera feed to recognize objects. We have started with important objects like doors and chairs  and convert them into spatial sound cues so you’re easily able to find a door or a chair or a plant, or you keep on expanding that object categories to be like have an enhanced awareness of wherever you are.

The other two modes which are called Waypoint Mode and Region mode only work on spaces that we have partnered with which we called Benabled spaces and we essentially what we do is we divide large indoor spaces into smaller regions so that the user is able to focus on a specific region, for example, they want to find the reception area. They can hear a sound coming from that and try to follow that using their guide dog or white cane or whatever. Navigation assistance works best with this technology and find their way to that space.

The other way they could use this mode is also hearing all the sounds at the same time to have enhanced awareness of the complex environment they are situated in.

Currently, the way the user will interact with the app is through the voiceover functionality. For example, they enter a new building, a list will automatically be populated which would have reception area or like lounge area, restroom area so that they would be able to swipe through and get an understanding of what are the available regions in a space and then navigate to a specific space by isolating the sound.

Roberts: I remember as a child listening to the music of Peter and the Wolf. If you remember how that worked, they started out by by telling you that this is the story of Peter and the Wolf, and in this way, each of the animals is a different instrument and you can hear the story of Peter and the Wolf by knowing which is the flute and which is the bassoon and which animal is what. Once you understood the premise of Peter in the Wolf, you actually could see the play by just the audio.

Burton: Yes, and that’s precisely why we use music. We did take a look at some of the other assistive tech applications that are on the market and we realized like a lot of them are based on speech, and spoken directions are great in some scenarios. What makes Ben unique is that it leverages the universal language of music, and it quickly gives the users a comprehensive spatial understanding of their surroundings. So, if you have kids, I don’t, but I know people who do. If you have kids, then you know that it’s difficult to take in more than one conversation at a time. But when you’re sitting in an orchestra or when you’re listening to Peter and the Wolf, you can very easily point out, OK, here’s the percussion. Here’s the synth. Here’s the winds, the strings, and you can understand it all comprehensively. You know what direction they’re all coming from, and your brain not only understands everything at the same time, but it it’s a delightful experience given that everybody’s playing their instruments, the way that they should.

Roberts: So is this a language that that you have come up with in terms of “doors sound like this” and “walls sound like this” and “people sound like this?”

Burton: Yeah, yeah, precisely. And so, as Henry Wadsworth Longfellow once said, music is the universal language, and we have an amazing audio engineer, or audio designer, rather, who’s our co-founder. She works very hard and spends a lot of time thinking of what is the most intuitive sound that we could assign to this object or this waypoint or this region.

Roberts: BenVision’s co-founder and the team’s audio designer Soobin Ha, spoke with us about her process.

Ha: First and foremost, we recognize the potential of music to prevent cognitive overload. Music, much like an orchestra, is an excellent medium for conveying complex information simultaneously.

When we listen to an orchestra, it sounds like a unified piece, yet we can still focus on individual sections such as the woodwinds, bass, percussion or strings if we choose.  

Our goal was to design assistive technology that not only conveys essential information, but also remains emotionally engaging. 

Designing something intuitive for users was challenging In the early stages I focused on keeping things simple. We began by brainstorming ways to extract information from sound that feels natural. One of the first things we explored was pitch. For example, when we hear a dinosaur in a movie, we intuitively expected to have a low-pitched roar rather than a high-pitched voice.

And this led us to consider using pitch to represent the size of the objects and the concept worked well. Next, we considered the distance, which is naturally indicated by changes in volume. We map the volume to reflect distance. As an object moves further away the volume decreases. We also incorporated a technique from video game sound design, specifically layering sound.

In games I often layer music paths or musical synths with foley effects to represent the characteristics of things. For instance, overlaying sounds like water splashes or wood clicks to the objects can give the user a more intuitive sense of what’s in front of them.

Through testing, we found that presenting more than five distinct sounds simultaneously can be overwhelming. Currently, we are focusing on common indoor objects like couches, chairs, plants, and doors, and determining which are most critical for users to navigate comfortably.

Our goal is to refine these objects list based on user feedback. The audio we use is Wwise, a game audio industries standard. I design each sound and then map it to parameters connected to the object detection and camera feeds for real time control. This dynamic implementation allows us to create smooth transitions like moving from a living room to a kitchen and even adding stingers when reaching specific destinations, much like in games. 

And we are also exploring how to gamify this experience so that users find it not only useful, but also enjoyable.

Roberts: As I mentioned before, the team at Lighthouse Guild had a chance to try BenVision ourselves.

Volunteer Shanell Matos describes her experience.

Matos: So, my vision loss is pretty extensive. I only have light perception, so I can see some colors if they are large enough and they are in the form of light. For example, if a light bulb or something like that is blue, I can usually tell that it’s blue. However, I don’t have color distinction in a very broad range. If it’s the light blue or dark blue, or even a green that’s too close to blue, I would never know the difference. It’s all the same color to me.

I call it color families rather than color clarity. BenVision is an app that is designed with music to navigate a space that you’re in. So essentially what it’s doing is creating unique musical sound that is going to represent every type of obstacle you could encounter. So if you are navigating, for example, a kitchen, there may be a sound that sounds like a water drop that will let you know that you have encountered a sink or a hum. A very loud hum comparatively to a low, softer hum to let you know the difference between that you have come across a microwave versus a refrigerator.

So what’s really cool with this is that the sounds are very harmonious. The nice part with this is that it’s such a harmonious down because it designed by someone with actual music background, actual music theory understanding and classical training, that the music feels like it’s part of the background you’re in. It becomes almost natural to experience.

And the nice part of it is that every sound they’ve created is something that is instinctual for most people to understand. Like like I said, for example, the water drops sound. It’s probably realistically more like a wood block instrument type of thing, but the realisticness of that water drop, you feel like it sounds like a water drop and then when you think water, you immediately think, well, where would the water be in the room. 

So, you can do this almost without thinking. And that’s really cool, because if you are navigating a space and there’s a lot of people talking, for example, maybe you’re at a party, or maybe you’re in a new space that you’ve never been in, you don’t expect to wonder what those sounds mean. A water drip is a water drip anywhere in the world.

Roberts: Chanel made a great point about using BenVision in a crowded space. I asked Aaditya to elaborate on how their team considered that aspect of the experience.

Vaze: So, what we’re trying to do is the equivalent of augmented reality for audio perception. So, things that already make loud sounds like maybe people talking or things moving or this is something that is already a perception for our users and we don’t need to augment that and make them distracting.

So, we are trying to see what are the objects which usually don’t make sounds and would help to provide like enhanced awareness using audio perception. So, that’s the goal and aim to try to filter out objects which already make sounds, but for crowded environments we suggest people use bone conduction headphones so that they are not giving away their perception of spatial soundscape, which they already have from physical sounds. And look at this like an augmented reality solution.

Roberts: So, some of the navigation technologies that we see work better indoors, some work better outdoors.

Vaze: So, indoors is a little more challenging than outdoors because of the amount of GPS information and the map spatial data we have for outdoor environments. It’s easier indoor. Is like different sort of navigation challenges where it’s like different floors or elevators or staircase which make it more complex.

But for outdoors and indoors, basically our object mode works everywhere and our other modes, the waypoint mode and region mode, where users are able to navigate to different areas and find interesting waypoints, would work in different places, so we are also partnering with like parks, gardens.  At the same time, we’re also partnering with hotels and museums. So, we believe that it’s very well working in both scenarios.

Roberts: Soobin spoke about the different challenges between indoor and outdoor navigation as well.

Ha: Our concept initially considered both indoor and outdoor environments and we realized there is a gap in indoor-focused navigation tools which led us to focus on BenVision for indoor use. Indoors also provide a quieter environment where we can truly showcase the technology’s potential in BenVision.

However, we are always looking for ways to expand to outdoor settings such as bus stations, amusement parks, or large events like the World Expo. We’re actively collaborating with haptic and AR glasses to enhance this experience.

It would be incredibly useful for users to identify doors or key waypoints without being overwhelmed by excessive information, while engaging in conversations or other activities.

For example, they could quickly locate a door when needed while talking to friends by hearing augmented ambient sound cues from the door. This aligns with our decision to use bone conduction headphones, which allow users to keep their ears free. We are not aiming to replace guide dogs or white canes, but to address what those tools might like. Our approach allows users to detect multiple objects without overloading their senses. They can continue interacting with others while still being aware of where the door is using subtle side cues, if they choose, they can focus on a specific waypoint and navigate accordingly, offering interactive experience that adapts to their.

Roberts: Chanel really appreciated the ability to have bend vision augment her perceptions rather than trying to replace them.

Matos: I preferred the audio of the BenVision because it was unobtrusive. The talking and the beats can be very distracting for me. I have a hard time what I call multi listening in very large spaces. So if there’s a lot of people, there’s a lot of conversations happening,  there’s a lot of sounds happening, a lot of movement happening, it’s really difficult to keep up with what everything is doing.

Whereas with music, it’s not as difficult to pick out a layer. If you’re listening to a song you can usually listen to the words. Listen to the trumpet. Listen to the drums. Listen to the guitar. Listen to the bass. Listen to the chimes, whatever, and pick out little pieces all at the same time and it doesn’t tax your brain or your stress levels to do that. Whereas if you’ve ever tried to listen to multiple conversations and try to keep up with all of them at the same time, you usually get a little bit frustrated and you drop the ball.

You don’t have to deal with that with the music.  Now with the beeping, that’s definitely a subjective thing, because some people find beeping obnoxious. Desensitizing, if you would call it that because it’s a little bit like after a while your hair starts to feel like it’s going dumb and you’re like, am I feeling the right thing anymore? I’ve been listening to this haptic feedback and walking for an hour at this point, and I don’t know anymore. I just don’t know.

So for me, I prefer the BenVision.

Roberts: But navigation is just one aspect. This technology can be used for the BenVision team is thinking about how their technology could be used to enhance everyday life, not just for people who are blind or low vision, but for everyone.

Burton: Designing for the disabled actually stems innovation for everybody. Closed captioning is a great example of that. 80%, I think of Netflix users now watch their content with closed captioning turned on, even though far less than 80% of them are hard of hearing.

And likewise, even LLMs like ChatGPT, that originated from speech-based assistance for people with communication disabilities. So I think that Ben is not unique in that regard, though that we one day will reach a point where even sighted people are even fully abled people find a lot of value in using Ben as an experiential navigator. 

Last week we participated in something called the Augmented World Expo. As a big AR conference, and we had the chance to demo Ben to a lot of people who had full vision. 

And what we discovered is that a lot of them came back to us with the feedback that, yeah, I I have no visual impairment at all but I would definitely use this would help calm my nerves. I get social anxiety and being able to navigate using music would give me an entirely new experience that would encourage me to explore my surroundings to their fullest. It really adds a new dimension of discovery for people with full vision, people who may be on the autism spectrum who relate to music more than they might with the in the visual world, we see a widespread application.

Roberts: How do you envision Ben long term? Where do you see this going?

Burton: So long term, and maybe this is just my background as a cinematographer coming out, but if you think about it, light and sound, they’re both just waves that we capture and interpret using different senses. I personally don’t see any reason why people in the future wouldn’t literally be able to translate light into sound, and cognate it with the same acuity though long term I see a future where, say, five years from now there I anticipate that there’s going to be some sort of mass adoption of AR glasses. I think eventually we’re going to figure it.

And so I see everybody, not just visually impaired, but everyone having a pair of fashionable glasses that are running the Ben application and using them to glean all sorts of nuanced information from their surroundings in real time through the distractionless medium of music, and not only understanding, but taking delight in their understanding of a new dimension, so to speak, not just that there’s a painting over there, but what does the painting depict? Not just that there’s the car, but what is the color of the car? What’s the make the model? How fast is the car going?

What do you think, Aaditya?

Vaze: Yeah, there’s so many visual properties when it comes to vision and being able to perceive that so quickly. And you’re definitely right. Like there is some version which you would be able to translate those visual properties into sounds. But I also believe that our vision is to find the best user experience for all the latest tech that’s happening. If you see a few examples of what the latest LLMs and AIs have been showing about, like continuous speech based help. It’s not the best UX, like you cannot be completely aware of your environment continuously if someone is talking to you and it’s so hard to describe and a location of 10 objects at the same time to you around you in terms of speech real time.

It depends on like choosing the right hardware like if smart glasses is the right hardware or do we need to partner with bone conduction headphones which have cameras on them? It’s also about looking into different modes of output, like sensory substitutions and augmentations with like, haptics, sounds, speech, and finding the best combination of what works together to be able to translate the visual properties and best convey for people making them aware of their environments.

That’s our goal and we are like working really hard to get there.

Roberts: Has AI impacted what you do and how can you see AI affecting the future of Ben?

Vaze: So since the very beginning, the application is completely based on the latest AI and we are trying to be at the top of the cutting edge research so that we are able to provide the best technology in the best form of experience for our users. So we are using AI for object detection. We are using AI for understanding the scene around the user.

We also have our companion app which is called a Speakaboo, which basically also uses multimodal LLMs to help provide instant support to users about their near environment, so we are definitely very tied with latest AI research and keep on updating our apps as things improve with time and AI is definitely informed a lot of what we are building here, but we also feel like there’s so much more that we can bring to that field in terms of the experience side of things. 

With these tools becoming better and better over time, there are different ways in which everybody would be using them.

Roberts: As Aaditya mentioned, AI will play a huge role in the future of technology like BenVision and Soobin as big ideas on how it could be used.

Ha: The role of AI in this project is something I’m constantly exploring, building up a vision synthesizer that adapts with AI could open up many, many possibilities. Using AI, we could automate much of the sound design, even allowing users to create interesting variations. It would be amazing if users could assign specific sounds to objects like their housekey. Imagine being able to set your favorite song or ambient sound as the queue for your key with the sound dynamically altering based on your behavior or an action. With so many advancements happening in generative AI right now, why not apply them here?

Roberts: With technology like BenVision, the future is full of promise. When we dare to think big, we unlock a world of so many possibilities. The BenVision team is dedicated to making life easier for people who are blind or have low vision.

Like the instruments of an orchestra coming together to create a symphony, artists, engineers and AI unite. This harmony shows that when we join forces for the common good, we can orchestrate something greater than we ever imagined.

Did this episode spark ideas for you? Let us know at podcasts@lighthouseguild.org and if you liked this episode, please subscribe, rate and review us on Apple Podcasts or wherever you get your podcasts.

I’m Doctor Cal Roberts. On Tech and Vision is produced by Lighthouse Guild. For more information visit www.lighthouseguild.org on tech and vision with Doctor Cal Roberts produced at Lighthouse Guild by my colleagues Jane Schmidt and Anne Marie O’hearn. My thanks to Podfly for their production support. 

Join our Mission

Lighthouse Guild is dedicated to providing exceptional services that inspire people who are visually impaired to attain their goals.