On Tech & Vision Podcast

Seeing with Your Ears: Translating the Visual World to Audio

On Tech & Vision with Dr. Cal Roberts Podcast

Today’s big idea is seeing with your ears — exploring technology that uses audio to connect people with vision loss to the world around them. Dr. Cal Roberts speaks with Dr. Yonatan Wexler about the OrCam My Eye device. They explore the ideas of whether the brain can recreate a picture using audio alone, how well hearing can replace lost sight, and what are the discovery processes for using audio description technology as a substitute for vision.

Play on Apple Podcasts

Play on Spotify

Episode 1: Seeing with Your Ears: Translating the Visual World to Audio

Here’s the one one.  Pitch is grounded toward the middle, backhanded by Urena.  Throws to Burston…

Roberts

When you listen to a baseball game, do you see the game?

Karen

Yes, I see the game.

Roberts

This is Karen Gerald.  She’s a huge baseball fan from Norfolk, VA.

Karen

It’s as if I’m sitting behind home plate and I see Aaron running, running with his big tall self toward first base, and I see the ball going toward the outfield and if John says it is high, it is far, it is gone…

That ball is high.  That is far.  That is gone to win the game!

I’m picturing the ball just sailing away.

Roberts

Only one thing.  Karen’s been blind since early childhood.  She’s never seen a baseball game.

Karen

Far over the head of the outfielders who are trying to get it, and one of them jumps high but it goes over the wall before he can grab it.

A phenomenal eighth inning rally.

Roberts

This is Lighthouse Guild’s podcast On Tech & Vision with Dr. Cal Roberts, where I talk with people about big ideas about how technology can make life better for people with vision loss.

I’m Dr. Cal Roberts, and today’s big idea is seeing with your ears.  Specifically, we’ll explore technology that uses audio to connect people with vision loss to the world around them.

We know a person’s brain gets information from its five senses:  sight, touch, hearing, smell and taste.  And it uses that information to create an understanding of the world around us.  Just as we enter information into a computer from varying input devices: a keyboard, a mouse, a scanner or voice commands, so do each of our senses act as input devices to our brain.

The brain aggregates all this information and like a computer processes it to generate the thoughts and mental images the person demands.  Sighted people rely on their eyes to see, just as people who can hear depend on their ears.

But the beauty of the brain is that it can generate sight using information from the ears like Karen did when she listened to baseball.  Just as when our keyboards are not connected to our computers, input still comes from our mouse or scanner.

What we’re exploring today is when a person has had sight and lost it, like Karen has, can the brain recreate a picture using sound alone?  How well can hearing replace lost sight?  And, for developers of technologies for people who are blind or people who have vision loss, whether the benefits and challenges to using sounds as a substitute sense.

Karen

I picture the game as sort of a schematic.  I have an idea of how the ball field is shaped, where the players are standing and what they do.  Since I’ve never actually seen them, it’s not a fully fleshed out picture, but it’s a schematic that helps me know what everyone is doing at any particular moment.

Roberts

When it comes to helping people see with their ears, the famed radio announcer of the New York Yankees, John Sterling, is a master.  I asked John if he felt like his job was converting a three-dimensional image into words.  The people would then turn back into a picture into their minds.

Sterling

Well, that’s the idea certainly.  I hope that’s true.  One day a guy who wrote for the Atlanta Constitution, before a game he said, I just want you to know I can see the plays developing under your play-by-play.  I got a big kick out of that that he could see the play being run because of the way I described it.

Roberts

And so, how do you describe it that brings it to life?

Sterling

I hate to say this.  You’re asking questions that have complexity, but for me there’s no complexity.

Roberts

Asking John Sterling how he narrates a baseball game is like asking a duck how it swims.  But he summed it up nicely.

Sterling

I do describe things; I hope I describe them well.  But basically, play-by-play is “following the ball.”  Now in baseball, there is so much more time that you can talk about many different things in between pitches.  I always make the point and tell people for home runs where it lands just so they get a feel of an idea.  I love that fact.  I love to see where home runs land in different ball parks.

But basically, in play-by-play you’re just following the ball.  One of the great powers of radio, and I don’t mean just sports, but I mean the different Jack Benny shows, or Bob Hope shows, radio allowed you to use your imagination.  It allowed you to imagine Benny walking up the path to his house and ringing the doorbell.  So in some ways radio is even better.

Roberts

We know we can use our ears to watch a baseball game, but how about to read our mail?  To identify objects in a room?  To recognize people’s faces?

OrCam is an Israeli company that has been tackling this issue using a camera to talk to people who are visually impaired.  Dr. Yonatan Wexler, long time researcher in the fields of artificial intelligence and computer vision is also a big fan of audio technology.

Wexler

One of my first memories it was the year where record players were becoming an old technology and people were just throwing them and my brother used to collect them and my hobby was to try and fix them.  It was great to look at how things worked on the inside.

Roberts

After Dr. Wexler investigated a lot of record players, he studied optical character recognition, OCR, teaching computers to decipher hand-written letters.  He then worked at an airport figuring out how to digitize carbon copies of tickets.

Wexler

And they said, okay, here are the red stubs, see that you can scan them properly.  It was a black and white scanner.  For us humans, once you learn how to read, you don’t think much of it.  If you look at a piece of text your brain reads it for you.  Once you try to have a computer do it, I found this is an interesting problem here.

Roberts

Fast forward.  Thanks to people like Dr. Wexler, machine vision technology has developed dramatically.  Now we have self-driving cars.  We have face recognition.  And we have tiny, wearable machines like the OrCam device which sees on behalf of people who are blind or have vision loss, and then uses audio to paint a picture for users of the world around them.

OrCam was founded by Professor Amnon Shashua who also founded Mobileye, an autonomous driving company.  So, before we dive too deeply into the use of audio as an input for blind people, I asked Dr. Wexler how the OrCam technology is similar or different from a self-driving car technology.

Wexler

It’s seemingly the same thing.  It’s like using artificial computer vision in order to solve daily tasks that people do, but the specifics are quite different.  We have the same hardware, but it’s specialized in a different way.  Mobileye is working on a camera that works with cars, so it has to have a very high response time for quite a few different things.  It has to recognize cars and roads.  It has to do it very reliably in a car.

Whereas, in OrCam we have to deal with thousands and thousands and thousands of objects and things.  Even if you just take letters, we’re dealing with letters in 35 countries and 20-some languages, tens of thousands of letters.  On top of that we recognize people and recognize gestures.  So, we have to deal with real time recognition of thousands of classes of objects.  So, in that sense it’s very different.  And we also have to work in a wearable device.  It’s a tiny thing that works on a battery. It has to be very comfortable, very reliable on you.

Unlike what’s going on in a car that has infinite power – the car battery is infinite.  So the basic technology is quite different.

Roberts

When trying to help people who are blind, is there a difference between people who lost vision versus people who were born blind?

Wexler

There is quite a bit of trauma when losing sight.  When we look at all our senses, the sight gives us the most amount of information to the brain, environmental information.  After that I’d say touch, and after that there’s hearing.  And once your brain stops getting this information it kind of shifts gears.

Now that what I did with audio, I read a lot of research.  One of the interesting research shows that if you are losing your hearing you’re more likely to suffer from dementia.  This is very interesting because what it means to me is that as soon as the brain realizes there is no interesting information it kind of slows down.

What I see with people who are losing their sight, they are thirsty for information for a while.  They really want to keep on doing.  I got an email from one of our customers.  She sent me a picture of her on the floor with the New York Times weekend, and she said, “this weekend I read again the whole paper back to front.”  She said, “I got my thirst of getting the information.”

That’s what I see is the biggest hurdle for people who are losing their sight.

Roberts

And does the device work differently for people with no vision versus people with low vision?

Wexler

We designed the technology to fit the whole spectrum, but there are tweaks for the user interface to accommodate.  I’ll give you one example.  We have the ability to teach the device the faces of your friends, family and co-workers.  People who cannot see at all, they want to know if there is someone in front of them.  Whether it recognizes them or not, it will say there’s a person in front of you.  And you don’t have to do anything.

People with remaining vision or low vision, they say, I know when there’s someone in front of me.  I only want to know if it’s someone that I know.  It was an interesting distinction because the need is different.  You want the device to give you the part that you cannot do.

Roberts

So now we know how the OrCam MyEye works.  It’s a small, wearable device.  you wear it on your glasses, and through machine vision it recognizes thousands of everyday objects including words and faces.  But how did the OrCam team decide that audio would be the feedback mechanism, and how did you figure out how to make it work for users?

Wexler

We decided early on we wanted the device to recognize everything that’s around you and convey it in spoken language.  We were speaking of blind people – they’re used to getting the information in speech.  It became apparent that we have to be very selective of the things that we say because the last thing that you need is someone chattering in your ear constantly.  And we started working on choosing the important issues, the important subjects and objects that you would like to know about so that you can walk around and the device is quiet and it’s ready for you and waiting for your cue.

We were thinking for quite a while, how do you do that?  Let’s say if you’re low vision and I’m there next to you and I want to help you.  How do I know that you need the help?  How do I know when to start talking?  How do I know how to react to your will with minimal intervention and minimal work on your side.

One of the ideas we had is to use the natural gesture of pointing.  We said the device will constantly scan the environment and will constantly be ready for you, but it won’t tell you.  It will wait for you to point at something and then it’ll say, well, I can see at you’re pointing at a book, or a bank note, or a product.  And then I know that you want information about that thing that you’re holding.

Roberts

We all know that development is not in a straight line.  We go in one direction, it doesn’t work.  We go in another direction until we finally find the way to make this work.  You mentioned something about trying lasers.  What was it you were trying to do with the lasers?

Wexler

We prepared a space to build a laser white cane.  And as we were playing with it and we had various ideas how to use it, we got to the realization that you cannot replace a physical white cane, as far as I know.  Because the amount of information you get from your fingers is way more than you can get from spoken language.  You can feel goosebumps.  How do you really describe them in words?  And you feel the same thing through the white cane as well.  And we just gave up on that for the moment and decided to focus on the other aspects.

Roberts

This whole episode is about translating the visual world into audio.  I asked Dr. Wexler why he felt it was so challenging to translate input from the sense of touch into audio.

Wexler

There’s a deep, mathematical reason that I can try to explain.  The world has three dimensions.  They have many more dimensions on the texture and feel of things.  We’re talking about a lot of dimensions.  If you’re trying to map all these dimensions to one dimension, which is audio, you’re bound to fail.

We know, as humans, if you go see a sunset, and I write about it, it’s probably not going to sound so great.  But if you let Ernest Hemingway write about it you’ll read it and you’ll shed a tear.  You’ll say wow, this is wonderful, it’s probably better than the real thing.  It takes real genius to convey reality into words.  It’s not a trivial task.  And most of us humans cannot do it, and obviously a machine can’t do it yet.

Just like you can touch something and feel it, and it gives you way more information than anything you can describe in words.  The goal of conveying two-dimensional information like an image, or three dimensional or even more, higher dimensional signal into one dimensional speech is very, very hard.  In many cases it’s impossible.  Within one stroke with your hand you can feel something that will take you a long, long time to explain.

There’s even an old age phrase that says a picture is worth a thousand words, right?  In a blink of an eye you can look at a picture and it’ll take you several minutes to talk about it.  That is the information overload.  If there were no white canes, that would be a worthy thing to replace.  But I think, technologically, we’re not there yet.  There’s not yet a way to transfer this minute and fine grain detail that you get from touch into something else.  I haven’t seen it yet.  So I think it’s not there yet because the touch is amazing.  You can, with your fingers, you can feel so many things.  You can touch your arm with your hand and you know exactly if something is not right.  You can touch your child and say, I think you have a fever.  There’s no way that I know to convey that information accurately and concisely.

Roberts

And so there is a limit to the computing power of our brain.

Wexler

There’s definitely a limit, but there’s something that our brain cannot decipher.  I’ll give you another example.  There is a machine that converts pictures into sounds.  Most young people don’t know this machine, but you must know it.  We used to have it in all of the offices.  It was the fax machine.  And some people could hear the sound and have some guess of what’s going to be in that document, but you can definitely not make the fine details.  There’s some way that you have to make the information accessible to our brain.

If you lead with your sense of touch for 30 years, your brains can read the minute changes in the sense of touch.  If you now try to get the same information through the ears, I don’t think the brain can do it.  But even if it could, it doesn’t have all the years of experience of doing it.

There have been several attempts to convert pictures into music and something that is more palatable than fax noise.  But the way you encode the information in one, do you take two dimension information which is a picture and you convert it to a one dimensional signal which is a soundtrack or music or words, it’s not trivial.

We realized early on that in order to make a device like that useful, we have to bring the information to a level that a human can perceive.  For most blind people and low vision people, the way they used to get that information is by someone else telling them.  My grandmother would tell me, “could you please read that for me.  I can’t make out the details.”  I would tell her.  She was a clever woman, and her brain would know how to decipher that information.  I could tell her, and she would say okay, I see, I understand.  But if I started beeping or playing music or interpretive dance – there’s so many ways to understand it – it’s not very informative.

So you have to have a way to bring the information into that new domain.  You have to convert the picture or the video or the 3D into something that a human can perceive.  And us humans, one of the things that makes humans humans is our amazing ability to speak and hear.  We devised from early on, from the dawn of our time, we manage to convey information through sound into some form of language.  And there are many languages, many ways to describe things.  And the other person that listens to that explanation can reveal that mental image of what you are saying via words.

Back to back home runs by Gardner and Ford.  Back to back!  And a belly to belly!  And the Yankees come from four nothing down in the eighth to win on back to back…

The words or the language is a very powerful medium.  Any device that can look at the environment has to translate it into language.  Language that people can understand.  That was a very big step technologically.  It took us a while to do that, as well.

Roberts

So this technology that you have is spectacular.

Wexler

Thank you.

Roberts

When I speak to people with vision loss, certain things that are so important to them.  Number one is that they want to be independent.  They want to be able to function on their own.  Second is they want mobility.  They want to be able to navigate their homes.  They want to be able to navigate their neighborhood.  And number three, they want to feel autonomy.  They want to feel powerful.

We know that data and information is critical to that feeling of autonomy or that feeling of power.  To me, OrCam has really concentrated on Number Three.  So explain to me why that’s important to OrCam.

Wexler

I just got a message from one of our customers.  Around 18 she started losing her sight.  She was just about to enroll to University.  She postponed it for a while.  She thought maybe I’ll manage or not.  Then she got the device.  She wrote to me and said, I can now read things so quickly and it’s no longer a problem.  I was worried that I couldn’t catch up with all the studies, but all of a sudden I can do it.  All of a sudden that my loss of vision is not the one thing that’s defining me because I can very easily go through all the material and do it.

The chances of her becoming a very successful professional person and building a career is much, much higher.  I hope she’ll have great success and do the things she really cares about and succeeds in it.

All of a sudden that one thing that was about to stop her from fulfilling her potential is no longer a terminal hurdle.  It’s a hurdle.  We love that concept.  We love to allow people to keep doing what they do.  What makes us creative and human.

Roberts

People don’t want to be defined by their vision loss.

Wexler

It’s like they’re shell-shocked.  It’s very, very hard.

Roberts

What’s next for OrCam?

Wexler

One of the things people ask, they say “Can I have the device store some of the stuff that I read so I can hear it later?”  We said, sure.  There’s no problem.  Typically, the device doesn’t store anything.  This way there’s no issue of privacy and you know that the device reads to you whatever you want to hear.  But people say they would like to have some memories in the device.  We said sure, technically we can do that, but how would you find them.  Then we started developing a whole sub-system for the device so that it can hear natural language.  So you can speak to it.

All of a sudden we can have features that are not limited by the number of buttons that we have on the device.  The latest version that we’re just releasing this year has a feature that is actually super human.  So you can hold the newspaper, you can open it wide and say “read the headlines.”  And it will read the headlines.  And then you can say, “read the article about football” and it will go and read the article.  And it will do it faster than a person can read the newspaper.

Or I get the bills in the mail and I have to read all the fine print, but I don’t care about the fine print.  I want to know the core details of the bill.  With the latest version you can hold the bill and say, give me the total.  And, what’s the due date?  And if you’re not happy, read the phone numbers and you can just hear the phone number and dial the company to inquire about it.

All of a sudden we created an ability that is super human.  A person with MyEye can find information that they want faster than a sighted person.  That is very exciting.  It has opened up a whole new direction of speech.

Roberts

By direction of speech, Dr. Wexler is referring to the ability of the user to speak to the device.  The device’s ability to understand human speech.  He also added that OrCam’s audio technology will soon be put to use in a device for the hearing impaired.

Wexler

It can listen to audio and it can read lips and it can filter out only the person that you want to hear.  This is quite remarkable.  People with hearing aids they have a problem that whenever they’re in a cocktail party, all the sounds mix together and they cannot distinguish the person that they want to talk to.  Again, it’s very hard.  You go to a social event and all of a sudden you cannot really join the conversation.

With this technology that combines visual processing to read lips and audio processing, only the person you want to hear as if you were in a room, just the two of you by yourself.

Roberts

And OrCam is coming out with a personal assistant device that could be used by anyone.

Wexler

With our sight, our brain is busy ignoring information way more than actually noticing it.  It happens with all our senses.  So people who have paid too much attention to the details often miss out on the big picture.  How can we very quickly analyze the scene, find things that will be useful for our customer, so that they’re ready for whenever the user needs them?

One of the things that most of us have a problem, and us humans have bad hardware for, is memory.  We all find it hard to memorize things.  If you ask the average person where they were last Wednesday, it will take them time to recall.  MyEye feature of face recognition that we show to people, they say, I want the device just so I can remember faces.  People who shmooze a lot, or work in a big firm, or they have a lot of employees or a lot of colleagues, they want something that reminds them of the names.

We started thinking that if we take this part and combine it with some other form factor that will help people remember names, this is a step forward for everybody.  Everybody would like that.  I believe in a year or so when I have another solution that is broader.

Roberts

And you’re working to try and make us all a little more superhuman.

Wexler

To give more time to what makes us human.

Karen

He’s been warming up now, he’s going to throw his first pitch.

Roberts

I ask Karen, what is it about John Sterling’s announcing that helps her see the game with her minds’ eye?

Karen

It’s the words.  It’s the pitch of the voice.  It’s the pacing.  It’s the tone of voice.  It’s the pauses.  It’s like listening to a great piece of music.

Gardner homered the tie.  Ford pinch hit for Fraser and homered to win.  Mike Ford’s tenth home run.

Roberts

This has been Lighthouse Guild’s On Tech & Vision with Dr. Cal Roberts.  In this episode we explored how people who are blind can still see with their ears.  We talked about how our senses interact with new technologies, and how developers have to be mindful not to overload our senses.  And we dove deep into the development of the OrCam MyEye, one technology that uses audio to help people with vision impairment not just to navigate, but to see their worlds.

We exist on the forefront of a tech revolution.  Every day technologists come up with tools that benefit all of us.  At Lighthouse Guild, we believe creative entrepreneurs and developers, maybe you, will build on these advancements will deliver solutions for people who are visually impaired.

Did this episode spark ideas for you?  Let us know at podcasts@lighthouseguild.org

I’m Dr. Cal Roberts. On Tech and Vision is produced at Lighthouse Guild by my colleagues Cathleen Wirts, Jaine Schmidt and Annemarie O’Hearn.  My thanks to Podfly for their production support.  For more information, please visit www.lighthouseguild.org.