On Tech & Vision Podcast

Generative AI: The World in the Palm of Your Hand

Subscribe to Podcast

When it comes to emerging technology, there’s no hotter topic than artificial intelligence. Programs like ChatGPT and Midjourney are becoming more popular and are inspiring people to explore the possibilities of what AI can achieve – including when it comes to accessible technology for people who are blind or visually impaired.

One of those people is Saqib Shaikh, an engineering manager at Microsoft. Saqib leads a team that developed an app called Seeing AI, which utilizes the latest generation of artificial intelligence, known as generative AI. Dr. Cal, spoke with Saqib about how Generative AI works, his firsthand experience using an app like Seeing AI, and how it helped improve his daily life. 

This episode also features Alice Massa, an occupational therapist at Lighthouse Guild. Alice described the many benefits of generative AI, and how it helps her clients better engage in their world.. 

Saqib and Alice also both agreed that the current state of AI is only the beginning of its potential. They shared their visions of what it could achieve in the future – and it doesn’t seem that far off.



Podcast Transcription

Female AI Voice: The image captures a close up view of a small rocky pool of water inhabited by various starfish. The pool is surrounded by rocks and the water is clear, allowing a detailed view of the starfish and the rocky bottom of the pool the starfish are in a variety of colors and sizes, including some with a reddish orange hue and others that are beige with brown patterns.

The starfish are scattered throughout the pool, some partially submerged in water, while others are fully visible. Their arms are spread out, showcasing their iconic star shape. The rocky bottom of the pool is visible through the clear water.

The rocks are small and appear to be smooth, with colors ranging from dark brown to light beige. There are also larger, irregularly shaped rocks that are reddish brown. There is a yellow sign on a rock on the right side of the image with the text: “But please don’t touch.”

Roberts: What you just heard is the audio description of a starfish tank at an aquarium. Judging from the details, you think it’s a prerecorded clip for an audio tour, but it’s not. This clip was generated on the spot by artificial intelligence or AI.

I’m Doctor Cal Roberts and this is On Tech and Vision. Today’s big idea is technology that can literally put the world in the palm of your hand. I’m talking about the latest generation of artificial intelligence called Generative AI.  It is revolutionizing how people who are blind or visually impaired interact with their environment, not just simple object identification and directions, but enabling users to experience their surroundings fully and engage on a much deeper level.

To learn more about this new evolution, I spoke with Saqib Shaikh, an engineering manager at Microsoft. Saqib has developed an app called Seeing AI and has firsthand experience using AI in his daily life.

It’s a pleasure to welcome to On Tech and Vision Saqib Sheikh. Saqib was the Pisart Award winner at Lighthouse Guild last October in October of 2023 and just wowed us with his vision for the future of AI and how AI will impact the lives of people who are blind and visually impaired. And so it’s just a great honor to welcome Saqib to On Tech and Vision. Welcome.

Shaikh: Thank you so much. It’s an absolute honor to be here, thank you very much.

Roberts: So, our audience is getting familiar with this term, AI, artificial intelligence. People hear it all the time. Now we have a new term for them, Generative AI. Explain. What’s Generative AI?

Shaikh: In many ways, it’s just the latest wave of artificial intelligence. So, artificial intelligence has been around for many decades, and what I started with Seeing AI at Microsoft maybe seven years there was a whole new wave back then of deep learning. It really improved what was possible. And we’re now at this new Generative AI wave. So, in a sense, the term doesn’t matter. You might prefer the term large language model in the press, and again this is a tech word that just means we have really big AI systems which are trained on really large computers and a huge amount of data in a way that was never thought possible before. And the results, which is really what matters is the results, are just remarkable. And this is the technology behind tools like ChatGPT and of course now Seeing AI as well.

Roberts: So compared to past generations of AI, what does the user experience?

Shaikh: We are able to get a lot more details because the AI is able to, like generative sounds, it is able to generate the descriptions or generate language in general from having observed huge amounts of data on the Internet. And so it is able to give the impression of, in the sense of like a chat bot responding like a human, and in the case of describing images, just way in more details than was ever possible before.

Roberts: So this is information as you say on the Internet. How about peoples personal experience? How does Generative AI help someone with the repetitive tasks that they have to function on a daily basis?

Shaikh: We’re still at the beginning of this, but some of the ways are, Microsoft has these tools are called copilots because they sit alongside the person, the human, and as like a copilot they can help you with everyday tasks. And again, it’s just the beginning. But I’m excited to see that they can summarize information or generate reports of interest to someone who’s blind. Maybe there’s a huge amount of data that can be analyzed, summarized, or trend spotted.

And in the case of like Seeing AI, we were talking about the real world with images. It can go from giving a one sentence overview about what’s in an image to giving you a whole paragraph describing intricate details of what’s in the image, again in a way that they never thought even a year or two ago would be possible.

Roberts: So take us through your own personal journey of how you got interested in AI.

Shaikh: So, I’ve been interested in technology since I was a kid. I sort of stumbled onto learning to code when I was in my early teens and I just loved it. I loved this idea that you could have an idea, do some thinking, some typing, and you could make something out of nothing.

And from that, it’s the transformative ability of technology to improve people’s lives. Growing up as a blind kid, I was surrounded by software to generate Braille or to print out the Braille and generate raised diagrams, and so much more. So assistive technology was what enabled me at school, and of course, beyond that as well. 

And so that took me to studying computer science at university and then I specialized in artificial intelligence in my postgraduate and eventually came to Microsoft where I’ve done a whole bunch of things, but it was about seven years ago where we had the opportunity to do a hackathon. The CEO said that there’s a one week period where you can do whatever catches your interest and for me I thought, I want to bring my interest in artificial intelligence and some of the needs I identified individually as a blind person, that’s where it started. 

And we spent a week prototyping some solutions for AI to help blind people and building on this vision of what if there was a visual assistant with you, like when you have a sighted guide who understood you and understood the world around you and told you what was going on, what was interesting to you and could answer your questions. And we’re still quite away from that vision, but we’re closer than we ever have been. 

And so that brought me to the Seeing AI journey which I’ve been on for, like I say, about seven years or so now, which feels like a remarkably long time, but there’s so much exciting things going on, it doesn’t feel that long at all.

Roberts: So for those who don’t know the Seeing AI app, explain it. What does it do and how does someone use it?

Shaikh: Yes, it’s a free mobile app which you can download from the App Store or Play Store and it’s a visual assistant. We sometimes talk about it as the talking camera you just hold up your phone, point it and it will start reading things to you. You could also identify your friends, or describe what’s in our picture or recognize products. And there’s a whole bunch of different tasks it can assist with in one’s daily life.

Most recently, some of the interesting things powered by Generative AI is, as I said, going from taking a photo, say, from your photo gallery to your reliving memories from your vacation, or even just what’s in front of you right now. You can go from saying it’s a man sitting on a chair in a room to actually give you maybe a whole paragraph describing what’s in the room, what’s on the shelf, what’s in the background, what’s through the window, even. And it’s just remarkable. 

I work on this every day. I understand technology, yet as an end user, I still am surprised and delighted by what this generation of AI is capable of telling me.

Roberts: Alice Massa is an occupational therapist. She provides therapy to people who are visually impaired, including how they incorporate vision technology into their lives. And she’s a user of vision technology herself.

Massa: My name is Alice Deluise-Massa. I’m an occupational therapist. I’ve been in practice 50 years. Most people are not always sure what occupational therapy is. My view of occupational therapy is that it’s an empowering profession that looks at people’s strengths, looks at the barriers that are preventing them from doing what they want to do, and offering strategies in order to meet their goals and to empower them to be able to do the things they choose to do.

So I am now working at the Lighthouse and I see clients for occupational therapy and what we do is we do what we call an occupational profile and what that is, is what people’s lives are like, what their strengths are, what their interests are, what they are struggling with doing, what they would like to do, and then we try to match them with strategies, techniques, compensation so that they can do the things they want to do, and that’s where the electronics came into play.

The reason I love my iPhone, and specifically the Seeing AI app, is it’s been developed with really visually impaired people in mind. Seeing AI is so intuitive for somebody who has vision loss. There are a lot of things that you do in a day that other people take for granted. You know, walking in the street, knowing when to cross the street, knowing which street you’re on, going into a store and buying regular milk instead of low fat milk. Those are all the kind of day-to-day things that people really take for granted. 

I was not able to shop alone before I had Seeing AI. Now, I go into TJ Maxx, I pick up a blouse. I can see what size it is. I can see what the price it is. I can see what it’s made of just by putting the tag in front of my phone. It’ll also read the menu at a restaurant.

Roberts: In fact, Seeing AI can do more than just read the menu at a restaurant. It can help you decide what to order. Here’s a recording from Saqib showing how it’s done.

Man: After scanning this restaurant menu in Seeing AI. What are the categories?

AI Voice: The categories in the document are antipasti, salads, pizza with sub categories, red and white specials and dessert.

Man: What Antipasto dishes are there?

AI Voice:  There are three antipasti dishes mentioned in the document. One: oven roasted asparagus with paprika aioli, red onion, tomatoes, arugula and lemon zest. Two: polenta with marinara sauce, granorino, homemade meatballs and herbs. Three: oven roasted castelvetrano olives marinated with chilies, garlic and lemon.

Man: How much does the asparagus antipasti cost?

AI Voice:  The asparagus antipasti costs $12.00.

Roberts: Now we’ll hear Alice tell us more about the incredible things Generative AI can help us do.

Massa: I have a client at the Lighthouse who really was very disinterested in doing anything. The only thing he did on his phone was answer a call from his pastor and call his pastor, and I was able to put Seeing AI on his phone and his wife said the first time in two years she saw a smile on his face because now he could read his Bible by himself.

He just literally had to put the phone in front of the Bible and it would read the page. So for him this was life changing. I call my phone my sister because my phone is the person I go to when I’m on the street. If I’m walking in Manhattan- the other day I was walking, I was meeting someone on 47th St. I wasn’t sure which block I was on. all I did was open Seeing AI short tech, hold it up to the street sign and it told me I was on West 46th St.

So those are the kinds of things that it really makes a difference.

Roberts: As we pointed out in many of our podcast episodes involving the users of technology in its development is vital. Saqib agrees.

Shaikh: I view my work in some ways as a conversation between the community and the scientists. And one of the most fun things about that, like, you know, I just really enjoy hearing from people. What are they using Seeing AI for? Of course, what could it do in the future? But often we’re surprised even by the way these people are using it today because everyone has different tasks and different requirements in their lives. And we are making a set of tools to help as many people as possible. 

A story that comes to mind is Seeing AI can now recognize currency bills, but it did not at the very beginning and we got an e-mail from a group of users who had decided they were going to use the face recognition feature to train Seeing AI to recognize all the presidents and faces on the bank notes like Lincoln or whoever, and you know you’re using that to recognize currency. And I was like, that is so innovative that, you know, our users are taking these tools and they themselves are inventing by finding the ways that it’s going to enable them in their daily lives.

Roberts: And an incredible innovation like this is only scratching the surface of  Generative AI’s capabilities. 

The term multimodal capabilities comes up a lot in discussions about Generative AI. Can you speak more on that?

Shaikh: Yes. So a lot of generative AI began with the concept that you’d have something similar to a chat bot like ChatGPT, where you’re typing text and you’re reading text. Multimodal means what if there were multiple modes of input like audio or a picture, or potentially even video. So of interest to someone who can’t see this idea that you can give Generative AI systems an image and maybe a question about the image or some instructions about what you want to know about the image and you’ll be able to get textual or audio output describing the image, or you’re answering your question that that’s the crux of the multimodal part.

Roberts: And so multimodal can be not only the sources of input, but then it could also be the means of output, so it’s not just everything has to be audio. Our listeners to On Tech and Vision have learned over the years a lot about haptics and about being able to feel data not just to always hear data.

Shaikh: Yeah, absolutely. And today, I don’t know of much work with sort of these multimodal models doing haptics, but why not? In a sense, we’ve seen systems doing music generation and other audio and you know haptics is not that different to audio.

And also generating images. I’m quite excited by this idea that what if someone who could not see can actually generate good looking images. So I think that we are at the cusp of this new feature where in everyday life but also some productivity at work and school we haven’t yet discovered all the possibilities of how this can transform the type of trust that someone who cannot see could independently carry out.

Roberts: So we care a lot about assistive technology for people who are blind. But we also recognize that there are other forms of disability that could benefit from AI, particularly people who can’t hear.

Shaikh: Absolutely. And in recent years, we’ve seen the technology for that audience come a long, long way as well. Speech recognition used to be very hit or miss. But now speech recognition, dictation or identifying who is speaking, these aspects of AI have really become quite accurate. So they can be usable in everyday life and work very fast so that you can even use them in real time. So you’re absolutely right that.

It’s not just one form of disability, and it doesn’t have to be limited to disabilities, though I have this philosophy that people with disabilities are often at the forefront of new technologies because we have the most to gain our the most invested in experimenting with this.

So, if you look at historically, there are so many innovations which were initially created for people with disabilities, whether that be the telephone or iPhone touchscreen, on screen  keyboards, text to speech and speech recognition, and flatbed scanners and talking books, and the list just goes on.

It’s this idea that people with disabilities have the most to gain, and so technologies which later become commonplace are often explored and created in the research phases for people with disabilities. And so if you take that one step further and saying, OK, if we look to the future, what are the needs of people with disabilities that are not being met, that this new wave of AI can label, that’s incredibly exciting. And I have this idea of what I call an assistive agent or an assistive companion. 

So what if  AI could understand you as human. What are your capabilities? What are your limitations at any moment in time, whether that’s due to a disability or your preferences or something else, and understand the environment, the world you’re in or the task you’re doing on the computer or whatever. And then can we use the AI to close that gap and enable everyone to do more and realize their full potential.

So right now, that’s my big vision. And yes, we’re probably still a ways away, but we’re getting closer all the time. So, it’s just this idea that we are all different, every single one of us has our own needs, whether it’s we call it disability or not.

Roberts: Alice Massa also has a vision for the future of AI.

Massa: What I would love is if it could describe an action scene. Let’s say I was at a baseball game and the guy hit the ball and the shortstop is pitching the ball at the third base. I believe that eventually it will be able to describe those kinds of seats. I think that would be ideal and the reason I say that is I have worked with some young people who really do want to socialize by going to sporting events and things like that. So, I always think of Phil Rizzuto, like listening to a ball game as it’s happening right in front of them, almost like a radio announcer.

When I was at the theater the other night, it would have been nice if I could have had some description of the theater. The movies have the auditory description, but that’s done by someone as they watch them because the movie never changes. But in real life things are not scripted, so therefore AI would be able to react to what’s actually happening rather than just be a scripted description of the action.

One of the places in reality, other than in theater entertainment is walking in the street. Last night we were walking on Times Square at night. Talk about having vision problems and feeling a little bit challenged, it would be wonderful if there was like a button I could wear that, as I’m walking in the street it’s sort of saying to me there’s a crowd of five people in front of you. Four people are walking towards you directly in your path. A car is coming on your left. You know, if there was some way that you could use artificial intelligence to cue you so that when you’re moving in a communal area that you would be able to navigate much more easily.

Roberts: Saqib has the same hope. I asked him to tell me more about how Generative AI could someday function as a navigational tool, and much more.

Shaikh: Personally, I imagine having this visual assistant, sometimes think of a friend sitting on my shoulder whispering in my ear. When I’m with friends and family, maybe I wouldn’t need such a thing. We just are together enjoying our company in the moment. But there are always those times when there aren’t other people around, so I really want that thing which does the equivalent to what my friends and family do where they’re telling me that, oh, there’s someone walking towards you who you know. Oh, there’s a new shop just opened up on your left. Or be careful, there’s a cleaning trolley down the corridor.

These are things which are just what a friend or colleague would do without even thinking about it, and I really want the AI to be able to take on that role when there aren’t other people around.

Roberts: So recently, legislators have wanted to talk about the safety of AI, and could AI be dangerous? And could the computer take over for people and they’ve created a large level of fear. Talk to me about this subject. What is the risks associated with AI? Do we need to be establishing curbs on the advancement of technology?

Shaikh: I’m absolutely not an expert in this particular area, but many of my colleagues at Microsoft are working on our responsible AI standards and I’m really glad that policymakers are thinking about this, because it is really important for society as a whole that we do start thinking about what are the potential harms and put the things in place so that, you know, we detect and prevent any problems before they happen. I also think as someone on a personal level, with disability that we should also just make sure that we can also continue to leverage these technologies in the ways that can benefit people. So I’m very happy that people are looking into this and it’s it’s not my area of expertise, but responsible AI and making sure that we innovate in a way that does not cause those unintended consequences is critical.

Roberts: Alice agrees. She thinks there is much more to gain from AI than to fear from it.

Massa: There is a lot of anxiety about AI. Everybody right now is thinking, Oh my God, AI is going to take over the world. But the truth of the matter is, AI really has a lot to offer for people who need new strategies, who need other strategies and that the people who need those strategies are often fearful and I just hope people can get over that fear and recognize the value.

Roberts: As we plunge into this AI driven era, the sheer magnitude of what generative AI can achieve becomes crystal clear. Is not merely interpreting texts or offering navigational support. It’s an ever evolving force that adapts and grows with an expanding data universe. Picture a future where AI not only keeps pace with our needs, but anticipates them constantly pushing the boundaries of innovation. Prepare to be blown away by the endless possibilities.

Did this episode spark ideas for you? Let us know at podcasts@lighthouseguild.org and if you liked this episode, please subscribe, rate and review us on Apple Podcasts or wherever you get your podcasts.

I’m Doctor Cal Roberts. On Tech and Vision is produced by Lighthouse Guild. For more information visit www.lighthouseguild.org on tech and vision with Doctor Cal Roberts produced at Lighthouse Guild by my colleagues Jane Schmidt and Anne Marie O’hearn. My thanks to Podfly for their production support. 

Join the Tech & Vision Mailing List

Receive exclusive invites to virtual and in-person sessions on accessible technology for people with vision loss.

Join our Mission

Lighthouse Guild is dedicated to providing exceptional services that inspire people who are visually impaired to attain their goals.