Which AI Models Create Accurate Alt Text for Picture Books?

Abstract

In the last decade, there has been a surge in development and mainstream adoption of Artificial Intelligence (AI) systems that can generate textual image descriptions from images. However, only a few of these, such as Microsoft’s SeeingAI, are specifically tailored to needs of people who are blind screen reader users, and none of these have been brought to bear on the particular challenges faced by parents who desire image descriptions of children’s picture books. Such images have distinct qualities, but there exists no research to explore the current state of the art and opportunities to improve image-to-text AI systems for this problem domain. We conducted a content analysis of the image descriptions generated for a sample of 20 images selected from 17 recently published children’s picture books, using five AI systems: asticaVision, BLIP, SeeingAI, TapTapSee, and VertexAI. We found that descriptions varied widely in their accuracy and completeness, with only 13% meeting both criteria. Overall, our findings suggest a need for AI image-to-text generation systems that are trained on the types, contents, styles, and layouts characteristic of children’s picture book images, towards increased accessibility for blind parents.