OpenAI Announces Enhanced Capabilities for ChatGPT: Speech, Listening, and Image Processing Now Supported

By Maggie On Sep 25, 2023

OpenAI has unveiled a major update to its ChatGPT, representing the most significant development since the introduction of GPT-4. This upgrade equips ChatGPT with the ability to “see, hear and speak”, more specifically, to understand spoken language, respond with synthetic voices, and process images. The company made the announcement on Monday.

Users of the ChatGPT mobile app will now be able to engage in voice conversations with a choice of five different synthetic voices for the bot’s responses. In addition, users can share images with ChatGPT, provide the bot with visual context, and even highlight specific areas for analysis, such as asking, “What kinds of clouds are these?”

OpenAI plans to roll out these changes to paying users within the next two weeks. While the voice features will be available exclusively on the iOS and Android apps, the image processing capabilities will be available on all platforms.

This major feature enhancement is in line with the ongoing competition among AI chatbot leaders such as OpenAI, Microsoft, Google, and Anthropic. These tech giants are looking to integrate generative AI more deeply into consumers’ daily lives by introducing not only new chatbot applications but also exciting features, especially this summer. Google unveiled a series of updates to its Bard chatbot, while Microsoft incorporated visual search into Bing.

Microsoft made a significant investment in OpenAI earlier this year, pouring an additional $10 billion into the company, making it the most significant AI investment of the year, according to PitchBook. In April, OpenAI reportedly closed a $300 million stock sale, valuing the startup at between $27 billion and $29 billion, with contributions from firms such as Sequoia Capital and Andreessen Horowitz.

However, the introduction of synthetic voices generated by artificial intelligence has raised concerns. While they can offer a more natural user experience, they also have the potential to facilitate convincing deepfakes. Cyber threat actors and researchers are already exploring how deepfakes can be used to breach cybersecurity systems.

OpenAI acknowledges those concerns in its Monday announcement, stressing that the synthetic voices were created in collaboration with voice actors the company worked with directly, rather than collected from strangers.

The announcement also provided limited information about how OpenAI intends to use consumer voice input and how it will secure that data if it is used. OpenAI’s terms of service state that consumers own their inputs “to the extent permitted by applicable law.”

When it comes to voice interactions, OpenAI directs users to its guidelines, which clarify that audio clips are not stored and used to improve models. However, it is mentioned that transcriptions are considered inputs and can be used to improve the model.