How Multimodal AI Systems and 'Smart Information' Will Shape the Future

Being early in my career and seeing things through a startup lens have allowed me to focus on the big picture. I’m not a manager in a large company. I’m not always on the lookout for technical advances that yield returns within one or two years. Rather, I like to identify trends that are unfolding within the next 20, 30, or even 50 years and consider how my generation fits into the picture.

This analysis is about one such trend. I can’t guarantee immediate tactical applications, but I will argue why this trend will influence us for decades to come. It is a paradigm shift in how we think about information and, if adopted early, can pay dividends for years to come.

If you told someone about ChatGPT’s impact two or three years ago, they would have laughed. What we once considered science fiction may well become tomorrow’s reality. If you’re caught unprepared, it may sink your business.

The Age of Smart Information

I recently read “The Age of Smart Information” by M. Pell, which delivers powerful ideas on the future of information and how we consume it.

Picture a university lecture. Some students are following the professor and understanding the material; others aren’t. Some struggle because they’re not as comfortable with the topic or forgot to do last night’s reading. Others still prefer to learn through diagrams and images, but the professor never uses them. For many students, the information is easily lost due to its inflexibility. The professor can only create one version of the lecture.

That’s how most information exists today: passive, static, one-size-fits-all. The content and the “container,” as Pell calls it, of the information can’t be changed without incurring large costs, or expending great effort, after it’s created.

Up to this point, information has not been personalized. This is due to the difficulty in creating individualized content. Imagine if the professor were to hold several unique lectures on the same topic catered to the specific needs of each student. One version might rely more on simplified diagrams; another might dive right into details.

However, the era of one-size-fits-all information is coming to an end; content is becoming more of a two-way conversation between the creator and the recipient. Microsoft’s AI-powered Bing is an excellent example of this. Instead of suggesting existing web pages and static content, Microsoft’s chat-powered search allows you to receive information, generated in real-time, in a way suited for you.

This is what Pell calls ‘smart information,’ and it’s being enabled by the rise of generative AI. Tools that rely on text, like Bing’s new form of search and ChatGPT, are just the beginning. Imagine a generative AI program that could watch the recording of one lecture and automatically create another version that explains the same underlying concepts but in 50 different ways, blending various media types, with each way tailored to the specific student watching. In this model, everyone gets a lot of value from the lecture.

This is only possible because the cost of producing content has dropped to nearly zero, allowing infinite versions of a piece of information to be created in real-time. The internet has eliminated distribution costs of content; now AI is disrupting the supply side.

Data artifacts will also be able to convey their own meaning and have contextual awareness. You could track who is paying attention, and in what setting, to determine the content and container most likely to be understood. A simplified example of this contextual awareness is responsive web pages that know if you’re reading on a phone or a computer screen, adjusting the layout accordingly to improve your viewing experience. A future iteration of this is an article that recognizes if you’re reading on your phone with headphones connected and suggests generating a conversational audio file version of that article instead.

Insights into the Why & How of AI & Hyperautomation's Impact_featured — Guidebook: Insights Into the Why and How of AI and Hyperautomation’s Impact

The Real Impact of Multimodal AI Systems

Multimodal AI systems “combine multiple modes of input, such as text, images, videos, and audio, to perform a task or solve a problem,” per ChatGPT. Our current generative AI tools are a little limited in that they convert one to one (text → text, text → image, image → 3D model, sound → text). I’m certain that we will start seeing systems arise that can convert between many modalities while still accurately communicating the underlying concept. These next-gen systems will be the ones to make information adaptive and context-aware. New expressions or formats will be generated on the fly depending on the needs of the recipient.

Bing’s chat-powered search is just the beginning. As of now you can only input text and receive (mostly) text-based responses. As multimodal systems improve, however, both inputs and outputs will be a fusion of images, sounds, videos, 3D environments, and more.

An Illustration of Smart Information in the Year 2038

Imagine you’re interested in buying a house. As you’re parked on the street, you see one that looks nice. So, you activate your smart glasses and put the house’s address into your new-age multimodal AI-powered search engine. Let’s call it Shmoogle. It immediately knows what house you’re referencing and uses a combination of satellite images, street views, data scraped from the web, and information on real estate listing sites to recommend 10 similar houses nearby that are for sale.

First, Shmoogle gives you an executive overview of each property in the form of a list. Properties that didn’t fit your other preferences, which it pulled from a query you did a few days ago, weren’t included.

You find one that catches your eye, so you want to learn more. The underlying multimodal AI system gathers images from street view and websites that the house is listed on to auto-generate a 3D replica of the house’s interior. You use your extended reality (XR) headset to automatically step inside and explore the space as if you were there. You can tap on the counter to see what it’s made of, or see the square footage of each room.

Last week, you bought a furniture set from IKEA and want to see if it might fit in the living room. You tap on the existing furniture to ‘erase’ it from view, pull in 3D models with embedded metadata of the furniture set that IKEA uploaded, and see exactly how the set would look in the new living room. Perfect match. With the click of one button, you can see what the walls would look like if you painted them a different color.

Next, you ask Shmoogle to show you the market value of the house in the past 10 years in a graph. That graph didn’t exist before, but the underlying AI scraped data from across the Internet and real estate portfolios to produce its best estimate. It also extrapolates the property value based on underlying market data.

Now, if we laid new tiles in the kitchen, how would that impact the market value and how much would that cost? Is that a recommended investment? The AI will query hundreds of data sources to help you make that decision.

Meanwhile, your car communicates with your XR headset and recognizes that you’re momentarily distracted. Upon request, which you make in your headset, the car will self-drive to the property you’re currently using virtual reality to check out. By the time you’re done with your research, you can step outside and see the house in person.

Welcome to the future.

Final Thoughts

Invisible information becomes visible. Anything can have infinite depth and many connections to other pieces of information. Like the hyperlinks on a Wikipedia page, AI can create rabbit holes and data layers for you to explore in real-time. Content and containers are fluid and adapt based on your needs. Data objects become like LEGO bricks as they mix and match to construct a conversational answer to your query.

In the hypothetical example above, notice that data objects remember your past interaction with them. This is critical in creating a conversation. Additionally, smart data objects combine with each other in useful ways, like the data object that automatically imported the furniture into your new potential home.

This may seem like a foreign concept, but it’s a natural progression of what happens in your brain every second of the day. Events, ideas, people, and things all exist in a web of cause and effect, but that reality is impossible to convey using the Internet’s discrete information vehicles like articles, podcasts, videos, or images. These ‘vehicles’ are flawed simply because they are rigid, discrete, over-simplified, and one-size-fits-all. Generative AI is changing that by giving rise to what Pell calls a living ‘information organism.’

I know these things sound like science fiction. For now, they might be. But I predict they’ll remain so for a lot less time than you think.

Looking for real-world insights into artificial intelligence and hyperautomation? Subscribe to the AI and Hyperautomation channel:

How Multimodal AI Systems and ‘Smart Information’ Will Shape the Future

Toni Witt

Areas of Expertise

ServiceNow Acquisition Creates Single GenAI Entry Point to Full Spectrum of Corporate Data

Kyndryl, SAP Partner to Deliver Services for Cloud ERP Transformation

Social Media Platforms Implement GenAI Tools and Policies While Fighting Misinformation

Biz Apps Partner Summit: How Partner Roles and AI Evolve in the Microsoft Community

Delivering on the Promise of Multicloud | How to Realize Multicloud’s Full Potential While Addressing Challenges

Zero Trust Network Access | A CISO Guidebook

Mastering Cloud ERP Migration: A Comprehensive Guide to D365 Finance and Operations Supply Chain Integrations

Cloud Database TCO Decoded | Savings Strategies to Drive Down the Cost of Cloud Databases