Artificial Intelligence > Multimodal AI > What Is Multimodal AI Video
What Is Multimodal AI Video Transcript:
"Welcome to the AI Research Lab Explained, the Salesforce series, where we give you a look into the new, emerging AI techniques our team is experimenting with. We break down complex concepts and share real-world insights, all from the forefront of research. My name is Juan Carlos Niebles and I lead an AI research team here at Salesforce.
One of the most exciting frontiers in AI is multimodality, the ability for AI models to understand not only text but also sounds and images. That will be our focus for this segment. Now, why is our research focusing on multimodality? Well, think about how you interact with the world: texting, sharing photos, recording videos, sending voice notes. Every day you seamlessly switch between different types of data or modalities. Shouldn't AI do the same? Real-world intelligence just isn't one dimensional.
That's where multimodal AI comes in. Unlike traditional AI models that only process one type of input like text-only chatbots or image classifiers, multimodal models fuse multiple sources of data to create a richer, more contextual understanding of the world. For example, let's say you give a multimodal AI system a video and ask, "What is happening here?" Instead of analyzing just the audio or a single frame, it processes both video and sequences of frames at once, identifying objects, recognizing speech, and understanding the full scene to generate a meaningful response.
So how do we actually build multimodal AI? Well, multimodal AI works by integrating multiple models such as visual models, speech models, and text models, and training them together to interconnect effectively. Conceptually, this architecture works as follows: we start with a language-based AI like an LLM, which already understands text tokens. To enable multimodal capabilities, we introduce Neural Network Modules that act as translators. The role of each translator is to convert inputs from one modality, like pixel data from images for example, into a format that the LLM can understand. Now we have an AI system that integrates the LLM and this additional module to achieve understanding of both text and images. After that, we can further introduce one new module for each additional modality such as audio, video, or sensor data that result in a multimodal AI system that can understand multiple types of data.
Now, to make this concept work, we have to take our AI system through training, a machine learning process that uses example data to optimize a system. Each Translation Module, like the image-to-token module, can be trained independently or alongside the others. This training ensures that the module can accurately map image content to a vector space that the LLM understands. Once trained, the LLM gains a new input pathway, allowing it to process images and generate responses based on their content. Again, this process can be repeated to add more modalities like sound or sensor data.
Interestingly, after training, our research shows that this multimodal AI system doesn't just understand each modality separately. It can also answer cross-modal questions. For example, given an image of a soccer field and a sound clip of a guitar, we can ask the AI system, "Which input is relevant to a musician?" This cross-modal reasoning is a key capability of multimodal AI.
So how does multimodality fit into AI agentic systems? Well, by integrating multimodal AI capabilities into the AI agent framework, we can effectively enable agents to interact with more types of data. For example, a multimodal agent could analyze an image and reason about its content, like counting objects or identifying patterns across multiple images. A multimodal agent could also interact with a webpage visually, reading text, clicking buttons, and filling out forms in real time. Furthermore, such a multimodal AI agent will be key to serve as a robotic brain, enabling useful robots in the future that can sense the environment with image sensors but can also talk to human users via language.
That's the power of multimodal AI, combining different types of data to lead to deeper insights and better outcomes. Multimodal AI is more than text, images, and sounds. It is about helping AI interpret the world the way we do. For a deeper dive on the technical components of the multimodal AI research at Salesforce, you can see the links in the description. One particular work to highlight from our team is xGen-MM, also known as BLIP. It is a model that integrates text, image, and video processing for advanced visual language understanding, and you can read all about it below. If you found this helpful, give the video a like and subscribe for more from the AI Research Lab. Thanks for watching and see you next time!"
Learn more about AI agents and how they can help your business.
Ready to take the next step with Agentforce?
Build agents fast.
Take a closer look at how agent building works in our library.
Get expert guidance.
Launch Agentforce with speed, confidence, and ROI you can measure.
Talk to a rep.
Tell us about your business needs, and we’ll help you find answers.