Junnan Li
author title Director, AI Research SingaporeJunnan Li is a Research Director at Salesforce. He joined Salesforce in 2019 as the founding researcher of the Singapore AI research team. In 2024, he co-founded Rhymes.ai as the chief scientist, which was soft-acquired by Salesforce AI Research in 2025. Junnan is an expert in multimodal AI, LLMs, and agentic research. His papers are well-cited and his research is widely-adopted in both industry and academia. In particular, his BLIP-series of papers are among the most top-cited AI papers with over 15k+ citations combined.
Most agents can respond to a prompt, but ask them to click a button in your enterprise software, and suddenly its limitations show. In the age of generative AI, everyone’s racing to build…
The landscape of AI agent development has evolved rapidly, with developers needing robust frameworks to build, test, and benchmark intelligent systems. MCP-Universe emerges as a comprehensive solution, providing a modular framework designed around…
Time series forecasting plays a central role in data-driven decision making. Yet, adapting forecasting models across different domains and temporal resolutions often requires custom engineering. This increases both development and maintenance costs —…
TL;DR: CodeT5+ is a new family of open code large language models (LLMs) with improved model architectures and training techniques. CodeT5+ achieves the state-of-the-art performance among the open-source LLMs on many challenging code…
BLIP-2: Scalable Pre-training of Multimodal Foundation Models for the World's First Open-source Multimodal Chatbot
TL;DR: LAVIS (short for LAnguage-VISion) is an open-source deep learning library for language-vision research and applications, offering comprehensive support for a wide range of tasks, datasets, and state-of-the-art models. Featuring a unified interface…
TL;DR: We propose ALPRO, a new video-and-language representation learning framework which achieves state-of-the-art performance on video-text retrieval and video question answering by learning fine-grained alignment between video regions and textual entities via entity…
TL;DR: BLIP is a new pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks. Background For a review of some terms and definitions…
TL; DR: We propose a new vision-language representation learning framework which achieves state-of-the-art performance by first aligning the unimodal representations before fusing them. Vision and language are two of the most fundamental channels…






