Marketing Cloud, Trailhead...
June 17, 2020
Salesforce Research Develops New Search Engine to Support the Fight Against COVID-19
By Andre Esteva, PhD, Head of Medical AI, Salesforce Research, and Anuprit Kale, Lead Data Scientist, Salesforce
From February to May 2020, the number of scientific papers published on COVID-19 skyrocketed from 29,000 to more than 138,000. As people around the world step up to help, the number will continue to grow exponentially, with projections to swell to more than 1,000,000 by the end of 2020.
That’s good news for the medical community and policymakers working on vaccines and treatments — but only if they’re able to efficiently search the growing body of research. As papers are data-rich and can be hundreds of pages long, finding what you're looking for, in the time crunch of a global pandemic, can be a challenge.
Introducing COVID-19 Search
Salesforce Research utilized the data from the CORD-19 Challenge and created COVID-19 Search, an AI-powered search engine to equip scientists and researchers with the most relevant COVID-19 research. Sponsored by the White House, and including a number of leading AI and health policy groups - the NIH, Georgetown, AI2, CZI, MSR - the ongoing CORD-19 Challenge aims to catalyze the development of search algorithms and engines designed for researchers and policymakers to better understand and combat COVID-19. It maintains the growing corpus of coronavirus-related publications and makes it easily accessible to the public.
This type of initiative aligns nicely with the goals of Salesforce Research. In addition to developing technology that powers Salesforce Einstein’s product line, a core part of our team works on applying AI to Social Good, including areas such as Healthcare. We believe that by advancing the field of AI, we can serve our broader community and improve the state of the world.
With deep experience in natural language processing (NLP), Salesforce Research pulled together a team of our experts to develop a search engine that would support research efforts as more information pours into public archives. In a few months, Salesforce Research developed COVID-19 Search to help users easily search through rigorous scientific information in their efforts to stem the tide of the global pandemic.
Getting technical: A novel virus requires a unique search experience
Searching scientific publications requires different techniques from traditional keyword-matching search engines. It’s critical that a COVID-19 search engine interpret the proper meaning in a given search, going beyond finding results based on the frequency with which words appear in documents. And with long documents, it’s valuable to quickly surface relevant passages in search results.
COVID-19 Search addresses this by combining text retrieval and NLP — including semantic search, state of the art question answering, and abstractive summarization — to better understand the question and surface the most relevant scientific results.
The order of words in a single scientific search are very specific, and a slight change in that order can have a drastically different meaning. For example, searching for “What expression pathways does SARS-CoV-2 induce?” is substantially different from “What is the expression pathway of SARS-CoV-2?” The results need to align with the context of the query.
So we combined information retrieval (IR) search with our strengths in NLP to emphasize semantic search that models the meaning behind the query. Leveraging recent work in sentence correspondence (Reimers et al 2019), we split scientific publications into pairs of paragraphs and citations that could be used to train algorithms to determine if the title of a citation was referenced by a paragraph. The same AI can be used to take a query, and find paragraphs in a document set that address it.
Above: Learning to associate queries to text. A publication is split into paragraphs and citation titles, which define training pairs (tuples). These are fed into an AI model that learns to associate them with each other.
Semantic search combs through the massive population of documents and returns a subset, maybe 100 or 1,000. We run these documents through a question-answering AI that treats the user’s query as a question, and does its best to generate an answer from the retrieved documents. If an answer is contained in any single document, COVID-19 Search can re-rank the document list to surface this document. This is made possible by our recent work in multi-hop question answering (Asai et al, 2020), which searches across multiple documents to find answers.
For example, you may have a query on COVID-19 (the illness) that actually relates to SARS CoV-2 (the virus), such as “How does COVID-19 enter the cells of a patient?” The question-answering AI module first finds a paragraph in one document that explains how COVID-19 is related to SARS CoV-2 and then finds a paragraph in a different document that explains how SARS CoV-2 enters the cells in a certain way. By searching across different documents, COVID-19 Search can help users find more accurate results.
COVID-19 Search applies an abstractive summarizer (Kryscinski et al, 2018) that reads a single document or a set of documents and then generates a summary of those documents. We leverage recent advances in language modeling to generate a short summary, and then re-rank results based on the documents that most closely match it. Think of it like the abstract of a scientific paper, which captures the key search results in a brief paragraph.
Competition that inspires collaboration
In response to the CORD-19 Challenge, the TREC conference has formed the TREC-COVID Information Retrieval (IR) Challenge. This competition - created to objectively evaluate COVID-19 search engines - has catalyzed collaboration amongst a community of NLP and IR researchers, allowing them to build on each other’s work and develop techniques much faster. We hope that other teams in the community take our work and expand on it further.
COVID-19 Search is designed to serve those on the front lines of medicine and policymaking to accelerate the search for effective vaccines and treatments. CORD-19 and TREC-COVID are just the beginning. The computer science community is highly collaborative and we will continue working together and sharing our research to help the larger community develop better search engines for this pandemic and for future challenges.
If you’re interested in learning more about the projects we’re working on at Salesforce Research, visit our website.
This search engine has been the result of hard work and dedication from a number of Salesforce teammates, including: Anuprit Kale, Romain Paulus, Kazuma Hashimoto, and Wenpeng Yin, who built and deployed the various synchronized AIs - Melvin Gruesbeck who designed the site - and Dragomir Radev, Jayesh Govindarajan, Caiming Xiong, and Richard Socher, who advised the project.