Semantic Scholar Cuts Through the Clutter of Biomedical Research

Estimates indicate that global scientific output doubles every nine years. For many researchers, this translates to hours spent every week just trying to keep up with the latest discoveries. As a result, scientists are overwhelmed with information and a sense that they may be missing something — and in some cases, they may be right.

At the Allen Institute for Artificial Intelligence, our mission is to pursue AI for the common good. We started Semantic Scholar, a free academic search engine, to help scientists find research that accelerates their work and to help the public get informed on advanced research topics. Using AI, we are creating a scientific knowledge graph out of the insights trapped in over 40 million PDFs to understand the research on important topics and how they interrelate.

Delayed Discoveries: The Cost of Information Overload

To understand the impact of missed connections on medical research, look no further than the development of trastuzumab, a lifesaving drug for treating HER2-positive breast cancer. The most important advance in breast cancer therapy in the last thirty years could have been discovered seven years earlier had technology been able to extract and identify the relevant information.

Trastuzumab, known under the brand name Herceptin, was approved by the FDA in 1988 for the treatment of HER2-positive breast cancer. 25% of all breast cancer cases are “HER2-positive,” meaning that the HER2 gene is overexpressed; it’s one of the most aggressive types of cancer in humans and for a long time, the prognosis was grim. Trastuzumab reduces the risk of death by 33%, an astonishing breakthrough that rested on the discovery of a single gene, a protein and its antibody.

Trastuzumab was developed at Genentech based on research by Axel Ullrich and his colleague from UCLA, Dennis Slamon. Ullrich and Slamon established that a specific gene, which they called HER2, overexpressed in a certain type of breast cancer and that the antibody to the HER2 protein had potential as a treatment. But they weren’t the first to make this discovery — the same gene had been discovered years earlier at a different lab.

In 1982, Lakshmi Padhy was a postdoc in Robert Weinberg’s lab at MIT. Weinberg had discovered that oncogenes cause cancer and they were working on identifying specific genes that might be responsible. Padhy extracted DNA from neurological tumors in rats and injected it into normal mouse cells, which then turned cancerous. Padhy then discovered that these cancerous mouse cells triggered an immune response in the mice, creating specific antibodies to the proteins created by these rat-tumor genes. He called this gene and antibody neu — but never followed up on it.


Although Padhy’s discovery was published in Cell, a high-profile scientific journal, nobody noticed that they might have stumbled on a potential anticancer drug because the neu-binding antibody was buried in an obscure figure in the article. This is the kind of buried information that AI has the potential to bring to the surface for researchers, and that Semantic Scholar tries to foreground: discoveries that may not be recognized as important at the time, but are essential to the advancement of science overall.

Important Discoveries in Obscure Research

Some breakthroughs turn an entire field on its head. One of these actually inspired me to join the Semantic Scholar team. My personal experience made very real for me the cost of delay in spreading research findings to the broader medical community.

In 2002, I was working as a software developer for a small startup. Things were going well until my health took a turn for the worse. Everything I ate or drank caused intense stomach pain and heartburn — even a glass of water made me double over in pain. I saw a gastroenterologist who did an endoscopy and the results weren’t good — I had an inflamed stomach lining (gastritis) and two duodenal ulcers.

I was shocked! I was young, healthy, and none of the usual ulcer causes explained my condition. The doctor handed me a six month prescription for Prevacid, a drug that reduces stomach acid, and said I would need to be on it for the rest of my life. The medication made my symptoms go away but I was worried about the idea of taking a pill for the rest of my life, especially when the cause of my condition was unknown.

I saw a second gastroenterologist and heard the same story. At my wits’ end, I searched online for medical research papers and learned that Helicobacter Pylori (H. Pylori) causes ulcers and gastritis. I saw a third doctor, armed with research about H. Pylori. The treatment was a 2-week antibiotic cocktail, and she agreed to let me try it.

Since that course of treatment I have been completely cured. Just a few years later, I learned the amazing story of two Australian doctors, Marshall and Warren, who were awarded a Nobel Prize in 2005 for their work on H. Pylori. Marshall had to prove that H. Pylori causes ulcers through drastic measures — by experimenting on himself. Their work was originally published in the Australian Journal of Medicine in 1985 and languished for another 10 years before it started gaining acceptance in the US. Today antibiotics are the standard of care for ulcer treatment as well as stomach cancer, which is now almost gone from the Western world as a result of their work.

Helping Researchers and Patients Connect the Dots

With these examples in mind, we set out to turn Semantic Scholar into a tool that could surface interesting connections in the literature and help researchers and patients alike discover important new insights. In particular, we focused on a problem that kept popping up as we talked to biomedical researchers: finding information about an unfamiliar topic without getting overwhelmed.

What we did

To help people get up to speed on new research topics, we first needed to identify which topics are important in research papers and how the topics relate to each other. Most existing work like this has identified people, organizations, and locations. However, identifying scientific topics is harder than identifying proper nouns because there are fewer lexical cues like capitalization. For example, in the paper “Acute Lymphoblastic Leukemia in Children,” we need to identify “acute lymphoblastic leukemia” as the important topic — not just “leukemia” — and we also need to know that “acute lymphocytic leukemia” is the same disease.

We started by enlisting medical experts to spend hundreds of hours labeling topics like diseases, genes, and research techniques in papers. These examples were used to teach our algorithm to identify medical topics in any paper. This was an expensive and time consuming endeavor so the challenge was to build an algorithm that could understand the meaning of many varied medical topics without human annotated examples for each subject area.

PubMed includes research from 80+ medical fields, it would be prohibitive to collect labeled data for each field. To get around this restriction, we used two neural networks: one to capture the grammatical context of information in papers, and a second to use that language understanding to make predictions based on the examples annotated by our medical experts. The result was highly accurate identification of salient topics in papers with less training data than might normally be needed for a project of this scope. Our algorithm also relies on several knowledge bases, in addition to the grammatical context and labeled data, to help it understand the meanings of words. For example, in medical literature the word “hedgehog” can refer to the signaling protein or the animal. If you want to know more, don’t worry — the Semantic Scholar Research team is working on a paper for publication that will describe this system in more depth.

How it works

Once we’ve identified topics and scored how important they are in a particular research paper, we use that information to decide what to show on the site when someone does a search. Topics show up in two ways:

  • Suggested terms to augment or refine your search
  • Topic pages that define the topic and offer a summary of recent and review literature

To take an earlier example, here’s what we show on a search for “stomach ulcer”: