Checking in on GenBank

When high schoolers in New York City were curious whether the sushi in their local market contained the kind of fish that was advertised, GenBank held the clues.

When historians wanted to identify the sequence of the lethal pandemic 1918 influenza virus that killed millions worldwide, they found the answer through GenBank.

When physicians were perplexed about why a 14-year old boy was in a coma and all the tests, including a brain biopsy, were inconclusive, GenBank confirmed the evidence of Leptospira santarosai infection.

When scientists needed a retrospective genomic characterization of the 2018 Ebola outbreak in Équateur Province, Democratic Republic of the Congo, GenBank came through.

You might have heard about these cases, but you might not have heard about the role that NLM’s GenBank played.

Even the people who use GenBank may not realize the importance of the world’s largest genetic sequence data repository.

“Our role is to help make genetic sequence data available to researchers so that they can make new discoveries with it,” says Colleen Bollin, leader of the technical development team for processing submissions to GenBank. “When large quantities of good data are shared, the power of individual researchers is amplified.”

Amplifying that power are the biologists, along with computational physicists, engineers, and computer scientists, who update and maintain GenBank every day. They are as passionate about their work as they are scientifically and technically talented.

“This is not a nine-to-five kind of job,” says Jonathan Kans, PhD, a developer of GenBank. “It’s the kind of work that grabs you because it’s interesting—and important.”

Data Driven

GenBank contains over 1.65 billion sequences and over 6.26 trillion base pairs. It includes sequences for viruses, human pathogens, micro-organsims, bacteria, animals and plants.

GenBank is used by an average of 60,000- people daily.

Most users who need GenBank also rely on PubMed, even though not everyone who uses PubMed relies on GenBank.

“I think GenBank and PubMed go hand-in-hand. You need the data, but it’s not as useful without the literature, so you had the sequences in GenBank and those are referenced by publications that researchers write and that’s why we have the two most popular resources at NCBI [National Center for Biotechnology Information],” explains Yoon Choi, PhD, developer.

“As NLM prepares for a data driven future, GenBank is a premiere example of ‘open data’ and ‘open science,’” says Ilene Karsch-Mizrachi, PhD, GenBank coordinator and program head for Sequence Submission and Archives. “The work we do is central to biology in the 21^stcentury.”

It’s also global.

From the beginning, GenBank was and remains an international collaborative enterprise. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan, the European Nucleotide Archive, and GenBank at NCBI. These three organizations exchange data daily.

A Growing Bank

As the cost of sequencing goes down, the rate of submissions goes up. Submissions to GenBank come from all over the world, with the majority from the US, China and India.

But the submissions aren’t added to the database automatically.

“Sequencing may be cheap and easy, but analyzing the data is not,” says Linda Yankie, PhD, biochemist and indexer.

Submissions must be reviewed for scientific accuracy. This work may include vector contamination and sequencing errors.

“The database is only as good as it is accurate,” says Yankie. “If we put out bad data and you generate good data and compare it to bad data, you’re going to think yours is wrong.”

The data are also examined to confirm that everything is where it’s supposed to be. This step is crucial, Yankie says, because “people compute off of the data and aren’t actually human reading the data, so if they look for everything from the Ebola outbreak from 2016 to 2019 and a collection date when that sample was collected is not in the right place, then the file format will never pull that data out and they won’t be able to access it.”

They also do what Yankie calls “making it pretty,” an effort that may include fixing misspellings and making sure there aren’t long paragraphs in the middle of a file.

Sometimes data must be analyzed on a tight timetable.

“Whenever there are outbreaks, we try to respond as quickly as possible,” says Yankie. Which could mean posting sequencing information within an hour.

That was the case with the 2009 flu pandemic, or swine flu outbreak, which lasted from early 2009 to late 2010. “This was interesting because influenza had been sequenced for a long time, but this was a brand-new strain that had not been identified before,” says Yankie. “We decided that the sequences related to the flu needed to get out as quickly as possible. You never know who needs to see what.”

As Ebola has peaked for a second time, the GenBank team has focused on getting information out fast. They also work swiftly on possible foodborne viruses. “Researchers need GenBank to help them figure out if the outbreak of a foodborne illness is the same in California as it is in North Dakota,” explains Yankie. “Analyzing the data helps narrow the focus. . . . Sometimes you don’t want to eat the spinach.”

GenBank history

Before GenBank, it was costly and time consuming to sequence proteins and DNA, so researchers usually limited sequencing to only those genes and proteins in which they had a specific interest.

By the late 1970s, demand was growing for an international computer database of nucleic acid sequence data, and 1982 marked the official beginning of GenBank. When the Basic Local Alignment Search Tool (BLAST) was introduced in 1990, it became possible to search GenBank for similar sequences in seconds.

In 1992, GenBank became part of NLM. Through the years, GenBank has been integrated with dozens of other biological databases, as well as with the scientific literature via NLM’s PubMed and PubMed Central. In addition, sequence data from the Human Genome Project were added as soon as they were generated.

By Kathryn McKay, NLM in Focus editor.

2 thoughts on “Checking in on GenBank”

Mark Cavanaugh says:

October 4, 2019 at 11:24 am

As of release 233.0 in August 2019, the size of GenBank is actually 6.26 terabases and 1.65 billion records, far larger than the figures mentioned in this article. See:
https://ncbiinsights.ncbi.nlm.nih.gov/2019/08/30/genbank-release-233/
Guest Author says:

October 10, 2019 at 2:11 pm

Mark,

Wow! Thank you so much! We’ve updated the story to reflect these impressive stats.

Kathryn McKay, NLM in Focus editor