Maintaining functional annotation databases in a genomic era

How can sequence databases providing functional annotations remain sustainable as sequence data production explodes?

Nicole Wheeler
6 min readJan 18, 2021
Photo by Marcel Strauß on Unsplash

The amount of genomic sequencing data being produced each year is growing exponentially. Crucial to our ability to perform useful work with this data is to understand what function a piece of DNA we’ve sequenced might perform. We rely on large, curated functional annotation databases to do much of this analysis, many of which were established in the early days of genomic sequencing when much of the curation could be done manually by a small team. How are these databases coping with the subsequent explosion in sequencing data and continuing to provide functional annotations for sequence data as it appears in the public domain?

Example databases

Pfam

Pfam, established in 1995, aims to curate conserved domains within proteins. These domains encode distinct functional or structural subunits in a protein that can often be combined in different sequences to encode proteins that perform different functions. What started as a small database became unsustainable as sequence data grew exponentially, requiring a reorganization and rethink of how the data were curated.

Uniprot

The UniProt Knowledgebase (UniProtKB), launched as part of UniProt in 2003, is a protein database partially curated by experts, consisting of two sections: Swiss-Prot (containing reviewed, manually annotated entries) and TrEMBL (containing unreviewed, automatically annotated entries). UniProt integrates, interprets, and standardizes data from multiple selected resources to add biological knowledge and associated metadata to protein records and acts as a central hub from which users can link out to 180 other resources. The number of sequences in UniProtKB has risen to approximately 190 million, despite continued work to reduce sequence redundancy at the proteome level, requiring continual improvement in the resource’s ability to process and curate this data.

CARD

The Comprehensive Antibiotic Resistance Database (“CARD”) was created in 2013. It provides data, models, and algorithms relating to the molecular basis of antimicrobial resistance. CARD curation occurs continuously, with monthly updates released by a team of biocurators.

Below are some examples of the major strategies large functional annotation databases are using to stay on top of the genomic data deluge.

Community input

An example information page for the Pfam Piwi domain, which links to the domain’s Wikipedia page

Pfam

The Pfam team regularly receives feedback from users about families or domains missing in Pfam and typically adds many user-submitted families at each release. As a solution to providing functional descriptions of all of these domains, each domain is linked to a unique Wikipedia page if possible, or sometimes to a page that is shared across a family of domains. The community can then edit the page with additional relevant information describing the function of that domain. Some domains do not meet Wikipedia’s criteria for notability and therefore can not have their own page, in which case they will retain their original Pfam description. Curators review each Wikipedia revision before it is displayed on the Pfam website to guard against vandalism. Almost all cases of vandalism have been corrected by the community before they reach curators, however. Since Pfam has many community contributors, they recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This process effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.

Uniprot

Uniprot has implemented a credit-based publication submission interface to allow the community to contribute publications and annotations to UniProt entries. The significant number of community contributions improving the resource prompted the development of the ‘Community submission’ page. The page allows researchers to add articles they consider relevant to an entry and provide optional basic annotation by selecting the topics pertinent to each paper from a controlled list and/or adding short statements about protein name, function, and disease, as described in the publication. An experienced curator minimally checks submissions before they are added to the Publications section of the record. Again, this is facilitated through the supply of an ORCID.

CARD

In response to the 2019 European Commission’s Joint Research Centre (JRC) AMR Databases Workshop, the ‘AMR_Curation’ public repository for collective curation of AMR genes and mutations involving the majority of AMR database curators (e.g., CARD, NCBI, Resfinder, MEGARes, etc.) was established. It has an active and monitored curation issue tracker, a parallel AMR curation mailing list, editable Google Spreadsheet List of AMR Databases and Software, and curated Wikipedia list of AMR Databases, all accessible at https://github.com/arpcard/amr_curation.

Automated annotation

Pfam

An automated procedure for generating articles based on InterPro and Pfam data has been implemented, which populates a page with information and links to databases and available images. Once a curator has reviewed an automatically generated article, it is moved from the Sandbox to Wikipedia proper.

Uniprot

Annotations added to each Swiss-Prot entry in UniProt is done by a team of biologists and comes, primarily, from articles in journals reporting the actual sequencing and sometimes characterization. This process is time- and labor-intensive, so it must be complemented by an automated process. This complement is provided in the form of TrEMBL, which contains unreviewed, automatically annotated entries. The proportion of unreviewed records in UniProtKB/ TrEMBL describing largely predicted proteins represents by far the most extensive and most rapidly growing section of UniProtKB.

An example of conditions underpinning UniRules. Source.

Rather than just representing basic sequence and source information, steps have been taken to add features and annotation automatically. Unreviewed TrEMBL records are enriched with functional annotation by systems using the protein classification tool InterPro, which classifies sequences at superfamily, family, and sub-family levels and predicts the occurrence of functional domains and important sites. The semi-automated rule-based computational annotation UniRule system annotates experimentally uncharacterized proteins based on similarity to known experimentally characterized proteins, adding properties, such as protein name, functional annotation, catalytic activity, pathway, GO terms, and subcellular locations.

To complement the expert-guided process of creating UniRules, Uniprot has recently introduced the Association-Rule-Based Annotator (ARBA), a multiclass, self-training annotation system for automatic classification and annotation of UniProtKB proteins. ARBA is trained on UniProtKB/ Swiss-Prot, then uses rule mining techniques to generate concise annotation models with the highest representativeness and coverage based on InterPro group membership and taxonomy properties.

In the case of Pfam-B and TrEMBL, annotators did not want to dilute the quality of their other resources, Pfam-A and Swiss-Prot, with automated annotations, so keep these as distinct resources.

Automated literature scanning

An example of automatically generated literature hits for a protein query from PaperBLAST.

CARD

A large part of CARD’s value is expert human biocuration of AMR sequence data and its relationship to antibiotics. However, AMR publications in PubMed have exceeded over 5000 per year for the last ten years, and as a result, the task of keeping CARD both comprehensive and up-to-date is daunting. CARD addresses this problem using several strategies, one of which is computer-assisted literature triage. In 2017, the CARD*Shark text-mining algorithm was introduced for computer-assisted literature triage. CARD*Shark assigns priority scores to publications from a general PubMed Medical Subject Headings (MeSH) search based on relevance and assigns the results to a CARD biocurator for manual review.

PaperBLAST

In contrast to CARD*Shark, which scans the literature for new findings relating to a topic, PaperBLAST is a tool that could be adopted to enhance our ability to characterize new proteins as they appear. PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes and takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific papers. Given a protein of interest, PaperBLAST quickly finds similar proteins discussed in the literature and presents text snippets from relevant articles or the curators. The tool has already been adopted by GapMind, a Web-based tool for annotating amino acid biosynthesis in bacteria and archaea.

Summary

The ongoing maintenance of functional annotation databases by small teams of qualified people is becoming increasingly intractable. This has created the requirement for a rethink in how data are curated and annotated, with an increasing move toward decentralization, automation and methods of directing curator attention to where it is needed most. With steady improvements in our ability to process “big data”, this automation is likely to improve over time. However, inferring a biological function from biological data is still a thorny issue, hence the continued separation of smaller, human-curated annotation sources and the larger collections of automated annotations.

--

--

Nicole Wheeler

Bioinformatician + data scientist, building machine learning algorithms for the detection of emerging infectious threats to human health