Authors: Ashish Baghudana and Juan Miguel Cejuela (juanmi@jmcejuela.com)
We present the relna corpus currently being prepared at Rostlab, Technical University of Munich. The corpus contains annotations of transcription factors (TFs) and genes or proteins (in general, GGPs or Gene or Gene Product) and their interactions (typically "TF transcribes gene"). The TF and GGP entities are annotated with offsets and both normalized to Entrez Gene and UniProtKB. Annotations are done semi-automatically. For manual annotations we are using the web editor tagtog. Considered documents are PubMed abstracts. At the time of writing, we have annotated 120 abstracts. We plan to annotate at least 200 total PubMed abstracts until December 2015. Our goal for the BLAH2 hackaton is to publicly release and store the corpus to PubAnnotation. We will take special care in solving often overlooked details such as offsets alignments or identifiable DB normalizations. Some composing team members already participated at BLAH1 and released the LocText corpus to PubAnnotation. Thus, the successful conversion is realizable during the hackaton week.
As a corpus in current development, the annotation guidelines are gradually being adapted. We document the changes and latest version on an open GitHub wiki. The following is a summary of the annotation guidelines:
-
Document selection: SwissProt is filtered for proteins from
Homo sapiensand having the GO termGO:0003700(transcription factor activity, sequence-specific DNA binding). We collect these proteins' citations and further filter them to those that contain the keywordINTERACTION WITH .... We finally randomply sample for the resulting set. -
Entities annotation: Entities are first tagged automatically using GNormPlus (via its public API). Gene families and protein families are discarded. Besides, GNormPlus makes no distinction between gene, protein, or mRNA, and thus we mark these entities as GGP. Further, the tagged entities are originally normalized to Entrez Gene IDs and we convert these to UniProt IDs using UniProtKB's mapping API. We then programmatically search these UniProt IDs on SwissProt and if they are annotated with the GO term
GO:0003700, we automatically mark them as TF. Later, we upload the entity-annotated documents to the web editor tagtog. Finally, the documents are manually reviewed to 1) verify the automatic conversion of the entities GGP to TF and 2) correct small possible tagging inconsistencies originated in GNormPlus. -
Relationships annotation: as the last step, we manually annotate the TF to GGP relationships.
Note that entity annotations are primarily done automatically (plus manually reviewed) whereas relationship annotations are done completely manually. An annotated document sample visualized on tagtog is shown in the following figure.
Note: the license policy for the corpus is still undecided. However, we will likely release it as Creative Commons or otherwise using another open source license such as the MIT license.
- Juan Miguel Cejuela, Peter McQuilton, Laura Ponting, Steven J. Marygold, Raymund Stefancsik, Gillian H. Millburn, Burkhard Rost, and the FlyBase Consortium -- tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles -- Database, 2014
- Tatyana Goldberg, Shrikant Vinchurkar, Juan Miguel Cejuela, Lars Juhl Jensen and Burkhard Rost -- Linked annotations: a middle ground for manual curation of biomedical databases and text corpora -- BMC Proceedings, 2015
- Wei C-H, Kao H-Y, Lu Z -- GNormPlus: An Integrative Approach for Tagging Gene, Gene Family and Protein Domain -- BioMed Research International Journal, Text Mining for Translational Bioinformatics special issue, in press, 2015
