Skip to content

Instantly share code, notes, and snippets.

@juanmirocks
Last active October 19, 2015 16:48
Show Gist options
  • Select an option

  • Save juanmirocks/192318cf4e615b653dd3 to your computer and use it in GitHub Desktop.

Select an option

Save juanmirocks/192318cf4e615b653dd3 to your computer and use it in GitHub Desktop.
relna: corpus of Relations of Transcription Factors to Genes or Proteins

relna: corpus of Relations of Transcription Factors to Genes or Proteins

Authors: Ashish Baghudana and Juan Miguel Cejuela (juanmi@jmcejuela.com)

We present the relna corpus currently being prepared at Rostlab, Technical University of Munich. The corpus contains annotations of transcription factors (TFs) and genes or proteins (in general, GGPs or Gene or Gene Product) and their interactions (typically "TF transcribes gene"). The TF and GGP entities are annotated with offsets and both normalized to Entrez Gene and UniProtKB. Annotations are done semi-automatically. For manual annotations we are using the web editor tagtog. Considered documents are PubMed abstracts. At the time of writing, we have annotated 120 abstracts. We plan to annotate at least 200 total PubMed abstracts until December 2015. Our goal for the BLAH2 hackaton is to publicly release and store the corpus to PubAnnotation. We will take special care in solving often overlooked details such as offsets alignments or identifiable DB normalizations. Some composing team members already participated at BLAH1 and released the LocText corpus to PubAnnotation. Thus, the successful conversion is realizable during the hackaton week.

Annotation Guidelines

As a corpus in current development, the annotation guidelines are gradually being adapted. We document the changes and latest version on an open GitHub wiki. The following is a summary of the annotation guidelines:

  • Document selection: SwissProt is filtered for proteins from Homo sapiens and having the GO term GO:0003700 (transcription factor activity, sequence-specific DNA binding). We collect these proteins' citations and further filter them to those that contain the keyword INTERACTION WITH .... We finally randomply sample for the resulting set.

  • Entities annotation: Entities are first tagged automatically using GNormPlus (via its public API). Gene families and protein families are discarded. Besides, GNormPlus makes no distinction between gene, protein, or mRNA, and thus we mark these entities as GGP. Further, the tagged entities are originally normalized to Entrez Gene IDs and we convert these to UniProt IDs using UniProtKB's mapping API. We then programmatically search these UniProt IDs on SwissProt and if they are annotated with the GO term GO:0003700, we automatically mark them as TF. Later, we upload the entity-annotated documents to the web editor tagtog. Finally, the documents are manually reviewed to 1) verify the automatic conversion of the entities GGP to TF and 2) correct small possible tagging inconsistencies originated in GNormPlus.

  • Relationships annotation: as the last step, we manually annotate the TF to GGP relationships.

Note that entity annotations are primarily done automatically (plus manually reviewed) whereas relationship annotations are done completely manually. An annotated document sample visualized on tagtog is shown in the following figure.

relna annotations sample on tagtog

Note: the license policy for the corpus is still undecided. However, we will likely release it as Creative Commons or otherwise using another open source license such as the MIT license.

Publications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment