Annotation fetching

A very common bioinformatics task is that of obtaining current annotations for a given gene or protein. There is a great deal of flexibility in how annotations may be obtained from the database and some strategies are more or less appropriate for a specific application. First, however it is important that you understand how the Gene Ontology (GO) information is stored in the database.

Fetching annotations with a list of NCBI or UniProt identifiers

First, we need to connect to the database.

>>> from htsint.database import fetch_annotations,db_connect
>>> session, engine = db_connect()
>>> conn = engine.connect()

The conn or connection is for usage with the SQLAlchemy core API and is not necessary for basic annotation fetching. If you have a list of NCBI gene identifiers or UniProt identifiers (UniprotAc) then the easiest way to get the annotations is to

>>> annotations = fetch_annotations(['31251'],engine,idType='ncbi',useIea=False,aspect='biological_process')
>>> print(annotations)
>>> print(annotations['31251'][0])
('GO:0042752', 'regulation of circadian rhythm')

Simply use idType=’uniprot’ if the ids are from that namespace. As you can see the results are contained in a dictionary with a key for each NCBI identifier. The results associated with each identifier are a unique list of tuples each containing the GO identifier and GO name. So if you wanted only the GO names simply use a list comprehension.

print([a[1] for a in annotations['31251']])

A results list can be quickly sorted with:

sortby_inplace(annotations['31251'], 1)

Note

For large lists it may be faster to search for all annotations using fetch_taxa_annotations (see below) then to trim the list

Fetching directly with SQLAlchemy

There are two API’s from SQLAlchemy the ORM and the Core.

Using the ORM

>>> subq1 = session.query(GoTerm).\
            filter(GoTerm.aspect==aspect).\
            subquery()
>>> subq2 = session.query(GoAnnotation).\
          filter(GoAnnotation.gene_id==gquery['id']).\
          subquery()
>>> q = session.query(subq1.c.name).\
        filter(subq1.c.id==subq2.c.go_term_id).all()

Using the Core

>>> s = select([GoTerm.name],GoAnnotation.go_term_id==GoTerm.id).\
      where(GoAnnotation.gene_id == gquery['id']).\
      where(GoTerm.aspect==aspect)
>>> _results = conn.execute(s)
>>>  results = [tuple(map(str,r)) for r in _results.fetchall()]

Fetching annotations by taxa

The syntax is similar to individual gene/protein identifiers.

>>> from htsint.database import fetch_taxa_annotations,db_connect
>>> session, engine = db_connect()
>>> conn = engine.connect()
>>> geneAnnots,uniprotAnnots = fetch_taxa_annotations(['7091'],engine,useIea=False,verbose=True)
>>> print(uniprotAnnots['Q9NL89'])
['GO:0045088']