.. main file for lpedit documentation Database ====================== The schema for the database. .. figure:: dbschema.png :scale: 99% :align: center :alt: database schema :figclass: align-center The main uses of the database are documented here: * :doc:`database cookbook ` * :doc:`annotation fetching ` Setting up your database ----------------------------- There are 4 step that you need to carry out in order to have a working database. 1. Install `PostgreSQL `_ and create an empty database 2. Modify the `htsint` config file 3. Fetch all necessary data files 4. Populate the datbase (1) PostgreSQL setup ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Install `PostgreSQL `_ on you system. For example on Ubuntu .. code-block:: bash ~$ sudo apt-get install postgresql Then you need to create a database. Here we do this with a database named `htsintegrate`, a user named `dbuser` and a password `somepassword`. You should change these parameters to fit your system. .. code-block:: bash ~$ sudo su - postgres ~$ psql -U postgres CREATE USER dbuser WITH ENCRYPTED PASSWORD 'somepassword'; CREATE DATABASE htsintegrate WITH OWNER dbuser; \q Visit the `PostgreSQL documentation `_ to learn more. (2) Modify the config file ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Set the database name, user, password and port >>> from htsint import Configure >>> config = Configure() >>> config.log['dbuser'] = 'dbuser' >>> config.log['dbname'] = 'htsintegrate' >>> config.log['dbpass'] = 'somepassword' >>> config.save() The other variables than can be modified are: >>> print(config.log.keys()) ['dbport', 'taxa', 'dbhost', 'dbpass', 'dbuser', 'blast', 'data', 'dbname'] Check that all taxa that you would like annotation information for are present. You can print the default taxa with >>> print(config.log['taxa']) The default taxa are: .. code-block:: none 3702, Arabidopsis thaliana 4932, Saccharomyces cerevisiae 5476, Candida albicans 7227, Drosophila melanogaster 7955, Danio rerio 8355, Xenopus laevis 8364, Xenopus (Silurana) tropicalis 9031, Gallus gallus 9606, Homo sapiens 10090, Mus musculus 10566, Human papillomavirus 10116, Rattus norvegicus 28377, Anolis carolinensis If you would like to add, for example, *Solanum lycopersicum* then use typical list syntax. You need to save any changes you make. >>> config.log['taxa'].append('4081') >>> config.save() There is a good chance you will want to specify the directory where all the downloaded data is stored. This can be done with any valid full path. The default is >>> print(config.log['data']) '/usr/local/share/htsint' Alternatively, it is possible to edit this file directly. To locate the directory where it is stored you may type the following. >>> import os >>> os.path.join(os.path.expanduser('~'),".htsint") '/home/adam/.htsint' The dbport (default '5432') and dbhost (default 'localhost') may also be configured. .. note:: ``htsint`` will only populate annotation information for taxa in the *taxa* variable so make sure all species are present **before** database population. (3) Fetch the necessary data files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The rest of the process is automated assuming you have created your PostgreSQL database and modified you config file. >>> from htsint.database import DatabaseFetch >>> fetch = DatabaseFetch() >>> fetch.run() This class only currently works under Linux/OSX operating systems. For other systems the following files could be downloaded by hand and placed in the 'data' directory. * `go.obo `_ * `taxdump.tar.gz `_ * `gene_info.gz `_ * `gene2go.gz `_ * `gene_association.goa_uniprot.gz `_ * `idmapping.dat.gz `_ * `uniprot_sprot.fasta.gz `_ The fetching can take several hours depending on the speed of your connection. The compressed files total less than 15GB, but be aware that the uncompressed versions will take up over 100GB of space. If space is an issue all files may be erased except ``uniprot_sprot.fasta.*`` and ``go.obo`` as the former is used for BLAST and the latter is not stored directly in the database and is used as part of most analysis pipelines. A logfile is produced and stored in your data directory. (4) Populate the database ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Finally, the database can be populated with the following class. >>> from htsint.database import DatabaseCreate >>> db = DatabaseCreate() >>> db.run() A typical database will take a little over an hour to populate. A logfile is produced and stored in your data directory. A summary can be produced at any time using which will produce a similar output. >>> from htsint.database import print_db_summary >>> print_db_summary() .. code-block:: none DATABASE - htsintegrate - SUMMARY There are 1262260 entries in the taxa table There are 681732 entries in the genes table There are 777608 entries in the uniprot table There are 42627 entries in the go_terms table There are 7463568 entries in the go_annotations table Additional Notes ----------------- What exactly is stored in the database? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * All taxa from `NCBI taxonomy `_ * Gene, UniProt and GO annotation information for only the specified taxa * All information about GO terms Database updating ^^^^^^^^^^^^^^^^^^^^^^ Because of the challenges that can arise through naming conflicts when updating NCBI and UniProt data it is recommended that you run the fetch and create steps again, which will create a clean updated version. The fetch step will check if a current file is the newest and only download a new one if necessary. Database portability ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You may want to copy your database to another computer instead of waiting for the db to populate. This can be done as follows. To create a file that may be transferred to another computer .. code-block:: bash ~$ pg_dump -h localhost -U dbuser dbname > htsint.sql To add the database to another server .. code-block:: bash ~$ sudo su - postgres ~$ psql -U postgres CREATE USER dbuser WITH ENCRYPTED PASSWORD 'somepassword'; CREATE DATABASE newdbname WITH OWNER dbuser; \q ~$ psql newdbname < htsint.sql See the :doc:`database cookbook ` for more information on getting started with the database.