.. pipeline example Pathway example ====================== Lets start with a pathway downloaded from the KEGG database. * :download:`hsa00860.txt` The Python code in this example is available as a script. * :download:`pathway-example.py` First lets define a parsing function for the KEGG formatted pathways file. .. code-block:: python def get_genes(pathway): fid = open(os.path.join(".",pathway+".txt"),'r') isGene = False gene2symbol = {} for linja in fid: linja = linja[:-1] if re.search("^GENE",linja): isGene = True elif re.search("^[A-Z]+",linja): isGene = False if isGene: geneIds = linja.split(";")[0] _geneIds = [re.sub("\s+","",gi) for gi in geneIds.split(" ")] geneIds = [] for gid in _geneIds: if len(gid) > 0 and gid != 'GENE': geneIds.append(gid) if len(geneIds) != 2: print("...%s"%str(geneIds)) raise Exception("Could not parse gene names") ncbiId,symbol = geneIds gene2symbol[ncbiId] = symbol fid.close() return gene2symbol Extract the genes into a list >>> pathwayFile = "hsa00860.txt" >>> pathway = pathwayFile[:-4] >>> geneList = get_genes(pathway) >>> print(geneList.items()[:3]) [('7363', 'UGT2B4'), ('212', 'ALAS2'), ('210', 'ALAD')] Specify a location for the analysis files. >>> gsaDir = os.path.join(".","gsa-path") >>> if not os.path.exists(gsaDir): >>> os.mkdir(gsaDir) Make imports and specify variables >>> from htsint import GeneOntology,TermDistances,GeneDistances >>> useIea = True >>> aspect = "biological_process" >>> _aspect = 'bp' >>> taxaList = ['9606'] >>> go = GeneOntology(taxaList,useIea=useIea,aspect=aspect) >>> termsPath = os.path.join(gsaDir,"go-terms-%s.pickle"%(_aspect)) >>> graphPath = os.path.join(gsaDir,"go-graph-%s.pickle"%(_aspect)) By default ``htsint`` works at the genome level so to get term distances with respect to a subset of genes we need to specify an `accepted` list. The rest of the process is the same as was shown in the gsa example. Create the gene-term dictionaries >>> geneIds = geneList.keys() >>> go.create_dicts(termsPath,accepted=geneIds) >>> gene2go,go2gene = go.load_dicts(termsPath) >>> print("pathway genes with terms: %s/%s"%(len(gene2go.keys()),len(geneIds))) Create the term specific graph >>> G = go.create_gograph(termsPath=termsPath,graphPath=graphPath) >>> print("Term graph for with %s nodes successfully created."%(len(G.nodes()))) Calculate the distances between terms >>> termDistancePath = os.path.join(gsaDir,"term-distances-%s.npy"%(_aspect)) >>> td = TermDistances(termsPath,graphPath) >>> print("total distances to evaluate: %s"%td.totalDistances) >>> td.run_with_multiprocessing(termDistancePath,cpus=4) Map the term distances into gene space >>> geneDistancePath = os.path.join(gsaDir,"gene-distances-%s.csv"%(_aspect)) >>> gd = GeneDistances(termsPath,termDistancePath,outFile=geneDistancePath) >>> gd.run() Run the parameter search for spectral clustering >>> silvalFile = re.sub("\.csv","-scparams-sv.csv",geneDistancePath) >>> clustersFile = re.sub("\.csv","-scparams-cl.csv",geneDistancePath) >>> scps = SpectralClusterParamSearch(geneDistancePath,dtype='distance') >>> scps.run(chunks=5,kRange=range(3,11)) Plot the parameter search >>> psFigureFile = os.path.join(gsaDir,"param-scan-%s.png"%(_aspect)) >>> scr = SpectralClusterResults(silvalFile,clustersFile) >>> scr.plot(figName=psFigureFile) .. figure:: ./gsa-path/param-scan-bp.png :scale: 25% :align: center :alt: parameter scan :figclass: align-center Run spectral clustering with selected parameters >>> k = 3 >>> sigma = 0.43 >>> labelsPath = os.path.join(gsaDir,"sc-labels-%s.csv"%(_aspect)) >>> sc = SpectralCluster(geneDistancePath,dtype='distance') >>> sc.run(k,sk=None,sigma=sigma,verbose=True) >>> sc.save(labelsPath=labelsPath) The genes and their labels are saved in `labelsPath`.