# Awesome Computational Biology [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) A knowledge collection of databases, software and papers related to computational biology. > Computational biology involves the development and application of data-analytical and theoretical methods, > mathematical modelling and computational simulation techniques to the study of biological, ecological, > behavioural, and social systems. - [Wikipedia](https://en.wikipedia.org/wiki/Computational_biology) ## Contents - [Databases](#databases) - [scRNA](#scrna) - [Compound](#compound) - [Pathway](#pathway) - [Mass Spectra](#mass-spectra) - [Protein](#protein) - [Genome](#genome) - [Disease](#disease) - [Interaction](#interaction) - [Clinical Trial](#clinical-trial) - [API](#api) - [Preprocess](#preprocess) - [Machine Learning Tasks and Models](#machine-learning-tasks-and-models) - [Drug Response Prediction](#drug-response-prediction) - [Drug Repurposing](#drug-repurposing) - [Drug Target Interaction](#drug-target-interaction) - [Compound Protein Interaction](#compound-protein-interaction) - [Pre-trained embedding](#pre-trained-embedding) - [LLM for biology](#llm-for-biology) ## Databases ### scRNA - [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/) - Public functional genemics database. - [Single Cell PORTAL](https://singlecell.broadinstitute.org/single_cell) - Public database for single cell RNA. - [Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/home) - Public database for single cell RNA. ### Compound - [PubChem](https://pubchem.ncbi.nlm.nih.gov/) - One of the biggest chemical database such as compounds, genes and proteins. - [ChEBI](https://www.ebi.ac.uk/chebi/) - Chemical database focused on small chemical compounds. - [ChEMBL](https://www.ebi.ac.uk/chembl/) - Database of bioactive molecules with drug-like properties. - [ChemSpider](http://www.chemspider.com/) - Chemical structure database. - [KEGG COMPOUND](https://www.genome.jp/kegg/compound/) - Collection of small molecules and biopolymers. - [LIPID MAPS](https://www.lipidmaps.org/databases/lmsd/overview) - Database of lipids. - [Rhea](https://www.rhea-db.org/) - Database of chemical reactions. - [Drug Repurposing Hub](https://repo-hub.broadinstitute.org/repurposing#download-data) - Collections of drug repurposing data containing drug, moa, target etc. - [Therapeutic Target Database](https://idrblab.net/ttd/full-data-download) - collections of drug-target, target-disease, and drug-disease dataset. - [ZINC ligand discovery database](https://zinc.docking.org/) - Free database of commercially-available compounds for virtual screening. - [MoleculeNet](http://moleculenet.ai/) - Benchmark for molecular machine learning. - [Ames Mutagenicity dataset](https://www.sciencedirect.com/science/article/abs/pii/S0166354220302412) - Dataset for predicting mutagenicity. - [ADCdb](https://www.antibody-drug.com/) - Database for antibody-drug conjugates. ### Pathway - [PathwayCommons](https://www.pathwaycommons.org/) - Database of Pathways and Interactions. - [KEGG PATHWAY](https://www.genome.jp/kegg/pathway.html) - Collection fo drawn pathway maps. - [WikiPathways](https://wikipathways.org/) - Database of biological pathways. ### Mass Spectra - [MassBank](http://www.massbank.jp/) - Open souce databases and tools for mass spectrometry reference spectra. - [MoNA MassBank of North America](https://mona.fiehnlab.ucdavis.edu/) - Meta database of metabolite mass spectra, metadata and associated compounds. ### Protein - [THE HUMAN PROTEIN ATLAS](https://www.proteinatlas.org/) - One of the biggest human protein database contained cells, tissues, and organs. - [PROTEIN DATA BANK](https://www.rcsb.org/) - Database of the 3D shapes of proteins, nucleic acids, and complex assemblies. - [UniProt](https://www.uniprot.org/) - The collection of functional information on proteins. - [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/api-docs) - Database of 3D protein structures. - [RCSB Protein Data Bank (PDB)](https://www.rcsb.org/) - Repository of 3D structural data of large biological molecules. - [Critical Assessment of Structure Prediction (CASP)](https://predictioncenter.org/) - Experiment for advancing the methods of predicting protein structure from sequence. - [Uniclust](https://uniclust.mmseqs.com/) - Collection of clustered protein sequence databases. - [CATH database](https://www.cathdb.info/) - Hierarchical classification of protein domain structures. ### Genome - [Human Genome Resources at NCBI](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) - Database of image, proteomics, transcriptomics and systems biology. - [GenBank](https://www.ncbi.nlm.nih.gov/genbank/) - Database of genetic sequence offered by NCBI. - [UCSC Genome Browser](https://genome.ucsc.edu/) - Genome blowser offered by UCSC. - [cBioPortal](https://www.cbioportal.org/) - Database of Cancer Genomics. This has overall metaview for a lot of patients. - [10x Genomics Dataset](https://www.10xgenomics.com/resources/datasets) - Collection of single-cell datasets. - [The Genotype-Tissue Expression (GTEx)](https://gtexportal.org/home/) - Resource for studying human gene expression and regulation. - [Dependency Map (DepMap)](https://depmap.org/portal/) - Genome-wide CRISPR-Cas9 screens in cancer cell lines. - [Catalogue Of Somatic Mutations In Cancer (COSMIC)](https://cancer.sanger.ac.uk/cosmic) - Comprehensive resource for exploring somatic mutations in human cancers. - [MGnify](https://www.ebi.ac.uk/metagenomics/) - Free resource for archiving, analysis, and browsing of metagenomic and metatranscriptomic data. - [JASPAR](http://jaspar.genereg.net/) - Open-access database of curated, non-redundant transcription factor binding profiles. ### Disease - [KEGG DRUG](https://www.genome.jp/kegg/drug/) - Comprehensive drug information resource for approved drugs. - [DrugBank](https://www.drugbank.com/) - A database of drug and target maintained by the University of Alberta. ### Interaction - Drug Gene Interaction - [DGIdb](https://www.dgidb.org/) - A database of drug-gene interactions and the druggable genome. - [Comparative Toxicogenomics Database](http://ctdbase.org/) - A database of Chemical-gene interactions, Chemical-disease associations, Gene-disease associations, and Chemical-phenotype associations. - [SNAP](https://snap.stanford.edu/biodata/datasets/10002/10002-ChG-Miner.html#:~:text=Dataset%20information,or%20activation%20of%20the%20drug.) - A dataset which contains Drug-gene interactions. - [Therapeutics Data Commons](https://tdcommons.ai/) - A database for a lot of tasks such as drug-target, drug-response, drug-drug interaction. - Drug (-Cell line) Response - [NCI60](https://dtp.cancer.gov/discovery_development/nci-60/) A database which focus on 60 cancer cell lines with many drugs. - [Genomics of Drug Sensitivity in Cancer (GDSC)](https://www.cancerrxgene.org/) - A database of drug sensitibity which has 1000 human cancer cell lines and 100s compounds. - [Cancer Cell Line Encyclopedia](https://sites.broadinstitute.org/ccle/) - A database of cancer cell lines. This has 1000 cell lines. - [CellMiner Cross Database (CellMinerCDB)](https://discover.nci.nih.gov/cellminercdb/) - Integration of multiple cancer cell line databases. - Chemical Protein Interaction - [STITCH](http://stitch.embl.de/) - A database of Chemical Protein Interaction. - [BindingDB](https://www.bindingdb.org/rwd/bind/index.jsp) - A database of compounds and targes. - [PDBBind](http://www.pdbbind.org.cn/) - Database of experimentally measured binding affinity data for biomolecular complexes. - [CrossDocked2020](https://arxiv.org/abs/2001.01037) - Large-scale dataset for machine learning in structure-based virtual screening. - Protein-Protein Interaction - [STRING](https://string-db.org/) - Protein-Protein Interaction Networks for several organisms. - [BioGRID](https://thebiogrid.org/) - Database of Protein, Genetic and Chemical Interactions. - [HIPPIE](http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/) - Human Protein-Protein Interaction database. - Knowledge Graph - [Drug Mechanism Database (DrugMechDB)](https://github.com/SuLab/DrugMechDB/tree/2.0.1): database of the mechanism of action from a drug to a disease. - [DRKG](https://github.com/gnn4dr/DRKG) - A library for biological knowledge graph. ### Clinical Trial - [ClinicalTrials.gov](https://clinicaltrials.gov/) - Database of privately and publicly funded clinical studies. - [ICD10](https://icd.who.int/browse10/2019/en) - International Classification of Diseases, 10th revision. - [EU Drug Regulating Authorities Clinical Trials DB (EudraCT)](https://eudract.ema.europa.eu/) - European database of clinical trials. - [MIMIC-IV](https://mimic.mit.edu/) - Freely accessible critical care database. ## API - [PubMed esearch](https://www.nlm.nih.gov/dataguide/edirect/esearch.html): API for searching articles in PubMed. ## Preprocess - [Chemistry Development Kit](https://github.com/cdk/cdk) - A software of cheminformatics and Machine Learning. - [RDKit](https://github.com/rdkit/rdkit) - A software of cheminformatics and Machine Learning. - [Scanpy](https://scanpy.readthedocs.io/en/stable/) - scRNA analysis library in Python. - [Seurat](https://satijalab.org/seurat/) - scRNA analysis library in R. ## Machine Learning Tasks and Models ## Drug Response Prediction - [drGAT](https://github.com/inoue0426/drGAT): A model for drug response prediction with gene explainability with attention mechanism. - [MOFGCN](https://github.com/weiba/MOFGCN/tree/main): GCN + heterogeneous network - [DeepDSC](https://ieeexplore-ieee-org.ezp2.lib.umn.edu/stamp/stamp.jsp?tp=&arnumber=8723620&tag=1): Autoencoder + Fully Connected NN - [DGDRP](https://github.com/minwoopak/heteronet): Multi-view embedding NN. - [DeepAEG](https://github.com/zhejiangzhuque/DeepAEG): GNN Embedding + Attention ### Drug Repurposing - [DeepPurpose](https://github.com/kexinhuang12345/DeepPurpose) - A DL Library for Drug Repurposing. ### Drug Target Interaction - [NeoDTI](https://github.com/FangpingWan/NeoDTI) - A library for Drug Target Interaction. ### Compound Protein Interaction - [MCPINN](https://github.com/mhlee0903/multi_channels_PINN) - A library for drug discovery using Compound Protein Interaction and Machine Learning. - [TransformerCPI](https://github.com/lifanchen-simm/transformerCPI) - A library for Compound Protein Interaction prediction using Transformer. ### Pre-trained embedding - [Evolutionary Scale Modeling](https://github.com/facebookresearch/esm) - a library for protein embeddings. - [ChemBERTa-2](https://github.com/seyonechithrananda/bert-loves-chemistry) - a library for chemical embeddingg and prediction. ### LLM for biology - [AI4Chem/ChemLLM-7B-Chat](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat) - LLM for chemical and molecule science - [BioGPT](https://github.com/microsoft/BioGPT) - LLM for Biomedical text generation - [GeneGPT](https://github.com/ncbi/GeneGPT) - LLM for biomedical information with several API. - [GenePT](https://github.com/yiqunchen/GenePT) - foundation LLM for single cell data - [scPRINT](https://github.com/cantinilab/scPRINT) - scPRINT is pretrained on 50M cells to denoise and perform zero imputation of any single cell RNAseq profile. [computationalbiology.md Github](https://github.com/inoue0426/awesome-computational-biology )