Files
awesome-awesomeness/html/bioie.md2.html
2025-07-18 23:13:11 +02:00

810 lines
42 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<div data-align="center">
<pre><code>&lt;img src=&quot;https://github.com/caufieldjh/awesome-bioie/blob/main/images/abie_head.png&quot; alt=&quot;Awesome BioIE Logo&quot;/&gt;
&lt;br&gt;
&lt;a href=&quot;https://awesome.re&quot;&gt;
&lt;img src=&quot;https://awesome.re/badge-flat2.svg&quot; alt=&quot;Awesome&quot;&gt;
&lt;/a&gt;
&lt;br&gt;
How to extract information from unstructured biomedical data and text.
&lt;br&gt;</code></pre>
</div>
<p>What is BioIE? It includes any effort to extract structured
information from <em>unstructured</em> (or, at least inconsistently
structured) biological, clinical, or other biomedical data. The data
source is often some collection of text documents written in technical
language. If the resulting information is verifiable and consistent
across sources, we may then consider it <em>knowledge</em>. Extracting
information and producing knowledge from bio data requires adaptations
upon methods developed for other types of unstructured data.</p>
<p>BioIE has undergone massive changes since the introduction of
language models like BERT and the more recently created Large Language
Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).</p>
<p>Resources included here are preferentially those available at no
monetary cost and limited license requirements. Methods and datasets
should be publicly accessible and actively maintained.</p>
<p>See also <a
href="https://github.com/keon/awesome-nlp">awesome-nlp</a>, <a
href="https://github.com/raivivek/awesome-biology">awesome-biology</a>
and <a
href="https://github.com/danielecook/Awesome-Bioinformatics">Awesome-Bioinformatics</a>.</p>
<p><em>Please read the <a href="contributing.md">contribution
guidelines</a> before contributing. Please add your favourite resource
by raising a <a
href="https://github.com/caufieldjh/awesome-bioie/pulls">pull
request</a>.</em></p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#research-overviews">Research Overviews</a></li>
<li><a href="#groups-active-in-the-field">Groups Active in the
Field</a></li>
<li><a href="#organizations">Organizations</a></li>
<li><a href="#journals-and-events">Journals and Events</a>
<ul>
<li><a href="#journals">Journals</a></li>
<li><a href="#conferences-and-other-events">Conferences and Other
Events</a></li>
<li><a href="#challenges">Challenges</a></li>
</ul></li>
<li><a href="#tutorials">Tutorials</a>
<ul>
<li><a href="#guides">Guides</a></li>
<li><a href="#video-lectures-and-online-courses">Video Lectures and
Online Courses</a></li>
</ul></li>
<li><a href="#code-libraries">Code Libraries</a>
<ul>
<li><a href="#repos-for-specific-datasets">Repos for Specific
Datasets</a></li>
</ul></li>
<li><a href="#tools-platforms-and-services">Tools, Platforms, and
Services</a>
<ul>
<li><a href="#annotation-tools">Annotation Tools</a></li>
</ul></li>
<li><a href="#techniques-and-models">Techniques and Models</a></li>
<li><a href="#datasets">Datasets</a>
<ul>
<li><a href="#biomedical-text-sources">Biomedical Text Sources</a></li>
<li><a href="#annotated-text-data">Annotated Text Data</a></li>
<li><a
href="#protein-protein-interaction-annotated-corpora">Protein-protein
Interaction Annotated Corpora</a></li>
<li><a href="#other-datasets">Other Datasets</a></li>
</ul></li>
<li><a href="#ontologies-and-controlled-vocabularies">Ontologies and
Controlled Vocabularies</a></li>
<li><a href="#data-models">Data Models</a></li>
<li><a href="#credits">Credits</a></li>
</ul>
<h2 id="research-overviews">Research Overviews</h2>
<h3 id="llms-in-biomedical-ie">LLMs in Biomedical IE</h3>
<ul>
<li><a href="http://dx.doi.org/10.1101/2024.04.24.24306315">Large
language models in healthcare: A comprehensive benchmark</a> - a
statistical and human evaluation of sixteen different LLMs applied to
medical language tasks.</li>
<li><a href="https://doi.org/10.1186/s12911-024-02459-6">Assessing the
research landscape and clinical utility of large language models: a
scoping review</a> - a high-level review of LLM applications in medicine
as of March 2024.</li>
<li><a href="https://doi.org/10.1016/s2589-7500(24)00061-x">Ethical and
regulatory challenges of large language models in medicine</a> - a
review of ethical issues arising from applications of LLMs in
biomedicine.</li>
<li><a href="http://dx.doi.org/10.1145/3442188.3445922">On the Dangers
of Stochastic Parrots: Can Language Models Be Too Big? 🦜</a> - a
frequently referenced but still relevant work concerning the roles,
applications, and risks of language models.</li>
</ul>
<h3 id="pre-llm-overviews">Pre-LLM Overviews</h3>
<ul>
<li><a
href="https://www.ahajournals.org/doi/full/10.1161/CIRCRESAHA.117.310967">Biomedical
Informatics on the Cloud: A Treasure Hunt for Advancing Cardiovascular
Medicine</a> - An overview of how BioIE and bioinformatics workflows can
be applied to questions in cardiovascular health and medicine
research.</li>
<li><a
href="https://www.sciencedirect.com/science/article/pii/S1532046417302563">Clinical
information extraction applications: A literature review</a> - A review
of clinical IE papers published as of September 2016. From Mayo Clinic
group (see below).</li>
<li><a
href="https://www.sciencedirect.com/science/article/pii/S1532046417301909">Literature
Based Discovery: Models, methods, and trends</a> - A review of
Literature Based Discovery (LBD), or the philosophy that meaningful
connections may be found between seemingly unrelated scientific
literature.
<ul>
<li>For some historical context on LBD, see papers by University of
Chicagos Don Swanson and Neil Smalheiser, including <a
href="https://www.jstor.org/stable/4307965"><em>Undiscovered Public
Knowledge</em></a> (paywalled) and <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5771422/"><em>Rediscovering
Don Swanson: the Past, Present and Future of Literature-Based
Discovery</em></a>.</li>
</ul></li>
<li><a href="https://arxiv.org/abs/1702.03222">Mining Electronic Health
Records (EHRs): A Survey</a> - A review of the methods and philosophy
behind mining electronic health records, including using them for
adverse event detection. See Table 2 for a list of relevant papers as of
mid-2017.</li>
<li><a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6250990/">Capturing
the Patients Perspective: a Review of Advances in Natural Language
Processing of Health-Related Text</a> - A 2017 review of natural
language processing methods applied to information extraction in health
records and social media text. An important note from this review: “One
of the main challenges in the field is the availability of data that can
be shared and which can be used by the community to push the development
of methods based on comparable and reproducible studies”.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="groups-active-in-the-field">Groups Active in the Field</h2>
<ul>
<li><a
href="http://www.childrenshospital.org/research/labs/natural-language-processing-laboratory">Boston
Childrens Hospital Natural Language Processing Laboratory</a> - Led by
Dr. Guergana Savova, formerly at Mayo Clinic and the Apache cTAKES
project.</li>
<li><a
href="https://www.brown.edu/academics/medical/about-us/research/centers-institutes-and-programs/biomedical-informatics/">Brown
Center for Biomedical Informatics</a> - Based at Brown University and
directed by Dr. Neil Sarkar, whose research group works on topics in
clinical NLP and IE.</li>
<li><a
href="http://compbio.ucdenver.edu/Hunter_lab/CCP_website/index.html">Center
for Computational Pharmacology NLP Group</a> - based at University of
Colorado, Denver and led by Larry Hunter - <a
href="https://github.com/UCDenver-ccp">see their GitHub repos
here.</a></li>
<li>Groups at U.S. National Institutes of Health (NIH) / National
Library of Medicine (NLM):
<ul>
<li><a
href="https://www.lhncbc.nlm.nih.gov/personnel/dina-demner-fushman">Demner-Fushman
group at NLM</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/research/bionlp/">BioNLP group
at NCBI</a> - Develops improvements to biomedical literature search and
curation (e.g., through PubMed), led by Dr. Zhiyong Lu.</li>
</ul></li>
<li><a href="https://jensenlab.org/">JensenLab</a> - Based at the Novo
Nordisk Foundation Center for Protein Research at the University of
Copenhagen, Denmark.</li>
<li><a href="http://www.nactem.ac.uk/">National Centre for Text Mining
(NaCTeM)</a> - Based at the University of Manchester and led by
Prof. Sophia Ananiadou, NaCTeM is concerned with text mining in general
but has a particular focus on biomedical applications.</li>
<li><a
href="https://www.mayo.edu/research/departments-divisions/department-health-sciences-research/medical-informatics/projects">Mayo
Clinics clinical natural language processing program</a> - Several
groups at Mayo Clinic have made major contributions to BioIE (for
example, the Apache cTAKES platform) over the past 20 years.</li>
<li><a href="https://monarchinitiative.org/">Monarch Initiative</a> - A
joint effort between groups at Oregon State University, Oregon Health
&amp; Science University, Lawrence Berkeley National Lab, The Jackson
Laboratory, and several others, seeking to “integrate biological
information using semantics, and present it in a novel way, leveraging
phenotypes to bridge the knowledge gap”.</li>
<li><a href="https://turkunlp.org/">TurkuNLP</a> - Based at the
University of Turku and concerned with NLP in general with a focus on
BioNLP and clinical applications.</li>
<li><a href="https://sbmi.uth.edu/nlp/">UTHealth Houston Biomedical
Natural Language Processing Lab</a> - Based in the University of Texas
Health Science Center at Houston, School of Biomedical Informatics and
led by Dr. Hua Xu.</li>
<li><a href="https://nlp.cs.vcu.edu/">VCU Natural Language Processing
Lab</a> - Based at Virginia Commonwealth University and led by
Dr. Bridget McInnes.</li>
<li><a href="http://zaklab.org">Zaklab</a> - Group led by Dr. Isaac
Kohane at Harvard Medical Schools Department of Biomedical Informatics
(Dr. Kohane is also a steward of the n2c2 (formerly i2b2) datasets - see
<a href="#datasets">Datasets</a> below).</li>
<li><a href="https://www.dbmi.columbia.edu/">Columbia University
Department of Biomedical Informatics</a> - Led by Drs. George Hripcsak
and Noémie Elhadad.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="organizations">Organizations</h2>
<ul>
<li><a href="https://www.amia.org/">AMIA</a> - Many—but certainly not
all—individuals studying biomedical informatics are members of the
American Medical Informatics Association. AMIA publishes a journal,
JAMIA (see below).</li>
<li><a href="https://imia-medinfo.org/">IMIA</a> - The International
Medical Informatics Association. Publishes the IMIA Yearbook of Medical
Informatics.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="journals-and-events">Journals and Events</h2>
<p>The interdisciplinary nature of BioIE means researchers in this space
may share their findings and tools in a variety of ways. They may
publish papers in journals, as is common in the biomedical and life
sciences. They may publish conference papers and, upon acceptance, give
a poster and/or oral presentation at an event; this is common practice
in computer science and engineering fields. Conference papers are often
published in collections of proceedings. Preprint publication is an
increasingly popular and institutionally-accepted way to publish
findings as well. Surrounding these formal, written products are the
ideas of <a href="https://en.wikipedia.org/wiki/Open_science">open
science</a>, open data, and open source: the code, data, and software
BioIE researchers develop are valuable resources to the community.</p>
<h3 id="journals">Journals</h3>
<p>For preprints, try <a href="https://arxiv.org">arXiv</a>, especially
the subjects Computation and Language (cs.CL) and Information Retrieval
(cs.IR); <a href="https://www.biorxiv.org/">bioRxiv</a>; or <a
href="https://www.medrxiv.org/">medRxiv</a>, especially the Health
Informatics subject area.</p>
<ul>
<li><a href="https://academic.oup.com/database">Database</a> - Its
subtitle is “The Journal of Biological Databases and Curation”. Open
access.</li>
<li><a href="https://academic.oup.com/nar">NAR</a> - Nucleic Acids
Research. Has a broad biomolecular focus but is particularly notable for
its annual database issue.</li>
<li><a href="https://academic.oup.com/jamia">JAMIA</a> - The Journal of
the American Medical Informatics Association. Concerns “articles in the
areas of clinical care, clinical research, translational science,
implementation science, imaging, education, consumer health, public
health, and policy”.</li>
<li><a
href="https://www.sciencedirect.com/journal/journal-of-biomedical-informatics">JBI</a>
- The Journal of Biomedical Informatics. Not open access by default,
though it does have an open-access “X” version.</li>
<li><a href="https://www.nature.com/sdata/">Scientific Data</a> - An
open-access Springer Nature journal publishing “descriptions of
scientifically valuable datasets, and research that advances the sharing
and reuse of scientific data”.</li>
</ul>
<h3 id="conferences-and-other-events">Conferences and Other Events</h3>
<ul>
<li><a href="http://acm-bcb.org/">ACM-BCB</a> - The ACM Conference on
Bioinformatics, Computational Biology, and Health Informatics. Held
annually since 2010.</li>
<li><a href="http://ieeebibm.org/BIBM2019/">BIBM</a> - The IEEE
International Conference on Bioinformatics and Biomedicine.</li>
<li><a href="https://www.iscb.org/about-ismb">ISMB</a> - The
International Conference on Intelligent Systems for Molecular Biology is
an annual conference hosted by the International Society for
Computational Biology since 1993. Much of its focus has concerned
bioinformatics and computational biology without an explicit clinical
focus, though it has included an increasing amount of text mining
content (e.g., the 2019 meeting included a <a
href="http://cosi.iscb.org/wiki/TextMining:Home">full-day special
session on Text Mining for Biology and Healthcare</a>). The meeting is
combined with that of the European Conference on Computational Biology
(ECCB) on odd-numbered years.</li>
<li><a href="https://psb.stanford.edu/">PSB</a> - The Pacific Symposium
on Biocomputing.</li>
</ul>
<h3 id="challenges">Challenges</h3>
<p>Some events in BioIE are organized around formal tasks and challenges
in which groups develop their own computational solutions, given a
dataset.</p>
<ul>
<li><a href="http://bioasq.org/">BioASQ</a> - Challenges on biomedical
semantic indexing and question answering. Challenges and workshops held
annually since 2013.</li>
<li><a href="https://biocreative.bioinformatics.udel.edu/">BioCreAtIvE
workshop</a> - These workshops have been organized since 2004, with
BioCreative VI happening February 2017 and the <a
href="https://sites.google.com/view/ohnlp2018/home">BioCreative/OHNLP
Challenge</a> held in 2018. See <a href="#datasets">Datasets</a>
below.</li>
<li><a href="http://alt.qcri.org/semeval2020/">SemEval workshop</a> -
Tasks and evaluations in computational semantic analysis. Tasks vary by
year but frequently cover scientific and/or biomedical language,
e.g. the <a
href="https://competitions.codalab.org/competitions/19948">SemEval-2019
Task 12 on Toponym Resolution in Scientific Papers</a>.</li>
<li><a
href="https://knowledge-learning.github.io/ehealthkd-2019/">eHealth-KD</a>
- Challenges for encouraging “development of software technologies to
automatically extract a large variety of knowledge from eHealth
documents written in the Spanish Language”. Previously held as part of
<a href="http://www.sepln.org/workshops/tass/">TASS</a>, an annual
workshop for semantic analysis in Spanish.</li>
<li><a
href="https://www.synapse.org/#!Synapse:syn18405991/wiki/589657">EHR
DREAM Challenge</a> - Held along with several other <a
href="http://dreamchallenges.org/">more bioinformatics-focused
challenges</a>, this challenge opened in October 2019 and focuses on
using electronic health record data to predict patient mortality. Uses a
synthetic data set rather than real EHR contents.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="tutorials">Tutorials</h2>
<p>The field changes rapidly enough that tutorials any older than a few
years are missing crucial details. A few more recent educational
resources are listed below. A good foundational understanding of text
mining techniques is very helpful, as is some basic experience with the
Python and or R languages. The best option may be to learn by doing.</p>
<h3 id="llm-guides">LLM Guides</h3>
<p><em>TBD - watch this space!</em></p>
<h3 id="pre-llm-guides-lectures-and-courses">Pre-LLM Guides, Lectures,
and Courses</h3>
<ul>
<li><a
href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0040020">Getting
Started in Text Mining</a> - A brief introduction to bio-text mining
from Cohen and Hunter. More than ten years old but still quite relevant.
See also an <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1702322/">earlier
paper by the same authors</a>.</li>
<li><a
href="https://link.springer.com/book/10.1007/978-1-4939-0709-0">Biomedical
Literature Mining</a> - A (non-free) volume of Methods in Molecular
Biology from 2014. Chapters covers introductory principles in text
mining, applications in the biological sciences, and potential for use
in clinical or medical safety scenarios.</li>
<li><a
href="https://www.coursera.org/learn/mining-medical-data">Coursera -
Foundations of mining non-structured medical data</a> - About three
hours worth of video lectures on working with medical data of various
types and structures, including text and image data. Appears fairly
high-level and intended for beginners.</li>
<li><a href="https://jensenlab.org/training/textmining/">JensenLab text
mining exercises</a></li>
<li><a
href="https://www.bits.vib.be/training-list/111-bits/training/previous-trainings/183-text-mining">VIB
text mining and curation training</a> - This training workshop
happenened in 2013 but the slides are still online.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="code-libraries">Code Libraries</h2>
<ul>
<li><a href="https://biopython.org/">Biopython</a> - <a
href="http://dx.doi.org/10.1093/bioinformatics/btp163">paper</a> - <a
href="https://github.com/biopython/biopython">code</a> - Python tools
primarily intended for bioinformatics and computational molecular
biology purposes, but also a convenient way to obtain data, including
documents/abstracts from PubMed (see Chapter 9 of the
documentation).</li>
<li><a href="https://github.com/kilicogluh/Bio-SCoRes">Bio-SCoRes</a> -
<a
href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148538">paper</a>
- A framework for biomedical coreference resolution.</li>
<li><a href="https://github.com/NLPatVCU/medaCy">medaCy</a> - A system
for building predictive medical natural language processing models.
Built on the <a href="https://spacy.io/">spaCy</a> framework.</li>
<li><a href="https://github.com/allenai/SciSpaCy">ScispaCy</a> - <a
href="https://arxiv.org/abs/1902.07669">paper</a> - A version of the <a
href="https://spacy.io/">spaCy</a> framework for scientific and
biomedical documents.</li>
<li><a href="https://github.com/ropensci/rentrez">rentrez</a> - R
utilities for accessing NCBI resources, including PubMed.</li>
<li><a
href="https://medium.com/@kormilitzin/med7-clinical-information-extraction-system-in-python-and-spacy-5e6f68ab1c68">Med7</a>
- <a href="https://arxiv.org/abs/2003.01271">paper</a> - <a
href="https://github.com/kormilitzin/med7">code</a> - a Python package
and model (for use with spaCy) for doing NER with medication-related
concepts.</li>
</ul>
<h3 id="repos-for-specific-datasets">Repos for Specific Datasets</h3>
<ul>
<li><a href="https://github.com/MIT-LCP/mimic-code">mimic-code</a> -
Code associated with the MIMIC-III dataset (see below). Includes some
helpful <a
href="https://github.com/MIT-LCP/mimic-code/tree/master/tutorials">tutorials</a>.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="tools-platforms-and-services">Tools, Platforms, and
Services</h2>
<ul>
<li><a href="https://ctakes.apache.org/">cTAKES</a> - <a
href="https://academic.oup.com/jamia/article/17/5/507/830823">paper</a>
- <a href="https://github.com/apache/ctakes">code</a> - A system for
processing the text in electronic medical records. Widely used and open
source.</li>
<li><a href="https://clamp.uth.edu/">CLAMP</a> - <a
href="https://academic.oup.com/jamia/article/25/3/331/4657212">paper</a>
- A natural language processing toolkit intended for use with the text
in clinical reports. Check out their <a
href="https://clamp.uth.edu/clampdemo.php">live demo</a> first to see
what it does. Usable at no cost for academic research.</li>
<li><a href="https://github.com/DeepPhe/DeepPhe-Release">DeepPhe</a> - A
system for processing documents describing cancer presentations. Based
on cTAKES (see above).</li>
<li><a
href="https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/dnorm/">DNorm</a>
- <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/">paper</a> -
A method for disease normalization, i.e., linking mentions of disease
names and acronyms to unique concept identifiers. Downloadable version
includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data
below).</li>
<li><a href="https://www.ncbi.nlm.nih.gov/research/pubtator/">PubTator
Central</a> - <a
href="https://academic.oup.com/nar/article/47/W1/W587/5494727">paper</a>
- A web platform that identifies five different types of biomedical
concepts in PubMed articles and PubMed Central full texts. The full
annotation sets are downloadable (see <a
href="#annotated-text-data">Annotated Text Data</a> below).</li>
<li><a href="https://github.com/jakelever/pubrunner">Pubrunner</a> - A
framework for running text mining tools on the newest set(s) of
documents from PubMed.</li>
<li><a href="https://github.com/CogStack/CogStack-SemEHR">SemEHR</a> -
<a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019046/">paper</a> -
an IE infrastructure for electronic health records (EHR). Built on the
<a href="https://github.com/CogStack">CogStack project</a>.</li>
<li><a
href="https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/taggerone/">TaggerOne</a>
- <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018376/">paper</a> -
Performs concept normalization (see also DNorm above). Can be trained
for specific concept types and can perform NER independent of other
normalization functions.</li>
<li><a href="https://github.com/nikolamilosevic86/TabInOut">TabInOut</a>
- <a
href="https://link.springer.com/article/10.1007/s10032-019-00317-0">paper</a>
- a framework for IE from tables in the literature.</li>
</ul>
<h3 id="annotation-tools">Annotation Tools</h3>
<ul>
<li><a href="https://github.com/weitechen/anafora">Anafora</a> - <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657237/">paper</a> -
An annotation tool with adjudication and progress tracking
features.</li>
<li><a href="https://brat.nlplab.org/">brat</a> - <a
href="https://www.aclweb.org/anthology/E12-2021/">paper</a> - <a
href="https://github.com/nlplab/brat">code</a> - The brat rapid
annotation tool. Supports producing text annotations visually, through
the browser. Not subject specific; appropriate for many annotation
projects. Visualization is based on that of the <a
href="https://github.com/nlplab/stav/"><em>stav</em> tool</a>.</li>
<li><a href="https://ohnlp.github.io/MedTator/">MedTator</a> - <a
href="https://academic.oup.com/bioinformatics/article-abstract/38/6/1776/6496915">paper</a>
- <a href="https://github.com/OHNLP/MedTator">code</a> - An annotation
tool designed to have minimal dependencies.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="techniques-and-models">Techniques and Models</h2>
<h3 id="large-language-models">Large Language Models</h3>
<p><em>TBD - watch this space!</em></p>
<h3 id="bert-models">BERT models</h3>
<ul>
<li><a href="https://github.com/naver/biobert-pretrained">BioBERT</a> -
<a href="https://arxiv.org/abs/1901.08746">paper</a> - <a
href="https://github.com/dmis-lab/biobert">code</a> - A PubMed and
PubMed Central-trained version of the <a
href="https://arxiv.org/abs/1810.04805">BERT language model</a>.</li>
<li>ClinicalBERT - Two language models trained on clinical text have
similar names. Both are BERT models trained on the text of clinical
notes from the MIMIC-III dataset.
<ul>
<li><a href="https://github.com/EmilyAlsentzer/clinicalBERT">Alsentzer
et al Clinical BERT</a> - <a
href="https://www.aclweb.org/anthology/W19-1909/">paper</a></li>
<li><a href="https://github.com/kexinhuang12345/clinicalBERT">Huang et
al ClinicalBERT</a> - <a
href="https://arxiv.org/abs/1904.05342">paper</a></li>
</ul></li>
<li><a href="https://github.com/allenai/scibert">SciBERT</a> - <a
href="https://arxiv.org/abs/1903.10676">paper</a> - A BERT model trained
on &gt;1M papers from the Semantic Scholar database.</li>
<li><a href="https://github.com/ncbi-nlp/bluebert">BlueBERT</a> - <a
href="https://arxiv.org/abs/1906.05474">paper</a> - A BERT model
pre-trained on PubMed text and MIMIC-III notes.</li>
<li><a
href="https://microsoft.github.io/BLURB/models.html">PubMedBERT</a> - <a
href="https://arxiv.org/abs/2007.15779">paper</a> - A BERT model trained
from scratch on PubMed, with versions trained on abstracts+full texts
and on abstracts alone.</li>
</ul>
<h3 id="gpt-2-models">GPT-2 models</h3>
<ul>
<li><a href="https://github.com/microsoft/BioGPT">BioGPT</a> - <a
href="https://doi.org/10.1093/bib/bbac409">paper</a> - A GPT-2 model
pre-trained on 15 million PubMed abstracts, along with fine-tuned
versions for several biomedical tasks.</li>
</ul>
<h3 id="other-models">Other models</h3>
<ul>
<li><a href="https://github.com/zalandoresearch/flair/pull/519">Flair
embeddings from PubMed</a> - A language model available through the
Flair framework and embedding method. Trained over a 5% sample of PubMed
abstracts until 2015, or &gt; 1.2 million abstracts in total.</li>
</ul>
<h3 id="text-embeddings">Text Embeddings</h3>
<ul>
<li><a
href="https://www.sciencedirect.com/science/article/pii/S1532046418301825">This
paper from Hongfang Lius group at Mayo Clinic</a> demonstrates how text
embeddings trained on biomedical or clinical text can, but dont always,
perform better on biomedical natural language processing tasks. That
being said, pre-trained embeddings may be appropriate for your needs,
especially as training domain-specific embeddings can be computationally
intensive.</li>
<li><a
href="http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts">BioASQword2vec</a>
- <a href="http://bioasq.lip6.fr/info/BioASQword2vec/">paper</a> - Qord
embeddings derived from biomedical text (&gt;10 million PubMed
abstracts) using the popular <a
href="https://code.google.com/archive/p/word2vec/">word2vec</a>
tool.</li>
<li><a
href="https://figshare.com/articles/Improving_Biomedical_Word_Embeddings_with_Subword_Information_and_MeSH_Ontology/6882647">BioWordVec</a>
- <a href="https://www.nature.com/articles/s41597-019-0055-0">paper</a>
- <a href="https://github.com/ncbi-nlp/BioWordVec">code</a> - Word
embeddings derived from biomedical text (&gt;27 million PubMed titles
and abstracts), including subword embedding model based on MeSH.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="datasets">Datasets</h2>
<p>Some of the datasets listed below require a <a
href="https://www.nlm.nih.gov/databases/umls.html#license_request">UMLS
Terminology Services (UTS) account</a> to access. Please note that the
license granted with the UTS account requires users to submit an annual
report about their use of UMLS resources. This is less challenging than
it sounds.</p>
<h3 id="biomedical-text-sources">Biomedical Text Sources</h3>
<p>The following resources contain indexed text documents in the
biomedical sciences. * <a
href="http://davis.wpi.edu/xmdv/datasets/ohsumed.html">OHSUMED</a> - <a
href="https://dl.acm.org/citation.cfm?id=188557">paper</a> - 348,566
MEDLINE entries (title and sometimes abstract) from between 1987 and
1991. Includes MeSH labels. Primarily of historical significance. * <a
href="https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/">PubMed Central
Open Access Subset</a> - A set of PubMed Central articles usable under
licenses other than traditional copyright, though the exact licenses
vary by publication and source. Articles are available as PDF and XML. *
<a href="https://github.com/allenai/cord19">CORD-19</a> - A corpus of
scholarly manuscripts concerning COVID-19. Articles are primarily from
PubMed Central and preprint servers, though the set also includes
metadata on papers without full-text availability.</p>
<h3 id="annotated-text-data">Annotated Text Data</h3>
<ul>
<li><a
href="https://bionlp.nlm.nih.gov/tac2017adversereactions/">SPL-ADR-200db</a>
- <a href="https://www.nature.com/articles/sdata20181">paper</a> - A
pilot dataset containing standardised information, and annotations of
occurence in text, about ~5,000 known adverse reactions for 200
FDA-approved drugs.</li>
<li><a
href="https://sourceforge.net/projects/biocreative/files/">BioCreAtIvE
1</a> - <a
href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-S1-S1">paper</a>
- 15,000 sentences (10,000 training and 5,000 test) annotated for
protein and gene names. 1,000 full text biomedical research articles
annotated with protein names and Gene Ontology terms.</li>
<li><a
href="https://sourceforge.net/projects/biocreative/files/">BioCreAtIvE
2</a> - <a
href="https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-s2-s1">paper</a>
- 15,000 sentences (10,000 training and 5,000 test, different from the
first corpus) annotated for protein and gene names. 542 abstracts linked
to EntrezGene identifiers. A variety of research articles annotated for
features of proteinprotein interactions.</li>
<li><a
href="https://biocreative.bioinformatics.udel.edu/accounts/login/?next=/resources/corpora/biocreative-v-cdr-corpus/">BioCreAtIvE
V CDR Task Corpus (BC5CDR)</a> - <a
href="https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414">paper</a>
- 1,500 articles (title and abstract) published in 2014 or later,
annotated for 4,409 chemicals, 5,818 diseases and 3116 chemicaldisease
interactions. Requires registration.</li>
<li><a
href="https://biocreative.bioinformatics.udel.edu/resources/corpora/chemprot-corpus-biocreative-vi/#chemprot-corpus-biocreative-vi:downloads">BioCreative
VI CHEMPROT Corpus</a> - <a
href="https://pdfs.semanticscholar.org/eed7/81f498b563df5a9e8a241c67d63dd1d92ad5.pdf">paper</a>
- &gt;2,400 articles annotated with chemical-protein interactions of a
variety of relation types. Requires registration.</li>
<li><a href="https://github.com/UCDenver-ccp/CRAFT">CRAFT</a> - <a
href="https://link.springer.com/chapter/10.1007/978-94-024-0881-2_53">paper</a>
- 67 full-text biomedical articles annotated in a variety of ways,
including for concepts and coreferences. Now on version 5, including
annotations linking concepts to the MONDO disease ontology.</li>
<li><a
href="https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/">n2c2
(formerly i2b2) Data</a> - The Department of Biomedical Informatics
(DBMI) at Harvard Medical School manages data for the National NLP
Clinical Challenges and the Informatics for Integrating Biology and the
Bedside challenges running since 2006. They require registration before
access and use. Datasets include a variety of topics. See the <a
href="https://portal.dbmi.hms.harvard.edu/data-challenges/">list of data
challenges</a> for individual descriptions.</li>
<li><a
href="https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/">NCBI
Disease Corpus</a> - <a
href="https://www.sciencedirect.com/science/article/pii/S1532046413001974">paper</a>
- A corpus of 793 biomedical abstracts annotated with names of diseases
and related concepts from MeSH and <a
href="https://omim.org/">OMIM</a>.</li>
<li><a href="https://www.ncbi.nlm.nih.gov/research/pubtator/">PubTator
Central datasets</a> - <a
href="https://academic.oup.com/nar/article/47/W1/W587/5494727">paper</a>
- Accessible through a RESTful API or FTP download. Includes annotations
for &gt;29 million abstracts and 3 million full text documents.</li>
<li><a href="https://wsd.nlm.nih.gov/">Word Sense Disambiguation
(WSD)</a> - <a
href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-223">paper</a>
- 203 ambiguous words and 37,888 automatically extracted instances of
their use in biomedical research publications. Requires UTS
account.</li>
<li><a
href="https://www.nlm.nih.gov/databases/download/CQC.html">Clinical
Questions Collection</a> - also known as CQC or the Iowa collection,
these are several thousand questions posed by physicians during office
visits along with the associated answers.</li>
<li><a href="http://2013.bionlp-st.org/">BioNLP ST 2013 datasets</a> -
data from six shared tasks, though some may not be easily accessible;
try the CG task set (BioNLP2013CG) for extensive entity and event
annotations.</li>
<li><a href="https://rgai.inf.u-szeged.hu/node/105">BioScope</a> - <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586758/">paper</a> -
a corpus of sentences from medical and biological documents, annotated
for negation, speculation, and linguistic scope.</li>
<li><a href="https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/">BioRED</a> -
<a href="https://arxiv.org/abs/2204.04263">paper</a> - a set of &gt;6.5K
biomedical relation annotations, plus labels for novel findings.</li>
</ul>
<h3 id="protein-protein-interaction-annotated-corpora">Protein-protein
Interaction Annotated Corpora</h3>
<p>Protein-protein interactions are abbreviated as PPI. The following
sets are available in <a href="http://bioc.sourceforge.net/">BioC
format</a>. The older sets (AIMed, BioInfer, HPRD50, IEPA, and LLL) are
available courtesy of the <a
href="http://corpora.informatik.hu-berlin.de">WBI corpora repository</a>
and were originally derived from the original sets by a <a
href="http://mars.cs.utu.fi/PPICorpora/">group at Turku
University</a>.</p>
<ul>
<li><a
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/aimed_bioc.xml.zip">AIMed</a>
- <a href="https://www.ncbi.nlm.nih.gov/pubmed/15811782">paper</a> - 225
MEDLINE abstracts annotated for PPI.</li>
<li><a
href="http://bioc.sourceforge.net/BioC-BioGRID.html">BioC-BioGRID</a> -
<a
href="https://academic.oup.com/database/article/doi/10.1093/database/baw147/2884890">paper</a>
- 120 full text articles annotated for PPI and genetic interactions.
Used in the BioCreative V BioC task.</li>
<li><a
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/bioinfer_bioc.xml.zip">BioInfer</a>
- <a
href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50">paper</a>
- 1,100 sentences from biomedical research abstracts annotated for
relationships (including PPI), named entities, and syntactic
dependencies. <a href="http://mars.cs.utu.fi/BioInfer/">Additional
information and download links are here.</a></li>
<li><a
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hprd50_bioc.xml.zip">HPRD50</a>
- <a
href="https://academic.oup.com/bioinformatics/article/23/3/365/236564">paper</a>
- 50 scientific abstracts referenced by the Human Protein Reference
Database, annotated for PPI.</li>
<li><a
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/iepa_bioc.xml.zip">IEPA</a>
- <a
href="http://psb.stanford.edu/psb-online/proceedings/psb02/abstracts/p326.html">paper</a>
- 486 sentences from biomedical research abstracts annotated for pairs
of co-occurring chemicals, including proteins (hence, PPI
annotations).</li>
<li><a
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/lll_bioc.xml.zip">LLL</a>
- <a
href="https://www.semanticscholar.org/paper/Learning-Language-in-Logic-Genic-Interaction-Nedellec/0863a9d71955341b7e1a6a6877d44d4f0bb22671">paper</a>
- 77 sentences from research articles about the bacterium <em>Bacillus
subtilis</em>, annotated for proteingene interactions (so, fairly close
to PPI annotations). <a
href="http://genome.jouy.inra.fr/texte/LLLchallenge/#task1">Additional
information is here.</a></li>
</ul>
<h3 id="other-datasets">Other Datasets</h3>
<ul>
<li><a href="http://cohd.io">Columbia Open Health Data</a> - <a
href="https://www.nature.com/articles/sdata2018273">paper</a> - A
database of prevalence and co-occurrence frequencies of conditions,
drugs, procedures, and patient demographics extracted from electronic
health records. Does not include original record text.</li>
<li><a href="https://ctdbase.org/">Comparative Toxicogenomics
Database</a> - <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323936/">paper</a> -
A database of manually curated associations between chemicals, gene
products, phenotypes, diseases, and environmental exposures. Useful for
assembling ontologies of the related concepts, such as types of
chemicals.</li>
<li><a href="https://mimic.physionet.org/">MIMIC-III</a> - <a
href="https://www.nature.com/articles/sdata201635">paper</a> -
Deidentified health data from ~60,000 intensive care unit admissions.
Requires completion of an online training course (CITI training) and
acceptance of a data use agreement prior to use.</li>
<li><a
href="https://physionet.org/content/mimic-cxr/2.0.0/">MIMIC-CXR</a> -
The MIMIC Chest X-Ray database. Contains more than 377,000 radiographic
images and accompanying free-text radiology reports. As with MIMIC-III,
requires acceptance of a data use agreement.</li>
<li><a
href="https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html">UMLS
Knowledge Sources</a> - <a
href="https://www.ncbi.nlm.nih.gov/books/NBK9676/">reference manual</a>
- A large and comprehensive collection of biomedical terminology and
identifiers, as well as accompanying tools and scripts. Depending on
your purposes, the single file MRCONSO.RRF may be sufficient, as this
file contains unique identifiers and names for all concepts in the UMLS
Metathesaurus. See also the Ontologies and Controlled Vocabularies
section below.</li>
<li><a href="https://mimic-iv.mit.edu/">MIMIC-IV</a> - An update to
MIMIC-IIIs multimodal patient data, now covering more recent years of
admissions, plus a new data structure, emergency department records, and
links to MIMIC-CXR images.</li>
<li><a href="https://eicu-crd.mit.edu/">eICU Collaborative Research
Database</a> - <a
href="https://www.nature.com/articles/sdata2018178">paper</a> - a
database of observations from more than 200 thousand intensive care unit
admissions, with consistent structure. Requires registration, training
course completion, and data use agreement.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="ontologies-and-controlled-vocabularies">Ontologies and
Controlled Vocabularies</h2>
<ul>
<li><a href="http://www.disease-ontology.org/">Disease Ontology</a> - <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383880/">paper</a> -
An ontology of human diseases. Has cross-links to MeSH, ICD, NCI
Thesaurus, SNOMED, and OMIM. Public domain. Available on <a
href="https://github.com/DiseaseOntology/HumanDiseaseOntology">GitHub</a>
and on the <a href="http://www.obofoundry.org/ontology/doid.html">OBO
Foundry</a>.</li>
<li><a
href="https://www.nlm.nih.gov/research/umls/rxnorm/index.html">RxNorm</a>
- <a
href="https://academic.oup.com/jamia/article/18/4/441/734170">paper</a>
- Normalized names for clinical drugs and drug packs, with combined
ingredients, strengths, and form, and assigned types from the Semantic
Network (see below). Released monthly.</li>
<li><a
href="https://lexsrv3.nlm.nih.gov/Specialist/Summary/lexicon.html">SPECIALIST
Lexicon</a> - <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2247735/">paper</a> -
A general English lexicon that includes many biomedical terms. Updated
yearly since 1994 and still updated as of 2019. Part of UMLS but does
not require UTS account to download.</li>
<li><a
href="https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html">UMLS
Metathesaurus</a> - <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308795/">paper</a> -
Mappings between &gt;3.8 million concepts, 14 million concept names, and
&gt;200 sources of biomedical vocabulary and identifiers. Its big. It
may help to prepare a subset of the Metathesaurus with the <a
href="https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html">MetamorphoSys
installation tool</a> but were still talking about ~30 Gb of disk space
required for the 2019 release. <a
href="https://www.ncbi.nlm.nih.gov/books/NBK9684/">See the manual
here</a>. Requires UTS account.</li>
<li><a href="https://semanticnetwork.nlm.nih.gov/">UMLS Semantic
Network</a> - <a
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2447396/">paper</a> -
Lists of 133 semantic types and 54 semantic relationships covering
biomedical concepts and vocabulary. Is the Metathesaurus too complex for
your needs? Try this. Does not require UTS account to download.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="data-models">Data Models</h2>
<p>Do you need a <a href="https://en.wikipedia.org/wiki/Data_model">data
model</a>? If you are working with biomedical data, then the answer is
probably “Yes”.</p>
<ul>
<li><a href="https://biolink.github.io/biolink-model/">Biolink</a> - <a
href="https://github.com/biolink/biolink-model">code</a> - A data model
of biological entities. Provided as a <a
href="https://yaml.org/">YAML</a> file.</li>
<li><a href="http://wiki.biouml.org/index.php/BioUML">BioUML</a> - <a
href="https://academic.oup.com/nar/article/47/W1/W225/5498754">paper</a>
- An architecture for biomedical data analysis, integration, and
visualization. Conceptually based on the visual modeling language <a
href="https://www.uml.org/what-is-uml.htm">UML</a>.</li>
<li><a href="https://github.com/OHDSI/CommonDataModel">OMOP Common Data
Model</a> - a standard for observational healthcare data.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="credits">Credits</h2>
<p><a href="./CREDITS.md">Credits</a> for curators and sources.</p>
<h2 id="license">License</h2>
<p><a href="https://creativecommons.org/publicdomain/zero/1.0"><img
src="https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg"
alt="CC0" /></a></p>
<p><a href="./LICENSE">License</a></p>
<p><a href="https://github.com/caufieldjh/awesome-bioie">bioie.md
Github</a></p>