810 lines
42 KiB
HTML
810 lines
42 KiB
HTML
<div data-align="center">
|
||
<pre><code><img src="https://github.com/caufieldjh/awesome-bioie/blob/main/images/abie_head.png" alt="Awesome BioIE Logo"/>
|
||
<br>
|
||
<a href="https://awesome.re">
|
||
<img src="https://awesome.re/badge-flat2.svg" alt="Awesome">
|
||
</a>
|
||
<br>
|
||
How to extract information from unstructured biomedical data and text.
|
||
<br></code></pre>
|
||
</div>
|
||
<p>What is BioIE? It includes any effort to extract structured
|
||
information from <em>unstructured</em> (or, at least inconsistently
|
||
structured) biological, clinical, or other biomedical data. The data
|
||
source is often some collection of text documents written in technical
|
||
language. If the resulting information is verifiable and consistent
|
||
across sources, we may then consider it <em>knowledge</em>. Extracting
|
||
information and producing knowledge from bio data requires adaptations
|
||
upon methods developed for other types of unstructured data.</p>
|
||
<p>BioIE has undergone massive changes since the introduction of
|
||
language models like BERT and the more recently created Large Language
|
||
Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).</p>
|
||
<p>Resources included here are preferentially those available at no
|
||
monetary cost and limited license requirements. Methods and datasets
|
||
should be publicly accessible and actively maintained.</p>
|
||
<p>See also <a
|
||
href="https://github.com/keon/awesome-nlp">awesome-nlp</a>, <a
|
||
href="https://github.com/raivivek/awesome-biology">awesome-biology</a>
|
||
and <a
|
||
href="https://github.com/danielecook/Awesome-Bioinformatics">Awesome-Bioinformatics</a>.</p>
|
||
<p><em>Please read the <a href="contributing.md">contribution
|
||
guidelines</a> before contributing. Please add your favourite resource
|
||
by raising a <a
|
||
href="https://github.com/caufieldjh/awesome-bioie/pulls">pull
|
||
request</a>.</em></p>
|
||
<h2 id="contents">Contents</h2>
|
||
<ul>
|
||
<li><a href="#research-overviews">Research Overviews</a></li>
|
||
<li><a href="#groups-active-in-the-field">Groups Active in the
|
||
Field</a></li>
|
||
<li><a href="#organizations">Organizations</a></li>
|
||
<li><a href="#journals-and-events">Journals and Events</a>
|
||
<ul>
|
||
<li><a href="#journals">Journals</a></li>
|
||
<li><a href="#conferences-and-other-events">Conferences and Other
|
||
Events</a></li>
|
||
<li><a href="#challenges">Challenges</a></li>
|
||
</ul></li>
|
||
<li><a href="#tutorials">Tutorials</a>
|
||
<ul>
|
||
<li><a href="#guides">Guides</a></li>
|
||
<li><a href="#video-lectures-and-online-courses">Video Lectures and
|
||
Online Courses</a></li>
|
||
</ul></li>
|
||
<li><a href="#code-libraries">Code Libraries</a>
|
||
<ul>
|
||
<li><a href="#repos-for-specific-datasets">Repos for Specific
|
||
Datasets</a></li>
|
||
</ul></li>
|
||
<li><a href="#tools-platforms-and-services">Tools, Platforms, and
|
||
Services</a>
|
||
<ul>
|
||
<li><a href="#annotation-tools">Annotation Tools</a></li>
|
||
</ul></li>
|
||
<li><a href="#techniques-and-models">Techniques and Models</a></li>
|
||
<li><a href="#datasets">Datasets</a>
|
||
<ul>
|
||
<li><a href="#biomedical-text-sources">Biomedical Text Sources</a></li>
|
||
<li><a href="#annotated-text-data">Annotated Text Data</a></li>
|
||
<li><a
|
||
href="#protein-protein-interaction-annotated-corpora">Protein-protein
|
||
Interaction Annotated Corpora</a></li>
|
||
<li><a href="#other-datasets">Other Datasets</a></li>
|
||
</ul></li>
|
||
<li><a href="#ontologies-and-controlled-vocabularies">Ontologies and
|
||
Controlled Vocabularies</a></li>
|
||
<li><a href="#data-models">Data Models</a></li>
|
||
<li><a href="#credits">Credits</a></li>
|
||
</ul>
|
||
<h2 id="research-overviews">Research Overviews</h2>
|
||
<h3 id="llms-in-biomedical-ie">LLMs in Biomedical IE</h3>
|
||
<ul>
|
||
<li><a href="http://dx.doi.org/10.1101/2024.04.24.24306315">Large
|
||
language models in healthcare: A comprehensive benchmark</a> - a
|
||
statistical and human evaluation of sixteen different LLMs applied to
|
||
medical language tasks.</li>
|
||
<li><a href="https://doi.org/10.1186/s12911-024-02459-6">Assessing the
|
||
research landscape and clinical utility of large language models: a
|
||
scoping review</a> - a high-level review of LLM applications in medicine
|
||
as of March 2024.</li>
|
||
<li><a href="https://doi.org/10.1016/s2589-7500(24)00061-x">Ethical and
|
||
regulatory challenges of large language models in medicine</a> - a
|
||
review of ethical issues arising from applications of LLMs in
|
||
biomedicine.</li>
|
||
<li><a href="http://dx.doi.org/10.1145/3442188.3445922">On the Dangers
|
||
of Stochastic Parrots: Can Language Models Be Too Big? 🦜</a> - a
|
||
frequently referenced but still relevant work concerning the roles,
|
||
applications, and risks of language models.</li>
|
||
</ul>
|
||
<h3 id="pre-llm-overviews">Pre-LLM Overviews</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.ahajournals.org/doi/full/10.1161/CIRCRESAHA.117.310967">Biomedical
|
||
Informatics on the Cloud: A Treasure Hunt for Advancing Cardiovascular
|
||
Medicine</a> - An overview of how BioIE and bioinformatics workflows can
|
||
be applied to questions in cardiovascular health and medicine
|
||
research.</li>
|
||
<li><a
|
||
href="https://www.sciencedirect.com/science/article/pii/S1532046417302563">Clinical
|
||
information extraction applications: A literature review</a> - A review
|
||
of clinical IE papers published as of September 2016. From Mayo Clinic
|
||
group (see below).</li>
|
||
<li><a
|
||
href="https://www.sciencedirect.com/science/article/pii/S1532046417301909">Literature
|
||
Based Discovery: Models, methods, and trends</a> - A review of
|
||
Literature Based Discovery (LBD), or the philosophy that meaningful
|
||
connections may be found between seemingly unrelated scientific
|
||
literature.
|
||
<ul>
|
||
<li>For some historical context on LBD, see papers by University of
|
||
Chicago’s Don Swanson and Neil Smalheiser, including <a
|
||
href="https://www.jstor.org/stable/4307965"><em>Undiscovered Public
|
||
Knowledge</em></a> (paywalled) and <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5771422/"><em>Rediscovering
|
||
Don Swanson: the Past, Present and Future of Literature-Based
|
||
Discovery</em></a>.</li>
|
||
</ul></li>
|
||
<li><a href="https://arxiv.org/abs/1702.03222">Mining Electronic Health
|
||
Records (EHRs): A Survey</a> - A review of the methods and philosophy
|
||
behind mining electronic health records, including using them for
|
||
adverse event detection. See Table 2 for a list of relevant papers as of
|
||
mid-2017.</li>
|
||
<li><a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6250990/">Capturing
|
||
the Patient’s Perspective: a Review of Advances in Natural Language
|
||
Processing of Health-Related Text</a> - A 2017 review of natural
|
||
language processing methods applied to information extraction in health
|
||
records and social media text. An important note from this review: “One
|
||
of the main challenges in the field is the availability of data that can
|
||
be shared and which can be used by the community to push the development
|
||
of methods based on comparable and reproducible studies”.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="groups-active-in-the-field">Groups Active in the Field</h2>
|
||
<ul>
|
||
<li><a
|
||
href="http://www.childrenshospital.org/research/labs/natural-language-processing-laboratory">Boston
|
||
Children’s Hospital Natural Language Processing Laboratory</a> - Led by
|
||
Dr. Guergana Savova, formerly at Mayo Clinic and the Apache cTAKES
|
||
project.</li>
|
||
<li><a
|
||
href="https://www.brown.edu/academics/medical/about-us/research/centers-institutes-and-programs/biomedical-informatics/">Brown
|
||
Center for Biomedical Informatics</a> - Based at Brown University and
|
||
directed by Dr. Neil Sarkar, whose research group works on topics in
|
||
clinical NLP and IE.</li>
|
||
<li><a
|
||
href="http://compbio.ucdenver.edu/Hunter_lab/CCP_website/index.html">Center
|
||
for Computational Pharmacology NLP Group</a> - based at University of
|
||
Colorado, Denver and led by Larry Hunter - <a
|
||
href="https://github.com/UCDenver-ccp">see their GitHub repos
|
||
here.</a></li>
|
||
<li>Groups at U.S. National Institutes of Health (NIH) / National
|
||
Library of Medicine (NLM):
|
||
<ul>
|
||
<li><a
|
||
href="https://www.lhncbc.nlm.nih.gov/personnel/dina-demner-fushman">Demner-Fushman
|
||
group at NLM</a></li>
|
||
<li><a href="https://www.ncbi.nlm.nih.gov/research/bionlp/">BioNLP group
|
||
at NCBI</a> - Develops improvements to biomedical literature search and
|
||
curation (e.g., through PubMed), led by Dr. Zhiyong Lu.</li>
|
||
</ul></li>
|
||
<li><a href="https://jensenlab.org/">JensenLab</a> - Based at the Novo
|
||
Nordisk Foundation Center for Protein Research at the University of
|
||
Copenhagen, Denmark.</li>
|
||
<li><a href="http://www.nactem.ac.uk/">National Centre for Text Mining
|
||
(NaCTeM)</a> - Based at the University of Manchester and led by
|
||
Prof. Sophia Ananiadou, NaCTeM is concerned with text mining in general
|
||
but has a particular focus on biomedical applications.</li>
|
||
<li><a
|
||
href="https://www.mayo.edu/research/departments-divisions/department-health-sciences-research/medical-informatics/projects">Mayo
|
||
Clinic’s clinical natural language processing program</a> - Several
|
||
groups at Mayo Clinic have made major contributions to BioIE (for
|
||
example, the Apache cTAKES platform) over the past 20 years.</li>
|
||
<li><a href="https://monarchinitiative.org/">Monarch Initiative</a> - A
|
||
joint effort between groups at Oregon State University, Oregon Health
|
||
& Science University, Lawrence Berkeley National Lab, The Jackson
|
||
Laboratory, and several others, seeking to “integrate biological
|
||
information using semantics, and present it in a novel way, leveraging
|
||
phenotypes to bridge the knowledge gap”.</li>
|
||
<li><a href="https://turkunlp.org/">TurkuNLP</a> - Based at the
|
||
University of Turku and concerned with NLP in general with a focus on
|
||
BioNLP and clinical applications.</li>
|
||
<li><a href="https://sbmi.uth.edu/nlp/">UTHealth Houston Biomedical
|
||
Natural Language Processing Lab</a> - Based in the University of Texas
|
||
Health Science Center at Houston, School of Biomedical Informatics and
|
||
led by Dr. Hua Xu.</li>
|
||
<li><a href="https://nlp.cs.vcu.edu/">VCU Natural Language Processing
|
||
Lab</a> - Based at Virginia Commonwealth University and led by
|
||
Dr. Bridget McInnes.</li>
|
||
<li><a href="http://zaklab.org">Zaklab</a> - Group led by Dr. Isaac
|
||
Kohane at Harvard Medical School’s Department of Biomedical Informatics
|
||
(Dr. Kohane is also a steward of the n2c2 (formerly i2b2) datasets - see
|
||
<a href="#datasets">Datasets</a> below).</li>
|
||
<li><a href="https://www.dbmi.columbia.edu/">Columbia University
|
||
Department of Biomedical Informatics</a> - Led by Drs. George Hripcsak
|
||
and Noémie Elhadad.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="organizations">Organizations</h2>
|
||
<ul>
|
||
<li><a href="https://www.amia.org/">AMIA</a> - Many—but certainly not
|
||
all—individuals studying biomedical informatics are members of the
|
||
American Medical Informatics Association. AMIA publishes a journal,
|
||
JAMIA (see below).</li>
|
||
<li><a href="https://imia-medinfo.org/">IMIA</a> - The International
|
||
Medical Informatics Association. Publishes the IMIA Yearbook of Medical
|
||
Informatics.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="journals-and-events">Journals and Events</h2>
|
||
<p>The interdisciplinary nature of BioIE means researchers in this space
|
||
may share their findings and tools in a variety of ways. They may
|
||
publish papers in journals, as is common in the biomedical and life
|
||
sciences. They may publish conference papers and, upon acceptance, give
|
||
a poster and/or oral presentation at an event; this is common practice
|
||
in computer science and engineering fields. Conference papers are often
|
||
published in collections of proceedings. Preprint publication is an
|
||
increasingly popular and institutionally-accepted way to publish
|
||
findings as well. Surrounding these formal, written products are the
|
||
ideas of <a href="https://en.wikipedia.org/wiki/Open_science">open
|
||
science</a>, open data, and open source: the code, data, and software
|
||
BioIE researchers develop are valuable resources to the community.</p>
|
||
<h3 id="journals">Journals</h3>
|
||
<p>For preprints, try <a href="https://arxiv.org">arXiv</a>, especially
|
||
the subjects Computation and Language (cs.CL) and Information Retrieval
|
||
(cs.IR); <a href="https://www.biorxiv.org/">bioRxiv</a>; or <a
|
||
href="https://www.medrxiv.org/">medRxiv</a>, especially the Health
|
||
Informatics subject area.</p>
|
||
<ul>
|
||
<li><a href="https://academic.oup.com/database">Database</a> - Its
|
||
subtitle is “The Journal of Biological Databases and Curation”. Open
|
||
access.</li>
|
||
<li><a href="https://academic.oup.com/nar">NAR</a> - Nucleic Acids
|
||
Research. Has a broad biomolecular focus but is particularly notable for
|
||
its annual database issue.</li>
|
||
<li><a href="https://academic.oup.com/jamia">JAMIA</a> - The Journal of
|
||
the American Medical Informatics Association. Concerns “articles in the
|
||
areas of clinical care, clinical research, translational science,
|
||
implementation science, imaging, education, consumer health, public
|
||
health, and policy”.</li>
|
||
<li><a
|
||
href="https://www.sciencedirect.com/journal/journal-of-biomedical-informatics">JBI</a>
|
||
- The Journal of Biomedical Informatics. Not open access by default,
|
||
though it does have an open-access “X” version.</li>
|
||
<li><a href="https://www.nature.com/sdata/">Scientific Data</a> - An
|
||
open-access Springer Nature journal publishing “descriptions of
|
||
scientifically valuable datasets, and research that advances the sharing
|
||
and reuse of scientific data”.</li>
|
||
</ul>
|
||
<h3 id="conferences-and-other-events">Conferences and Other Events</h3>
|
||
<ul>
|
||
<li><a href="http://acm-bcb.org/">ACM-BCB</a> - The ACM Conference on
|
||
Bioinformatics, Computational Biology, and Health Informatics. Held
|
||
annually since 2010.</li>
|
||
<li><a href="http://ieeebibm.org/BIBM2019/">BIBM</a> - The IEEE
|
||
International Conference on Bioinformatics and Biomedicine.</li>
|
||
<li><a href="https://www.iscb.org/about-ismb">ISMB</a> - The
|
||
International Conference on Intelligent Systems for Molecular Biology is
|
||
an annual conference hosted by the International Society for
|
||
Computational Biology since 1993. Much of its focus has concerned
|
||
bioinformatics and computational biology without an explicit clinical
|
||
focus, though it has included an increasing amount of text mining
|
||
content (e.g., the 2019 meeting included a <a
|
||
href="http://cosi.iscb.org/wiki/TextMining:Home">full-day special
|
||
session on Text Mining for Biology and Healthcare</a>). The meeting is
|
||
combined with that of the European Conference on Computational Biology
|
||
(ECCB) on odd-numbered years.</li>
|
||
<li><a href="https://psb.stanford.edu/">PSB</a> - The Pacific Symposium
|
||
on Biocomputing.</li>
|
||
</ul>
|
||
<h3 id="challenges">Challenges</h3>
|
||
<p>Some events in BioIE are organized around formal tasks and challenges
|
||
in which groups develop their own computational solutions, given a
|
||
dataset.</p>
|
||
<ul>
|
||
<li><a href="http://bioasq.org/">BioASQ</a> - Challenges on biomedical
|
||
semantic indexing and question answering. Challenges and workshops held
|
||
annually since 2013.</li>
|
||
<li><a href="https://biocreative.bioinformatics.udel.edu/">BioCreAtIvE
|
||
workshop</a> - These workshops have been organized since 2004, with
|
||
BioCreative VI happening February 2017 and the <a
|
||
href="https://sites.google.com/view/ohnlp2018/home">BioCreative/OHNLP
|
||
Challenge</a> held in 2018. See <a href="#datasets">Datasets</a>
|
||
below.</li>
|
||
<li><a href="http://alt.qcri.org/semeval2020/">SemEval workshop</a> -
|
||
Tasks and evaluations in computational semantic analysis. Tasks vary by
|
||
year but frequently cover scientific and/or biomedical language,
|
||
e.g. the <a
|
||
href="https://competitions.codalab.org/competitions/19948">SemEval-2019
|
||
Task 12 on Toponym Resolution in Scientific Papers</a>.</li>
|
||
<li><a
|
||
href="https://knowledge-learning.github.io/ehealthkd-2019/">eHealth-KD</a>
|
||
- Challenges for encouraging “development of software technologies to
|
||
automatically extract a large variety of knowledge from eHealth
|
||
documents written in the Spanish Language”. Previously held as part of
|
||
<a href="http://www.sepln.org/workshops/tass/">TASS</a>, an annual
|
||
workshop for semantic analysis in Spanish.</li>
|
||
<li><a
|
||
href="https://www.synapse.org/#!Synapse:syn18405991/wiki/589657">EHR
|
||
DREAM Challenge</a> - Held along with several other <a
|
||
href="http://dreamchallenges.org/">more bioinformatics-focused
|
||
challenges</a>, this challenge opened in October 2019 and focuses on
|
||
using electronic health record data to predict patient mortality. Uses a
|
||
synthetic data set rather than real EHR contents.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="tutorials">Tutorials</h2>
|
||
<p>The field changes rapidly enough that tutorials any older than a few
|
||
years are missing crucial details. A few more recent educational
|
||
resources are listed below. A good foundational understanding of text
|
||
mining techniques is very helpful, as is some basic experience with the
|
||
Python and or R languages. The best option may be to learn by doing.</p>
|
||
<h3 id="llm-guides">LLM Guides</h3>
|
||
<p><em>TBD - watch this space!</em></p>
|
||
<h3 id="pre-llm-guides-lectures-and-courses">Pre-LLM Guides, Lectures,
|
||
and Courses</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0040020">Getting
|
||
Started in Text Mining</a> - A brief introduction to bio-text mining
|
||
from Cohen and Hunter. More than ten years old but still quite relevant.
|
||
See also an <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1702322/">earlier
|
||
paper by the same authors</a>.</li>
|
||
<li><a
|
||
href="https://link.springer.com/book/10.1007/978-1-4939-0709-0">Biomedical
|
||
Literature Mining</a> - A (non-free) volume of Methods in Molecular
|
||
Biology from 2014. Chapters covers introductory principles in text
|
||
mining, applications in the biological sciences, and potential for use
|
||
in clinical or medical safety scenarios.</li>
|
||
<li><a
|
||
href="https://www.coursera.org/learn/mining-medical-data">Coursera -
|
||
Foundations of mining non-structured medical data</a> - About three
|
||
hours worth of video lectures on working with medical data of various
|
||
types and structures, including text and image data. Appears fairly
|
||
high-level and intended for beginners.</li>
|
||
<li><a href="https://jensenlab.org/training/textmining/">JensenLab text
|
||
mining exercises</a></li>
|
||
<li><a
|
||
href="https://www.bits.vib.be/training-list/111-bits/training/previous-trainings/183-text-mining">VIB
|
||
text mining and curation training</a> - This training workshop
|
||
happenened in 2013 but the slides are still online.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="code-libraries">Code Libraries</h2>
|
||
<ul>
|
||
<li><a href="https://biopython.org/">Biopython</a> - <a
|
||
href="http://dx.doi.org/10.1093/bioinformatics/btp163">paper</a> - <a
|
||
href="https://github.com/biopython/biopython">code</a> - Python tools
|
||
primarily intended for bioinformatics and computational molecular
|
||
biology purposes, but also a convenient way to obtain data, including
|
||
documents/abstracts from PubMed (see Chapter 9 of the
|
||
documentation).</li>
|
||
<li><a href="https://github.com/kilicogluh/Bio-SCoRes">Bio-SCoRes</a> -
|
||
<a
|
||
href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148538">paper</a>
|
||
- A framework for biomedical coreference resolution.</li>
|
||
<li><a href="https://github.com/NLPatVCU/medaCy">medaCy</a> - A system
|
||
for building predictive medical natural language processing models.
|
||
Built on the <a href="https://spacy.io/">spaCy</a> framework.</li>
|
||
<li><a href="https://github.com/allenai/SciSpaCy">ScispaCy</a> - <a
|
||
href="https://arxiv.org/abs/1902.07669">paper</a> - A version of the <a
|
||
href="https://spacy.io/">spaCy</a> framework for scientific and
|
||
biomedical documents.</li>
|
||
<li><a href="https://github.com/ropensci/rentrez">rentrez</a> - R
|
||
utilities for accessing NCBI resources, including PubMed.</li>
|
||
<li><a
|
||
href="https://medium.com/@kormilitzin/med7-clinical-information-extraction-system-in-python-and-spacy-5e6f68ab1c68">Med7</a>
|
||
- <a href="https://arxiv.org/abs/2003.01271">paper</a> - <a
|
||
href="https://github.com/kormilitzin/med7">code</a> - a Python package
|
||
and model (for use with spaCy) for doing NER with medication-related
|
||
concepts.</li>
|
||
</ul>
|
||
<h3 id="repos-for-specific-datasets">Repos for Specific Datasets</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/MIT-LCP/mimic-code">mimic-code</a> -
|
||
Code associated with the MIMIC-III dataset (see below). Includes some
|
||
helpful <a
|
||
href="https://github.com/MIT-LCP/mimic-code/tree/master/tutorials">tutorials</a>.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="tools-platforms-and-services">Tools, Platforms, and
|
||
Services</h2>
|
||
<ul>
|
||
<li><a href="https://ctakes.apache.org/">cTAKES</a> - <a
|
||
href="https://academic.oup.com/jamia/article/17/5/507/830823">paper</a>
|
||
- <a href="https://github.com/apache/ctakes">code</a> - A system for
|
||
processing the text in electronic medical records. Widely used and open
|
||
source.</li>
|
||
<li><a href="https://clamp.uth.edu/">CLAMP</a> - <a
|
||
href="https://academic.oup.com/jamia/article/25/3/331/4657212">paper</a>
|
||
- A natural language processing toolkit intended for use with the text
|
||
in clinical reports. Check out their <a
|
||
href="https://clamp.uth.edu/clampdemo.php">live demo</a> first to see
|
||
what it does. Usable at no cost for academic research.</li>
|
||
<li><a href="https://github.com/DeepPhe/DeepPhe-Release">DeepPhe</a> - A
|
||
system for processing documents describing cancer presentations. Based
|
||
on cTAKES (see above).</li>
|
||
<li><a
|
||
href="https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/dnorm/">DNorm</a>
|
||
- <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/">paper</a> -
|
||
A method for disease normalization, i.e., linking mentions of disease
|
||
names and acronyms to unique concept identifiers. Downloadable version
|
||
includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data
|
||
below).</li>
|
||
<li><a href="https://www.ncbi.nlm.nih.gov/research/pubtator/">PubTator
|
||
Central</a> - <a
|
||
href="https://academic.oup.com/nar/article/47/W1/W587/5494727">paper</a>
|
||
- A web platform that identifies five different types of biomedical
|
||
concepts in PubMed articles and PubMed Central full texts. The full
|
||
annotation sets are downloadable (see <a
|
||
href="#annotated-text-data">Annotated Text Data</a> below).</li>
|
||
<li><a href="https://github.com/jakelever/pubrunner">Pubrunner</a> - A
|
||
framework for running text mining tools on the newest set(s) of
|
||
documents from PubMed.</li>
|
||
<li><a href="https://github.com/CogStack/CogStack-SemEHR">SemEHR</a> -
|
||
<a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019046/">paper</a> -
|
||
an IE infrastructure for electronic health records (EHR). Built on the
|
||
<a href="https://github.com/CogStack">CogStack project</a>.</li>
|
||
<li><a
|
||
href="https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/taggerone/">TaggerOne</a>
|
||
- <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018376/">paper</a> -
|
||
Performs concept normalization (see also DNorm above). Can be trained
|
||
for specific concept types and can perform NER independent of other
|
||
normalization functions.</li>
|
||
<li><a href="https://github.com/nikolamilosevic86/TabInOut">TabInOut</a>
|
||
- <a
|
||
href="https://link.springer.com/article/10.1007/s10032-019-00317-0">paper</a>
|
||
- a framework for IE from tables in the literature.</li>
|
||
</ul>
|
||
<h3 id="annotation-tools">Annotation Tools</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/weitechen/anafora">Anafora</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657237/">paper</a> -
|
||
An annotation tool with adjudication and progress tracking
|
||
features.</li>
|
||
<li><a href="https://brat.nlplab.org/">brat</a> - <a
|
||
href="https://www.aclweb.org/anthology/E12-2021/">paper</a> - <a
|
||
href="https://github.com/nlplab/brat">code</a> - The brat rapid
|
||
annotation tool. Supports producing text annotations visually, through
|
||
the browser. Not subject specific; appropriate for many annotation
|
||
projects. Visualization is based on that of the <a
|
||
href="https://github.com/nlplab/stav/"><em>stav</em> tool</a>.</li>
|
||
<li><a href="https://ohnlp.github.io/MedTator/">MedTator</a> - <a
|
||
href="https://academic.oup.com/bioinformatics/article-abstract/38/6/1776/6496915">paper</a>
|
||
- <a href="https://github.com/OHNLP/MedTator">code</a> - An annotation
|
||
tool designed to have minimal dependencies.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="techniques-and-models">Techniques and Models</h2>
|
||
<h3 id="large-language-models">Large Language Models</h3>
|
||
<p><em>TBD - watch this space!</em></p>
|
||
<h3 id="bert-models">BERT models</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/naver/biobert-pretrained">BioBERT</a> -
|
||
<a href="https://arxiv.org/abs/1901.08746">paper</a> - <a
|
||
href="https://github.com/dmis-lab/biobert">code</a> - A PubMed and
|
||
PubMed Central-trained version of the <a
|
||
href="https://arxiv.org/abs/1810.04805">BERT language model</a>.</li>
|
||
<li>ClinicalBERT - Two language models trained on clinical text have
|
||
similar names. Both are BERT models trained on the text of clinical
|
||
notes from the MIMIC-III dataset.
|
||
<ul>
|
||
<li><a href="https://github.com/EmilyAlsentzer/clinicalBERT">Alsentzer
|
||
et al Clinical BERT</a> - <a
|
||
href="https://www.aclweb.org/anthology/W19-1909/">paper</a></li>
|
||
<li><a href="https://github.com/kexinhuang12345/clinicalBERT">Huang et
|
||
al ClinicalBERT</a> - <a
|
||
href="https://arxiv.org/abs/1904.05342">paper</a></li>
|
||
</ul></li>
|
||
<li><a href="https://github.com/allenai/scibert">SciBERT</a> - <a
|
||
href="https://arxiv.org/abs/1903.10676">paper</a> - A BERT model trained
|
||
on >1M papers from the Semantic Scholar database.</li>
|
||
<li><a href="https://github.com/ncbi-nlp/bluebert">BlueBERT</a> - <a
|
||
href="https://arxiv.org/abs/1906.05474">paper</a> - A BERT model
|
||
pre-trained on PubMed text and MIMIC-III notes.</li>
|
||
<li><a
|
||
href="https://microsoft.github.io/BLURB/models.html">PubMedBERT</a> - <a
|
||
href="https://arxiv.org/abs/2007.15779">paper</a> - A BERT model trained
|
||
from scratch on PubMed, with versions trained on abstracts+full texts
|
||
and on abstracts alone.</li>
|
||
</ul>
|
||
<h3 id="gpt-2-models">GPT-2 models</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/microsoft/BioGPT">BioGPT</a> - <a
|
||
href="https://doi.org/10.1093/bib/bbac409">paper</a> - A GPT-2 model
|
||
pre-trained on 15 million PubMed abstracts, along with fine-tuned
|
||
versions for several biomedical tasks.</li>
|
||
</ul>
|
||
<h3 id="other-models">Other models</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/zalandoresearch/flair/pull/519">Flair
|
||
embeddings from PubMed</a> - A language model available through the
|
||
Flair framework and embedding method. Trained over a 5% sample of PubMed
|
||
abstracts until 2015, or > 1.2 million abstracts in total.</li>
|
||
</ul>
|
||
<h3 id="text-embeddings">Text Embeddings</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.sciencedirect.com/science/article/pii/S1532046418301825">This
|
||
paper from Hongfang Liu’s group at Mayo Clinic</a> demonstrates how text
|
||
embeddings trained on biomedical or clinical text can, but don’t always,
|
||
perform better on biomedical natural language processing tasks. That
|
||
being said, pre-trained embeddings may be appropriate for your needs,
|
||
especially as training domain-specific embeddings can be computationally
|
||
intensive.</li>
|
||
<li><a
|
||
href="http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts">BioASQword2vec</a>
|
||
- <a href="http://bioasq.lip6.fr/info/BioASQword2vec/">paper</a> - Qord
|
||
embeddings derived from biomedical text (>10 million PubMed
|
||
abstracts) using the popular <a
|
||
href="https://code.google.com/archive/p/word2vec/">word2vec</a>
|
||
tool.</li>
|
||
<li><a
|
||
href="https://figshare.com/articles/Improving_Biomedical_Word_Embeddings_with_Subword_Information_and_MeSH_Ontology/6882647">BioWordVec</a>
|
||
- <a href="https://www.nature.com/articles/s41597-019-0055-0">paper</a>
|
||
- <a href="https://github.com/ncbi-nlp/BioWordVec">code</a> - Word
|
||
embeddings derived from biomedical text (>27 million PubMed titles
|
||
and abstracts), including subword embedding model based on MeSH.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="datasets">Datasets</h2>
|
||
<p>Some of the datasets listed below require a <a
|
||
href="https://www.nlm.nih.gov/databases/umls.html#license_request">UMLS
|
||
Terminology Services (UTS) account</a> to access. Please note that the
|
||
license granted with the UTS account requires users to submit an annual
|
||
report about their use of UMLS resources. This is less challenging than
|
||
it sounds.</p>
|
||
<h3 id="biomedical-text-sources">Biomedical Text Sources</h3>
|
||
<p>The following resources contain indexed text documents in the
|
||
biomedical sciences. * <a
|
||
href="http://davis.wpi.edu/xmdv/datasets/ohsumed.html">OHSUMED</a> - <a
|
||
href="https://dl.acm.org/citation.cfm?id=188557">paper</a> - 348,566
|
||
MEDLINE entries (title and sometimes abstract) from between 1987 and
|
||
1991. Includes MeSH labels. Primarily of historical significance. * <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/">PubMed Central
|
||
Open Access Subset</a> - A set of PubMed Central articles usable under
|
||
licenses other than traditional copyright, though the exact licenses
|
||
vary by publication and source. Articles are available as PDF and XML. *
|
||
<a href="https://github.com/allenai/cord19">CORD-19</a> - A corpus of
|
||
scholarly manuscripts concerning COVID-19. Articles are primarily from
|
||
PubMed Central and preprint servers, though the set also includes
|
||
metadata on papers without full-text availability.</p>
|
||
<h3 id="annotated-text-data">Annotated Text Data</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://bionlp.nlm.nih.gov/tac2017adversereactions/">SPL-ADR-200db</a>
|
||
- <a href="https://www.nature.com/articles/sdata20181">paper</a> - A
|
||
pilot dataset containing standardised information, and annotations of
|
||
occurence in text, about ~5,000 known adverse reactions for 200
|
||
FDA-approved drugs.</li>
|
||
<li><a
|
||
href="https://sourceforge.net/projects/biocreative/files/">BioCreAtIvE
|
||
1</a> - <a
|
||
href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-S1-S1">paper</a>
|
||
- 15,000 sentences (10,000 training and 5,000 test) annotated for
|
||
protein and gene names. 1,000 full text biomedical research articles
|
||
annotated with protein names and Gene Ontology terms.</li>
|
||
<li><a
|
||
href="https://sourceforge.net/projects/biocreative/files/">BioCreAtIvE
|
||
2</a> - <a
|
||
href="https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-s2-s1">paper</a>
|
||
- 15,000 sentences (10,000 training and 5,000 test, different from the
|
||
first corpus) annotated for protein and gene names. 542 abstracts linked
|
||
to EntrezGene identifiers. A variety of research articles annotated for
|
||
features of protein–protein interactions.</li>
|
||
<li><a
|
||
href="https://biocreative.bioinformatics.udel.edu/accounts/login/?next=/resources/corpora/biocreative-v-cdr-corpus/">BioCreAtIvE
|
||
V CDR Task Corpus (BC5CDR)</a> - <a
|
||
href="https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414">paper</a>
|
||
- 1,500 articles (title and abstract) published in 2014 or later,
|
||
annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease
|
||
interactions. Requires registration.</li>
|
||
<li><a
|
||
href="https://biocreative.bioinformatics.udel.edu/resources/corpora/chemprot-corpus-biocreative-vi/#chemprot-corpus-biocreative-vi:downloads">BioCreative
|
||
VI CHEMPROT Corpus</a> - <a
|
||
href="https://pdfs.semanticscholar.org/eed7/81f498b563df5a9e8a241c67d63dd1d92ad5.pdf">paper</a>
|
||
- >2,400 articles annotated with chemical-protein interactions of a
|
||
variety of relation types. Requires registration.</li>
|
||
<li><a href="https://github.com/UCDenver-ccp/CRAFT">CRAFT</a> - <a
|
||
href="https://link.springer.com/chapter/10.1007/978-94-024-0881-2_53">paper</a>
|
||
- 67 full-text biomedical articles annotated in a variety of ways,
|
||
including for concepts and coreferences. Now on version 5, including
|
||
annotations linking concepts to the MONDO disease ontology.</li>
|
||
<li><a
|
||
href="https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/">n2c2
|
||
(formerly i2b2) Data</a> - The Department of Biomedical Informatics
|
||
(DBMI) at Harvard Medical School manages data for the National NLP
|
||
Clinical Challenges and the Informatics for Integrating Biology and the
|
||
Bedside challenges running since 2006. They require registration before
|
||
access and use. Datasets include a variety of topics. See the <a
|
||
href="https://portal.dbmi.hms.harvard.edu/data-challenges/">list of data
|
||
challenges</a> for individual descriptions.</li>
|
||
<li><a
|
||
href="https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/">NCBI
|
||
Disease Corpus</a> - <a
|
||
href="https://www.sciencedirect.com/science/article/pii/S1532046413001974">paper</a>
|
||
- A corpus of 793 biomedical abstracts annotated with names of diseases
|
||
and related concepts from MeSH and <a
|
||
href="https://omim.org/">OMIM</a>.</li>
|
||
<li><a href="https://www.ncbi.nlm.nih.gov/research/pubtator/">PubTator
|
||
Central datasets</a> - <a
|
||
href="https://academic.oup.com/nar/article/47/W1/W587/5494727">paper</a>
|
||
- Accessible through a RESTful API or FTP download. Includes annotations
|
||
for >29 million abstracts and ∼3 million full text documents.</li>
|
||
<li><a href="https://wsd.nlm.nih.gov/">Word Sense Disambiguation
|
||
(WSD)</a> - <a
|
||
href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-223">paper</a>
|
||
- 203 ambiguous words and 37,888 automatically extracted instances of
|
||
their use in biomedical research publications. Requires UTS
|
||
account.</li>
|
||
<li><a
|
||
href="https://www.nlm.nih.gov/databases/download/CQC.html">Clinical
|
||
Questions Collection</a> - also known as CQC or the Iowa collection,
|
||
these are several thousand questions posed by physicians during office
|
||
visits along with the associated answers.</li>
|
||
<li><a href="http://2013.bionlp-st.org/">BioNLP ST 2013 datasets</a> -
|
||
data from six shared tasks, though some may not be easily accessible;
|
||
try the CG task set (BioNLP2013CG) for extensive entity and event
|
||
annotations.</li>
|
||
<li><a href="https://rgai.inf.u-szeged.hu/node/105">BioScope</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586758/">paper</a> -
|
||
a corpus of sentences from medical and biological documents, annotated
|
||
for negation, speculation, and linguistic scope.</li>
|
||
<li><a href="https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/">BioRED</a> -
|
||
<a href="https://arxiv.org/abs/2204.04263">paper</a> - a set of >6.5K
|
||
biomedical relation annotations, plus labels for novel findings.</li>
|
||
</ul>
|
||
<h3 id="protein-protein-interaction-annotated-corpora">Protein-protein
|
||
Interaction Annotated Corpora</h3>
|
||
<p>Protein-protein interactions are abbreviated as PPI. The following
|
||
sets are available in <a href="http://bioc.sourceforge.net/">BioC
|
||
format</a>. The older sets (AIMed, BioInfer, HPRD50, IEPA, and LLL) are
|
||
available courtesy of the <a
|
||
href="http://corpora.informatik.hu-berlin.de">WBI corpora repository</a>
|
||
and were originally derived from the original sets by a <a
|
||
href="http://mars.cs.utu.fi/PPICorpora/">group at Turku
|
||
University</a>.</p>
|
||
<ul>
|
||
<li><a
|
||
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/aimed_bioc.xml.zip">AIMed</a>
|
||
- <a href="https://www.ncbi.nlm.nih.gov/pubmed/15811782">paper</a> - 225
|
||
MEDLINE abstracts annotated for PPI.</li>
|
||
<li><a
|
||
href="http://bioc.sourceforge.net/BioC-BioGRID.html">BioC-BioGRID</a> -
|
||
<a
|
||
href="https://academic.oup.com/database/article/doi/10.1093/database/baw147/2884890">paper</a>
|
||
- 120 full text articles annotated for PPI and genetic interactions.
|
||
Used in the BioCreative V BioC task.</li>
|
||
<li><a
|
||
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/bioinfer_bioc.xml.zip">BioInfer</a>
|
||
- <a
|
||
href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50">paper</a>
|
||
- 1,100 sentences from biomedical research abstracts annotated for
|
||
relationships (including PPI), named entities, and syntactic
|
||
dependencies. <a href="http://mars.cs.utu.fi/BioInfer/">Additional
|
||
information and download links are here.</a></li>
|
||
<li><a
|
||
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hprd50_bioc.xml.zip">HPRD50</a>
|
||
- <a
|
||
href="https://academic.oup.com/bioinformatics/article/23/3/365/236564">paper</a>
|
||
- 50 scientific abstracts referenced by the Human Protein Reference
|
||
Database, annotated for PPI.</li>
|
||
<li><a
|
||
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/iepa_bioc.xml.zip">IEPA</a>
|
||
- <a
|
||
href="http://psb.stanford.edu/psb-online/proceedings/psb02/abstracts/p326.html">paper</a>
|
||
- 486 sentences from biomedical research abstracts annotated for pairs
|
||
of co-occurring chemicals, including proteins (hence, PPI
|
||
annotations).</li>
|
||
<li><a
|
||
href="http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/lll_bioc.xml.zip">LLL</a>
|
||
- <a
|
||
href="https://www.semanticscholar.org/paper/Learning-Language-in-Logic-Genic-Interaction-Nedellec/0863a9d71955341b7e1a6a6877d44d4f0bb22671">paper</a>
|
||
- 77 sentences from research articles about the bacterium <em>Bacillus
|
||
subtilis</em>, annotated for protein–gene interactions (so, fairly close
|
||
to PPI annotations). <a
|
||
href="http://genome.jouy.inra.fr/texte/LLLchallenge/#task1">Additional
|
||
information is here.</a></li>
|
||
</ul>
|
||
<h3 id="other-datasets">Other Datasets</h3>
|
||
<ul>
|
||
<li><a href="http://cohd.io">Columbia Open Health Data</a> - <a
|
||
href="https://www.nature.com/articles/sdata2018273">paper</a> - A
|
||
database of prevalence and co-occurrence frequencies of conditions,
|
||
drugs, procedures, and patient demographics extracted from electronic
|
||
health records. Does not include original record text.</li>
|
||
<li><a href="https://ctdbase.org/">Comparative Toxicogenomics
|
||
Database</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323936/">paper</a> -
|
||
A database of manually curated associations between chemicals, gene
|
||
products, phenotypes, diseases, and environmental exposures. Useful for
|
||
assembling ontologies of the related concepts, such as types of
|
||
chemicals.</li>
|
||
<li><a href="https://mimic.physionet.org/">MIMIC-III</a> - <a
|
||
href="https://www.nature.com/articles/sdata201635">paper</a> -
|
||
Deidentified health data from ~60,000 intensive care unit admissions.
|
||
Requires completion of an online training course (CITI training) and
|
||
acceptance of a data use agreement prior to use.</li>
|
||
<li><a
|
||
href="https://physionet.org/content/mimic-cxr/2.0.0/">MIMIC-CXR</a> -
|
||
The MIMIC Chest X-Ray database. Contains more than 377,000 radiographic
|
||
images and accompanying free-text radiology reports. As with MIMIC-III,
|
||
requires acceptance of a data use agreement.</li>
|
||
<li><a
|
||
href="https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html">UMLS
|
||
Knowledge Sources</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/books/NBK9676/">reference manual</a>
|
||
- A large and comprehensive collection of biomedical terminology and
|
||
identifiers, as well as accompanying tools and scripts. Depending on
|
||
your purposes, the single file MRCONSO.RRF may be sufficient, as this
|
||
file contains unique identifiers and names for all concepts in the UMLS
|
||
Metathesaurus. See also the Ontologies and Controlled Vocabularies
|
||
section below.</li>
|
||
<li><a href="https://mimic-iv.mit.edu/">MIMIC-IV</a> - An update to
|
||
MIMIC-III’s multimodal patient data, now covering more recent years of
|
||
admissions, plus a new data structure, emergency department records, and
|
||
links to MIMIC-CXR images.</li>
|
||
<li><a href="https://eicu-crd.mit.edu/">eICU Collaborative Research
|
||
Database</a> - <a
|
||
href="https://www.nature.com/articles/sdata2018178">paper</a> - a
|
||
database of observations from more than 200 thousand intensive care unit
|
||
admissions, with consistent structure. Requires registration, training
|
||
course completion, and data use agreement.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="ontologies-and-controlled-vocabularies">Ontologies and
|
||
Controlled Vocabularies</h2>
|
||
<ul>
|
||
<li><a href="http://www.disease-ontology.org/">Disease Ontology</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383880/">paper</a> -
|
||
An ontology of human diseases. Has cross-links to MeSH, ICD, NCI
|
||
Thesaurus, SNOMED, and OMIM. Public domain. Available on <a
|
||
href="https://github.com/DiseaseOntology/HumanDiseaseOntology">GitHub</a>
|
||
and on the <a href="http://www.obofoundry.org/ontology/doid.html">OBO
|
||
Foundry</a>.</li>
|
||
<li><a
|
||
href="https://www.nlm.nih.gov/research/umls/rxnorm/index.html">RxNorm</a>
|
||
- <a
|
||
href="https://academic.oup.com/jamia/article/18/4/441/734170">paper</a>
|
||
- Normalized names for clinical drugs and drug packs, with combined
|
||
ingredients, strengths, and form, and assigned types from the Semantic
|
||
Network (see below). Released monthly.</li>
|
||
<li><a
|
||
href="https://lexsrv3.nlm.nih.gov/Specialist/Summary/lexicon.html">SPECIALIST
|
||
Lexicon</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2247735/">paper</a> -
|
||
A general English lexicon that includes many biomedical terms. Updated
|
||
yearly since 1994 and still updated as of 2019. Part of UMLS but does
|
||
not require UTS account to download.</li>
|
||
<li><a
|
||
href="https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html">UMLS
|
||
Metathesaurus</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308795/">paper</a> -
|
||
Mappings between >3.8 million concepts, 14 million concept names, and
|
||
>200 sources of biomedical vocabulary and identifiers. It’s big. It
|
||
may help to prepare a subset of the Metathesaurus with the <a
|
||
href="https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html">MetamorphoSys
|
||
installation tool</a> but we’re still talking about ~30 Gb of disk space
|
||
required for the 2019 release. <a
|
||
href="https://www.ncbi.nlm.nih.gov/books/NBK9684/">See the manual
|
||
here</a>. Requires UTS account.</li>
|
||
<li><a href="https://semanticnetwork.nlm.nih.gov/">UMLS Semantic
|
||
Network</a> - <a
|
||
href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2447396/">paper</a> -
|
||
Lists of 133 semantic types and 54 semantic relationships covering
|
||
biomedical concepts and vocabulary. Is the Metathesaurus too complex for
|
||
your needs? Try this. Does not require UTS account to download.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="data-models">Data Models</h2>
|
||
<p>Do you need a <a href="https://en.wikipedia.org/wiki/Data_model">data
|
||
model</a>? If you are working with biomedical data, then the answer is
|
||
probably “Yes”.</p>
|
||
<ul>
|
||
<li><a href="https://biolink.github.io/biolink-model/">Biolink</a> - <a
|
||
href="https://github.com/biolink/biolink-model">code</a> - A data model
|
||
of biological entities. Provided as a <a
|
||
href="https://yaml.org/">YAML</a> file.</li>
|
||
<li><a href="http://wiki.biouml.org/index.php/BioUML">BioUML</a> - <a
|
||
href="https://academic.oup.com/nar/article/47/W1/W225/5498754">paper</a>
|
||
- An architecture for biomedical data analysis, integration, and
|
||
visualization. Conceptually based on the visual modeling language <a
|
||
href="https://www.uml.org/what-is-uml.htm">UML</a>.</li>
|
||
<li><a href="https://github.com/OHDSI/CommonDataModel">OMOP Common Data
|
||
Model</a> - a standard for observational healthcare data.</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h2 id="credits">Credits</h2>
|
||
<p><a href="./CREDITS.md">Credits</a> for curators and sources.</p>
|
||
<h2 id="license">License</h2>
|
||
<p><a href="https://creativecommons.org/publicdomain/zero/1.0"><img
|
||
src="https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg"
|
||
alt="CC0" /></a></p>
|
||
<p><a href="./LICENSE">License</a></p>
|
||
<p><a href="https://github.com/caufieldjh/awesome-bioie">bioie.md
|
||
Github</a></p>
|