<img src="https://github.com/caufieldjh/awesome-bioie/blob/main/images/abie_head.png" alt="Awesome BioIE Logo"/>
<br>
<a href="https://awesome.re">
<img src="https://awesome.re/badge-flat2.svg" alt="Awesome">
</a>
<br>
How to extract information from unstructured biomedical data and text.
<br>
What is BioIE? It includes any effort to extract structured
information from unstructured (or, at least inconsistently
structured) biological, clinical, or other biomedical data. The data
source is often some collection of text documents written in technical
language. If the resulting information is verifiable and consistent
across sources, we may then consider it knowledge. Extracting
information and producing knowledge from bio data requires adaptations
upon methods developed for other types of unstructured data.
BioIE has undergone massive changes since the introduction of
language models like BERT and the more recently created Large Language
Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).
Resources included here are preferentially those available at no
monetary cost and limited license requirements. Methods and datasets
should be publicly accessible and actively maintained.
See also awesome-nlp, awesome-biology
and Awesome-Bioinformatics.
Please read the contribution
guidelines before contributing. Please add your favourite resource
by raising a pull
request.
Contents
Research Overviews
LLMs in Biomedical IE
Pre-LLM Overviews
Back to Top
Groups Active in the Field
- Boston
Childrenâs Hospital Natural Language Processing Laboratory - Led by
Dr. Guergana Savova, formerly at Mayo Clinic and the Apache cTAKES
project.
- Brown
Center for Biomedical Informatics - Based at Brown University and
directed by Dr. Neil Sarkar, whose research group works on topics in
clinical NLP and IE.
- Center
for Computational Pharmacology NLP Group - based at University of
Colorado, Denver and led by Larry Hunter - see their GitHub repos
here.
- Groups at U.S. National Institutes of Health (NIH) / National
Library of Medicine (NLM):
- JensenLab - Based at the Novo
Nordisk Foundation Center for Protein Research at the University of
Copenhagen, Denmark.
- National Centre for Text Mining
(NaCTeM) - Based at the University of Manchester and led by
Prof. Sophia Ananiadou, NaCTeM is concerned with text mining in general
but has a particular focus on biomedical applications.
- Mayo
Clinicâs clinical natural language processing program - Several
groups at Mayo Clinic have made major contributions to BioIE (for
example, the Apache cTAKES platform) over the past 20 years.
- Monarch Initiative - A
joint effort between groups at Oregon State University, Oregon Health
& Science University, Lawrence Berkeley National Lab, The Jackson
Laboratory, and several others, seeking to âintegrate biological
information using semantics, and present it in a novel way, leveraging
phenotypes to bridge the knowledge gapâ.
- TurkuNLP - Based at the
University of Turku and concerned with NLP in general with a focus on
BioNLP and clinical applications.
- UTHealth Houston Biomedical
Natural Language Processing Lab - Based in the University of Texas
Health Science Center at Houston, School of Biomedical Informatics and
led by Dr. Hua Xu.
- VCU Natural Language Processing
Lab - Based at Virginia Commonwealth University and led by
Dr. Bridget McInnes.
- Zaklab - Group led by Dr. Isaac
Kohane at Harvard Medical Schoolâs Department of Biomedical Informatics
(Dr. Kohane is also a steward of the n2c2 (formerly i2b2) datasets - see
Datasets below).
- Columbia University
Department of Biomedical Informatics - Led by Drs. George Hripcsak
and Noémie Elhadad.
Back to Top
Organizations
- AMIA - Manyâbut certainly not
allâindividuals studying biomedical informatics are members of the
American Medical Informatics Association. AMIA publishes a journal,
JAMIA (see below).
- IMIA - The International
Medical Informatics Association. Publishes the IMIA Yearbook of Medical
Informatics.
Back to Top
Journals and Events
The interdisciplinary nature of BioIE means researchers in this space
may share their findings and tools in a variety of ways. They may
publish papers in journals, as is common in the biomedical and life
sciences. They may publish conference papers and, upon acceptance, give
a poster and/or oral presentation at an event; this is common practice
in computer science and engineering fields. Conference papers are often
published in collections of proceedings. Preprint publication is an
increasingly popular and institutionally-accepted way to publish
findings as well. Surrounding these formal, written products are the
ideas of open
science, open data, and open source: the code, data, and software
BioIE researchers develop are valuable resources to the community.
Journals
For preprints, try arXiv, especially
the subjects Computation and Language (cs.CL) and Information Retrieval
(cs.IR); bioRxiv; or medRxiv, especially the Health
Informatics subject area.
- Database - Its
subtitle is âThe Journal of Biological Databases and Curationâ. Open
access.
- NAR - Nucleic Acids
Research. Has a broad biomolecular focus but is particularly notable for
its annual database issue.
- JAMIA - The Journal of
the American Medical Informatics Association. Concerns âarticles in the
areas of clinical care, clinical research, translational science,
implementation science, imaging, education, consumer health, public
health, and policyâ.
- JBI
- The Journal of Biomedical Informatics. Not open access by default,
though it does have an open-access âXâ version.
- Scientific Data - An
open-access Springer Nature journal publishing âdescriptions of
scientifically valuable datasets, and research that advances the sharing
and reuse of scientific dataâ.
Conferences and Other Events
- ACM-BCB - The ACM Conference on
Bioinformatics, Computational Biology, and Health Informatics. Held
annually since 2010.
- BIBM - The IEEE
International Conference on Bioinformatics and Biomedicine.
- ISMB - The
International Conference on Intelligent Systems for Molecular Biology is
an annual conference hosted by the International Society for
Computational Biology since 1993. Much of its focus has concerned
bioinformatics and computational biology without an explicit clinical
focus, though it has included an increasing amount of text mining
content (e.g., the 2019 meeting included a full-day special
session on Text Mining for Biology and Healthcare). The meeting is
combined with that of the European Conference on Computational Biology
(ECCB) on odd-numbered years.
- PSB - The Pacific Symposium
on Biocomputing.
Challenges
Some events in BioIE are organized around formal tasks and challenges
in which groups develop their own computational solutions, given a
dataset.
- BioASQ - Challenges on biomedical
semantic indexing and question answering. Challenges and workshops held
annually since 2013.
- BioCreAtIvE
workshop - These workshops have been organized since 2004, with
BioCreative VI happening February 2017 and the BioCreative/OHNLP
Challenge held in 2018. See Datasets
below.
- SemEval workshop -
Tasks and evaluations in computational semantic analysis. Tasks vary by
year but frequently cover scientific and/or biomedical language,
e.g. the SemEval-2019
Task 12 on Toponym Resolution in Scientific Papers.
- eHealth-KD
- Challenges for encouraging âdevelopment of software technologies to
automatically extract a large variety of knowledge from eHealth
documents written in the Spanish Languageâ. Previously held as part of
TASS, an annual
workshop for semantic analysis in Spanish.
- EHR
DREAM Challenge - Held along with several other more bioinformatics-focused
challenges, this challenge opened in October 2019 and focuses on
using electronic health record data to predict patient mortality. Uses a
synthetic data set rather than real EHR contents.
Back to Top
Tutorials
The field changes rapidly enough that tutorials any older than a few
years are missing crucial details. A few more recent educational
resources are listed below. A good foundational understanding of text
mining techniques is very helpful, as is some basic experience with the
Python and or R languages. The best option may be to learn by doing.
LLM Guides
TBD - watch this space!
Pre-LLM Guides, Lectures,
and Courses
Back to Top
Code Libraries
- Biopython - paper - code - Python tools
primarily intended for bioinformatics and computational molecular
biology purposes, but also a convenient way to obtain data, including
documents/abstracts from PubMed (see Chapter 9 of the
documentation).
- Bio-SCoRes -
paper
- A framework for biomedical coreference resolution.
- medaCy - A system
for building predictive medical natural language processing models.
Built on the spaCy framework.
- ScispaCy - paper - A version of the spaCy framework for scientific and
biomedical documents.
- rentrez - R
utilities for accessing NCBI resources, including PubMed.
- Med7
- paper - code - a Python package
and model (for use with spaCy) for doing NER with medication-related
concepts.
Repos for Specific Datasets
- mimic-code -
Code associated with the MIMIC-III dataset (see below). Includes some
helpful tutorials.
Back to Top
- cTAKES - paper
- code - A system for
processing the text in electronic medical records. Widely used and open
source.
- CLAMP - paper
- A natural language processing toolkit intended for use with the text
in clinical reports. Check out their live demo first to see
what it does. Usable at no cost for academic research.
- DeepPhe - A
system for processing documents describing cancer presentations. Based
on cTAKES (see above).
- DNorm
- paper -
A method for disease normalization, i.e., linking mentions of disease
names and acronyms to unique concept identifiers. Downloadable version
includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data
below).
- PubTator
Central - paper
- A web platform that identifies five different types of biomedical
concepts in PubMed articles and PubMed Central full texts. The full
annotation sets are downloadable (see Annotated Text Data below).
- Pubrunner - A
framework for running text mining tools on the newest set(s) of
documents from PubMed.
- SemEHR -
paper -
an IE infrastructure for electronic health records (EHR). Built on the
CogStack project.
- TaggerOne
- paper -
Performs concept normalization (see also DNorm above). Can be trained
for specific concept types and can perform NER independent of other
normalization functions.
- TabInOut
- paper
- a framework for IE from tables in the literature.
- Anafora - paper -
An annotation tool with adjudication and progress tracking
features.
- brat - paper - code - The brat rapid
annotation tool. Supports producing text annotations visually, through
the browser. Not subject specific; appropriate for many annotation
projects. Visualization is based on that of the stav tool.
- MedTator - paper
- code - An annotation
tool designed to have minimal dependencies.
Back to Top
Techniques and Models
Large Language Models
TBD - watch this space!
BERT models
- BioBERT -
paper - code - A PubMed and
PubMed Central-trained version of the BERT language model.
- ClinicalBERT - Two language models trained on clinical text have
similar names. Both are BERT models trained on the text of clinical
notes from the MIMIC-III dataset.
- SciBERT - paper - A BERT model trained
on >1M papers from the Semantic Scholar database.
- BlueBERT - paper - A BERT model
pre-trained on PubMed text and MIMIC-III notes.
- PubMedBERT - paper - A BERT model trained
from scratch on PubMed, with versions trained on abstracts+full texts
and on abstracts alone.
GPT-2 models
- BioGPT - paper - A GPT-2 model
pre-trained on 15 million PubMed abstracts, along with fine-tuned
versions for several biomedical tasks.
Other models
- Flair
embeddings from PubMed - A language model available through the
Flair framework and embedding method. Trained over a 5% sample of PubMed
abstracts until 2015, or > 1.2 million abstracts in total.
Text Embeddings
- This
paper from Hongfang Liuâs group at Mayo Clinic demonstrates how text
embeddings trained on biomedical or clinical text can, but donât always,
perform better on biomedical natural language processing tasks. That
being said, pre-trained embeddings may be appropriate for your needs,
especially as training domain-specific embeddings can be computationally
intensive.
- BioASQword2vec
- paper - Qord
embeddings derived from biomedical text (>10 million PubMed
abstracts) using the popular word2vec
tool.
- BioWordVec
- paper
- code - Word
embeddings derived from biomedical text (>27 million PubMed titles
and abstracts), including subword embedding model based on MeSH.
Back to Top
Datasets
Some of the datasets listed below require a UMLS
Terminology Services (UTS) account to access. Please note that the
license granted with the UTS account requires users to submit an annual
report about their use of UMLS resources. This is less challenging than
it sounds.
Biomedical Text Sources
The following resources contain indexed text documents in the
biomedical sciences. * OHSUMED - paper - 348,566
MEDLINE entries (title and sometimes abstract) from between 1987 and
1991. Includes MeSH labels. Primarily of historical significance. * PubMed Central
Open Access Subset - A set of PubMed Central articles usable under
licenses other than traditional copyright, though the exact licenses
vary by publication and source. Articles are available as PDF and XML. *
CORD-19 - A corpus of
scholarly manuscripts concerning COVID-19. Articles are primarily from
PubMed Central and preprint servers, though the set also includes
metadata on papers without full-text availability.
Annotated Text Data
- SPL-ADR-200db
- paper - A
pilot dataset containing standardised information, and annotations of
occurence in text, about ~5,000 known adverse reactions for 200
FDA-approved drugs.
- BioCreAtIvE
1 - paper
- 15,000 sentences (10,000 training and 5,000 test) annotated for
protein and gene names. 1,000 full text biomedical research articles
annotated with protein names and Gene Ontology terms.
- BioCreAtIvE
2 - paper
- 15,000 sentences (10,000 training and 5,000 test, different from the
first corpus) annotated for protein and gene names. 542 abstracts linked
to EntrezGene identifiers. A variety of research articles annotated for
features of proteinâprotein interactions.
- BioCreAtIvE
V CDR Task Corpus (BC5CDR) - paper
- 1,500 articles (title and abstract) published in 2014 or later,
annotated for 4,409 chemicals, 5,818 diseases and 3116 chemicalâdisease
interactions. Requires registration.
- BioCreative
VI CHEMPROT Corpus - paper
- >2,400 articles annotated with chemical-protein interactions of a
variety of relation types. Requires registration.
- CRAFT - paper
- 67 full-text biomedical articles annotated in a variety of ways,
including for concepts and coreferences. Now on version 5, including
annotations linking concepts to the MONDO disease ontology.
- n2c2
(formerly i2b2) Data - The Department of Biomedical Informatics
(DBMI) at Harvard Medical School manages data for the National NLP
Clinical Challenges and the Informatics for Integrating Biology and the
Bedside challenges running since 2006. They require registration before
access and use. Datasets include a variety of topics. See the list of data
challenges for individual descriptions.
- NCBI
Disease Corpus - paper
- A corpus of 793 biomedical abstracts annotated with names of diseases
and related concepts from MeSH and OMIM.
- PubTator
Central datasets - paper
- Accessible through a RESTful API or FTP download. Includes annotations
for >29 million abstracts and âŒ3 million full text documents.
- Word Sense Disambiguation
(WSD) - paper
- 203 ambiguous words and 37,888 automatically extracted instances of
their use in biomedical research publications. Requires UTS
account.
- Clinical
Questions Collection - also known as CQC or the Iowa collection,
these are several thousand questions posed by physicians during office
visits along with the associated answers.
- BioNLP ST 2013 datasets -
data from six shared tasks, though some may not be easily accessible;
try the CG task set (BioNLP2013CG) for extensive entity and event
annotations.
- BioScope - paper -
a corpus of sentences from medical and biological documents, annotated
for negation, speculation, and linguistic scope.
- BioRED -
paper - a set of >6.5K
biomedical relation annotations, plus labels for novel findings.
Protein-protein
Interaction Annotated Corpora
Protein-protein interactions are abbreviated as PPI. The following
sets are available in BioC
format. The older sets (AIMed, BioInfer, HPRD50, IEPA, and LLL) are
available courtesy of the WBI corpora repository
and were originally derived from the original sets by a group at Turku
University.
- AIMed
- paper - 225
MEDLINE abstracts annotated for PPI.
- BioC-BioGRID -
paper
- 120 full text articles annotated for PPI and genetic interactions.
Used in the BioCreative V BioC task.
- BioInfer
- paper
- 1,100 sentences from biomedical research abstracts annotated for
relationships (including PPI), named entities, and syntactic
dependencies. Additional
information and download links are here.
- HPRD50
- paper
- 50 scientific abstracts referenced by the Human Protein Reference
Database, annotated for PPI.
- IEPA
- paper
- 486 sentences from biomedical research abstracts annotated for pairs
of co-occurring chemicals, including proteins (hence, PPI
annotations).
- LLL
- paper
- 77 sentences from research articles about the bacterium Bacillus
subtilis, annotated for proteinâgene interactions (so, fairly close
to PPI annotations). Additional
information is here.
Other Datasets
- Columbia Open Health Data - paper - A
database of prevalence and co-occurrence frequencies of conditions,
drugs, procedures, and patient demographics extracted from electronic
health records. Does not include original record text.
- Comparative Toxicogenomics
Database - paper -
A database of manually curated associations between chemicals, gene
products, phenotypes, diseases, and environmental exposures. Useful for
assembling ontologies of the related concepts, such as types of
chemicals.
- MIMIC-III - paper -
Deidentified health data from ~60,000 intensive care unit admissions.
Requires completion of an online training course (CITI training) and
acceptance of a data use agreement prior to use.
- MIMIC-CXR -
The MIMIC Chest X-Ray database. Contains more than 377,000 radiographic
images and accompanying free-text radiology reports. As with MIMIC-III,
requires acceptance of a data use agreement.
- UMLS
Knowledge Sources - reference manual
- A large and comprehensive collection of biomedical terminology and
identifiers, as well as accompanying tools and scripts. Depending on
your purposes, the single file MRCONSO.RRF may be sufficient, as this
file contains unique identifiers and names for all concepts in the UMLS
Metathesaurus. See also the Ontologies and Controlled Vocabularies
section below.
- MIMIC-IV - An update to
MIMIC-IIIâs multimodal patient data, now covering more recent years of
admissions, plus a new data structure, emergency department records, and
links to MIMIC-CXR images.
- eICU Collaborative Research
Database - paper - a
database of observations from more than 200 thousand intensive care unit
admissions, with consistent structure. Requires registration, training
course completion, and data use agreement.
Back to Top
Ontologies and
Controlled Vocabularies
- Disease Ontology - paper -
An ontology of human diseases. Has cross-links to MeSH, ICD, NCI
Thesaurus, SNOMED, and OMIM. Public domain. Available on GitHub
and on the OBO
Foundry.
- RxNorm
- paper
- Normalized names for clinical drugs and drug packs, with combined
ingredients, strengths, and form, and assigned types from the Semantic
Network (see below). Released monthly.
- SPECIALIST
Lexicon - paper -
A general English lexicon that includes many biomedical terms. Updated
yearly since 1994 and still updated as of 2019. Part of UMLS but does
not require UTS account to download.
- UMLS
Metathesaurus - paper -
Mappings between >3.8 million concepts, 14 million concept names, and
>200 sources of biomedical vocabulary and identifiers. Itâs big. It
may help to prepare a subset of the Metathesaurus with the MetamorphoSys
installation tool but weâre still talking about ~30 Gb of disk space
required for the 2019 release. See the manual
here. Requires UTS account.
- UMLS Semantic
Network - paper -
Lists of 133 semantic types and 54 semantic relationships covering
biomedical concepts and vocabulary. Is the Metathesaurus too complex for
your needs? Try this. Does not require UTS account to download.
Back to Top
Data Models
Do you need a data
model? If you are working with biomedical data, then the answer is
probably âYesâ.
- Biolink - code - A data model
of biological entities. Provided as a YAML file.
- BioUML - paper
- An architecture for biomedical data analysis, integration, and
visualization. Conceptually based on the visual modeling language UML.
- OMOP Common Data
Model - a standard for observational healthcare data.
Back to Top
Credits
Credits for curators and sources.
License

License
bioie.md
Github