update lists

This commit is contained in:
2025-07-18 22:22:32 +02:00
parent 55bed3b4a1
commit 5916c5c074
3078 changed files with 331679 additions and 357255 deletions

View File

@@ -1,5 +1,5 @@
<div data-align="center">
<pre><code>&lt;img src=&quot;https://github.com/caufieldjh/awesome-bioie/blob/master/images/abie_head.png&quot; alt=&quot;Awesome BioIE Logo&quot;/&gt;
<pre><code>&lt;img src=&quot;https://github.com/caufieldjh/awesome-bioie/blob/main/images/abie_head.png&quot; alt=&quot;Awesome BioIE Logo&quot;/&gt;
&lt;br&gt;
&lt;a href=&quot;https://awesome.re&quot;&gt;
&lt;img src=&quot;https://awesome.re/badge-flat2.svg&quot; alt=&quot;Awesome&quot;&gt;
@@ -16,6 +16,9 @@ language. If the resulting information is verifiable and consistent
across sources, we may then consider it <em>knowledge</em>. Extracting
information and producing knowledge from bio data requires adaptations
upon methods developed for other types of unstructured data.</p>
<p>BioIE has undergone massive changes since the introduction of
language models like BERT and the more recently created Large Language
Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).</p>
<p>Resources included here are preferentially those available at no
monetary cost and limited license requirements. Methods and datasets
should be publicly accessible and actively maintained.</p>
@@ -58,12 +61,7 @@ Services</a>
<ul>
<li><a href="#annotation-tools">Annotation Tools</a></li>
</ul></li>
<li><a href="#techniques">Techniques</a>
<ul>
<li><a href="#text-embeddings">Text Embeddings</a></li>
<li><a href="#word-embeddings">Word Embeddings</a></li>
<li><a href="#language-models">Language Models</a></li>
</ul></li>
<li><a href="#techniques-and-models">Techniques and Models</a></li>
<li><a href="#datasets">Datasets</a>
<ul>
<li><a href="#biomedical-text-sources">Biomedical Text Sources</a></li>
@@ -79,6 +77,26 @@ Controlled Vocabularies</a></li>
<li><a href="#credits">Credits</a></li>
</ul>
<h2 id="research-overviews">Research Overviews</h2>
<h3 id="llms-in-biomedical-ie">LLMs in Biomedical IE</h3>
<ul>
<li><a href="http://dx.doi.org/10.1101/2024.04.24.24306315">Large
language models in healthcare: A comprehensive benchmark</a> - a
statistical and human evaluation of sixteen different LLMs applied to
medical language tasks.</li>
<li><a href="https://doi.org/10.1186/s12911-024-02459-6">Assessing the
research landscape and clinical utility of large language models: a
scoping review</a> - a high-level review of LLM applications in medicine
as of March 2024.</li>
<li><a href="https://doi.org/10.1016/s2589-7500(24)00061-x">Ethical and
regulatory challenges of large language models in medicine</a> - a
review of ethical issues arising from applications of LLMs in
biomedicine.</li>
<li><a href="http://dx.doi.org/10.1145/3442188.3445922">On the Dangers
of Stochastic Parrots: Can Language Models Be Too Big? 🦜</a> - a
frequently referenced but still relevant work concerning the roles,
applications, and risks of language models.</li>
</ul>
<h3 id="pre-llm-overviews">Pre-LLM Overviews</h3>
<ul>
<li><a
href="https://www.ahajournals.org/doi/full/10.1161/CIRCRESAHA.117.310967">Biomedical
@@ -129,20 +147,6 @@ href="http://www.childrenshospital.org/research/labs/natural-language-processing
Childrens Hospital Natural Language Processing Laboratory</a> - Led by
Dr. Guergana Savova, formerly at Mayo Clinic and the Apache cTAKES
project.</li>
<li><a href="https://commonfund.nih.gov/bd2k">BD2K</a> - The U.S.
National Institutes of Health (NIH) funded 13 Centers of Excellence
through their Big Data to Knowledge (BD2K) program, several of which
developed tools and resources for BioIE.
<ul>
<li><a href="http://www.heartbd2k.org/">HeartBD2K</a> - Based at
University of California, Los Angeles (UCLA). Led by Dr. Peipei
Ping.</li>
<li><a href="https://knoweng.org/about/people/">KnowEng</a> - Based an
University of Illinois at Urbana-Champaign (UIUC). Led by Dr. Jiawei
Han.</li>
<li><a href="http://mobilize.stanford.edu/">Mobilize</a> - Based at
Stanford. Led by Dr. Scott Delp.</li>
</ul></li>
<li><a
href="https://www.brown.edu/academics/medical/about-us/research/centers-institutes-and-programs/biomedical-informatics/">Brown
Center for Biomedical Informatics</a> - Based at Brown University and
@@ -314,15 +318,11 @@ synthetic data set rather than real EHR contents.</li>
years are missing crucial details. A few more recent educational
resources are listed below. A good foundational understanding of text
mining techniques is very helpful, as is some basic experience with the
Python and or R languages. Starting with the <a
href="https://www.nltk.org/book/">NLTK tutorials</a> and then trying out
the tutorials for the <a
href="https://github.com/zalandoresearch/flair">Flair framework</a> will
provide excellent examples of natural language processing, text mining,
and modern machine learning-driven methods, all in Python. Most of the
examples dont include anything biomedical, however, so the best option
may be to learn by doing.</p>
<h3 id="guides">Guides</h3>
Python and or R languages. The best option may be to learn by doing.</p>
<h3 id="llm-guides">LLM Guides</h3>
<p><em>TBD - watch this space!</em></p>
<h3 id="pre-llm-guides-lectures-and-courses">Pre-LLM Guides, Lectures,
and Courses</h3>
<ul>
<li><a
href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0040020">Getting
@@ -337,10 +337,6 @@ Literature Mining</a> - A (non-free) volume of Methods in Molecular
Biology from 2014. Chapters covers introductory principles in text
mining, applications in the biological sciences, and potential for use
in clinical or medical safety scenarios.</li>
</ul>
<h3 id="video-lectures-and-online-courses">Video Lectures and Online
Courses</h3>
<ul>
<li><a
href="https://www.coursera.org/learn/mining-medical-data">Coursera -
Foundations of mining non-structured medical data</a> - About three
@@ -463,34 +459,10 @@ href="https://academic.oup.com/bioinformatics/article-abstract/38/6/1776/6496915
tool designed to have minimal dependencies.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="techniques">Techniques</h2>
<h3 id="text-embeddings">Text Embeddings</h3>
<p><a
href="https://www.sciencedirect.com/science/article/pii/S1532046418301825">This
paper from Hongfang Lius group at Mayo Clinic</a> demonstrates how text
embeddings trained on biomedical or clinical text can, but dont always,
perform better on biomedical natural language processing tasks. That
being said, pre-trained embeddings may be appropriate for your needs,
especially as training domain-specific embeddings can be computationally
intensive.</p>
<h3 id="word-embeddings">Word Embeddings</h3>
<ul>
<li><a
href="http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts">BioASQword2vec</a>
- <a href="http://bioasq.lip6.fr/info/BioASQword2vec/">paper</a> - Qord
embeddings derived from biomedical text (&gt;10 million PubMed
abstracts) using the popular <a
href="https://code.google.com/archive/p/word2vec/">word2vec</a>
tool.</li>
<li><a
href="https://figshare.com/articles/Improving_Biomedical_Word_Embeddings_with_Subword_Information_and_MeSH_Ontology/6882647">BioWordVec</a>
- <a href="https://www.nature.com/articles/s41597-019-0055-0">paper</a>
- <a href="https://github.com/ncbi-nlp/BioWordVec">code</a> - Word
embeddings derived from biomedical text (&gt;27 million PubMed titles
and abstracts), including subword embedding model based on MeSH.</li>
</ul>
<h3 id="language-models">Language Models</h3>
<h4 id="bert-models">BERT models</h4>
<h2 id="techniques-and-models">Techniques and Models</h2>
<h3 id="large-language-models">Large Language Models</h3>
<p><em>TBD - watch this space!</em></p>
<h3 id="bert-models">BERT models</h3>
<ul>
<li><a href="https://github.com/naver/biobert-pretrained">BioBERT</a> -
<a href="https://arxiv.org/abs/1901.08746">paper</a> - <a
@@ -520,20 +492,44 @@ href="https://arxiv.org/abs/2007.15779">paper</a> - A BERT model trained
from scratch on PubMed, with versions trained on abstracts+full texts
and on abstracts alone.</li>
</ul>
<h4 id="gpt-models">GPT models</h4>
<h3 id="gpt-2-models">GPT-2 models</h3>
<ul>
<li><a href="https://github.com/microsoft/BioGPT">BioGPT</a> - <a
href="https://doi.org/10.1093/bib/bbac409">paper</a> - A GPT-2 model
pre-trained on 15 million PubMed abstracts, along with fine-tuned
versions for several biomedical tasks.</li>
</ul>
<h4 id="other-models">Other models</h4>
<h3 id="other-models">Other models</h3>
<ul>
<li><a href="https://github.com/zalandoresearch/flair/pull/519">Flair
embeddings from PubMed</a> - A language model available through the
Flair framework and embedding method. Trained over a 5% sample of PubMed
abstracts until 2015, or &gt; 1.2 million abstracts in total.</li>
</ul>
<h3 id="text-embeddings">Text Embeddings</h3>
<ul>
<li><a
href="https://www.sciencedirect.com/science/article/pii/S1532046418301825">This
paper from Hongfang Lius group at Mayo Clinic</a> demonstrates how text
embeddings trained on biomedical or clinical text can, but dont always,
perform better on biomedical natural language processing tasks. That
being said, pre-trained embeddings may be appropriate for your needs,
especially as training domain-specific embeddings can be computationally
intensive.</li>
<li><a
href="http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts">BioASQword2vec</a>
- <a href="http://bioasq.lip6.fr/info/BioASQword2vec/">paper</a> - Qord
embeddings derived from biomedical text (&gt;10 million PubMed
abstracts) using the popular <a
href="https://code.google.com/archive/p/word2vec/">word2vec</a>
tool.</li>
<li><a
href="https://figshare.com/articles/Improving_Biomedical_Word_Embeddings_with_Subword_Information_and_MeSH_Ontology/6882647">BioWordVec</a>
- <a href="https://www.nature.com/articles/s41597-019-0055-0">paper</a>
- <a href="https://github.com/ncbi-nlp/BioWordVec">code</a> - Word
embeddings derived from biomedical text (&gt;27 million PubMed titles
and abstracts), including subword embedding model based on MeSH.</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<h2 id="datasets">Datasets</h2>
<p>Some of the datasets listed below require a <a
@@ -809,3 +805,5 @@ Model</a> - a standard for observational healthcare data.</li>
src="https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg"
alt="CC0" /></a></p>
<p><a href="./LICENSE">License</a></p>
<p><a href="https://github.com/caufieldjh/awesome-bioie">bioie.md
Github</a></p>