update lists
This commit is contained in:
126
html/bioie.html
126
html/bioie.html
@@ -1,5 +1,5 @@
|
||||
<div data-align="center">
|
||||
<pre><code><img src="https://github.com/caufieldjh/awesome-bioie/blob/master/images/abie_head.png" alt="Awesome BioIE Logo"/>
|
||||
<pre><code><img src="https://github.com/caufieldjh/awesome-bioie/blob/main/images/abie_head.png" alt="Awesome BioIE Logo"/>
|
||||
<br>
|
||||
<a href="https://awesome.re">
|
||||
<img src="https://awesome.re/badge-flat2.svg" alt="Awesome">
|
||||
@@ -16,6 +16,9 @@ language. If the resulting information is verifiable and consistent
|
||||
across sources, we may then consider it <em>knowledge</em>. Extracting
|
||||
information and producing knowledge from bio data requires adaptations
|
||||
upon methods developed for other types of unstructured data.</p>
|
||||
<p>BioIE has undergone massive changes since the introduction of
|
||||
language models like BERT and the more recently created Large Language
|
||||
Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).</p>
|
||||
<p>Resources included here are preferentially those available at no
|
||||
monetary cost and limited license requirements. Methods and datasets
|
||||
should be publicly accessible and actively maintained.</p>
|
||||
@@ -58,12 +61,7 @@ Services</a>
|
||||
<ul>
|
||||
<li><a href="#annotation-tools">Annotation Tools</a></li>
|
||||
</ul></li>
|
||||
<li><a href="#techniques">Techniques</a>
|
||||
<ul>
|
||||
<li><a href="#text-embeddings">Text Embeddings</a></li>
|
||||
<li><a href="#word-embeddings">Word Embeddings</a></li>
|
||||
<li><a href="#language-models">Language Models</a></li>
|
||||
</ul></li>
|
||||
<li><a href="#techniques-and-models">Techniques and Models</a></li>
|
||||
<li><a href="#datasets">Datasets</a>
|
||||
<ul>
|
||||
<li><a href="#biomedical-text-sources">Biomedical Text Sources</a></li>
|
||||
@@ -79,6 +77,26 @@ Controlled Vocabularies</a></li>
|
||||
<li><a href="#credits">Credits</a></li>
|
||||
</ul>
|
||||
<h2 id="research-overviews">Research Overviews</h2>
|
||||
<h3 id="llms-in-biomedical-ie">LLMs in Biomedical IE</h3>
|
||||
<ul>
|
||||
<li><a href="http://dx.doi.org/10.1101/2024.04.24.24306315">Large
|
||||
language models in healthcare: A comprehensive benchmark</a> - a
|
||||
statistical and human evaluation of sixteen different LLMs applied to
|
||||
medical language tasks.</li>
|
||||
<li><a href="https://doi.org/10.1186/s12911-024-02459-6">Assessing the
|
||||
research landscape and clinical utility of large language models: a
|
||||
scoping review</a> - a high-level review of LLM applications in medicine
|
||||
as of March 2024.</li>
|
||||
<li><a href="https://doi.org/10.1016/s2589-7500(24)00061-x">Ethical and
|
||||
regulatory challenges of large language models in medicine</a> - a
|
||||
review of ethical issues arising from applications of LLMs in
|
||||
biomedicine.</li>
|
||||
<li><a href="http://dx.doi.org/10.1145/3442188.3445922">On the Dangers
|
||||
of Stochastic Parrots: Can Language Models Be Too Big? 🦜</a> - a
|
||||
frequently referenced but still relevant work concerning the roles,
|
||||
applications, and risks of language models.</li>
|
||||
</ul>
|
||||
<h3 id="pre-llm-overviews">Pre-LLM Overviews</h3>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://www.ahajournals.org/doi/full/10.1161/CIRCRESAHA.117.310967">Biomedical
|
||||
@@ -129,20 +147,6 @@ href="http://www.childrenshospital.org/research/labs/natural-language-processing
|
||||
Children’s Hospital Natural Language Processing Laboratory</a> - Led by
|
||||
Dr. Guergana Savova, formerly at Mayo Clinic and the Apache cTAKES
|
||||
project.</li>
|
||||
<li><a href="https://commonfund.nih.gov/bd2k">BD2K</a> - The U.S.
|
||||
National Institutes of Health (NIH) funded 13 Centers of Excellence
|
||||
through their Big Data to Knowledge (BD2K) program, several of which
|
||||
developed tools and resources for BioIE.
|
||||
<ul>
|
||||
<li><a href="http://www.heartbd2k.org/">HeartBD2K</a> - Based at
|
||||
University of California, Los Angeles (UCLA). Led by Dr. Peipei
|
||||
Ping.</li>
|
||||
<li><a href="https://knoweng.org/about/people/">KnowEng</a> - Based an
|
||||
University of Illinois at Urbana-Champaign (UIUC). Led by Dr. Jiawei
|
||||
Han.</li>
|
||||
<li><a href="http://mobilize.stanford.edu/">Mobilize</a> - Based at
|
||||
Stanford. Led by Dr. Scott Delp.</li>
|
||||
</ul></li>
|
||||
<li><a
|
||||
href="https://www.brown.edu/academics/medical/about-us/research/centers-institutes-and-programs/biomedical-informatics/">Brown
|
||||
Center for Biomedical Informatics</a> - Based at Brown University and
|
||||
@@ -314,15 +318,11 @@ synthetic data set rather than real EHR contents.</li>
|
||||
years are missing crucial details. A few more recent educational
|
||||
resources are listed below. A good foundational understanding of text
|
||||
mining techniques is very helpful, as is some basic experience with the
|
||||
Python and or R languages. Starting with the <a
|
||||
href="https://www.nltk.org/book/">NLTK tutorials</a> and then trying out
|
||||
the tutorials for the <a
|
||||
href="https://github.com/zalandoresearch/flair">Flair framework</a> will
|
||||
provide excellent examples of natural language processing, text mining,
|
||||
and modern machine learning-driven methods, all in Python. Most of the
|
||||
examples don’t include anything biomedical, however, so the best option
|
||||
may be to learn by doing.</p>
|
||||
<h3 id="guides">Guides</h3>
|
||||
Python and or R languages. The best option may be to learn by doing.</p>
|
||||
<h3 id="llm-guides">LLM Guides</h3>
|
||||
<p><em>TBD - watch this space!</em></p>
|
||||
<h3 id="pre-llm-guides-lectures-and-courses">Pre-LLM Guides, Lectures,
|
||||
and Courses</h3>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0040020">Getting
|
||||
@@ -337,10 +337,6 @@ Literature Mining</a> - A (non-free) volume of Methods in Molecular
|
||||
Biology from 2014. Chapters covers introductory principles in text
|
||||
mining, applications in the biological sciences, and potential for use
|
||||
in clinical or medical safety scenarios.</li>
|
||||
</ul>
|
||||
<h3 id="video-lectures-and-online-courses">Video Lectures and Online
|
||||
Courses</h3>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://www.coursera.org/learn/mining-medical-data">Coursera -
|
||||
Foundations of mining non-structured medical data</a> - About three
|
||||
@@ -463,34 +459,10 @@ href="https://academic.oup.com/bioinformatics/article-abstract/38/6/1776/6496915
|
||||
tool designed to have minimal dependencies.</li>
|
||||
</ul>
|
||||
<p><a href="#contents">Back to Top</a></p>
|
||||
<h2 id="techniques">Techniques</h2>
|
||||
<h3 id="text-embeddings">Text Embeddings</h3>
|
||||
<p><a
|
||||
href="https://www.sciencedirect.com/science/article/pii/S1532046418301825">This
|
||||
paper from Hongfang Liu’s group at Mayo Clinic</a> demonstrates how text
|
||||
embeddings trained on biomedical or clinical text can, but don’t always,
|
||||
perform better on biomedical natural language processing tasks. That
|
||||
being said, pre-trained embeddings may be appropriate for your needs,
|
||||
especially as training domain-specific embeddings can be computationally
|
||||
intensive.</p>
|
||||
<h3 id="word-embeddings">Word Embeddings</h3>
|
||||
<ul>
|
||||
<li><a
|
||||
href="http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts">BioASQword2vec</a>
|
||||
- <a href="http://bioasq.lip6.fr/info/BioASQword2vec/">paper</a> - Qord
|
||||
embeddings derived from biomedical text (>10 million PubMed
|
||||
abstracts) using the popular <a
|
||||
href="https://code.google.com/archive/p/word2vec/">word2vec</a>
|
||||
tool.</li>
|
||||
<li><a
|
||||
href="https://figshare.com/articles/Improving_Biomedical_Word_Embeddings_with_Subword_Information_and_MeSH_Ontology/6882647">BioWordVec</a>
|
||||
- <a href="https://www.nature.com/articles/s41597-019-0055-0">paper</a>
|
||||
- <a href="https://github.com/ncbi-nlp/BioWordVec">code</a> - Word
|
||||
embeddings derived from biomedical text (>27 million PubMed titles
|
||||
and abstracts), including subword embedding model based on MeSH.</li>
|
||||
</ul>
|
||||
<h3 id="language-models">Language Models</h3>
|
||||
<h4 id="bert-models">BERT models</h4>
|
||||
<h2 id="techniques-and-models">Techniques and Models</h2>
|
||||
<h3 id="large-language-models">Large Language Models</h3>
|
||||
<p><em>TBD - watch this space!</em></p>
|
||||
<h3 id="bert-models">BERT models</h3>
|
||||
<ul>
|
||||
<li><a href="https://github.com/naver/biobert-pretrained">BioBERT</a> -
|
||||
<a href="https://arxiv.org/abs/1901.08746">paper</a> - <a
|
||||
@@ -520,20 +492,44 @@ href="https://arxiv.org/abs/2007.15779">paper</a> - A BERT model trained
|
||||
from scratch on PubMed, with versions trained on abstracts+full texts
|
||||
and on abstracts alone.</li>
|
||||
</ul>
|
||||
<h4 id="gpt-models">GPT models</h4>
|
||||
<h3 id="gpt-2-models">GPT-2 models</h3>
|
||||
<ul>
|
||||
<li><a href="https://github.com/microsoft/BioGPT">BioGPT</a> - <a
|
||||
href="https://doi.org/10.1093/bib/bbac409">paper</a> - A GPT-2 model
|
||||
pre-trained on 15 million PubMed abstracts, along with fine-tuned
|
||||
versions for several biomedical tasks.</li>
|
||||
</ul>
|
||||
<h4 id="other-models">Other models</h4>
|
||||
<h3 id="other-models">Other models</h3>
|
||||
<ul>
|
||||
<li><a href="https://github.com/zalandoresearch/flair/pull/519">Flair
|
||||
embeddings from PubMed</a> - A language model available through the
|
||||
Flair framework and embedding method. Trained over a 5% sample of PubMed
|
||||
abstracts until 2015, or > 1.2 million abstracts in total.</li>
|
||||
</ul>
|
||||
<h3 id="text-embeddings">Text Embeddings</h3>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://www.sciencedirect.com/science/article/pii/S1532046418301825">This
|
||||
paper from Hongfang Liu’s group at Mayo Clinic</a> demonstrates how text
|
||||
embeddings trained on biomedical or clinical text can, but don’t always,
|
||||
perform better on biomedical natural language processing tasks. That
|
||||
being said, pre-trained embeddings may be appropriate for your needs,
|
||||
especially as training domain-specific embeddings can be computationally
|
||||
intensive.</li>
|
||||
<li><a
|
||||
href="http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts">BioASQword2vec</a>
|
||||
- <a href="http://bioasq.lip6.fr/info/BioASQword2vec/">paper</a> - Qord
|
||||
embeddings derived from biomedical text (>10 million PubMed
|
||||
abstracts) using the popular <a
|
||||
href="https://code.google.com/archive/p/word2vec/">word2vec</a>
|
||||
tool.</li>
|
||||
<li><a
|
||||
href="https://figshare.com/articles/Improving_Biomedical_Word_Embeddings_with_Subword_Information_and_MeSH_Ontology/6882647">BioWordVec</a>
|
||||
- <a href="https://www.nature.com/articles/s41597-019-0055-0">paper</a>
|
||||
- <a href="https://github.com/ncbi-nlp/BioWordVec">code</a> - Word
|
||||
embeddings derived from biomedical text (>27 million PubMed titles
|
||||
and abstracts), including subword embedding model based on MeSH.</li>
|
||||
</ul>
|
||||
<p><a href="#contents">Back to Top</a></p>
|
||||
<h2 id="datasets">Datasets</h2>
|
||||
<p>Some of the datasets listed below require a <a
|
||||
@@ -809,3 +805,5 @@ Model</a> - a standard for observational healthcare data.</li>
|
||||
src="https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg"
|
||||
alt="CC0" /></a></p>
|
||||
<p><a href="./LICENSE">License</a></p>
|
||||
<p><a href="https://github.com/caufieldjh/awesome-bioie">bioie.md
|
||||
Github</a></p>
|
||||
|
||||
Reference in New Issue
Block a user