Files
awesome-awesomeness/html/nlp.html
2024-04-20 19:22:54 +02:00

1324 lines
64 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<h1 id="awesome-nlp">awesome-nlp</h1>
<p><a href="https://github.com/sindresorhus/awesome"><img
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
alt="Awesome" /></a></p>
<p>A curated list of resources dedicated to Natural Language
Processing</p>
<figure>
<img src="/images/logo.jpg" alt="Awesome NLP Logo" />
<figcaption aria-hidden="true">Awesome NLP Logo</figcaption>
</figure>
<p>Read this in <a href="./README.md">English</a>, <a
href="./README-ZH-TW.md">Traditional Chinese</a></p>
<p><em>Please read the <a href="contributing.md">contribution
guidelines</a> before contributing. Please add your favourite NLP
resource by raising a <a
href="https://github.com/keonkim/awesome-nlp/pulls">pull
request</a></em></p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#research-summaries-and-trends">Research Summaries and
Trends</a></li>
<li><a href="#prominent-nlp-research-labs">Prominent NLP Research
Labs</a></li>
<li><a href="#tutorials">Tutorials</a>
<ul>
<li><a href="#reading-content">Reading Content</a></li>
<li><a href="#videos-and-online-courses">Videos and Courses</a></li>
<li><a href="#books">Books</a></li>
</ul></li>
<li><a href="#libraries">Libraries</a>
<ul>
<li><a href="#node-js">Node.js</a></li>
<li><a href="#python">Python</a></li>
<li><a href="#c++">C++</a></li>
<li><a href="#java">Java</a></li>
<li><a href="#kotlin">Kotlin</a></li>
<li><a href="#scala">Scala</a></li>
<li><a href="#R">R</a></li>
<li><a href="#clojure">Clojure</a></li>
<li><a href="#ruby">Ruby</a></li>
<li><a href="#rust">Rust</a></li>
<li><a href="#NLP++">NLP++</a></li>
<li><a href="#julia">Julia</a></li>
</ul></li>
<li><a href="#services">Services</a></li>
<li><a href="#annotation-tools">Annotation Tools</a></li>
<li><a href="#datasets">Datasets</a></li>
<li><a href="#nlp-in-korean">NLP in Korean</a></li>
<li><a href="#nlp-in-arabic">NLP in Arabic</a></li>
<li><a href="#nlp-in-chinese">NLP in Chinese</a></li>
<li><a href="#nlp-in-german">NLP in German</a></li>
<li><a href="#nlp-in-polish">NLP in Polish</a></li>
<li><a href="#nlp-in-spanish">NLP in Spanish</a></li>
<li><a href="#nlp-in-indic-languages">NLP in Indic Languages</a></li>
<li><a href="#nlp-in-thai">NLP in Thai</a></li>
<li><a href="#nlp-in-danish">NLP in Danish</a></li>
<li><a href="#nlp-in-vietnamese">NLP in Vietnamese</a></li>
<li><a href="#nlp-for-dutch">NLP for Dutch</a></li>
<li><a href="#nlp-in-indonesian">NLP in Indonesian</a></li>
<li><a href="#nlp-in-urdu">NLP in Urdu</a></li>
<li><a href="#nlp-in-persian">NLP in Persian</a></li>
<li><a href="#nlp-in-ukrainian">NLP in Ukrainian</a></li>
<li><a href="#nlp-in-hungarian">NLP in Hungarian</a></li>
<li><a href="#nlp-in-portuguese">NLP in Portuguese</a></li>
<li><a href="#other-languages">Other Languages</a></li>
<li><a href="#credits">Credits</a></li>
</ul>
<h2 id="research-summaries-and-trends">Research Summaries and
Trends</h2>
<ul>
<li><a href="https://nlpoverview.com/">NLP-Overview</a> is an up-to-date
overview of deep learning techniques applied to NLP, including theory,
implementations, applications, and state-of-the-art results. This is a
great Deep NLP Introduction for researchers.</li>
<li><a href="https://nlpprogress.com/">NLP-Progress</a> tracks the
progress in Natural Language Processing, including the datasets and the
current state-of-the-art for the most common NLP tasks</li>
<li><a href="https://thegradient.pub/nlp-imagenet/">NLPs ImageNet
moment has arrived</a></li>
<li><a href="http://ruder.io/acl-2018-highlights/">ACL 2018 Highlights:
Understanding Representation and Evaluation in More Challenging
Settings</a></li>
<li><a
href="https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-1.html">Four
deep learning trends from ACL 2017. Part One: Linguistic Structure and
Word Embeddings</a></li>
<li><a
href="https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-2.html">Four
deep learning trends from ACL 2017. Part Two: Interpretability and
Attention</a></li>
<li><a
href="http://blog.aylien.com/highlights-emnlp-2017-exciting-datasets-return-clusters/">Highlights
of EMNLP 2017: Exciting Datasets, Return of the Clusters, and
More!</a></li>
<li><a
href="https://tryolabs.com/blog/2017/12/12/deep-learning-for-nlp-advancements-and-trends-in-2017/?utm_campaign=Revue%20newsletter&amp;utm_medium=Newsletter&amp;utm_source=The%20Wild%20Week%20in%20AI">Deep
Learning for Natural Language Processing (NLP): Advancements &amp;
Trends</a></li>
<li><a href="https://arxiv.org/abs/1703.09902">Survey of the State of
the Art in Natural Language Generation</a></li>
</ul>
<h2 id="prominent-nlp-research-labs">Prominent NLP Research Labs</h2>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a href="http://nlp.cs.berkeley.edu/index.shtml">The Berkeley NLP
Group</a> - Notable contributions include a tool to reconstruct long
dead languages, referenced <a
href="https://www.bbc.com/news/science-environment-21427896">here</a>
and by taking corpora from 637 languages currently spoken in Asia and
the Pacific and recreating their descendant.</li>
<li><a href="http://www.cs.cmu.edu/~nasmith/nlp-cl.html">Language
Technologies Institute, Carnegie Mellon University</a> - Notable
projects include <a href="http://www.cs.cmu.edu/~avenue/">Avenue
Project</a>, a syntax driven machine translation system for endangered
languages like Quechua and Aymara and previously, <a
href="http://www.cs.cmu.edu/~ark/">Noahs Ark</a> which created <a
href="http://www.cs.cmu.edu/~ark/AQMAR/">AQMAR</a> to improve NLP tools
for Arabic.</li>
<li><a href="http://www1.cs.columbia.edu/nlp/index.cgi">NLP research
group, Columbia University</a> - Responsible for creating BOLT (
interactive error handling for speech translation systems) and an
un-named project to characterize laughter in dialogue.</li>
<li><a href="http://clsp.jhu.edu/">The Center or Language and Speech
Processing, John Hopkins University</a> - Recently in the news for
developing speech recognition software to create a diagnostic test or
Parkinsons Disease, <a
href="https://www.clsp.jhu.edu/2019/03/27/speech-recognition-software-and-machine-learning-tools-are-being-used-to-create-diagnostic-test-for-parkinsons-disease/#.XNFqrIkzYdU">here</a>.</li>
<li><a
href="https://wiki.umiacs.umd.edu/clip/index.php/Main_Page">Computational
Linguistics and Information Processing Group, University of Maryland</a>
- Notable contributions include <a
href="http://www.umiacs.umd.edu/~jbg/projects/IIS-1652666">Human-Computer
Cooperation or Word-by-Word Question Answering</a> and modeling
development of phonetic representations.</li>
<li><a href="https://nlp.cis.upenn.edu/">Penn Natural Language
Processing, University of Pennsylvania</a>- Famous for creating the <a
href="https://www.seas.upenn.edu/~pdtb/">Penn Treebank</a>.</li>
<li><a href="https://nlp.stanford.edu/">The Stanford Nautral Language
Processing Group</a>- One of the top NLP research labs in the world,
notable for creating <a
href="https://nlp.stanford.edu/software/corenlp.shtml">Stanford
CoreNLP</a> and their <a
href="https://nlp.stanford.edu/software/dcoref.shtml">coreference
resolution system</a></li>
</ul>
<h2 id="tutorials">Tutorials</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="reading-content">Reading Content</h3>
<p>General Machine Learning</p>
<ul>
<li><a
href="https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/edit?usp=sharing">Machine
Learning 101</a> from Googles Senior Creative Engineer explains Machine
Learning for engineers and executives alike</li>
<li><a href="https://aiplaybook.a16z.com/">AI Playbook</a> - a16z AI
playbook is a great link to forward to your managers or content for your
presentations</li>
<li><a href="http://ruder.io/#open">Ruders Blog</a> by <a
href="https://twitter.com/seb_ruder">Sebastian Ruder</a> for commentary
on the best of NLP Research</li>
<li><a href="https://www.lighttag.io/how-to-label-data/">How To Label
Data</a> guide to managing larger linguistic annotation projects</li>
<li><a href="https://www.depends-on-the-definition.com/">Depends on the
Definition</a> collection of blog posts covering a wide array of NLP
topics with detailed implementation</li>
</ul>
<p>Introductions and Guides to NLP</p>
<ul>
<li><a
href="https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/">Understand
&amp; Implement Natural Language Processing</a></li>
<li><a href="http://github.com/NirantK/nlp-python-deep-learning">NLP in
Python</a> - Collection of Github notebooks</li>
<li><a
href="https://academic.oup.com/jamia/article/18/5/544/829676">Natural
Language Processing: An Introduction</a> - Oxford</li>
<li><a
href="https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html">Deep
Learning for NLP with Pytorch</a></li>
<li><a href="https://github.com/hb20007/hands-on-nltk-tutorial">Hands-On
NLTK Tutorial</a> - NLTK Tutorials, Jupyter notebooks</li>
<li><a href="https://www.nltk.org/book/">Natural Language Processing
with Python Analyzing Text with the Natural Language Toolkit</a> - An
online and print book introducing NLP concepts using NLTK. The books
authors also wrote the NLTK library.</li>
<li><a href="https://huggingface.co/blog/how-to-train">Train a new
language model from scratch</a> - Hugging Face 🤗</li>
<li><a href="https://notebooks.quantumstat.com/">The Super Duper NLP
Repo (SDNLPR)</a>: Collection of Colab notebooks covering a wide array
of NLP task implementations.</li>
</ul>
<p>Blogs and Newsletters</p>
<ul>
<li><a
href="https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">Deep
Learning, NLP, and Representations</a></li>
<li><a href="https://jalammar.github.io/illustrated-bert/">The
Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)</a>
and <a href="https://jalammar.github.io/illustrated-transformer/">The
Illustrated Transformer</a></li>
<li><a href="https://nlpers.blogspot.com/">Natural Language
Processing</a> by Hal Daumé III</li>
<li><a href="https://arxiv.org/pdf/1103.0398.pdf">arXiv: Natural
Language Processing (Almost) from Scratch</a></li>
<li><a
href="https://karpathy.github.io/2015/05/21/rnn-effectiveness">Karpathys
The Unreasonable Effectiveness of Recurrent Neural Networks</a></li>
<li><a
href="https://machinelearningmastery.com/category/natural-language-processing">Machine
Learning Mastery: Deep Learning for Natural Language Processing</a></li>
<li><a href="https://amitness.com/categories/#nlp">Visual NLP Paper
Summaries</a></li>
</ul>
<h3 id="videos-and-online-courses">Videos and Online Courses</h3>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a href="https://people.cs.umass.edu/~miyyer/cs685_f20/">Advanced
Natural Language Processing</a> - CS 685, UMass Amherst CS</li>
<li><a href="https://github.com/oxford-cs-deepnlp-2017/lectures">Deep
Natural Language Processing</a> - Lectures series from Oxford</li>
<li><a href="https://web.stanford.edu/class/cs224n/">Deep Learning for
Natural Language Processing (cs224-n)</a> - Richard Socher and
Christopher Mannings Stanford Course</li>
<li><a href="http://phontron.com/class/nn4nlp2017/">Neural Networks for
NLP</a> - Carnegie Mellon Language Technology Institute there</li>
<li><a href="https://github.com/yandexdataschool/nlp_course">Deep NLP
Course</a> by Yandex Data School, covering important ideas from text
embedding to machine translation including sequence modeling, language
models and so on.</li>
<li><a href="https://www.fast.ai/2019/07/08/fastai-nlp/">fast.ai
Code-First Intro to Natural Language Processing</a> - This covers a
blend of traditional NLP topics (including regex, SVD, naive bayes,
tokenization) and recent neural network approaches (including RNNs,
seq2seq, GRUs, and the Transformer), as well as addressing urgent
ethical issues, such as bias and disinformation. Find the Jupyter
Notebooks <a href="https://github.com/fastai/course-nlp">here</a></li>
<li><a
href="https://www.youtube.com/playlist?list=PL8P_Z6C4GcuWfAq8Pt6PBYlck4OprHXsw">Machine
Learning University - Accelerated Natural Language Processing</a> -
Lectures go from introduction to NLP and text processing to Recurrent
Neural Networks and Transformers. Material can be found <a
href="https://github.com/aws-samples/aws-machine-learning-university-accelerated-nlp">here</a>.</li>
<li><a
href="https://www.youtube.com/playlist?list=PLH-xYrxjfO2WyR3pOAB006CYMhNt4wTqp">Applied
Natural Language Processing</a>- Lecture series from IIT Madras taking
from the basics all the way to autoencoders and everything. The github
notebooks for this course are also available <a
href="https://github.com/Ramaseshanr/anlp">here</a></li>
</ul>
<h3 id="books">Books</h3>
<ul>
<li><a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and
Language Processing</a> - free, by Prof. Dan Jurafsy</li>
<li><a href="https://github.com/jacobeisenstein/gt-nlp-class">Natural
Language Processing</a> - free, NLP notes by Dr. Jacob Eisenstein at
GeorgiaTech</li>
<li><a href="https://github.com/joosthub/PyTorchNLPBook">NLP with
PyTorch</a> - Brian &amp; Delip Rao</li>
<li><a href="https://www.tidytextmining.com">Text Mining in R</a></li>
<li><a href="https://www.nltk.org/book/">Natural Language Processing
with Python</a></li>
<li><a
href="https://www.oreilly.com/library/view/practical-natural-language/9781492054047/">Practical
Natural Language Processing</a></li>
<li><a
href="https://www.oreilly.com/library/view/natural-language-processing/9781492047759/">Natural
Language Processing with Spark NLP</a></li>
<li><a
href="https://www.manning.com/books/deep-learning-for-natural-language-processing">Deep
Learning for Natural Language Processing</a> by Stephan Raaijmakers</li>
<li><a
href="https://www.manning.com/books/real-world-natural-language-processing">Real-World
Natural Language Processing</a> - by Masato Hagiwara</li>
<li><a
href="https://www.manning.com/books/natural-language-processing-in-action-second-edition">Natural
Language Processing in Action, Second Edition</a> - by Hobson Lane and
Maria Dyshel ## Libraries</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a id="node-js"><strong>Node.js and Javascript</strong> - Node.js
Libaries for NLP</a> | <a href="#contents">Back to Top</a>
<ul>
<li><a href="https://github.com/twitter/twitter-text">Twitter-text</a> -
A JavaScript implementation of Twitters text processing library</li>
<li><a href="https://github.com/benhmoore/Knwl.js">Knwl.js</a> - A
Natural Language Processor in JS</li>
<li><a href="https://github.com/retextjs/retext">Retext</a> - Extensible
system for analyzing and manipulating natural language</li>
<li><a href="https://github.com/spencermountain/compromise">NLP
Compromise</a> - Natural Language processing in the browser</li>
<li><a href="https://github.com/NaturalNode/natural">Natural</a> -
general natural language facilities for node</li>
<li><a href="https://github.com/synyi/poplar">Poplar</a> - A web-based
annotation tool for natural language processing (NLP)</li>
<li><a href="https://github.com/axa-group/nlp.js">NLP.js</a> - An NLP
library for building bots</li>
<li><a
href="https://github.com/huggingface/node-question-answering">node-question-answering</a>
- Fast and production-ready question answering w/ DistilBERT in
Node.js</li>
</ul></li>
<li><a id="python"> <strong>Python</strong> - Python NLP Libraries</a> |
<a href="#contents">Back to Top</a>
<ul>
<li><a
href="https://github.com/sloev/sentimental-onix">sentimental-onix</a>
Sentiment models for spacy using onnx</li>
<li><a href="https://github.com/QData/TextAttack">TextAttack</a> -
Adversarial attacks, adversarial training, and data augmentation in
NLP</li>
<li><a href="http://textblob.readthedocs.org/">TextBlob</a> - Providing
a consistent API for diving into common natural language processing
(NLP) tasks. Stands on the giant shoulders of <a
href="https://www.nltk.org/">Natural Language Toolkit (NLTK)</a> and <a
href="https://github.com/clips/pattern">Pattern</a>, and plays nicely
with both :+1:</li>
<li><a href="https://github.com/explosion/spaCy">spaCy</a> - Industrial
strength NLP with Python and Cython :+1:</li>
<li><a
href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster">Speedster</a>
- Automatically apply SOTA optimization techniques to achieve the
maximum inference speed-up on your hardware
<ul>
<li><a href="https://github.com/chartbeat-labs/textacy">textacy</a> -
Higher level NLP built on spaCy</li>
</ul></li>
<li><a href="https://radimrehurek.com/gensim/index.html">gensim</a> -
Python library to conduct unsupervised semantic modelling from plain
text :+1:</li>
<li><a
href="https://github.com/JasonKessler/scattertext">scattertext</a> -
Python library to produce d3 visualizations of how language differs
between corpora</li>
<li><a href="https://github.com/dmlc/gluon-nlp">GluonNLP</a> - A deep
learning toolkit for NLP, built on MXNet/Gluon, for research prototyping
and industrial deployment of state-of-the-art models on a wide range of
NLP tasks.</li>
<li><a href="https://github.com/allenai/allennlp">AllenNLP</a> - An NLP
research library, built on PyTorch, for developing state-of-the-art deep
learning models on a wide variety of linguistic tasks.</li>
<li><a href="https://github.com/PetrochukM/PyTorch-NLP">PyTorch-NLP</a>
- NLP research toolkit designed to support rapid prototyping with better
data loaders, word vector loaders, neural network layer representations,
common NLP metrics such as BLEU</li>
<li><a
href="https://github.com/columbia-applied-data-science/rosetta">Rosetta</a>
- Text processing tools and wrappers (e.g. Vowpal Wabbit)</li>
<li><a href="https://github.com/proycon/pynlpl">PyNLPl</a> - Python
Natural Language Processing Library. General purpose NLP library for
Python, handles some specific formats like ARPA language models, Moses
phrasetables, GIZA++ alignments.</li>
<li><a href="https://github.com/proycon/foliapy">foliapy</a> - Python
library for working with <a
href="https://proycon.github.io/folia/">FoLiA</a>, an XML format for
linguistic annotation.</li>
<li><a href="https://github.com/sergioburdisso/pyss3">PySS3</a> - Python
package that implements a novel white-box machine learning model for
text classification, called SS3. Since SS3 has the ability to visually
explain its rationale, this package also comes with easy-to-use
interactive visualizations tools (<a href="http://tworld.io/ss3/">online
demos</a>).</li>
<li><a href="https://github.com/datquocnguyen/jPTDP">jPTDP</a> - A
toolkit for joint part-of-speech (POS) tagging and dependency parsing.
jPTDP provides pre-trained models for 40+ languages.</li>
<li><a href="https://github.com/bigartm/bigartm">BigARTM</a> - a fast
library for topic modelling</li>
<li><a href="https://github.com/snipsco/snips-nlu">Snips NLU</a> - A
production ready library for intent parsing</li>
<li><a href="https://github.com/chakki-works/chazutsu">Chazutsu</a> - A
library for downloading&amp;parsing standard NLP research datasets</li>
<li><a href="https://github.com/gutfeeling/word_forms">Word Forms</a> -
Word forms can accurately generate all possible forms of an English
word</li>
<li><a
href="https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA">Multilingual
Latent Dirichlet Allocation (LDA)</a> - A multilingual and extensible
document clustering pipeline</li>
<li><a href="https://www.nltk.org/">Natural Language Toolkit (NLTK)</a>
- A library containing a wide variety of NLP functionality, supporting
over 50 corpora.</li>
<li><a href="https://github.com/NervanaSystems/nlp-architect">NLP
Architect</a> - A library for exploring the state-of-the-art deep
learning topologies and techniques for NLP and NLU</li>
<li><a href="https://github.com/zalandoresearch/flair">Flair</a> - A
very simple framework for state-of-the-art multilingual NLP built on
PyTorch. Includes BERT, ELMo and Flair embeddings.</li>
<li><a href="https://github.com/BrikerMan/Kashgari">Kashgari</a> -
Simple, Keras-powered multilingual NLP framework, allows you to build
your models in 5 minutes for named entity recognition (NER),
part-of-speech tagging (PoS) and text classification tasks. Includes
BERT and word2vec embedding.</li>
<li><a href="https://github.com/deepset-ai/FARM">FARM</a> - Fast &amp;
easy transfer learning for NLP. Harvesting language models for the
industry. Focus on Question Answering.</li>
<li><a href="https://github.com/deepset-ai/haystack">Haystack</a> -
End-to-end Python framework for building natural language search
interfaces to data. Leverages Transformers and the State-of-the-Art of
NLP. Supports DPR, Elasticsearch, HuggingFaces Modelhub, and much
more!</li>
<li><a href="https://github.com/zaibacu/rita-dsl">Rita DSL</a> - a DSL,
loosely based on <a href="https://uima.apache.org/ruta.html">RUTA on
Apache UIMA</a>. Allows to define language patterns (rule-based NLP)
which are then translated into <a href="https://spacy.io/">spaCy</a>, or
if you prefer less features and lightweight - regex patterns.</li>
<li><a
href="https://github.com/huggingface/transformers">Transformers</a> -
Natural Language Processing for TensorFlow 2.0 and PyTorch.</li>
<li><a href="https://github.com/huggingface/tokenizers">Tokenizers</a> -
Tokenizers optimized for Research and Production.</li>
<li><a href="https://github.com/pytorch/fairseq">fairSeq</a> Facebook AI
Research implementations of SOTA seq2seq models in Pytorch.</li>
<li><a
href="https://github.com/gregversteeg/corex_topic">corex_topic</a> -
Hierarchical Topic Modeling with Minimal Domain Knowledge</li>
<li><a href="https://github.com/awslabs/sockeye">Sockeye</a> - Neural
Machine Translation (NMT) toolkit that powers Amazon Translate.</li>
<li><a href="https://github.com/xhlulu/dl-translate">DL Translate</a> -
A deep learning-based translation library for 50 languages, built on
<code>transformers</code> and Facebooks mBART Large.</li>
<li><a href="https://github.com/obss/jury">Jury</a> - Evaluation of NLP
model outputs offering various automated metrics.</li>
<li><a href="https://github.com/proycon/python-ucto">python-ucto</a> -
Unicode-aware regular-expression based tokenizer for various languages.
Python binding to C++ library, supports <a
href="https://proycon.github.io/folia">FoLiA format</a>.</li>
</ul></li>
<li><a id="c++"><strong>C++</strong> - C++ Libraries</a> | <a
href="#contents">Back to Top</a>
<ul>
<li><a href="https://github.com/chncwang/InsNet">InsNet</a> - A neural
network library for building instance-dependent NLP models with
padding-free dynamic batching.</li>
<li><a href="https://github.com/mit-nlp/MITIE">MIT Information
Extraction Toolkit</a> - C, C++, and Python tools for named entity
recognition and relation extraction</li>
<li><a href="https://taku910.github.io/crfpp/">CRF++</a> - Open source
implementation of Conditional Random Fields (CRFs) for
segmenting/labeling sequential data &amp; other Natural Language
Processing tasks.</li>
<li><a href="http://www.chokkan.org/software/crfsuite/">CRFsuite</a> -
CRFsuite is an implementation of Conditional Random Fields (CRFs) for
labeling sequential data.</li>
<li><a href="https://github.com/BLLIP/bllip-parser">BLLIP Parser</a> -
BLLIP Natural Language Parser (also known as the Charniak-Johnson
parser)</li>
<li><a href="https://github.com/proycon/colibri-core">colibri-core</a> -
C++ library, command line tools, and Python binding for extracting and
working with basic linguistic constructions such as n-grams and
skipgrams in a quick and memory-efficient way.</li>
<li><a href="https://github.com/LanguageMachines/ucto">ucto</a> -
Unicode-aware regular-expression based tokenizer for various languages.
Tool and C++ library. Supports FoLiA format.</li>
<li><a href="https://github.com/LanguageMachines/libfolia">libfolia</a>
- C++ library for the <a href="https://proycon.github.io/folia/">FoLiA
format</a></li>
<li><a href="https://github.com/LanguageMachines/frog">frog</a> -
Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser,
dependency parser, NER, shallow parser, morphological analyzer.</li>
<li><a href="https://github.com/meta-toolkit/meta">MeTA</a> - <a
href="https://meta-toolkit.org/">MeTA : ModErn Text Analysis</a> is a
C++ Data Sciences Toolkit that facilitates mining big text data.</li>
<li><a href="https://taku910.github.io/mecab/">Mecab (Japanese)</a></li>
<li><a href="http://statmt.org/moses/">Moses</a></li>
<li><a
href="https://github.com/facebookresearch/StarSpace">StarSpace</a> - a
library from Facebook for creating embeddings of word-level,
paragraph-level, document-level and for text classification</li>
</ul></li>
<li><a id="java"><strong>Java</strong> - Java NLP Libraries</a> | <a
href="#contents">Back to Top</a>
<ul>
<li><a href="https://nlp.stanford.edu/software/index.shtml">Stanford
NLP</a></li>
<li><a href="https://opennlp.apache.org/">OpenNLP</a></li>
<li><a href="https://emorynlp.github.io/nlp4j/">NLP4J</a></li>
<li><a
href="https://deeplearning4j.org/docs/latest/deeplearning4j-nlp-word2vec">Word2vec
in Java</a></li>
<li><a href="https://github.com/knowitall/reverb/">ReVerb</a> Web-Scale
Open Information Extraction</li>
<li><a href="https://github.com/knowitall/openregex">OpenRegex</a> An
efficient and flexible token-based regular expression language and
engine.</li>
<li><a href="https://github.com/CogComp/cogcomp-nlp">CogcompNLP</a> -
Core libraries developed in the U of Illinois Cognitive Computation
Group.</li>
<li><a href="http://mallet.cs.umass.edu/">MALLET</a> - MAchine Learning
for LanguagE Toolkit - package for statistical natural language
processing, document classification, clustering, topic modeling,
information extraction, and other machine learning applications to
text.</li>
<li><a
href="https://github.com/datquocnguyen/RDRPOSTagger">RDRPOSTagger</a> -
A robust POS tagging toolkit available (in both Java &amp; Python)
together with pre-trained models for 40+ languages.</li>
</ul></li>
<li><a id="kotlin"><strong>Kotlin</strong> - Kotlin NLP Libraries</a> |
<a href="#contents">Back to Top</a>
<ul>
<li><a href="https://github.com/pemistahl/lingua/">Lingua</a> A language
detection library for Kotlin and Java, suitable for long and short text
alike</li>
<li><a href="https://github.com/meiblorn/kotidgy">Kotidgy</a> — an
index-based text data generator written in Kotlin</li>
</ul></li>
<li><a id="scala"><strong>Scala</strong> - Scala NLP Libraries</a> | <a
href="#contents">Back to Top</a>
<ul>
<li><a href="https://github.com/CogComp/saul">Saul</a> - Library for
developing NLP systems, including built in modules like SRL, POS,
etc.</li>
<li><a href="https://github.com/ispras/atr4s">ATR4S</a> - Toolkit with
state-of-the-art <a
href="https://en.wikipedia.org/wiki/Terminology_extraction">automatic
term recognition</a> methods.</li>
<li><a href="https://github.com/ispras/tm">tm</a> - Implementation of
topic modeling based on regularized multilingual <a
href="https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis">PLSA</a>.</li>
<li><a
href="https://github.com/Refefer/word2vec-scala">word2vec-scala</a> -
Scala interface to word2vec model; includes operations on vectors like
word-distance and word-analogy.</li>
<li><a href="https://github.com/dlwh/epic">Epic</a> - Epic is a high
performance statistical parser written in Scala, along with a framework
for building complex structured prediction models.</li>
<li><a href="https://github.com/JohnSnowLabs/spark-nlp">Spark NLP</a> -
Spark NLP is a natural language processing library built on top of
Apache Spark ML that provides simple, performant &amp; accurate NLP
annotations for machine learning pipelines that scale easily in a
distributed environment.</li>
</ul></li>
<li><a id="R"><strong>R</strong> - R NLP Libraries</a> | <a
href="#contents">Back to Top</a>
<ul>
<li><a href="https://github.com/dselivanov/text2vec">text2vec</a> - Fast
vectorization, topic modeling, distances and GloVe word embeddings in
R.</li>
<li><a href="https://github.com/bmschmidt/wordVectors">wordVectors</a> -
An R package for creating and exploring word2vec and other word
embedding models</li>
<li><a href="https://github.com/mimno/RMallet">RMallet</a> - R package
to interface with the Java machine learning tool MALLET</li>
<li><a href="https://github.com/agoldst/dfr-browser">dfr-browser</a> -
Creates d3 visualizations for browsing topic models of text in a web
browser.</li>
<li><a href="https://github.com/agoldst/dfrtopics">dfrtopics</a> - R
package for exploring topic models of text.</li>
<li><a
href="https://github.com/kevincobain2000/sentiment_classifier">sentiment_classifier</a>
- Sentiment Classification using Word Sense Disambiguation and WordNet
Reader</li>
<li><a
href="https://github.com/kevincobain2000/jProcessing">jProcessing</a> -
Japanese Natural Langauge Processing Libraries, with Japanese sentiment
classification</li>
<li><a
href="https://kgjerde.github.io/corporaexplorer/">corporaexplorer</a> -
An R package for dynamic exploration of text collections</li>
<li><a href="https://github.com/juliasilge/tidytext">tidytext</a> - Text
mining using tidy tools</li>
<li><a href="https://github.com/quanteda/spacyr">spacyr</a> - R wrapper
to spaCy NLP</li>
<li><a
href="https://github.com/cran-task-views/NaturalLanguageProcessing/">CRAN
Task View: Natural Language Processing</a></li>
</ul></li>
<li><a id="clojure"><strong>Clojure</strong></a> | <a
href="#contents">Back to Top</a>
<ul>
<li><a
href="https://github.com/dakrone/clojure-opennlp">Clojure-openNLP</a> -
Natural Language Processing in Clojure (opennlp)</li>
<li><a
href="https://github.com/r0man/inflections-clj">Infections-clj</a> -
Rails-like inflection library for Clojure and ClojureScript</li>
<li><a href="https://github.com/fekr/postagga">postagga</a> - A library
to parse natural language in Clojure and ClojureScript</li>
</ul></li>
<li><a id="ruby"><strong>Ruby</strong></a> | <a href="#contents">Back to
Top</a>
<ul>
<li>Kevin Diass <a href="https://github.com/diasks2/ruby-nlp">A
collection of Natural Language Processing (NLP) Ruby libraries, tools
and software</a></li>
<li><a href="https://github.com/arbox/nlp-with-ruby">Practical Natural
Language Processing done in Ruby</a></li>
</ul></li>
<li><a id="rust"><strong>Rust</strong></a> | <a href="#contents">Back to
Top</a>
<ul>
<li><a href="https://github.com/greyblake/whatlang-rs">whatlang</a>
Natural language recognition library based on trigrams</li>
<li><a href="https://github.com/snipsco/snips-nlu-rs">snips-nlu-rs</a> -
A production ready library for intent parsing</li>
<li><a href="https://github.com/guillaume-be/rust-bert">rust-bert</a> -
Ready-to-use NLP pipelines and Transformer-based models</li>
</ul></li>
<li><a id="NLP++"><strong>NLP++</strong> - NLP++ Language</a> | <a
href="#contents">Back to Top</a>
<ul>
<li><a
href="https://marketplace.visualstudio.com/items?itemName=dehilster.nlp">VSCode
Language Extension</a> - NLP++ Language Extension for VSCode</li>
<li><a href="https://github.com/VisualText/nlp-engine">nlp-engine</a> -
NLP++ engine to run NLP++ code on Linux including a full English
parser</li>
<li><a href="http://visualtext.org">VisualText</a> - Homepage for the
NLP++ Language</li>
<li><a
href="http://wiki.naturalphilosophy.org/index.php?title=NLP%2B%2B">NLP++
Wiki</a> - Wiki entry for the NLP++ language</li>
</ul></li>
<li><a id="julia"><strong>Julia</strong></a> | <a href="#contents">Back
to Top</a>
<ul>
<li><a
href="https://github.com/JuliaText/CorpusLoaders.jl">CorpusLoaders</a> -
A variety of loaders for various NLP corpora</li>
<li><a href="https://github.com/JuliaText/Languages.jl">Languages</a> -
A package for working with human languages</li>
<li><a
href="https://github.com/JuliaText/TextAnalysis.jl">TextAnalysis</a> -
Julia package for text analysis</li>
<li><a href="https://github.com/JuliaText/TextModels.jl">TextModels</a>
- Neural Network based models for Natural Language Processing</li>
<li><a
href="https://github.com/JuliaText/WordTokenizers.jl">WordTokenizers</a>
- High performance tokenizers for natural language processing and other
related tasks</li>
<li><a href="https://github.com/JuliaText/Word2Vec.jl">Word2Vec</a> -
Julia interface to word2vec</li>
</ul></li>
</ul>
<h3 id="services">Services</h3>
<p>NLP as API with higher level functionality such as NER, Topic tagging
and so on | <a href="#contents">Back to Top</a></p>
<ul>
<li><a href="https://github.com/wit-ai/wit">Wit-ai</a> - Natural
Language Interface for apps and devices</li>
<li><a
href="https://github.com/watson-developer-cloud/natural-language-understanding-nodejs">IBM
Watsons Natural Language Understanding</a> - API and Github demo</li>
<li><a href="https://aws.amazon.com/comprehend/">Amazon Comprehend</a> -
NLP and ML suite covers most common tasks like NER, tagging, and
sentiment analysis</li>
<li><a href="https://cloud.google.com/natural-language/">Google Cloud
Natural Language API</a> - Syntax Analysis, NER, Sentiment Analysis, and
Content tagging in atleast 9 languages include English and Chinese
(Simplified and Traditional).</li>
<li><a
href="https://www.paralleldots.com/text-analysis-apis">ParallelDots</a>
- High level Text Analysis API Service ranging from Sentiment Analysis
to Intent Analysis</li>
<li><a
href="https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/">Microsoft
Cognitive Service</a></li>
<li><a href="https://www.textrazor.com/">TextRazor</a></li>
<li><a href="https://www.rosette.com/">Rosette</a></li>
<li><a href="https://www.textalytic.com">Textalytic</a> - Natural
Language Processing in the Browser with sentiment analysis, named entity
extraction, POS tagging, word frequencies, topic modeling, word clouds,
and more</li>
<li><a href="https://nlpcloud.io">NLP Cloud</a> - SpaCy NLP models
(custom and pre-trained ones) served through a RESTful API for named
entity recognition (NER), POS tagging, and more.</li>
<li><a href="https://cloudmersive.com/nlp-api">Cloudmersive</a> -
Unified and free NLP APIs that perform actions such as speech tagging,
text rephrasing, language translation/detection, and sentence
parsing</li>
</ul>
<h3 id="annotation-tools">Annotation Tools</h3>
<ul>
<li><a href="https://gate.ac.uk/overview.html">GATE</a> - General
Architecture and Text Engineering is 15+ years old, free and open
source</li>
<li><a href="https://github.com/weitechen/anafora">Anafora</a> is free
and open source, web-based raw text annotation tool</li>
<li><a href="https://brat.nlplab.org/">brat</a> - brat rapid annotation
tool is an online environment for collaborative text annotation</li>
<li><a href="https://github.com/chakki-works/doccano">doccano</a> -
doccano is free, open-source, and provides annotation features for text
classification, sequence labeling and sequence to sequence</li>
<li><a href="https://inception-project.github.io">INCEpTION</a> - A
semantic annotation platform offering intelligent assistance and
knowledge management</li>
<li><a href="https://www.tagtog.net/">tagtog</a>, team-first web tool to
find, create, maintain, and share datasets - costs $</li>
<li><a href="https://prodi.gy/">prodigy</a> is an annotation tool
powered by active learning, costs $</li>
<li><a href="https://lighttag.io">LightTag</a> - Hosted and managed text
annotation tool for teams, costs $</li>
<li><a
href="https://corpling.uis.georgetown.edu/rstweb/info/">rstWeb</a> -
open source local or online tool for discourse tree annotations</li>
<li><a href="https://corpling.uis.georgetown.edu/gitdox/">GitDox</a> -
open source server annotation tool with GitHub version control and
validation for XML data and collaborative spreadsheet grids</li>
<li><a href="https://www.heartex.ai/">Label Studio</a> - Hosted and
managed text annotation tool for teams, freemium based, costs $</li>
<li><a href="https://datasaur.ai/">Datasaur</a> support various NLP
tasks for individual or teams, freemium based</li>
<li><a href="https://konfuzio.com/en/">Konfuzio</a> - team-first hosted
and on-prem text, image and PDF annotation tool powered by active
learning, freemium based, costs $</li>
<li><a href="https://ubiai.tools/">UBIAI</a> - Easy-to-use text
annotation tool for teams with most comprehensive auto-annotation
features. Supports NER, relations and document classification as well as
OCR annotation for invoice labeling, costs $</li>
<li><a href="https://github.com/AI4Bharat/Shoonya-Backend">Shoonya</a> -
Shoonya is free and open source data annotation platform with wide
varities of organization and workspace level management system. Shoonya
is data agnostic, can be used by teams to annotate data with various
level of verification stages at scale.</li>
<li><a href="https://www.johnsnowlabs.com/annotation-lab/">Annotation
Lab</a> - Free End-to-End No-Code platform for text annotation and DL
model training/tuning. Out-of-the-box support for Named Entity
Recognition, Classification, Relation extraction and Assertion Status
Spark NLP models. Unlimited support for users, teams, projects,
documents. Not FOSS.</li>
<li><a href="https://github.com/proycon/flat">FLAT</a> - FLAT is a
web-based linguistic annotation environment based around the <a
href="http://proycon.github.io/folia">FoLiA format</a>, a rich XML-based
format for linguistic annotation. Free and open source.</li>
</ul>
<h2 id="techniques">Techniques</h2>
<h3 id="text-embeddings">Text Embeddings</h3>
<h4 id="word-embeddings">Word Embeddings</h4>
<ul>
<li><p>Thumb Rule: <strong>fastText &gt;&gt; GloVe &gt;
word2vec</strong></p></li>
<li><p><a
href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">word2vec</a>
- <a
href="https://code.google.com/archive/p/word2vec/">implementation</a> -
<a
href="http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">explainer
blog</a></p></li>
<li><p><a href="https://nlp.stanford.edu/pubs/glove.pdf">glove</a> - <a
href="https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/">explainer
blog</a></p></li>
<li><p>fasttext - <a
href="https://github.com/facebookresearch/fastText">implementation</a> -
<a href="https://arxiv.org/abs/1607.04606">paper</a> - <a
href="https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3">explainer
blog</a></p></li>
</ul>
<h4 id="sentence-and-language-model-based-word-embeddings">Sentence and
Language Model Based Word Embeddings</h4>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li>ElMo - <a href="https://arxiv.org/abs/1802.05365">Deep
Contextualized Word Representations</a> - <a
href="https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md">PyTorch
implmentation</a> - <a href="https://github.com/allenai/bilm-tf">TF
Implementation</a></li>
<li>ULMFiT - <a href="https://arxiv.org/abs/1801.06146">Universal
Language Model Fine-tuning for Text Classification</a> by Jeremy Howard
and Sebastian Ruder</li>
<li>InferSent - <a href="https://arxiv.org/abs/1705.02364">Supervised
Learning of Universal Sentence Representations from Natural Language
Inference Data</a> by facebook</li>
<li>CoVe - <a href="https://arxiv.org/abs/1708.00107">Learned in
Translation: Contextualized Word Vectors</a></li>
<li>Pargraph vectors - from <a
href="https://cs.stanford.edu/~quocle/paragraph_vector.pdf">Distributed
Representations of Sentences and Documents</a>. See <a
href="https://rare-technologies.com/doc2vec-tutorial/">doc2vec tutorial
at gensim</a></li>
<li><a href="https://arxiv.org/abs/1511.06388">sense2vec</a> - on word
sense disambiguation</li>
<li><a href="https://arxiv.org/abs/1506.06726">Skip Thought Vectors</a>
- word representation method</li>
<li><a href="https://arxiv.org/abs/1502.07257">Adaptive skip-gram</a> -
similar approach, with adaptive properties</li>
<li><a
href="https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf">Sequence
to Sequence Learning</a> - word vectors for machine translation</li>
</ul>
<h3 id="question-answering-and-knowledge-extraction">Question Answering
and Knowledge Extraction</h3>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a href="https://github.com/facebookresearch/DrQA">DrQA</a> - Open
Domain Question Answering work by Facebook Research on Wikipedia
data</li>
<li><a href="https://github.com/allenai/document-qa">Document-QA</a> -
Simple and Effective Multi-Paragraph Reading Comprehension by
AllenAI</li>
<li><a
href="https://www.usna.edu/Users/cs/nchamber/pubs/acl2011-chambers-templates.pdf">Template-Based
Information Extraction without the Templates</a></li>
<li><a
href="https://www.sebastianzimmeck.de/zimmeckAndBellovin2014Privee.pdf">Privee:
An Architecture for Automatically Analyzing Web Privacy
Policies</a></li>
</ul>
<h2 id="datasets">Datasets</h2>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a href="https://github.com/niderhoff/nlp-datasets">nlp-datasets</a>
great collection of nlp datasets</li>
<li><a
href="https://github.com/RaRe-Technologies/gensim-data">gensim-data</a>
- Data repository for pretrained NLP models and NLP corpora.</li>
</ul>
<h2 id="multilingual-nlp-frameworks">Multilingual NLP Frameworks</h2>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a href="https://github.com/ufal/udpipe">UDPipe</a> is a trainable
pipeline for tokenizing, tagging, lemmatizing and parsing Universal
Treebanks and other CoNLL-U files. Primarily written in C++, offers a
fast and reliable solution for multilingual NLP processing.</li>
<li><a href="https://github.com/adobe/NLP-Cube">NLP-Cube</a> : Natural
Language Processing Pipeline - Sentence Splitting, Tokenization,
Lemmatization, Part-of-speech Tagging and Dependency Parsing. New
platform, written in Python with Dynet 2.0. Offers standalone
(CLI/Python bindings) and server functionality (REST API).</li>
<li><a href="https://github.com/mikahama/uralicNLP">UralicNLP</a> is an
NLP library mostly for many endangered Uralic languages such as Sami
languages, Mordvin languages, Mari languages, Komi languages and so on.
Also some non-endangered languages are supported such as Finnish
together with non-Uralic languages such as Swedish and Arabic. UralicNLP
can do morphological analysis, generation, lemmatization and
disambiguation.</li>
</ul>
<h2 id="nlp-in-korean">NLP in Korean</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="libraries">Libraries</h3>
<ul>
<li><a href="http://konlpy.org">KoNLPy</a> - Python package for Korean
natural language processing.</li>
<li><a href="https://eunjeon.blogspot.com/">Mecab (Korean)</a> - C++
library for Korean NLP</li>
<li><a href="https://koalanlp.github.io/koalanlp/">KoalaNLP</a> - Scala
library for Korean Natural Language Processing.</li>
<li><a href="https://cran.r-project.org/package=KoNLP">KoNLP</a> - R
package for Korean Natural language processing</li>
</ul>
<h3 id="blogs-and-tutorials">Blogs and Tutorials</h3>
<ul>
<li><a href="https://dsindex.github.io/">dsindexs blog</a></li>
<li><a href="http://cs.kangwon.ac.kr/~leeck/NLP/">Kangwon Universitys
NLP course in Korean</a></li>
</ul>
<h3 id="datasets-1">Datasets</h3>
<ul>
<li><a
href="http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus">KAIST
Corpus</a> - A corpus from the Korea Advanced Institute of Science and
Technology in Korean.</li>
<li><a href="https://github.com/e9t/nsmc/">Naver Sentiment Movie Corpus
in Korean</a></li>
<li><a href="http://srchdb1.chosun.com/pdf/i_archive/">Chosun Ilbo
archive</a> - dataset in Korean from one of the major newspapers in
South Korea, the Chosun Ilbo.</li>
<li><a href="https://github.com/songys/Chatbot_data">Chat data</a> -
Chatbot data in Korean</li>
<li><a href="https://github.com/akngs/petitions">Petitions</a> - Collect
expired petition data from the Blue House National Petition Site.</li>
<li><a href="https://github.com/j-min/korean-parallel-corpora">Korean
Parallel corpora</a> - Neural Machine Translation(NMT) Dataset for
<strong>Korean to French</strong> &amp; <strong>Korean to
English</strong></li>
<li><a href="https://korquad.github.io/">KorQuAD</a> - Korean SQuAD
dataset with Wiki HTML source. Mentions both v1.0 and v2.1 at the time
of adding to Awesome NLP</li>
</ul>
<h2 id="nlp-in-arabic">NLP in Arabic</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="libraries-1">Libraries</h3>
<ul>
<li><a href="https://github.com/01walid/goarabic">goarabic</a> - Go
package for Arabic text processing</li>
<li><a href="https://github.com/ejtaal/jsastem">jsastem</a> - Javascript
for Arabic stemming</li>
<li><a href="https://pypi.org/project/PyArabic/">PyArabic</a> - Python
libraries for Arabic</li>
<li><a href="https://github.com/amir-zeldes/RFTokenizer">RFTokenizer</a>
- trainable Python segmenter for Arabic, Hebrew and Coptic</li>
</ul>
<h3 id="datasets-2">Datasets</h3>
<ul>
<li><a
href="https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces">Multidomain
Datasets</a> - Largest Available Multi-Domain Resources for Arabic
Sentiment Analysis</li>
<li><a href="https://github.com/mohamedadaly/labr">LABR</a> - LArge
Arabic Book Reviews dataset</li>
<li><a href="https://github.com/mohataher/arabic-stop-words">Arabic
Stopwords</a> - A list of Arabic stopwords from various resources</li>
</ul>
<h2 id="nlp-in-chinese">NLP in Chinese</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="libraries-2">Libraries</h3>
<ul>
<li><a href="https://github.com/fxsjy/jieba#jieba-1">jieba</a> - Python
package for Words Segmentation Utilities in Chinese</li>
<li><a href="https://github.com/isnowfy/snownlp">SnowNLP</a> - Python
package for Chinese NLP</li>
<li><a href="https://github.com/FudanNLP/fnlp">FudanNLP</a> - Java
library for Chinese text processing</li>
<li><a href="https://github.com/hankcs/HanLP">HanLP</a> - The
multilingual NLP library</li>
</ul>
<h3 id="anthology">Anthology</h3>
<ul>
<li><a href="https://github.com/fighting41love/funNLP">funNLP</a> -
Collection of NLP tools and resources mainly for Chinese</li>
</ul>
<h2 id="nlp-in-german">NLP in German</h2>
<ul>
<li><a href="https://github.com/adbar/German-NLP">German-NLP</a> -
Curated list of open-access/open-source/off-the-shelf resources and
tools developed with a particular focus on German</li>
</ul>
<h2 id="nlp-in-polish">NLP in Polish</h2>
<ul>
<li><a
href="https://github.com/ksopyla/awesome-nlp-polish">Polish-NLP</a> - A
curated list of resources dedicated to Natural Language Processing (NLP)
in polish. Models, tools, datasets.</li>
</ul>
<h2 id="nlp-in-spanish">NLP in Spanish</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="libraries-3">Libraries</h3>
<ul>
<li><a href="https://github.com/jfreddypuentes/spanlp">spanlp</a> -
Python library to detect, censor and clean profanity, vulgarities,
hateful words, racism, xenophobia and bullying in texts written in
Spanish. It contains data of 21 Spanish-speaking countries.</li>
</ul>
<h3 id="data">Data</h3>
<ul>
<li><a
href="https://github.com/dav009/LatinamericanTextResources">Columbian
Political Speeches</a></li>
<li><a
href="https://mbkromann.github.io/copenhagen-dependency-treebank/">Copenhagen
Treebank</a></li>
<li><a href="https://github.com/crscardellino/sbwce">Spanish Billion
words corpus with Word2Vec embeddings</a></li>
<li><a
href="https://github.com/josecannete/spanish-unannotated-corpora">Compilation
of Spanish Unannotated Corpora</a></li>
</ul>
<h3 id="word-and-sentence-embeddings">Word and Sentence Embeddings</h3>
<ul>
<li><a
href="https://github.com/dccuchile/spanish-word-embeddings">Spanish Word
Embeddings Computed with Different Methods and from Different
Corpora</a></li>
<li><a href="https://github.com/BotCenter/spanishWordEmbeddings">Spanish
Word Embeddings Computed from Large Corpora and Different Sizes Using
fastText</a></li>
<li><a href="https://github.com/BotCenter/spanishSent2Vec">Spanish
Sentence Embeddings Computed from Large Corpora Using sent2vec</a></li>
<li><a href="https://github.com/dccuchile/beto">Beto - BERT for
Spanish</a></li>
</ul>
<h2 id="nlp-in-indic-languages">NLP in Indic languages</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="data-corpora-and-treebanks">Data, Corpora and Treebanks</h3>
<ul>
<li><a href="https://ltrc.iiit.ac.in/treebank_H2014/">Hindi Dependency
Treebank</a> - A multi-representational multi-layered treebank for Hindi
and Urdu</li>
<li><a
href="https://universaldependencies.org/treebanks/hi_hdtb/index.html">Universal
Dependencies Treebank in Hindi</a>
<ul>
<li><a
href="http://universaldependencies.org/treebanks/hi_pud/index.html">Parallel
Universal Dependencies Treebank in Hindi</a> - A smaller part of the
above-mentioned treebank.</li>
</ul></li>
<li><a href="https://www.isical.ac.in/~fire/data/">ISI FIRE Stopwords
List (Hindi and Bangla)</a></li>
<li><a href="https://github.com/6/stopwords-json">Peter Grahams
Stopwords List</a></li>
<li><a href="https://www.nltk.org/book/ch02.html">NLTK Corpus</a> 60k
Words POS Tagged, Bangla, Hindi, Marathi, Telugu</li>
<li><a href="https://github.com/goru001/nlp-for-hindi">Hindi Movie
Reviews Dataset</a> ~1k Samples, 3 polarity classes</li>
<li><a
href="https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1">BBC
News Hindi Dataset</a> 4.3k Samples, 14 classes</li>
<li><a href="https://github.com/pnisarg/ABSA">IIT Patna Hindi ABSA
Dataset</a> 5.4k Samples, 12 Domains, 4k aspect terms, aspect and
sentence level polarity in 4 classes</li>
<li><a href="https://github.com/AtikRahman/Bangla_Datasets_ABSA">Bangla
ABSA</a> 5.5k Samples, 2 Domains, 10 aspect terms</li>
<li><a href="https://www.iitp.ac.in/~ai-nlp-ml/resources.html">IIT Patna
Movie Review Sentiment Dataset</a> 2k Samples, 3 polarity labels</li>
</ul>
<h4
id="corporadatasets-that-need-a-loginaccess-can-be-gained-via-email">Corpora/Datasets
that need a login/access can be gained via email</h4>
<ul>
<li><a href="http://amitavadas.com/SAIL/">SAIL 2015</a> Twitter and
Facebook labelled sentiment samples in Hindi, Bengali, Tamil,
Telugu.</li>
<li><a
href="http://www.cfilt.iitb.ac.in/Sentiment_Analysis_Resources.html">IIT
Bombay NLP Resources</a> Sentiwordnet, Movie and Tourism parallel
labelled corpora, polarity labelled sense annotated corpus, Marathi
polarity labelled corpus.</li>
<li><a
href="https://tdil-dc.in/index.php?option=com_catalogue&amp;task=viewTools&amp;id=83&amp;lang=en">TDIL-IC
aggregates a lot of useful resources and provides access to otherwise
gated datasets</a></li>
</ul>
<h3 id="language-models-and-word-embeddings">Language Models and Word
Embeddings</h3>
<ul>
<li><a href="https://nirantk.com/hindi2vec/">Hindi2Vec</a> and <a
href="https://github.com/goru001/nlp-for-hindi">nlp-for-hindi</a> ULMFIT
style languge model</li>
<li><a href="https://www.iitp.ac.in/~ai-nlp-ml/resources.html">IIT Patna
Bilingual Word Embeddings Hi-En</a></li>
<li><a href="https://fasttext.cc/docs/en/crawl-vectors.html">Fasttext
word embeddings in a whole bunch of languages, trained on Common
Crawl</a></li>
<li><a href="https://github.com/Kyubyong/wordvectors">Hindi and Bengali
Word2Vec</a></li>
<li><a href="https://github.com/HIT-SCIR/ELMoForManyLangs">Hindi and
Urdu Elmo Model</a></li>
<li><a
href="https://huggingface.co/surajp/albert-base-sanskrit">Sanskrit
Albert</a> Trained on Sanskrit Wikipedia and OSCAR corpus</li>
</ul>
<h3 id="libraries-and-tooling">Libraries and Tooling</h3>
<ul>
<li><a href="https://github.com/Saurav0074/mt-dma">Multi-Task Deep
Morphological Analyzer</a> Deep Network based Morphological Parser for
Hindi and Urdu</li>
<li><a
href="https://github.com/anoopkunchukuttan/indic_nlp_library">Anoop
Kunchukuttan</a> 18 Languages, whole host of features from tokenization
to translation</li>
<li><a href="http://sivareddy.in/downloads">SivaReddys Dependency
Parser</a> Dependency Parser and Pos Tagger for Kannada, Hindi and
Telugu. <a
href="https://github.com/CalmDownKarm/sivareddydependencyparser">Python3
Port</a></li>
<li><a href="https://github.com/goru001/inltk">iNLTK</a> - A Natural
Language Toolkit for Indic Languages (Indian subcontinent languages)
built on top of Pytorch/Fastai, which aims to provide out of the box
support for common NLP tasks.</li>
</ul>
<h2 id="nlp-in-thai">NLP in Thai</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="libraries-4">Libraries</h3>
<ul>
<li><a href="https://github.com/PyThaiNLP/pythainlp">PyThaiNLP</a> -
Thai NLP in Python Package</li>
<li><a href="https://github.com/wittawatj/jtcc">JTCC</a> - A character
cluster library in Java</li>
<li><a href="https://github.com/pucktada/cutkum">CutKum</a> - Word
segmentation with deep learning in TensorFlow</li>
<li><a href="https://pypi.python.org/pypi/tltk/">Thai Language
Toolkit</a> - Based on a paper by Wirote Aroonmanakun in 2002 with
included dataset</li>
<li><a href="https://github.com/KenjiroAI/SynThai">SynThai</a> - Word
segmentation and POS tagging using deep learning in Python</li>
</ul>
<h3 id="data-1">Data</h3>
<ul>
<li><a
href="https://www.nectec.or.th/corpus/index.php?league=pm">Inter-BEST</a>
- A text corpus with 5 million words with word segmentation</li>
<li><a
href="https://github.com/PyThaiNLP/lexicon-thai/tree/master/thai-corpus/Prime%20Minister%2029">Prime
Minister 29</a> - Dataset containing speeches of the current Prime
Minister of Thailand</li>
</ul>
<h2 id="nlp-in-danish">NLP in Danish</h2>
<ul>
<li><a href="https://github.com/ITUnlp/daner">Named Entity Recognition
for Danish</a></li>
<li><a href="https://github.com/alexandrainst/danlp">DaNLP</a> - NLP
resources in Danish</li>
<li><a href="https://github.com/fnielsen/awesome-danish">Awesome
Danish</a> - A curated list of awesome resources for Danish language
technology</li>
</ul>
<h2 id="nlp-in-vietnamese">NLP in Vietnamese</h2>
<h3 id="libraries-5">Libraries</h3>
<ul>
<li><a
href="https://github.com/undertheseanlp/underthesea">underthesea</a> -
Vietnamese NLP Toolkit</li>
<li><a href="https://github.com/phuonglh/vn.vitk">vn.vitk</a> - A
Vietnamese Text Processing Toolkit</li>
<li><a href="https://github.com/vncorenlp/VnCoreNLP">VnCoreNLP</a> - A
Vietnamese natural language processing toolkit</li>
<li><a href="https://github.com/VinAIResearch/PhoBERT">PhoBERT</a> -
Pre-trained language models for Vietnamese</li>
<li><a href="https://github.com/trungtv/pyvi">pyvi</a> - Python
Vietnamese Core NLP Toolkit</li>
</ul>
<h3 id="data-2">Data</h3>
<ul>
<li><a
href="https://vlsp.hpda.vn/demo/?page=resources&amp;lang=en">Vietnamese
treebank</a> - 10,000 sentences for the constituency parsing task</li>
<li><a href="https://arxiv.org/pdf/1710.05519.pdf">BKTreeBank</a> - a
Vietnamese Dependency Treebank</li>
<li><a
href="https://github.com/UniversalDependencies/UD_Vietnamese-VTB">UD_Vietnamese</a>
- Vietnamese Universal Dependency Treebank</li>
<li><a href="https://ailab.hcmus.edu.vn/vivos/">VIVOS</a> - a free
Vietnamese speech corpus consisting of 15 hours of recording speech by
AILab</li>
<li><a
href="http://viet.jnlp.org/download-du-lieu-tu-vung-corpus">VNTQcorpus(big).txt</a>
- 1.75 million sentences in news</li>
<li><a href="https://github.com/VinAIResearch/ViText2SQL">ViText2SQL</a>
- A dataset for Vietnamese Text-to-SQL semantic parsing (EMNLP-2020
Findings)</li>
<li><a href="https://github.com/qhungngo/EVBCorpus">EVB Corpus</a> -
20,000,000 words (20 million) from 15 bilingual books, 100 parallel
English-Vietnamese / Vietnamese-English texts, 250 parallel law and
ordinance texts, 5,000 news articles, and 2,000 film subtitles.</li>
</ul>
<h2 id="nlp-for-dutch">NLP for Dutch</h2>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a href="https://github.com/proycon/python-frog">python-frog</a> -
Python binding to Frog, an NLP suite for Dutch. (pos tagging,
lemmatisation, dependency parsing, NER)</li>
<li><a href="https://github.com/rfdj/SimpleNLG-NL">SimpleNLG_NL</a> -
Dutch surface realiser used for Natural Language Generation in Dutch,
based on the SimpleNLG implementation for English and French.</li>
<li><a href="https://github.com/rug-compling/alpino">Alpino</a> -
Dependency parser for Dutch (also does PoS tagging and
Lemmatisation).</li>
<li><a
href="https://github.com/opensource-spraakherkenning-nl/Kaldi_NL">Kaldi
NL</a> - Dutch Speech Recognition models based on <a
href="http://kaldi-asr.org/">Kaldi</a>.</li>
<li><a href="https://spacy.io/">spaCy</a> - <a
href="https://spacy.io/models/nl">Dutch model</a> available. -
Industrial strength NLP with Python and Cython.</li>
</ul>
<h2 id="nlp-in-indonesian">NLP in Indonesian</h2>
<h3 id="datasets-3">Datasets</h3>
<ul>
<li>Kompas and Tempo collections at <a
href="http://ilps.science.uva.nl/resources/bahasa/">ILPS</a></li>
<li><a
href="http://www.panl10n.net/english/outputs/Indonesia/UI/0802/UI-1M-tagged.zip">PANL10N
for PoS tagging</a>: 39K sentences and 900K word tokens</li>
<li><a href="https://github.com/famrashel/idn-tagged-corpus">IDN for PoS
tagging</a>: This corpus contains 10K sentences and 250K word
tokens</li>
<li><a href="https://github.com/famrashel/idn-treebank">Indonesian
Treebank</a> and <a
href="https://github.com/UniversalDependencies/UD_Indonesian-GSD">Universal
Dependencies-Indonesian</a></li>
<li><a href="https://github.com/kata-ai/indosum">IndoSum</a> for text
summarization and classification both</li>
<li><a href="http://wn-msa.sourceforge.net/">Wordnet-Bahasa</a> - large,
free, semantic dictionary</li>
<li>IndoBenchmark <a
href="https://github.com/indobenchmark/indonlu">IndoNLU</a> includes
pre-trained language model (IndoBERT), FastText model, Indo4B corpus,
and several NLU benchmark datasets</li>
</ul>
<h3 id="libraries-embedding">Libraries &amp; Embedding</h3>
<ul>
<li>Natural language toolkit <a
href="https://github.com/kangfend/bahasa">bahasa</a></li>
<li><a
href="https://github.com/galuhsahid/indonesian-word-embedding">Indonesian
Word Embedding</a></li>
<li>Pretrained <a
href="https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.id.zip">Indonesian
fastText Text Embedding</a> trained on Wikipedia</li>
<li>IndoBenchmark <a
href="https://github.com/indobenchmark/indonlu">IndoNLU</a> includes
pretrained language model (IndoBERT), FastText model, Indo4B corpus, and
several NLU benchmark datasets</li>
</ul>
<h2 id="nlp-in-urdu">NLP in Urdu</h2>
<h3 id="datasets-4">Datasets</h3>
<ul>
<li><a href="https://github.com/mirfan899/Urdu">Collection of Urdu
datasets</a> for POS, NER and NLP tasks</li>
</ul>
<h3 id="libraries-6">Libraries</h3>
<ul>
<li><a href="https://github.com/urduhack/urduhack">Natural Language
Processing library</a> for ( 🇵🇰)Urdu language</li>
</ul>
<h2 id="nlp-in-persian">NLP in Persian</h2>
<p><a href="#contents">Back to Top</a></p>
<h3 id="libraries-7">Libraries</h3>
<ul>
<li><a href="https://github.com/roshan-research/hazm">Hazm</a> - Persian
NLP Toolkit.</li>
<li><a href="https://github.com/ICTRC/Parsivar">Parsivar</a>: A Language
Processing Toolkit for Persian</li>
<li><a href="https://github.com/AlirezaTheH/perke">Perke</a>: Perke is a
Python keyphrase extraction package for Persian language. It provides an
end-to-end keyphrase extraction pipeline in which each component can be
easily modified or extended to develop new models.</li>
<li><a href="https://github.com/jonsafari/perstem">Perstem</a>: Persian
stemmer, morphological analyzer, transliterator, and partial
part-of-speech tagger</li>
<li><a
href="https://github.com/NarimanN2/ParsiAnalyzer">ParsiAnalyzer</a>:
Persian Analyzer For Elasticsearch</li>
<li><a href="https://github.com/aziz/virastar">virastar</a>: Cleaning up
Persian text!</li>
</ul>
<h3 id="datasets-5">Datasets</h3>
<ul>
<li><a href="https://dbrg.ut.ac.ir/بیژن%E2%80%8Cخان/">Bijankhan
Corpus</a>: Bijankhan corpus is a tagged corpus that is suitable for
natural language processing research on the Persian (Farsi) language.
This collection is gathered form daily news and common texts. In this
collection all documents are categorized into different subjects such as
political, cultural and so on. Totally, there are 4300 different
subjects. The Bijankhan collection contains about 2.6 millions manually
tagged words with a tag set that contains 40 Persian POS tags.</li>
<li><a
href="https://sites.google.com/site/mojganserajicom/home/upc">Uppsala
Persian Corpus (UPC)</a>: Uppsala Persian Corpus (UPC) is a large,
freely available Persian corpus. The corpus is a modified version of the
Bijankhan corpus with additional sentence segmentation and consistent
tokenization containing 2,704,028 tokens and annotated with 31
part-of-speech tags. The part-of-speech tags are listed with
explanations in <a
href="https://sites.google.com/site/mojganserajicom/home/upc/Table_tag.pdf">this
table</a>.</li>
<li><a href="http://hdl.handle.net/11234/1-3195">Large-Scale Colloquial
Persian</a>: Large Scale Colloquial Persian Dataset (LSCP) is
hierarchically organized in asemantic taxonomy that focuses on
multi-task informal Persian language understanding as a comprehensive
problem. LSCP includes 120M sentences from 27M casual Persian tweets
with its dependency relations in syntactic annotation, Part-of-speech
tags, sentiment polarity and automatic translation of original Persian
sentences in English (EN), German (DE), Czech (CS), Italian (IT) and
Hindi (HI) spoken languages. Learn more about this project at <a
href="https://iasbs.ac.ir/~ansari/lscp/">LSCP webpage</a>.</li>
<li><a
href="https://github.com/HaniehP/PersianNER">ArmanPersoNERCorpus</a>:
The dataset includes 250,015 tokens and 7,682 Persian sentences in
total. It is available in 3 folds to be used in turn as training and
test sets. Each file contains one token, along with its manually
annotated named-entity tag, per line. Each sentence is separated with a
newline. The NER tags are in IOB format.</li>
<li><a href="https://github.com/Text-Mining/Persian-NER">FarsiYar
PersianNER</a>: The dataset includes about 25,000,000 tokens and about
1,000,000 Persian sentences in total based on <a
href="https://github.com/Text-Mining/Persian-Wikipedia-Corpus">Persian
Wikipedia Corpus</a>. The NER tags are in IOB format. More than 1000
volunteers contributed tag improvements to this dataset via web panel or
android app. They release updated tags every two weeks.</li>
<li><a href="http://farsbase.net/PERLEX.html">PERLEX</a>: The first
Persian dataset for relation extraction, which is an expert translated
version of the “Semeval-2010-Task-8” dataset. Link to the relevant
publication.</li>
<li><a href="http://dadegan.ir/catalog/perdt">Persian Syntactic
Dependency Treebank</a>: This treebank is supplied for free
noncommercial use. For commercial uses feel free to contact us. The
number of annotated sentences is 29,982 sentences including samples from
almost all verbs of the Persian valency lexicon.</li>
<li><a href="http://stp.lingfil.uu.se/~mojgan/UPDT.html">Uppsala Persian
Dependency Treebank (UPDT)</a>: Dependency-based syntactically annotated
corpus.</li>
<li><a href="https://dbrg.ut.ac.ir/hamshahri/">Hamshahri</a>: Hamshahri
collection is a standard reliable Persian text collection that was used
at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for
evaluation of Persian information retrieval systems.</li>
</ul>
<h2 id="nlp-in-ukrainian">NLP in Ukrainian</h2>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a
href="https://github.com/asivokon/awesome-ukrainian-nlp">awesome-ukrainian-nlp</a>
- a curated list of Ukrainian NLP datasets, models, etc.</li>
<li><a
href="https://github.com/Helsinki-NLP/UkrainianLT">UkrainianLT</a> -
another curated list with a focus on machine translation and speech
processing</li>
</ul>
<h2 id="nlp-in-hungarian">NLP in Hungarian</h2>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a
href="https://github.com/oroszgy/awesome-hungarian-nlp">awesome-hungarian-nlp</a>:
A curated list of free resources dedicated to Hungarian Natural Language
Processing.</li>
</ul>
<h2 id="nlp-in-portuguese">NLP in Portuguese</h2>
<p><a href="#contents">Back to Top</a></p>
<ul>
<li><a
href="https://github.com/ajdavidl/Portuguese-NLP">Portuguese-nlp</a> - a
List of resources and tools developed with focus on Portuguese.</li>
</ul>
<h2 id="other-languages">Other Languages</h2>
<ul>
<li>Russian: <a href="https://github.com/kmike/pymorphy2">pymorphy2</a>
- a good pos-tagger for Russian</li>
<li>Asian Languages: Thai, Lao, Chinese, Japanese, and Korean <a
href="https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html">ICU
Tokenizer</a> implementation in ElasticSearch</li>
<li>Ancient Languages: <a href="https://github.com/cltk/cltk">CLTK</a>:
The Classical Language Toolkit is a Python library and collection of
texts for doing NLP in ancient languages</li>
<li>Hebrew: <a
href="https://github.com/NLPH/NLPH_Resources">NLPH_Resources</a> - A
collection of papers, corpora and linguistic resources for NLP in
Hebrew</li>
</ul>
<p><a href="#contents">Back to Top</a></p>
<p><a href="./CREDITS.md">Credits</a> for initial curators and
sources</p>
<h2 id="license">License</h2>
<p><a href="./LICENSE">License</a> - CC0</p>