1324 lines
64 KiB
HTML
1324 lines
64 KiB
HTML
<h1 id="awesome-nlp">awesome-nlp</h1>
|
||
<p><a href="https://github.com/sindresorhus/awesome"><img
|
||
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
|
||
alt="Awesome" /></a></p>
|
||
<p>A curated list of resources dedicated to Natural Language
|
||
Processing</p>
|
||
<figure>
|
||
<img src="/images/logo.jpg" alt="Awesome NLP Logo" />
|
||
<figcaption aria-hidden="true">Awesome NLP Logo</figcaption>
|
||
</figure>
|
||
<p>Read this in <a href="./README.md">English</a>, <a
|
||
href="./README-ZH-TW.md">Traditional Chinese</a></p>
|
||
<p><em>Please read the <a href="contributing.md">contribution
|
||
guidelines</a> before contributing. Please add your favourite NLP
|
||
resource by raising a <a
|
||
href="https://github.com/keonkim/awesome-nlp/pulls">pull
|
||
request</a></em></p>
|
||
<h2 id="contents">Contents</h2>
|
||
<ul>
|
||
<li><a href="#research-summaries-and-trends">Research Summaries and
|
||
Trends</a></li>
|
||
<li><a href="#prominent-nlp-research-labs">Prominent NLP Research
|
||
Labs</a></li>
|
||
<li><a href="#tutorials">Tutorials</a>
|
||
<ul>
|
||
<li><a href="#reading-content">Reading Content</a></li>
|
||
<li><a href="#videos-and-online-courses">Videos and Courses</a></li>
|
||
<li><a href="#books">Books</a></li>
|
||
</ul></li>
|
||
<li><a href="#libraries">Libraries</a>
|
||
<ul>
|
||
<li><a href="#node-js">Node.js</a></li>
|
||
<li><a href="#python">Python</a></li>
|
||
<li><a href="#c++">C++</a></li>
|
||
<li><a href="#java">Java</a></li>
|
||
<li><a href="#kotlin">Kotlin</a></li>
|
||
<li><a href="#scala">Scala</a></li>
|
||
<li><a href="#R">R</a></li>
|
||
<li><a href="#clojure">Clojure</a></li>
|
||
<li><a href="#ruby">Ruby</a></li>
|
||
<li><a href="#rust">Rust</a></li>
|
||
<li><a href="#NLP++">NLP++</a></li>
|
||
<li><a href="#julia">Julia</a></li>
|
||
</ul></li>
|
||
<li><a href="#services">Services</a></li>
|
||
<li><a href="#annotation-tools">Annotation Tools</a></li>
|
||
<li><a href="#datasets">Datasets</a></li>
|
||
<li><a href="#nlp-in-korean">NLP in Korean</a></li>
|
||
<li><a href="#nlp-in-arabic">NLP in Arabic</a></li>
|
||
<li><a href="#nlp-in-chinese">NLP in Chinese</a></li>
|
||
<li><a href="#nlp-in-german">NLP in German</a></li>
|
||
<li><a href="#nlp-in-polish">NLP in Polish</a></li>
|
||
<li><a href="#nlp-in-spanish">NLP in Spanish</a></li>
|
||
<li><a href="#nlp-in-indic-languages">NLP in Indic Languages</a></li>
|
||
<li><a href="#nlp-in-thai">NLP in Thai</a></li>
|
||
<li><a href="#nlp-in-danish">NLP in Danish</a></li>
|
||
<li><a href="#nlp-in-vietnamese">NLP in Vietnamese</a></li>
|
||
<li><a href="#nlp-for-dutch">NLP for Dutch</a></li>
|
||
<li><a href="#nlp-in-indonesian">NLP in Indonesian</a></li>
|
||
<li><a href="#nlp-in-urdu">NLP in Urdu</a></li>
|
||
<li><a href="#nlp-in-persian">NLP in Persian</a></li>
|
||
<li><a href="#nlp-in-ukrainian">NLP in Ukrainian</a></li>
|
||
<li><a href="#nlp-in-hungarian">NLP in Hungarian</a></li>
|
||
<li><a href="#nlp-in-portuguese">NLP in Portuguese</a></li>
|
||
<li><a href="#other-languages">Other Languages</a></li>
|
||
<li><a href="#credits">Credits</a></li>
|
||
</ul>
|
||
<h2 id="research-summaries-and-trends">Research Summaries and
|
||
Trends</h2>
|
||
<ul>
|
||
<li><a href="https://nlpoverview.com/">NLP-Overview</a> is an up-to-date
|
||
overview of deep learning techniques applied to NLP, including theory,
|
||
implementations, applications, and state-of-the-art results. This is a
|
||
great Deep NLP Introduction for researchers.</li>
|
||
<li><a href="https://nlpprogress.com/">NLP-Progress</a> tracks the
|
||
progress in Natural Language Processing, including the datasets and the
|
||
current state-of-the-art for the most common NLP tasks</li>
|
||
<li><a href="https://thegradient.pub/nlp-imagenet/">NLP’s ImageNet
|
||
moment has arrived</a></li>
|
||
<li><a href="http://ruder.io/acl-2018-highlights/">ACL 2018 Highlights:
|
||
Understanding Representation and Evaluation in More Challenging
|
||
Settings</a></li>
|
||
<li><a
|
||
href="https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-1.html">Four
|
||
deep learning trends from ACL 2017. Part One: Linguistic Structure and
|
||
Word Embeddings</a></li>
|
||
<li><a
|
||
href="https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-2.html">Four
|
||
deep learning trends from ACL 2017. Part Two: Interpretability and
|
||
Attention</a></li>
|
||
<li><a
|
||
href="http://blog.aylien.com/highlights-emnlp-2017-exciting-datasets-return-clusters/">Highlights
|
||
of EMNLP 2017: Exciting Datasets, Return of the Clusters, and
|
||
More!</a></li>
|
||
<li><a
|
||
href="https://tryolabs.com/blog/2017/12/12/deep-learning-for-nlp-advancements-and-trends-in-2017/?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=The%20Wild%20Week%20in%20AI">Deep
|
||
Learning for Natural Language Processing (NLP): Advancements &
|
||
Trends</a></li>
|
||
<li><a href="https://arxiv.org/abs/1703.09902">Survey of the State of
|
||
the Art in Natural Language Generation</a></li>
|
||
</ul>
|
||
<h2 id="prominent-nlp-research-labs">Prominent NLP Research Labs</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a href="http://nlp.cs.berkeley.edu/index.shtml">The Berkeley NLP
|
||
Group</a> - Notable contributions include a tool to reconstruct long
|
||
dead languages, referenced <a
|
||
href="https://www.bbc.com/news/science-environment-21427896">here</a>
|
||
and by taking corpora from 637 languages currently spoken in Asia and
|
||
the Pacific and recreating their descendant.</li>
|
||
<li><a href="http://www.cs.cmu.edu/~nasmith/nlp-cl.html">Language
|
||
Technologies Institute, Carnegie Mellon University</a> - Notable
|
||
projects include <a href="http://www.cs.cmu.edu/~avenue/">Avenue
|
||
Project</a>, a syntax driven machine translation system for endangered
|
||
languages like Quechua and Aymara and previously, <a
|
||
href="http://www.cs.cmu.edu/~ark/">Noah’s Ark</a> which created <a
|
||
href="http://www.cs.cmu.edu/~ark/AQMAR/">AQMAR</a> to improve NLP tools
|
||
for Arabic.</li>
|
||
<li><a href="http://www1.cs.columbia.edu/nlp/index.cgi">NLP research
|
||
group, Columbia University</a> - Responsible for creating BOLT (
|
||
interactive error handling for speech translation systems) and an
|
||
un-named project to characterize laughter in dialogue.</li>
|
||
<li><a href="http://clsp.jhu.edu/">The Center or Language and Speech
|
||
Processing, John Hopkins University</a> - Recently in the news for
|
||
developing speech recognition software to create a diagnostic test or
|
||
Parkinson’s Disease, <a
|
||
href="https://www.clsp.jhu.edu/2019/03/27/speech-recognition-software-and-machine-learning-tools-are-being-used-to-create-diagnostic-test-for-parkinsons-disease/#.XNFqrIkzYdU">here</a>.</li>
|
||
<li><a
|
||
href="https://wiki.umiacs.umd.edu/clip/index.php/Main_Page">Computational
|
||
Linguistics and Information Processing Group, University of Maryland</a>
|
||
- Notable contributions include <a
|
||
href="http://www.umiacs.umd.edu/~jbg/projects/IIS-1652666">Human-Computer
|
||
Cooperation or Word-by-Word Question Answering</a> and modeling
|
||
development of phonetic representations.</li>
|
||
<li><a href="https://nlp.cis.upenn.edu/">Penn Natural Language
|
||
Processing, University of Pennsylvania</a>- Famous for creating the <a
|
||
href="https://www.seas.upenn.edu/~pdtb/">Penn Treebank</a>.</li>
|
||
<li><a href="https://nlp.stanford.edu/">The Stanford Nautral Language
|
||
Processing Group</a>- One of the top NLP research labs in the world,
|
||
notable for creating <a
|
||
href="https://nlp.stanford.edu/software/corenlp.shtml">Stanford
|
||
CoreNLP</a> and their <a
|
||
href="https://nlp.stanford.edu/software/dcoref.shtml">coreference
|
||
resolution system</a></li>
|
||
</ul>
|
||
<h2 id="tutorials">Tutorials</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="reading-content">Reading Content</h3>
|
||
<p>General Machine Learning</p>
|
||
<ul>
|
||
<li><a
|
||
href="https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/edit?usp=sharing">Machine
|
||
Learning 101</a> from Google’s Senior Creative Engineer explains Machine
|
||
Learning for engineer’s and executives alike</li>
|
||
<li><a href="https://aiplaybook.a16z.com/">AI Playbook</a> - a16z AI
|
||
playbook is a great link to forward to your managers or content for your
|
||
presentations</li>
|
||
<li><a href="http://ruder.io/#open">Ruder’s Blog</a> by <a
|
||
href="https://twitter.com/seb_ruder">Sebastian Ruder</a> for commentary
|
||
on the best of NLP Research</li>
|
||
<li><a href="https://www.lighttag.io/how-to-label-data/">How To Label
|
||
Data</a> guide to managing larger linguistic annotation projects</li>
|
||
<li><a href="https://www.depends-on-the-definition.com/">Depends on the
|
||
Definition</a> collection of blog posts covering a wide array of NLP
|
||
topics with detailed implementation</li>
|
||
</ul>
|
||
<p>Introductions and Guides to NLP</p>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/">Understand
|
||
& Implement Natural Language Processing</a></li>
|
||
<li><a href="http://github.com/NirantK/nlp-python-deep-learning">NLP in
|
||
Python</a> - Collection of Github notebooks</li>
|
||
<li><a
|
||
href="https://academic.oup.com/jamia/article/18/5/544/829676">Natural
|
||
Language Processing: An Introduction</a> - Oxford</li>
|
||
<li><a
|
||
href="https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html">Deep
|
||
Learning for NLP with Pytorch</a></li>
|
||
<li><a href="https://github.com/hb20007/hands-on-nltk-tutorial">Hands-On
|
||
NLTK Tutorial</a> - NLTK Tutorials, Jupyter notebooks</li>
|
||
<li><a href="https://www.nltk.org/book/">Natural Language Processing
|
||
with Python – Analyzing Text with the Natural Language Toolkit</a> - An
|
||
online and print book introducing NLP concepts using NLTK. The book’s
|
||
authors also wrote the NLTK library.</li>
|
||
<li><a href="https://huggingface.co/blog/how-to-train">Train a new
|
||
language model from scratch</a> - Hugging Face 🤗</li>
|
||
<li><a href="https://notebooks.quantumstat.com/">The Super Duper NLP
|
||
Repo (SDNLPR)</a>: Collection of Colab notebooks covering a wide array
|
||
of NLP task implementations.</li>
|
||
</ul>
|
||
<p>Blogs and Newsletters</p>
|
||
<ul>
|
||
<li><a
|
||
href="https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">Deep
|
||
Learning, NLP, and Representations</a></li>
|
||
<li><a href="https://jalammar.github.io/illustrated-bert/">The
|
||
Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)</a>
|
||
and <a href="https://jalammar.github.io/illustrated-transformer/">The
|
||
Illustrated Transformer</a></li>
|
||
<li><a href="https://nlpers.blogspot.com/">Natural Language
|
||
Processing</a> by Hal Daumé III</li>
|
||
<li><a href="https://arxiv.org/pdf/1103.0398.pdf">arXiv: Natural
|
||
Language Processing (Almost) from Scratch</a></li>
|
||
<li><a
|
||
href="https://karpathy.github.io/2015/05/21/rnn-effectiveness">Karpathy’s
|
||
The Unreasonable Effectiveness of Recurrent Neural Networks</a></li>
|
||
<li><a
|
||
href="https://machinelearningmastery.com/category/natural-language-processing">Machine
|
||
Learning Mastery: Deep Learning for Natural Language Processing</a></li>
|
||
<li><a href="https://amitness.com/categories/#nlp">Visual NLP Paper
|
||
Summaries</a></li>
|
||
</ul>
|
||
<h3 id="videos-and-online-courses">Videos and Online Courses</h3>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a href="https://people.cs.umass.edu/~miyyer/cs685_f20/">Advanced
|
||
Natural Language Processing</a> - CS 685, UMass Amherst CS</li>
|
||
<li><a href="https://github.com/oxford-cs-deepnlp-2017/lectures">Deep
|
||
Natural Language Processing</a> - Lectures series from Oxford</li>
|
||
<li><a href="https://web.stanford.edu/class/cs224n/">Deep Learning for
|
||
Natural Language Processing (cs224-n)</a> - Richard Socher and
|
||
Christopher Manning’s Stanford Course</li>
|
||
<li><a href="http://phontron.com/class/nn4nlp2017/">Neural Networks for
|
||
NLP</a> - Carnegie Mellon Language Technology Institute there</li>
|
||
<li><a href="https://github.com/yandexdataschool/nlp_course">Deep NLP
|
||
Course</a> by Yandex Data School, covering important ideas from text
|
||
embedding to machine translation including sequence modeling, language
|
||
models and so on.</li>
|
||
<li><a href="https://www.fast.ai/2019/07/08/fastai-nlp/">fast.ai
|
||
Code-First Intro to Natural Language Processing</a> - This covers a
|
||
blend of traditional NLP topics (including regex, SVD, naive bayes,
|
||
tokenization) and recent neural network approaches (including RNNs,
|
||
seq2seq, GRUs, and the Transformer), as well as addressing urgent
|
||
ethical issues, such as bias and disinformation. Find the Jupyter
|
||
Notebooks <a href="https://github.com/fastai/course-nlp">here</a></li>
|
||
<li><a
|
||
href="https://www.youtube.com/playlist?list=PL8P_Z6C4GcuWfAq8Pt6PBYlck4OprHXsw">Machine
|
||
Learning University - Accelerated Natural Language Processing</a> -
|
||
Lectures go from introduction to NLP and text processing to Recurrent
|
||
Neural Networks and Transformers. Material can be found <a
|
||
href="https://github.com/aws-samples/aws-machine-learning-university-accelerated-nlp">here</a>.</li>
|
||
<li><a
|
||
href="https://www.youtube.com/playlist?list=PLH-xYrxjfO2WyR3pOAB006CYMhNt4wTqp">Applied
|
||
Natural Language Processing</a>- Lecture series from IIT Madras taking
|
||
from the basics all the way to autoencoders and everything. The github
|
||
notebooks for this course are also available <a
|
||
href="https://github.com/Ramaseshanr/anlp">here</a></li>
|
||
</ul>
|
||
<h3 id="books">Books</h3>
|
||
<ul>
|
||
<li><a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and
|
||
Language Processing</a> - free, by Prof. Dan Jurafsy</li>
|
||
<li><a href="https://github.com/jacobeisenstein/gt-nlp-class">Natural
|
||
Language Processing</a> - free, NLP notes by Dr. Jacob Eisenstein at
|
||
GeorgiaTech</li>
|
||
<li><a href="https://github.com/joosthub/PyTorchNLPBook">NLP with
|
||
PyTorch</a> - Brian & Delip Rao</li>
|
||
<li><a href="https://www.tidytextmining.com">Text Mining in R</a></li>
|
||
<li><a href="https://www.nltk.org/book/">Natural Language Processing
|
||
with Python</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/library/view/practical-natural-language/9781492054047/">Practical
|
||
Natural Language Processing</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/library/view/natural-language-processing/9781492047759/">Natural
|
||
Language Processing with Spark NLP</a></li>
|
||
<li><a
|
||
href="https://www.manning.com/books/deep-learning-for-natural-language-processing">Deep
|
||
Learning for Natural Language Processing</a> by Stephan Raaijmakers</li>
|
||
<li><a
|
||
href="https://www.manning.com/books/real-world-natural-language-processing">Real-World
|
||
Natural Language Processing</a> - by Masato Hagiwara</li>
|
||
<li><a
|
||
href="https://www.manning.com/books/natural-language-processing-in-action-second-edition">Natural
|
||
Language Processing in Action, Second Edition</a> - by Hobson Lane and
|
||
Maria Dyshel ## Libraries</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a id="node-js"><strong>Node.js and Javascript</strong> - Node.js
|
||
Libaries for NLP</a> | <a href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a href="https://github.com/twitter/twitter-text">Twitter-text</a> -
|
||
A JavaScript implementation of Twitter’s text processing library</li>
|
||
<li><a href="https://github.com/benhmoore/Knwl.js">Knwl.js</a> - A
|
||
Natural Language Processor in JS</li>
|
||
<li><a href="https://github.com/retextjs/retext">Retext</a> - Extensible
|
||
system for analyzing and manipulating natural language</li>
|
||
<li><a href="https://github.com/spencermountain/compromise">NLP
|
||
Compromise</a> - Natural Language processing in the browser</li>
|
||
<li><a href="https://github.com/NaturalNode/natural">Natural</a> -
|
||
general natural language facilities for node</li>
|
||
<li><a href="https://github.com/synyi/poplar">Poplar</a> - A web-based
|
||
annotation tool for natural language processing (NLP)</li>
|
||
<li><a href="https://github.com/axa-group/nlp.js">NLP.js</a> - An NLP
|
||
library for building bots</li>
|
||
<li><a
|
||
href="https://github.com/huggingface/node-question-answering">node-question-answering</a>
|
||
- Fast and production-ready question answering w/ DistilBERT in
|
||
Node.js</li>
|
||
</ul></li>
|
||
<li><a id="python"> <strong>Python</strong> - Python NLP Libraries</a> |
|
||
<a href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/sloev/sentimental-onix">sentimental-onix</a>
|
||
Sentiment models for spacy using onnx</li>
|
||
<li><a href="https://github.com/QData/TextAttack">TextAttack</a> -
|
||
Adversarial attacks, adversarial training, and data augmentation in
|
||
NLP</li>
|
||
<li><a href="http://textblob.readthedocs.org/">TextBlob</a> - Providing
|
||
a consistent API for diving into common natural language processing
|
||
(NLP) tasks. Stands on the giant shoulders of <a
|
||
href="https://www.nltk.org/">Natural Language Toolkit (NLTK)</a> and <a
|
||
href="https://github.com/clips/pattern">Pattern</a>, and plays nicely
|
||
with both :+1:</li>
|
||
<li><a href="https://github.com/explosion/spaCy">spaCy</a> - Industrial
|
||
strength NLP with Python and Cython :+1:</li>
|
||
<li><a
|
||
href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster">Speedster</a>
|
||
- Automatically apply SOTA optimization techniques to achieve the
|
||
maximum inference speed-up on your hardware
|
||
<ul>
|
||
<li><a href="https://github.com/chartbeat-labs/textacy">textacy</a> -
|
||
Higher level NLP built on spaCy</li>
|
||
</ul></li>
|
||
<li><a href="https://radimrehurek.com/gensim/index.html">gensim</a> -
|
||
Python library to conduct unsupervised semantic modelling from plain
|
||
text :+1:</li>
|
||
<li><a
|
||
href="https://github.com/JasonKessler/scattertext">scattertext</a> -
|
||
Python library to produce d3 visualizations of how language differs
|
||
between corpora</li>
|
||
<li><a href="https://github.com/dmlc/gluon-nlp">GluonNLP</a> - A deep
|
||
learning toolkit for NLP, built on MXNet/Gluon, for research prototyping
|
||
and industrial deployment of state-of-the-art models on a wide range of
|
||
NLP tasks.</li>
|
||
<li><a href="https://github.com/allenai/allennlp">AllenNLP</a> - An NLP
|
||
research library, built on PyTorch, for developing state-of-the-art deep
|
||
learning models on a wide variety of linguistic tasks.</li>
|
||
<li><a href="https://github.com/PetrochukM/PyTorch-NLP">PyTorch-NLP</a>
|
||
- NLP research toolkit designed to support rapid prototyping with better
|
||
data loaders, word vector loaders, neural network layer representations,
|
||
common NLP metrics such as BLEU</li>
|
||
<li><a
|
||
href="https://github.com/columbia-applied-data-science/rosetta">Rosetta</a>
|
||
- Text processing tools and wrappers (e.g. Vowpal Wabbit)</li>
|
||
<li><a href="https://github.com/proycon/pynlpl">PyNLPl</a> - Python
|
||
Natural Language Processing Library. General purpose NLP library for
|
||
Python, handles some specific formats like ARPA language models, Moses
|
||
phrasetables, GIZA++ alignments.</li>
|
||
<li><a href="https://github.com/proycon/foliapy">foliapy</a> - Python
|
||
library for working with <a
|
||
href="https://proycon.github.io/folia/">FoLiA</a>, an XML format for
|
||
linguistic annotation.</li>
|
||
<li><a href="https://github.com/sergioburdisso/pyss3">PySS3</a> - Python
|
||
package that implements a novel white-box machine learning model for
|
||
text classification, called SS3. Since SS3 has the ability to visually
|
||
explain its rationale, this package also comes with easy-to-use
|
||
interactive visualizations tools (<a href="http://tworld.io/ss3/">online
|
||
demos</a>).</li>
|
||
<li><a href="https://github.com/datquocnguyen/jPTDP">jPTDP</a> - A
|
||
toolkit for joint part-of-speech (POS) tagging and dependency parsing.
|
||
jPTDP provides pre-trained models for 40+ languages.</li>
|
||
<li><a href="https://github.com/bigartm/bigartm">BigARTM</a> - a fast
|
||
library for topic modelling</li>
|
||
<li><a href="https://github.com/snipsco/snips-nlu">Snips NLU</a> - A
|
||
production ready library for intent parsing</li>
|
||
<li><a href="https://github.com/chakki-works/chazutsu">Chazutsu</a> - A
|
||
library for downloading&parsing standard NLP research datasets</li>
|
||
<li><a href="https://github.com/gutfeeling/word_forms">Word Forms</a> -
|
||
Word forms can accurately generate all possible forms of an English
|
||
word</li>
|
||
<li><a
|
||
href="https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA">Multilingual
|
||
Latent Dirichlet Allocation (LDA)</a> - A multilingual and extensible
|
||
document clustering pipeline</li>
|
||
<li><a href="https://www.nltk.org/">Natural Language Toolkit (NLTK)</a>
|
||
- A library containing a wide variety of NLP functionality, supporting
|
||
over 50 corpora.</li>
|
||
<li><a href="https://github.com/NervanaSystems/nlp-architect">NLP
|
||
Architect</a> - A library for exploring the state-of-the-art deep
|
||
learning topologies and techniques for NLP and NLU</li>
|
||
<li><a href="https://github.com/zalandoresearch/flair">Flair</a> - A
|
||
very simple framework for state-of-the-art multilingual NLP built on
|
||
PyTorch. Includes BERT, ELMo and Flair embeddings.</li>
|
||
<li><a href="https://github.com/BrikerMan/Kashgari">Kashgari</a> -
|
||
Simple, Keras-powered multilingual NLP framework, allows you to build
|
||
your models in 5 minutes for named entity recognition (NER),
|
||
part-of-speech tagging (PoS) and text classification tasks. Includes
|
||
BERT and word2vec embedding.</li>
|
||
<li><a href="https://github.com/deepset-ai/FARM">FARM</a> - Fast &
|
||
easy transfer learning for NLP. Harvesting language models for the
|
||
industry. Focus on Question Answering.</li>
|
||
<li><a href="https://github.com/deepset-ai/haystack">Haystack</a> -
|
||
End-to-end Python framework for building natural language search
|
||
interfaces to data. Leverages Transformers and the State-of-the-Art of
|
||
NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much
|
||
more!</li>
|
||
<li><a href="https://github.com/zaibacu/rita-dsl">Rita DSL</a> - a DSL,
|
||
loosely based on <a href="https://uima.apache.org/ruta.html">RUTA on
|
||
Apache UIMA</a>. Allows to define language patterns (rule-based NLP)
|
||
which are then translated into <a href="https://spacy.io/">spaCy</a>, or
|
||
if you prefer less features and lightweight - regex patterns.</li>
|
||
<li><a
|
||
href="https://github.com/huggingface/transformers">Transformers</a> -
|
||
Natural Language Processing for TensorFlow 2.0 and PyTorch.</li>
|
||
<li><a href="https://github.com/huggingface/tokenizers">Tokenizers</a> -
|
||
Tokenizers optimized for Research and Production.</li>
|
||
<li><a href="https://github.com/pytorch/fairseq">fairSeq</a> Facebook AI
|
||
Research implementations of SOTA seq2seq models in Pytorch.</li>
|
||
<li><a
|
||
href="https://github.com/gregversteeg/corex_topic">corex_topic</a> -
|
||
Hierarchical Topic Modeling with Minimal Domain Knowledge</li>
|
||
<li><a href="https://github.com/awslabs/sockeye">Sockeye</a> - Neural
|
||
Machine Translation (NMT) toolkit that powers Amazon Translate.</li>
|
||
<li><a href="https://github.com/xhlulu/dl-translate">DL Translate</a> -
|
||
A deep learning-based translation library for 50 languages, built on
|
||
<code>transformers</code> and Facebook’s mBART Large.</li>
|
||
<li><a href="https://github.com/obss/jury">Jury</a> - Evaluation of NLP
|
||
model outputs offering various automated metrics.</li>
|
||
<li><a href="https://github.com/proycon/python-ucto">python-ucto</a> -
|
||
Unicode-aware regular-expression based tokenizer for various languages.
|
||
Python binding to C++ library, supports <a
|
||
href="https://proycon.github.io/folia">FoLiA format</a>.</li>
|
||
</ul></li>
|
||
<li><a id="c++"><strong>C++</strong> - C++ Libraries</a> | <a
|
||
href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a href="https://github.com/chncwang/InsNet">InsNet</a> - A neural
|
||
network library for building instance-dependent NLP models with
|
||
padding-free dynamic batching.</li>
|
||
<li><a href="https://github.com/mit-nlp/MITIE">MIT Information
|
||
Extraction Toolkit</a> - C, C++, and Python tools for named entity
|
||
recognition and relation extraction</li>
|
||
<li><a href="https://taku910.github.io/crfpp/">CRF++</a> - Open source
|
||
implementation of Conditional Random Fields (CRFs) for
|
||
segmenting/labeling sequential data & other Natural Language
|
||
Processing tasks.</li>
|
||
<li><a href="http://www.chokkan.org/software/crfsuite/">CRFsuite</a> -
|
||
CRFsuite is an implementation of Conditional Random Fields (CRFs) for
|
||
labeling sequential data.</li>
|
||
<li><a href="https://github.com/BLLIP/bllip-parser">BLLIP Parser</a> -
|
||
BLLIP Natural Language Parser (also known as the Charniak-Johnson
|
||
parser)</li>
|
||
<li><a href="https://github.com/proycon/colibri-core">colibri-core</a> -
|
||
C++ library, command line tools, and Python binding for extracting and
|
||
working with basic linguistic constructions such as n-grams and
|
||
skipgrams in a quick and memory-efficient way.</li>
|
||
<li><a href="https://github.com/LanguageMachines/ucto">ucto</a> -
|
||
Unicode-aware regular-expression based tokenizer for various languages.
|
||
Tool and C++ library. Supports FoLiA format.</li>
|
||
<li><a href="https://github.com/LanguageMachines/libfolia">libfolia</a>
|
||
- C++ library for the <a href="https://proycon.github.io/folia/">FoLiA
|
||
format</a></li>
|
||
<li><a href="https://github.com/LanguageMachines/frog">frog</a> -
|
||
Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser,
|
||
dependency parser, NER, shallow parser, morphological analyzer.</li>
|
||
<li><a href="https://github.com/meta-toolkit/meta">MeTA</a> - <a
|
||
href="https://meta-toolkit.org/">MeTA : ModErn Text Analysis</a> is a
|
||
C++ Data Sciences Toolkit that facilitates mining big text data.</li>
|
||
<li><a href="https://taku910.github.io/mecab/">Mecab (Japanese)</a></li>
|
||
<li><a href="http://statmt.org/moses/">Moses</a></li>
|
||
<li><a
|
||
href="https://github.com/facebookresearch/StarSpace">StarSpace</a> - a
|
||
library from Facebook for creating embeddings of word-level,
|
||
paragraph-level, document-level and for text classification</li>
|
||
</ul></li>
|
||
<li><a id="java"><strong>Java</strong> - Java NLP Libraries</a> | <a
|
||
href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a href="https://nlp.stanford.edu/software/index.shtml">Stanford
|
||
NLP</a></li>
|
||
<li><a href="https://opennlp.apache.org/">OpenNLP</a></li>
|
||
<li><a href="https://emorynlp.github.io/nlp4j/">NLP4J</a></li>
|
||
<li><a
|
||
href="https://deeplearning4j.org/docs/latest/deeplearning4j-nlp-word2vec">Word2vec
|
||
in Java</a></li>
|
||
<li><a href="https://github.com/knowitall/reverb/">ReVerb</a> Web-Scale
|
||
Open Information Extraction</li>
|
||
<li><a href="https://github.com/knowitall/openregex">OpenRegex</a> An
|
||
efficient and flexible token-based regular expression language and
|
||
engine.</li>
|
||
<li><a href="https://github.com/CogComp/cogcomp-nlp">CogcompNLP</a> -
|
||
Core libraries developed in the U of Illinois’ Cognitive Computation
|
||
Group.</li>
|
||
<li><a href="http://mallet.cs.umass.edu/">MALLET</a> - MAchine Learning
|
||
for LanguagE Toolkit - package for statistical natural language
|
||
processing, document classification, clustering, topic modeling,
|
||
information extraction, and other machine learning applications to
|
||
text.</li>
|
||
<li><a
|
||
href="https://github.com/datquocnguyen/RDRPOSTagger">RDRPOSTagger</a> -
|
||
A robust POS tagging toolkit available (in both Java & Python)
|
||
together with pre-trained models for 40+ languages.</li>
|
||
</ul></li>
|
||
<li><a id="kotlin"><strong>Kotlin</strong> - Kotlin NLP Libraries</a> |
|
||
<a href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a href="https://github.com/pemistahl/lingua/">Lingua</a> A language
|
||
detection library for Kotlin and Java, suitable for long and short text
|
||
alike</li>
|
||
<li><a href="https://github.com/meiblorn/kotidgy">Kotidgy</a> — an
|
||
index-based text data generator written in Kotlin</li>
|
||
</ul></li>
|
||
<li><a id="scala"><strong>Scala</strong> - Scala NLP Libraries</a> | <a
|
||
href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a href="https://github.com/CogComp/saul">Saul</a> - Library for
|
||
developing NLP systems, including built in modules like SRL, POS,
|
||
etc.</li>
|
||
<li><a href="https://github.com/ispras/atr4s">ATR4S</a> - Toolkit with
|
||
state-of-the-art <a
|
||
href="https://en.wikipedia.org/wiki/Terminology_extraction">automatic
|
||
term recognition</a> methods.</li>
|
||
<li><a href="https://github.com/ispras/tm">tm</a> - Implementation of
|
||
topic modeling based on regularized multilingual <a
|
||
href="https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis">PLSA</a>.</li>
|
||
<li><a
|
||
href="https://github.com/Refefer/word2vec-scala">word2vec-scala</a> -
|
||
Scala interface to word2vec model; includes operations on vectors like
|
||
word-distance and word-analogy.</li>
|
||
<li><a href="https://github.com/dlwh/epic">Epic</a> - Epic is a high
|
||
performance statistical parser written in Scala, along with a framework
|
||
for building complex structured prediction models.</li>
|
||
<li><a href="https://github.com/JohnSnowLabs/spark-nlp">Spark NLP</a> -
|
||
Spark NLP is a natural language processing library built on top of
|
||
Apache Spark ML that provides simple, performant & accurate NLP
|
||
annotations for machine learning pipelines that scale easily in a
|
||
distributed environment.</li>
|
||
</ul></li>
|
||
<li><a id="R"><strong>R</strong> - R NLP Libraries</a> | <a
|
||
href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a href="https://github.com/dselivanov/text2vec">text2vec</a> - Fast
|
||
vectorization, topic modeling, distances and GloVe word embeddings in
|
||
R.</li>
|
||
<li><a href="https://github.com/bmschmidt/wordVectors">wordVectors</a> -
|
||
An R package for creating and exploring word2vec and other word
|
||
embedding models</li>
|
||
<li><a href="https://github.com/mimno/RMallet">RMallet</a> - R package
|
||
to interface with the Java machine learning tool MALLET</li>
|
||
<li><a href="https://github.com/agoldst/dfr-browser">dfr-browser</a> -
|
||
Creates d3 visualizations for browsing topic models of text in a web
|
||
browser.</li>
|
||
<li><a href="https://github.com/agoldst/dfrtopics">dfrtopics</a> - R
|
||
package for exploring topic models of text.</li>
|
||
<li><a
|
||
href="https://github.com/kevincobain2000/sentiment_classifier">sentiment_classifier</a>
|
||
- Sentiment Classification using Word Sense Disambiguation and WordNet
|
||
Reader</li>
|
||
<li><a
|
||
href="https://github.com/kevincobain2000/jProcessing">jProcessing</a> -
|
||
Japanese Natural Langauge Processing Libraries, with Japanese sentiment
|
||
classification</li>
|
||
<li><a
|
||
href="https://kgjerde.github.io/corporaexplorer/">corporaexplorer</a> -
|
||
An R package for dynamic exploration of text collections</li>
|
||
<li><a href="https://github.com/juliasilge/tidytext">tidytext</a> - Text
|
||
mining using tidy tools</li>
|
||
<li><a href="https://github.com/quanteda/spacyr">spacyr</a> - R wrapper
|
||
to spaCy NLP</li>
|
||
<li><a
|
||
href="https://github.com/cran-task-views/NaturalLanguageProcessing/">CRAN
|
||
Task View: Natural Language Processing</a></li>
|
||
</ul></li>
|
||
<li><a id="clojure"><strong>Clojure</strong></a> | <a
|
||
href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/dakrone/clojure-opennlp">Clojure-openNLP</a> -
|
||
Natural Language Processing in Clojure (opennlp)</li>
|
||
<li><a
|
||
href="https://github.com/r0man/inflections-clj">Infections-clj</a> -
|
||
Rails-like inflection library for Clojure and ClojureScript</li>
|
||
<li><a href="https://github.com/fekr/postagga">postagga</a> - A library
|
||
to parse natural language in Clojure and ClojureScript</li>
|
||
</ul></li>
|
||
<li><a id="ruby"><strong>Ruby</strong></a> | <a href="#contents">Back to
|
||
Top</a>
|
||
<ul>
|
||
<li>Kevin Dias’s <a href="https://github.com/diasks2/ruby-nlp">A
|
||
collection of Natural Language Processing (NLP) Ruby libraries, tools
|
||
and software</a></li>
|
||
<li><a href="https://github.com/arbox/nlp-with-ruby">Practical Natural
|
||
Language Processing done in Ruby</a></li>
|
||
</ul></li>
|
||
<li><a id="rust"><strong>Rust</strong></a> | <a href="#contents">Back to
|
||
Top</a>
|
||
<ul>
|
||
<li><a href="https://github.com/greyblake/whatlang-rs">whatlang</a> —
|
||
Natural language recognition library based on trigrams</li>
|
||
<li><a href="https://github.com/snipsco/snips-nlu-rs">snips-nlu-rs</a> -
|
||
A production ready library for intent parsing</li>
|
||
<li><a href="https://github.com/guillaume-be/rust-bert">rust-bert</a> -
|
||
Ready-to-use NLP pipelines and Transformer-based models</li>
|
||
</ul></li>
|
||
<li><a id="NLP++"><strong>NLP++</strong> - NLP++ Language</a> | <a
|
||
href="#contents">Back to Top</a>
|
||
<ul>
|
||
<li><a
|
||
href="https://marketplace.visualstudio.com/items?itemName=dehilster.nlp">VSCode
|
||
Language Extension</a> - NLP++ Language Extension for VSCode</li>
|
||
<li><a href="https://github.com/VisualText/nlp-engine">nlp-engine</a> -
|
||
NLP++ engine to run NLP++ code on Linux including a full English
|
||
parser</li>
|
||
<li><a href="http://visualtext.org">VisualText</a> - Homepage for the
|
||
NLP++ Language</li>
|
||
<li><a
|
||
href="http://wiki.naturalphilosophy.org/index.php?title=NLP%2B%2B">NLP++
|
||
Wiki</a> - Wiki entry for the NLP++ language</li>
|
||
</ul></li>
|
||
<li><a id="julia"><strong>Julia</strong></a> | <a href="#contents">Back
|
||
to Top</a>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/JuliaText/CorpusLoaders.jl">CorpusLoaders</a> -
|
||
A variety of loaders for various NLP corpora</li>
|
||
<li><a href="https://github.com/JuliaText/Languages.jl">Languages</a> -
|
||
A package for working with human languages</li>
|
||
<li><a
|
||
href="https://github.com/JuliaText/TextAnalysis.jl">TextAnalysis</a> -
|
||
Julia package for text analysis</li>
|
||
<li><a href="https://github.com/JuliaText/TextModels.jl">TextModels</a>
|
||
- Neural Network based models for Natural Language Processing</li>
|
||
<li><a
|
||
href="https://github.com/JuliaText/WordTokenizers.jl">WordTokenizers</a>
|
||
- High performance tokenizers for natural language processing and other
|
||
related tasks</li>
|
||
<li><a href="https://github.com/JuliaText/Word2Vec.jl">Word2Vec</a> -
|
||
Julia interface to word2vec</li>
|
||
</ul></li>
|
||
</ul>
|
||
<h3 id="services">Services</h3>
|
||
<p>NLP as API with higher level functionality such as NER, Topic tagging
|
||
and so on | <a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a href="https://github.com/wit-ai/wit">Wit-ai</a> - Natural
|
||
Language Interface for apps and devices</li>
|
||
<li><a
|
||
href="https://github.com/watson-developer-cloud/natural-language-understanding-nodejs">IBM
|
||
Watson’s Natural Language Understanding</a> - API and Github demo</li>
|
||
<li><a href="https://aws.amazon.com/comprehend/">Amazon Comprehend</a> -
|
||
NLP and ML suite covers most common tasks like NER, tagging, and
|
||
sentiment analysis</li>
|
||
<li><a href="https://cloud.google.com/natural-language/">Google Cloud
|
||
Natural Language API</a> - Syntax Analysis, NER, Sentiment Analysis, and
|
||
Content tagging in atleast 9 languages include English and Chinese
|
||
(Simplified and Traditional).</li>
|
||
<li><a
|
||
href="https://www.paralleldots.com/text-analysis-apis">ParallelDots</a>
|
||
- High level Text Analysis API Service ranging from Sentiment Analysis
|
||
to Intent Analysis</li>
|
||
<li><a
|
||
href="https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/">Microsoft
|
||
Cognitive Service</a></li>
|
||
<li><a href="https://www.textrazor.com/">TextRazor</a></li>
|
||
<li><a href="https://www.rosette.com/">Rosette</a></li>
|
||
<li><a href="https://www.textalytic.com">Textalytic</a> - Natural
|
||
Language Processing in the Browser with sentiment analysis, named entity
|
||
extraction, POS tagging, word frequencies, topic modeling, word clouds,
|
||
and more</li>
|
||
<li><a href="https://nlpcloud.io">NLP Cloud</a> - SpaCy NLP models
|
||
(custom and pre-trained ones) served through a RESTful API for named
|
||
entity recognition (NER), POS tagging, and more.</li>
|
||
<li><a href="https://cloudmersive.com/nlp-api">Cloudmersive</a> -
|
||
Unified and free NLP APIs that perform actions such as speech tagging,
|
||
text rephrasing, language translation/detection, and sentence
|
||
parsing</li>
|
||
</ul>
|
||
<h3 id="annotation-tools">Annotation Tools</h3>
|
||
<ul>
|
||
<li><a href="https://gate.ac.uk/overview.html">GATE</a> - General
|
||
Architecture and Text Engineering is 15+ years old, free and open
|
||
source</li>
|
||
<li><a href="https://github.com/weitechen/anafora">Anafora</a> is free
|
||
and open source, web-based raw text annotation tool</li>
|
||
<li><a href="https://brat.nlplab.org/">brat</a> - brat rapid annotation
|
||
tool is an online environment for collaborative text annotation</li>
|
||
<li><a href="https://github.com/chakki-works/doccano">doccano</a> -
|
||
doccano is free, open-source, and provides annotation features for text
|
||
classification, sequence labeling and sequence to sequence</li>
|
||
<li><a href="https://inception-project.github.io">INCEpTION</a> - A
|
||
semantic annotation platform offering intelligent assistance and
|
||
knowledge management</li>
|
||
<li><a href="https://www.tagtog.net/">tagtog</a>, team-first web tool to
|
||
find, create, maintain, and share datasets - costs $</li>
|
||
<li><a href="https://prodi.gy/">prodigy</a> is an annotation tool
|
||
powered by active learning, costs $</li>
|
||
<li><a href="https://lighttag.io">LightTag</a> - Hosted and managed text
|
||
annotation tool for teams, costs $</li>
|
||
<li><a
|
||
href="https://corpling.uis.georgetown.edu/rstweb/info/">rstWeb</a> -
|
||
open source local or online tool for discourse tree annotations</li>
|
||
<li><a href="https://corpling.uis.georgetown.edu/gitdox/">GitDox</a> -
|
||
open source server annotation tool with GitHub version control and
|
||
validation for XML data and collaborative spreadsheet grids</li>
|
||
<li><a href="https://www.heartex.ai/">Label Studio</a> - Hosted and
|
||
managed text annotation tool for teams, freemium based, costs $</li>
|
||
<li><a href="https://datasaur.ai/">Datasaur</a> support various NLP
|
||
tasks for individual or teams, freemium based</li>
|
||
<li><a href="https://konfuzio.com/en/">Konfuzio</a> - team-first hosted
|
||
and on-prem text, image and PDF annotation tool powered by active
|
||
learning, freemium based, costs $</li>
|
||
<li><a href="https://ubiai.tools/">UBIAI</a> - Easy-to-use text
|
||
annotation tool for teams with most comprehensive auto-annotation
|
||
features. Supports NER, relations and document classification as well as
|
||
OCR annotation for invoice labeling, costs $</li>
|
||
<li><a href="https://github.com/AI4Bharat/Shoonya-Backend">Shoonya</a> -
|
||
Shoonya is free and open source data annotation platform with wide
|
||
varities of organization and workspace level management system. Shoonya
|
||
is data agnostic, can be used by teams to annotate data with various
|
||
level of verification stages at scale.</li>
|
||
<li><a href="https://www.johnsnowlabs.com/annotation-lab/">Annotation
|
||
Lab</a> - Free End-to-End No-Code platform for text annotation and DL
|
||
model training/tuning. Out-of-the-box support for Named Entity
|
||
Recognition, Classification, Relation extraction and Assertion Status
|
||
Spark NLP models. Unlimited support for users, teams, projects,
|
||
documents. Not FOSS.</li>
|
||
<li><a href="https://github.com/proycon/flat">FLAT</a> - FLAT is a
|
||
web-based linguistic annotation environment based around the <a
|
||
href="http://proycon.github.io/folia">FoLiA format</a>, a rich XML-based
|
||
format for linguistic annotation. Free and open source.</li>
|
||
</ul>
|
||
<h2 id="techniques">Techniques</h2>
|
||
<h3 id="text-embeddings">Text Embeddings</h3>
|
||
<h4 id="word-embeddings">Word Embeddings</h4>
|
||
<ul>
|
||
<li><p>Thumb Rule: <strong>fastText >> GloVe >
|
||
word2vec</strong></p></li>
|
||
<li><p><a
|
||
href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">word2vec</a>
|
||
- <a
|
||
href="https://code.google.com/archive/p/word2vec/">implementation</a> -
|
||
<a
|
||
href="http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">explainer
|
||
blog</a></p></li>
|
||
<li><p><a href="https://nlp.stanford.edu/pubs/glove.pdf">glove</a> - <a
|
||
href="https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/">explainer
|
||
blog</a></p></li>
|
||
<li><p>fasttext - <a
|
||
href="https://github.com/facebookresearch/fastText">implementation</a> -
|
||
<a href="https://arxiv.org/abs/1607.04606">paper</a> - <a
|
||
href="https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3">explainer
|
||
blog</a></p></li>
|
||
</ul>
|
||
<h4 id="sentence-and-language-model-based-word-embeddings">Sentence and
|
||
Language Model Based Word Embeddings</h4>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li>ElMo - <a href="https://arxiv.org/abs/1802.05365">Deep
|
||
Contextualized Word Representations</a> - <a
|
||
href="https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md">PyTorch
|
||
implmentation</a> - <a href="https://github.com/allenai/bilm-tf">TF
|
||
Implementation</a></li>
|
||
<li>ULMFiT - <a href="https://arxiv.org/abs/1801.06146">Universal
|
||
Language Model Fine-tuning for Text Classification</a> by Jeremy Howard
|
||
and Sebastian Ruder</li>
|
||
<li>InferSent - <a href="https://arxiv.org/abs/1705.02364">Supervised
|
||
Learning of Universal Sentence Representations from Natural Language
|
||
Inference Data</a> by facebook</li>
|
||
<li>CoVe - <a href="https://arxiv.org/abs/1708.00107">Learned in
|
||
Translation: Contextualized Word Vectors</a></li>
|
||
<li>Pargraph vectors - from <a
|
||
href="https://cs.stanford.edu/~quocle/paragraph_vector.pdf">Distributed
|
||
Representations of Sentences and Documents</a>. See <a
|
||
href="https://rare-technologies.com/doc2vec-tutorial/">doc2vec tutorial
|
||
at gensim</a></li>
|
||
<li><a href="https://arxiv.org/abs/1511.06388">sense2vec</a> - on word
|
||
sense disambiguation</li>
|
||
<li><a href="https://arxiv.org/abs/1506.06726">Skip Thought Vectors</a>
|
||
- word representation method</li>
|
||
<li><a href="https://arxiv.org/abs/1502.07257">Adaptive skip-gram</a> -
|
||
similar approach, with adaptive properties</li>
|
||
<li><a
|
||
href="https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf">Sequence
|
||
to Sequence Learning</a> - word vectors for machine translation</li>
|
||
</ul>
|
||
<h3 id="question-answering-and-knowledge-extraction">Question Answering
|
||
and Knowledge Extraction</h3>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a href="https://github.com/facebookresearch/DrQA">DrQA</a> - Open
|
||
Domain Question Answering work by Facebook Research on Wikipedia
|
||
data</li>
|
||
<li><a href="https://github.com/allenai/document-qa">Document-QA</a> -
|
||
Simple and Effective Multi-Paragraph Reading Comprehension by
|
||
AllenAI</li>
|
||
<li><a
|
||
href="https://www.usna.edu/Users/cs/nchamber/pubs/acl2011-chambers-templates.pdf">Template-Based
|
||
Information Extraction without the Templates</a></li>
|
||
<li><a
|
||
href="https://www.sebastianzimmeck.de/zimmeckAndBellovin2014Privee.pdf">Privee:
|
||
An Architecture for Automatically Analyzing Web Privacy
|
||
Policies</a></li>
|
||
</ul>
|
||
<h2 id="datasets">Datasets</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a href="https://github.com/niderhoff/nlp-datasets">nlp-datasets</a>
|
||
great collection of nlp datasets</li>
|
||
<li><a
|
||
href="https://github.com/RaRe-Technologies/gensim-data">gensim-data</a>
|
||
- Data repository for pretrained NLP models and NLP corpora.</li>
|
||
</ul>
|
||
<h2 id="multilingual-nlp-frameworks">Multilingual NLP Frameworks</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a href="https://github.com/ufal/udpipe">UDPipe</a> is a trainable
|
||
pipeline for tokenizing, tagging, lemmatizing and parsing Universal
|
||
Treebanks and other CoNLL-U files. Primarily written in C++, offers a
|
||
fast and reliable solution for multilingual NLP processing.</li>
|
||
<li><a href="https://github.com/adobe/NLP-Cube">NLP-Cube</a> : Natural
|
||
Language Processing Pipeline - Sentence Splitting, Tokenization,
|
||
Lemmatization, Part-of-speech Tagging and Dependency Parsing. New
|
||
platform, written in Python with Dynet 2.0. Offers standalone
|
||
(CLI/Python bindings) and server functionality (REST API).</li>
|
||
<li><a href="https://github.com/mikahama/uralicNLP">UralicNLP</a> is an
|
||
NLP library mostly for many endangered Uralic languages such as Sami
|
||
languages, Mordvin languages, Mari languages, Komi languages and so on.
|
||
Also some non-endangered languages are supported such as Finnish
|
||
together with non-Uralic languages such as Swedish and Arabic. UralicNLP
|
||
can do morphological analysis, generation, lemmatization and
|
||
disambiguation.</li>
|
||
</ul>
|
||
<h2 id="nlp-in-korean">NLP in Korean</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="libraries">Libraries</h3>
|
||
<ul>
|
||
<li><a href="http://konlpy.org">KoNLPy</a> - Python package for Korean
|
||
natural language processing.</li>
|
||
<li><a href="https://eunjeon.blogspot.com/">Mecab (Korean)</a> - C++
|
||
library for Korean NLP</li>
|
||
<li><a href="https://koalanlp.github.io/koalanlp/">KoalaNLP</a> - Scala
|
||
library for Korean Natural Language Processing.</li>
|
||
<li><a href="https://cran.r-project.org/package=KoNLP">KoNLP</a> - R
|
||
package for Korean Natural language processing</li>
|
||
</ul>
|
||
<h3 id="blogs-and-tutorials">Blogs and Tutorials</h3>
|
||
<ul>
|
||
<li><a href="https://dsindex.github.io/">dsindex’s blog</a></li>
|
||
<li><a href="http://cs.kangwon.ac.kr/~leeck/NLP/">Kangwon University’s
|
||
NLP course in Korean</a></li>
|
||
</ul>
|
||
<h3 id="datasets-1">Datasets</h3>
|
||
<ul>
|
||
<li><a
|
||
href="http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus">KAIST
|
||
Corpus</a> - A corpus from the Korea Advanced Institute of Science and
|
||
Technology in Korean.</li>
|
||
<li><a href="https://github.com/e9t/nsmc/">Naver Sentiment Movie Corpus
|
||
in Korean</a></li>
|
||
<li><a href="http://srchdb1.chosun.com/pdf/i_archive/">Chosun Ilbo
|
||
archive</a> - dataset in Korean from one of the major newspapers in
|
||
South Korea, the Chosun Ilbo.</li>
|
||
<li><a href="https://github.com/songys/Chatbot_data">Chat data</a> -
|
||
Chatbot data in Korean</li>
|
||
<li><a href="https://github.com/akngs/petitions">Petitions</a> - Collect
|
||
expired petition data from the Blue House National Petition Site.</li>
|
||
<li><a href="https://github.com/j-min/korean-parallel-corpora">Korean
|
||
Parallel corpora</a> - Neural Machine Translation(NMT) Dataset for
|
||
<strong>Korean to French</strong> & <strong>Korean to
|
||
English</strong></li>
|
||
<li><a href="https://korquad.github.io/">KorQuAD</a> - Korean SQuAD
|
||
dataset with Wiki HTML source. Mentions both v1.0 and v2.1 at the time
|
||
of adding to Awesome NLP</li>
|
||
</ul>
|
||
<h2 id="nlp-in-arabic">NLP in Arabic</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="libraries-1">Libraries</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/01walid/goarabic">goarabic</a> - Go
|
||
package for Arabic text processing</li>
|
||
<li><a href="https://github.com/ejtaal/jsastem">jsastem</a> - Javascript
|
||
for Arabic stemming</li>
|
||
<li><a href="https://pypi.org/project/PyArabic/">PyArabic</a> - Python
|
||
libraries for Arabic</li>
|
||
<li><a href="https://github.com/amir-zeldes/RFTokenizer">RFTokenizer</a>
|
||
- trainable Python segmenter for Arabic, Hebrew and Coptic</li>
|
||
</ul>
|
||
<h3 id="datasets-2">Datasets</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces">Multidomain
|
||
Datasets</a> - Largest Available Multi-Domain Resources for Arabic
|
||
Sentiment Analysis</li>
|
||
<li><a href="https://github.com/mohamedadaly/labr">LABR</a> - LArge
|
||
Arabic Book Reviews dataset</li>
|
||
<li><a href="https://github.com/mohataher/arabic-stop-words">Arabic
|
||
Stopwords</a> - A list of Arabic stopwords from various resources</li>
|
||
</ul>
|
||
<h2 id="nlp-in-chinese">NLP in Chinese</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="libraries-2">Libraries</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/fxsjy/jieba#jieba-1">jieba</a> - Python
|
||
package for Words Segmentation Utilities in Chinese</li>
|
||
<li><a href="https://github.com/isnowfy/snownlp">SnowNLP</a> - Python
|
||
package for Chinese NLP</li>
|
||
<li><a href="https://github.com/FudanNLP/fnlp">FudanNLP</a> - Java
|
||
library for Chinese text processing</li>
|
||
<li><a href="https://github.com/hankcs/HanLP">HanLP</a> - The
|
||
multilingual NLP library</li>
|
||
</ul>
|
||
<h3 id="anthology">Anthology</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/fighting41love/funNLP">funNLP</a> -
|
||
Collection of NLP tools and resources mainly for Chinese</li>
|
||
</ul>
|
||
<h2 id="nlp-in-german">NLP in German</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/adbar/German-NLP">German-NLP</a> -
|
||
Curated list of open-access/open-source/off-the-shelf resources and
|
||
tools developed with a particular focus on German</li>
|
||
</ul>
|
||
<h2 id="nlp-in-polish">NLP in Polish</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/ksopyla/awesome-nlp-polish">Polish-NLP</a> - A
|
||
curated list of resources dedicated to Natural Language Processing (NLP)
|
||
in polish. Models, tools, datasets.</li>
|
||
</ul>
|
||
<h2 id="nlp-in-spanish">NLP in Spanish</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="libraries-3">Libraries</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/jfreddypuentes/spanlp">spanlp</a> -
|
||
Python library to detect, censor and clean profanity, vulgarities,
|
||
hateful words, racism, xenophobia and bullying in texts written in
|
||
Spanish. It contains data of 21 Spanish-speaking countries.</li>
|
||
</ul>
|
||
<h3 id="data">Data</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/dav009/LatinamericanTextResources">Columbian
|
||
Political Speeches</a></li>
|
||
<li><a
|
||
href="https://mbkromann.github.io/copenhagen-dependency-treebank/">Copenhagen
|
||
Treebank</a></li>
|
||
<li><a href="https://github.com/crscardellino/sbwce">Spanish Billion
|
||
words corpus with Word2Vec embeddings</a></li>
|
||
<li><a
|
||
href="https://github.com/josecannete/spanish-unannotated-corpora">Compilation
|
||
of Spanish Unannotated Corpora</a></li>
|
||
</ul>
|
||
<h3 id="word-and-sentence-embeddings">Word and Sentence Embeddings</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/dccuchile/spanish-word-embeddings">Spanish Word
|
||
Embeddings Computed with Different Methods and from Different
|
||
Corpora</a></li>
|
||
<li><a href="https://github.com/BotCenter/spanishWordEmbeddings">Spanish
|
||
Word Embeddings Computed from Large Corpora and Different Sizes Using
|
||
fastText</a></li>
|
||
<li><a href="https://github.com/BotCenter/spanishSent2Vec">Spanish
|
||
Sentence Embeddings Computed from Large Corpora Using sent2vec</a></li>
|
||
<li><a href="https://github.com/dccuchile/beto">Beto - BERT for
|
||
Spanish</a></li>
|
||
</ul>
|
||
<h2 id="nlp-in-indic-languages">NLP in Indic languages</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="data-corpora-and-treebanks">Data, Corpora and Treebanks</h3>
|
||
<ul>
|
||
<li><a href="https://ltrc.iiit.ac.in/treebank_H2014/">Hindi Dependency
|
||
Treebank</a> - A multi-representational multi-layered treebank for Hindi
|
||
and Urdu</li>
|
||
<li><a
|
||
href="https://universaldependencies.org/treebanks/hi_hdtb/index.html">Universal
|
||
Dependencies Treebank in Hindi</a>
|
||
<ul>
|
||
<li><a
|
||
href="http://universaldependencies.org/treebanks/hi_pud/index.html">Parallel
|
||
Universal Dependencies Treebank in Hindi</a> - A smaller part of the
|
||
above-mentioned treebank.</li>
|
||
</ul></li>
|
||
<li><a href="https://www.isical.ac.in/~fire/data/">ISI FIRE Stopwords
|
||
List (Hindi and Bangla)</a></li>
|
||
<li><a href="https://github.com/6/stopwords-json">Peter Graham’s
|
||
Stopwords List</a></li>
|
||
<li><a href="https://www.nltk.org/book/ch02.html">NLTK Corpus</a> 60k
|
||
Words POS Tagged, Bangla, Hindi, Marathi, Telugu</li>
|
||
<li><a href="https://github.com/goru001/nlp-for-hindi">Hindi Movie
|
||
Reviews Dataset</a> ~1k Samples, 3 polarity classes</li>
|
||
<li><a
|
||
href="https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1">BBC
|
||
News Hindi Dataset</a> 4.3k Samples, 14 classes</li>
|
||
<li><a href="https://github.com/pnisarg/ABSA">IIT Patna Hindi ABSA
|
||
Dataset</a> 5.4k Samples, 12 Domains, 4k aspect terms, aspect and
|
||
sentence level polarity in 4 classes</li>
|
||
<li><a href="https://github.com/AtikRahman/Bangla_Datasets_ABSA">Bangla
|
||
ABSA</a> 5.5k Samples, 2 Domains, 10 aspect terms</li>
|
||
<li><a href="https://www.iitp.ac.in/~ai-nlp-ml/resources.html">IIT Patna
|
||
Movie Review Sentiment Dataset</a> 2k Samples, 3 polarity labels</li>
|
||
</ul>
|
||
<h4
|
||
id="corporadatasets-that-need-a-loginaccess-can-be-gained-via-email">Corpora/Datasets
|
||
that need a login/access can be gained via email</h4>
|
||
<ul>
|
||
<li><a href="http://amitavadas.com/SAIL/">SAIL 2015</a> Twitter and
|
||
Facebook labelled sentiment samples in Hindi, Bengali, Tamil,
|
||
Telugu.</li>
|
||
<li><a
|
||
href="http://www.cfilt.iitb.ac.in/Sentiment_Analysis_Resources.html">IIT
|
||
Bombay NLP Resources</a> Sentiwordnet, Movie and Tourism parallel
|
||
labelled corpora, polarity labelled sense annotated corpus, Marathi
|
||
polarity labelled corpus.</li>
|
||
<li><a
|
||
href="https://tdil-dc.in/index.php?option=com_catalogue&task=viewTools&id=83&lang=en">TDIL-IC
|
||
aggregates a lot of useful resources and provides access to otherwise
|
||
gated datasets</a></li>
|
||
</ul>
|
||
<h3 id="language-models-and-word-embeddings">Language Models and Word
|
||
Embeddings</h3>
|
||
<ul>
|
||
<li><a href="https://nirantk.com/hindi2vec/">Hindi2Vec</a> and <a
|
||
href="https://github.com/goru001/nlp-for-hindi">nlp-for-hindi</a> ULMFIT
|
||
style languge model</li>
|
||
<li><a href="https://www.iitp.ac.in/~ai-nlp-ml/resources.html">IIT Patna
|
||
Bilingual Word Embeddings Hi-En</a></li>
|
||
<li><a href="https://fasttext.cc/docs/en/crawl-vectors.html">Fasttext
|
||
word embeddings in a whole bunch of languages, trained on Common
|
||
Crawl</a></li>
|
||
<li><a href="https://github.com/Kyubyong/wordvectors">Hindi and Bengali
|
||
Word2Vec</a></li>
|
||
<li><a href="https://github.com/HIT-SCIR/ELMoForManyLangs">Hindi and
|
||
Urdu Elmo Model</a></li>
|
||
<li><a
|
||
href="https://huggingface.co/surajp/albert-base-sanskrit">Sanskrit
|
||
Albert</a> Trained on Sanskrit Wikipedia and OSCAR corpus</li>
|
||
</ul>
|
||
<h3 id="libraries-and-tooling">Libraries and Tooling</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/Saurav0074/mt-dma">Multi-Task Deep
|
||
Morphological Analyzer</a> Deep Network based Morphological Parser for
|
||
Hindi and Urdu</li>
|
||
<li><a
|
||
href="https://github.com/anoopkunchukuttan/indic_nlp_library">Anoop
|
||
Kunchukuttan</a> 18 Languages, whole host of features from tokenization
|
||
to translation</li>
|
||
<li><a href="http://sivareddy.in/downloads">SivaReddy’s Dependency
|
||
Parser</a> Dependency Parser and Pos Tagger for Kannada, Hindi and
|
||
Telugu. <a
|
||
href="https://github.com/CalmDownKarm/sivareddydependencyparser">Python3
|
||
Port</a></li>
|
||
<li><a href="https://github.com/goru001/inltk">iNLTK</a> - A Natural
|
||
Language Toolkit for Indic Languages (Indian subcontinent languages)
|
||
built on top of Pytorch/Fastai, which aims to provide out of the box
|
||
support for common NLP tasks.</li>
|
||
</ul>
|
||
<h2 id="nlp-in-thai">NLP in Thai</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="libraries-4">Libraries</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/PyThaiNLP/pythainlp">PyThaiNLP</a> -
|
||
Thai NLP in Python Package</li>
|
||
<li><a href="https://github.com/wittawatj/jtcc">JTCC</a> - A character
|
||
cluster library in Java</li>
|
||
<li><a href="https://github.com/pucktada/cutkum">CutKum</a> - Word
|
||
segmentation with deep learning in TensorFlow</li>
|
||
<li><a href="https://pypi.python.org/pypi/tltk/">Thai Language
|
||
Toolkit</a> - Based on a paper by Wirote Aroonmanakun in 2002 with
|
||
included dataset</li>
|
||
<li><a href="https://github.com/KenjiroAI/SynThai">SynThai</a> - Word
|
||
segmentation and POS tagging using deep learning in Python</li>
|
||
</ul>
|
||
<h3 id="data-1">Data</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.nectec.or.th/corpus/index.php?league=pm">Inter-BEST</a>
|
||
- A text corpus with 5 million words with word segmentation</li>
|
||
<li><a
|
||
href="https://github.com/PyThaiNLP/lexicon-thai/tree/master/thai-corpus/Prime%20Minister%2029">Prime
|
||
Minister 29</a> - Dataset containing speeches of the current Prime
|
||
Minister of Thailand</li>
|
||
</ul>
|
||
<h2 id="nlp-in-danish">NLP in Danish</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/ITUnlp/daner">Named Entity Recognition
|
||
for Danish</a></li>
|
||
<li><a href="https://github.com/alexandrainst/danlp">DaNLP</a> - NLP
|
||
resources in Danish</li>
|
||
<li><a href="https://github.com/fnielsen/awesome-danish">Awesome
|
||
Danish</a> - A curated list of awesome resources for Danish language
|
||
technology</li>
|
||
</ul>
|
||
<h2 id="nlp-in-vietnamese">NLP in Vietnamese</h2>
|
||
<h3 id="libraries-5">Libraries</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/undertheseanlp/underthesea">underthesea</a> -
|
||
Vietnamese NLP Toolkit</li>
|
||
<li><a href="https://github.com/phuonglh/vn.vitk">vn.vitk</a> - A
|
||
Vietnamese Text Processing Toolkit</li>
|
||
<li><a href="https://github.com/vncorenlp/VnCoreNLP">VnCoreNLP</a> - A
|
||
Vietnamese natural language processing toolkit</li>
|
||
<li><a href="https://github.com/VinAIResearch/PhoBERT">PhoBERT</a> -
|
||
Pre-trained language models for Vietnamese</li>
|
||
<li><a href="https://github.com/trungtv/pyvi">pyvi</a> - Python
|
||
Vietnamese Core NLP Toolkit</li>
|
||
</ul>
|
||
<h3 id="data-2">Data</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://vlsp.hpda.vn/demo/?page=resources&lang=en">Vietnamese
|
||
treebank</a> - 10,000 sentences for the constituency parsing task</li>
|
||
<li><a href="https://arxiv.org/pdf/1710.05519.pdf">BKTreeBank</a> - a
|
||
Vietnamese Dependency Treebank</li>
|
||
<li><a
|
||
href="https://github.com/UniversalDependencies/UD_Vietnamese-VTB">UD_Vietnamese</a>
|
||
- Vietnamese Universal Dependency Treebank</li>
|
||
<li><a href="https://ailab.hcmus.edu.vn/vivos/">VIVOS</a> - a free
|
||
Vietnamese speech corpus consisting of 15 hours of recording speech by
|
||
AILab</li>
|
||
<li><a
|
||
href="http://viet.jnlp.org/download-du-lieu-tu-vung-corpus">VNTQcorpus(big).txt</a>
|
||
- 1.75 million sentences in news</li>
|
||
<li><a href="https://github.com/VinAIResearch/ViText2SQL">ViText2SQL</a>
|
||
- A dataset for Vietnamese Text-to-SQL semantic parsing (EMNLP-2020
|
||
Findings)</li>
|
||
<li><a href="https://github.com/qhungngo/EVBCorpus">EVB Corpus</a> -
|
||
20,000,000 words (20 million) from 15 bilingual books, 100 parallel
|
||
English-Vietnamese / Vietnamese-English texts, 250 parallel law and
|
||
ordinance texts, 5,000 news articles, and 2,000 film subtitles.</li>
|
||
</ul>
|
||
<h2 id="nlp-for-dutch">NLP for Dutch</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a href="https://github.com/proycon/python-frog">python-frog</a> -
|
||
Python binding to Frog, an NLP suite for Dutch. (pos tagging,
|
||
lemmatisation, dependency parsing, NER)</li>
|
||
<li><a href="https://github.com/rfdj/SimpleNLG-NL">SimpleNLG_NL</a> -
|
||
Dutch surface realiser used for Natural Language Generation in Dutch,
|
||
based on the SimpleNLG implementation for English and French.</li>
|
||
<li><a href="https://github.com/rug-compling/alpino">Alpino</a> -
|
||
Dependency parser for Dutch (also does PoS tagging and
|
||
Lemmatisation).</li>
|
||
<li><a
|
||
href="https://github.com/opensource-spraakherkenning-nl/Kaldi_NL">Kaldi
|
||
NL</a> - Dutch Speech Recognition models based on <a
|
||
href="http://kaldi-asr.org/">Kaldi</a>.</li>
|
||
<li><a href="https://spacy.io/">spaCy</a> - <a
|
||
href="https://spacy.io/models/nl">Dutch model</a> available. -
|
||
Industrial strength NLP with Python and Cython.</li>
|
||
</ul>
|
||
<h2 id="nlp-in-indonesian">NLP in Indonesian</h2>
|
||
<h3 id="datasets-3">Datasets</h3>
|
||
<ul>
|
||
<li>Kompas and Tempo collections at <a
|
||
href="http://ilps.science.uva.nl/resources/bahasa/">ILPS</a></li>
|
||
<li><a
|
||
href="http://www.panl10n.net/english/outputs/Indonesia/UI/0802/UI-1M-tagged.zip">PANL10N
|
||
for PoS tagging</a>: 39K sentences and 900K word tokens</li>
|
||
<li><a href="https://github.com/famrashel/idn-tagged-corpus">IDN for PoS
|
||
tagging</a>: This corpus contains 10K sentences and 250K word
|
||
tokens</li>
|
||
<li><a href="https://github.com/famrashel/idn-treebank">Indonesian
|
||
Treebank</a> and <a
|
||
href="https://github.com/UniversalDependencies/UD_Indonesian-GSD">Universal
|
||
Dependencies-Indonesian</a></li>
|
||
<li><a href="https://github.com/kata-ai/indosum">IndoSum</a> for text
|
||
summarization and classification both</li>
|
||
<li><a href="http://wn-msa.sourceforge.net/">Wordnet-Bahasa</a> - large,
|
||
free, semantic dictionary</li>
|
||
<li>IndoBenchmark <a
|
||
href="https://github.com/indobenchmark/indonlu">IndoNLU</a> includes
|
||
pre-trained language model (IndoBERT), FastText model, Indo4B corpus,
|
||
and several NLU benchmark datasets</li>
|
||
</ul>
|
||
<h3 id="libraries-embedding">Libraries & Embedding</h3>
|
||
<ul>
|
||
<li>Natural language toolkit <a
|
||
href="https://github.com/kangfend/bahasa">bahasa</a></li>
|
||
<li><a
|
||
href="https://github.com/galuhsahid/indonesian-word-embedding">Indonesian
|
||
Word Embedding</a></li>
|
||
<li>Pretrained <a
|
||
href="https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.id.zip">Indonesian
|
||
fastText Text Embedding</a> trained on Wikipedia</li>
|
||
<li>IndoBenchmark <a
|
||
href="https://github.com/indobenchmark/indonlu">IndoNLU</a> includes
|
||
pretrained language model (IndoBERT), FastText model, Indo4B corpus, and
|
||
several NLU benchmark datasets</li>
|
||
</ul>
|
||
<h2 id="nlp-in-urdu">NLP in Urdu</h2>
|
||
<h3 id="datasets-4">Datasets</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/mirfan899/Urdu">Collection of Urdu
|
||
datasets</a> for POS, NER and NLP tasks</li>
|
||
</ul>
|
||
<h3 id="libraries-6">Libraries</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/urduhack/urduhack">Natural Language
|
||
Processing library</a> for ( 🇵🇰)Urdu language</li>
|
||
</ul>
|
||
<h2 id="nlp-in-persian">NLP in Persian</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<h3 id="libraries-7">Libraries</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/roshan-research/hazm">Hazm</a> - Persian
|
||
NLP Toolkit.</li>
|
||
<li><a href="https://github.com/ICTRC/Parsivar">Parsivar</a>: A Language
|
||
Processing Toolkit for Persian</li>
|
||
<li><a href="https://github.com/AlirezaTheH/perke">Perke</a>: Perke is a
|
||
Python keyphrase extraction package for Persian language. It provides an
|
||
end-to-end keyphrase extraction pipeline in which each component can be
|
||
easily modified or extended to develop new models.</li>
|
||
<li><a href="https://github.com/jonsafari/perstem">Perstem</a>: Persian
|
||
stemmer, morphological analyzer, transliterator, and partial
|
||
part-of-speech tagger</li>
|
||
<li><a
|
||
href="https://github.com/NarimanN2/ParsiAnalyzer">ParsiAnalyzer</a>:
|
||
Persian Analyzer For Elasticsearch</li>
|
||
<li><a href="https://github.com/aziz/virastar">virastar</a>: Cleaning up
|
||
Persian text!</li>
|
||
</ul>
|
||
<h3 id="datasets-5">Datasets</h3>
|
||
<ul>
|
||
<li><a href="https://dbrg.ut.ac.ir/بیژن%E2%80%8Cخان/">Bijankhan
|
||
Corpus</a>: Bijankhan corpus is a tagged corpus that is suitable for
|
||
natural language processing research on the Persian (Farsi) language.
|
||
This collection is gathered form daily news and common texts. In this
|
||
collection all documents are categorized into different subjects such as
|
||
political, cultural and so on. Totally, there are 4300 different
|
||
subjects. The Bijankhan collection contains about 2.6 millions manually
|
||
tagged words with a tag set that contains 40 Persian POS tags.</li>
|
||
<li><a
|
||
href="https://sites.google.com/site/mojganserajicom/home/upc">Uppsala
|
||
Persian Corpus (UPC)</a>: Uppsala Persian Corpus (UPC) is a large,
|
||
freely available Persian corpus. The corpus is a modified version of the
|
||
Bijankhan corpus with additional sentence segmentation and consistent
|
||
tokenization containing 2,704,028 tokens and annotated with 31
|
||
part-of-speech tags. The part-of-speech tags are listed with
|
||
explanations in <a
|
||
href="https://sites.google.com/site/mojganserajicom/home/upc/Table_tag.pdf">this
|
||
table</a>.</li>
|
||
<li><a href="http://hdl.handle.net/11234/1-3195">Large-Scale Colloquial
|
||
Persian</a>: Large Scale Colloquial Persian Dataset (LSCP) is
|
||
hierarchically organized in asemantic taxonomy that focuses on
|
||
multi-task informal Persian language understanding as a comprehensive
|
||
problem. LSCP includes 120M sentences from 27M casual Persian tweets
|
||
with its dependency relations in syntactic annotation, Part-of-speech
|
||
tags, sentiment polarity and automatic translation of original Persian
|
||
sentences in English (EN), German (DE), Czech (CS), Italian (IT) and
|
||
Hindi (HI) spoken languages. Learn more about this project at <a
|
||
href="https://iasbs.ac.ir/~ansari/lscp/">LSCP webpage</a>.</li>
|
||
<li><a
|
||
href="https://github.com/HaniehP/PersianNER">ArmanPersoNERCorpus</a>:
|
||
The dataset includes 250,015 tokens and 7,682 Persian sentences in
|
||
total. It is available in 3 folds to be used in turn as training and
|
||
test sets. Each file contains one token, along with its manually
|
||
annotated named-entity tag, per line. Each sentence is separated with a
|
||
newline. The NER tags are in IOB format.</li>
|
||
<li><a href="https://github.com/Text-Mining/Persian-NER">FarsiYar
|
||
PersianNER</a>: The dataset includes about 25,000,000 tokens and about
|
||
1,000,000 Persian sentences in total based on <a
|
||
href="https://github.com/Text-Mining/Persian-Wikipedia-Corpus">Persian
|
||
Wikipedia Corpus</a>. The NER tags are in IOB format. More than 1000
|
||
volunteers contributed tag improvements to this dataset via web panel or
|
||
android app. They release updated tags every two weeks.</li>
|
||
<li><a href="http://farsbase.net/PERLEX.html">PERLEX</a>: The first
|
||
Persian dataset for relation extraction, which is an expert translated
|
||
version of the “Semeval-2010-Task-8” dataset. Link to the relevant
|
||
publication.</li>
|
||
<li><a href="http://dadegan.ir/catalog/perdt">Persian Syntactic
|
||
Dependency Treebank</a>: This treebank is supplied for free
|
||
noncommercial use. For commercial uses feel free to contact us. The
|
||
number of annotated sentences is 29,982 sentences including samples from
|
||
almost all verbs of the Persian valency lexicon.</li>
|
||
<li><a href="http://stp.lingfil.uu.se/~mojgan/UPDT.html">Uppsala Persian
|
||
Dependency Treebank (UPDT)</a>: Dependency-based syntactically annotated
|
||
corpus.</li>
|
||
<li><a href="https://dbrg.ut.ac.ir/hamshahri/">Hamshahri</a>: Hamshahri
|
||
collection is a standard reliable Persian text collection that was used
|
||
at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for
|
||
evaluation of Persian information retrieval systems.</li>
|
||
</ul>
|
||
<h2 id="nlp-in-ukrainian">NLP in Ukrainian</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/asivokon/awesome-ukrainian-nlp">awesome-ukrainian-nlp</a>
|
||
- a curated list of Ukrainian NLP datasets, models, etc.</li>
|
||
<li><a
|
||
href="https://github.com/Helsinki-NLP/UkrainianLT">UkrainianLT</a> -
|
||
another curated list with a focus on machine translation and speech
|
||
processing</li>
|
||
</ul>
|
||
<h2 id="nlp-in-hungarian">NLP in Hungarian</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/oroszgy/awesome-hungarian-nlp">awesome-hungarian-nlp</a>:
|
||
A curated list of free resources dedicated to Hungarian Natural Language
|
||
Processing.</li>
|
||
</ul>
|
||
<h2 id="nlp-in-portuguese">NLP in Portuguese</h2>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/ajdavidl/Portuguese-NLP">Portuguese-nlp</a> - a
|
||
List of resources and tools developed with focus on Portuguese.</li>
|
||
</ul>
|
||
<h2 id="other-languages">Other Languages</h2>
|
||
<ul>
|
||
<li>Russian: <a href="https://github.com/kmike/pymorphy2">pymorphy2</a>
|
||
- a good pos-tagger for Russian</li>
|
||
<li>Asian Languages: Thai, Lao, Chinese, Japanese, and Korean <a
|
||
href="https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html">ICU
|
||
Tokenizer</a> implementation in ElasticSearch</li>
|
||
<li>Ancient Languages: <a href="https://github.com/cltk/cltk">CLTK</a>:
|
||
The Classical Language Toolkit is a Python library and collection of
|
||
texts for doing NLP in ancient languages</li>
|
||
<li>Hebrew: <a
|
||
href="https://github.com/NLPH/NLPH_Resources">NLPH_Resources</a> - A
|
||
collection of papers, corpora and linguistic resources for NLP in
|
||
Hebrew</li>
|
||
</ul>
|
||
<p><a href="#contents">Back to Top</a></p>
|
||
<p><a href="./CREDITS.md">Credits</a> for initial curators and
|
||
sources</p>
|
||
<h2 id="license">License</h2>
|
||
<p><a href="./LICENSE">License</a> - CC0</p>
|