Files
awesome-awesomeness/terminal/linguistics2
2025-07-18 23:13:11 +02:00

137 lines
18 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Awesome Linguistics
!Awesome (https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) (https://github.com/sindresorhus/awesome)
A curated list of anything remotely related to linguistics, sorted in alphabetical order.
- Programming (#programming)
- **Platforms and toolkits** (#platforms-and-toolkits)
- **Algorithms** (#algorithms) 
- **Data sets** (#data-sets) 
- Resources (#resources)
- **Deep learning models and transformers** (#deep-learning-models-and-transformers)
- **On Wikipedia** (#on-wikipedia) 
- **On Youtube** (#on-youtube) 
- **Books** (#books) 
 - **Free** (#free) 
 - **Non free** (#non-free) 
 - **Lists** (#lists) 
- Standards (#standards)
- Lists (#lists)
- Communities (#communities)
Programming
Libraries, frameworks and applications useful for developing applications.
Platforms and toolkits
⟡ CLARIN-D web tools (https://www.clarin-d.net/en/analysing) - Tools for Analysing Research Data 
⟡ CorpusExplorer
 (https://notes.jan-oliver-ruediger.de/software/corpusexplorer-overview/) - Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 50 interactive visualizations under a user-friendly interface.
⟡ Haxe-linguistics (https://github.com/sexybiggetje/haxe-linguistics) - Early linguistical analysis and natural language processing library for Haxe.
⟡ Natural (https://github.com/NaturalNode/natural) - General natural language tools for Node.js.
⟡ Natural Language ToolKit (NLTK) (http://www.nltk.org/) - The most complete platform for building Python programs to work with human language data.
⟡ Snowball (https://snowballstem.org/) - Snowball is a language in which stemming algorithms can be easily represented.
⟡ Spacy (https://spacy.io/) - Industrial-strength National Language Processing in Python.
⟡ Mate Tools (http://hdl.handle.net/11022/1007-0000-0000-8E4E-A), webservice via WebLicht
⟡ UBIAI (https://ubiai.tools/) - Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling.
⟡ textblob-de (https://github.com/markuskiller/textblob-de) - Nice alternative for spacy (see above).
⟡ tyo (https://github.com/mongsvo/tyo) - A utility for finding Typo-Bridges.
⟡ UralicNLP (https://github.com/mikahama/uralicNLP) - An open source Python library for processing morphologically rich and, for the most part, endangered Uralic languages. It can do morphological analysis, generation, lemmatization, 
disambiguation and lexical lookup for a great many Uralic languages.
Algorithms
⟡ Stemming algorithms for various European languages (http://snowball.tartarus.org/texts/stemmersoverview.html) - Various stemming algorithms from snowball.
⟡ The Porter Stemmer Algorithm (http://tartarus.org/martin/PorterStemmer/) - The official home page for distribution of the Porter Stemming Algorithm, written and maintained by its author, Martin Porter.
Data sets
⟡ EuroRomCom Data (https://github.com/kirkins/euroromcom) - JSON formatted Pan-Romance word lists.
⟡ Araneum Germanicum (http://aranea.juls.savba.sk/aranea_about/_germanicum.html)
⟡ CEHugeWebCorpus (https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2638) - German corpus based on CommonCrawl
⟡ Digitales Wörterbuch der deutschen Sprache (DWDS) (https://dwds.de)
⟡ GC4 Corpus (https://german-nlp-group.github.io/projects/gc4-corpus.html) (CommonCrawl)
⟡ IDS Corpora (https://www1.ids-mannheim.de/kl/projekte/korpora) - German Reference Corpus
⟡ Leipzig Corpora Collection (https://wortschatz.uni-leipzig.de/en/download/) - sampled sentences in different languages.
⟡ SdeWaC (https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/sdewac.en.html) - big german internet corpus
⟡ C-WEP (http://lingured.info/linguistic-resources/cwep/)
⟡ DysList (list of dyslexic errors) (https://github.com/Rauschii/DysListGerman)
⟡ Falko (https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko)
⟡ Litkey (https://www.linguistics.ruhr-uni-bochum.de/litkeycorpus/)
⟡ OpinionSpam (https://github.com/hdaSprachtechnologie/OpinionSpam)
Resources
⟡ Low Resource Languages (https://github.com/RIchardLitt/low-resource-languages) - A list of resources for conservation, development, and documentation of low resource (human) languages.
⟡ Language Science Press (https://langsci-press.org/) - Language Science Press is a born-digital scholar-led open access publisher in linguistics.
Deep learning models and transformers
⟡ dbmdz BERT models (https://github.com/dbmdz/berts)
⟡ Deepset German BERT model (https://deepset.ai/german-bert)
⟡ Evaluating German Transformer Language Models with Syntactic Agreement Tests (https://github.com/DFKI-NLP/gevalm)
⟡ German ELMo Model (https://github.com/t-systems-on-site-services-gmbh/german-elmo-model)
⟡ german-transformer-training (https://github.com/PhilipMay/german-transformer-training)
⟡ GermLM (https://github.com/tonianelope/Multilingual-BERT) (NER exploration)
⟡ GerPT2 (https://github.com/bminixhofer/gerpt2)
⟡ Sentence Transformers (https://github.com/UKPLab/sentence-transformers)
On Wikipedia
⟡ Bag of words model (https://en.wikipedia.org/wiki/Bag-of-words_model)
⟡ Document classification (https://en.wikipedia.org/wiki/Document_classification)
⟡ Language models (https://en.wikipedia.org/wiki/Language_model)
⟡ Naive Bayes classification (https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
⟡ Natural language processing (https://en.wikipedia.org/wiki/Natural_language_processing)
⟡ Outline of natural language processing (https://en.wikipedia.org/wiki/Outline_of_natural_language_processing)
⟡ Parts of speech tagging (https://en.wikipedia.org/wiki/Part-of-speech_tagging)
⟡ Sentiment analysis (https://en.wikipedia.org/wiki/Sentiment_analysis)
⟡ Term frequency - inverse document frequency (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
⟡ Vector space model (https://en.wikipedia.org/wiki/Vector_space_model)
On Youtube
⟡ Computational Linguistics Lecture Playlist (Youtube) (https://www.youtube.com/playlist?list=PLegWUnz91WfuPebLI97-WueAP90JO-15i) - Lectures for University of Maryland class on computational linguistics.
⟡ The Virtual Linguistics Campus (https://www.youtube.com/channel/UCaMpov1PPVXGcKYgwHjXB3g) - CC-licensed educational videos interconnected with Marburg University's e-learning platform of the same name.
Books
Some of the more interesting and complete books.
Free
⟡ Essentials of Linguistics, 2nd edition (https://ecampusontario.pressbooks.pub/essentialsoflinguistics2/) - An introductory book (2nd edition).
⟡ Introduction to Linguistics (https://linguistics.ucla.edu/people/Kracht/courses/ling20-fall07/ling-intro.pdf)
⟡ Natural Language Processing with Python (https://www.nltk.org/book/) - The book from the NLTK package.
⟡ Text Mining with R (https://www.tidytextmining.com)
Non free
⟡ Foundations of Computational Linguistics (https://books.google.com/books?id=o9iGAgAAQBAJ&dq=Foundations+of+Computational+Linguistics&hl=nl&source=gbs_navlinks_s)
⟡ Foundations of Statistical Natural Language Processing (https://books.google.nl/books?id=YiFDxbEX3SUC)
⟡ Semisupervised Learning for Computational Linguistics (https://books.google.com/books/about/Semisupervised_Learning_for_Computationa.html?id=VCd67cGB_rAC&redir_esc=y)
⟡ Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (https://books.google.nl/books?id=fZmj5UNK8AQC)
⟡ The Oxford Handbook of Computational Linguistics (https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199276349.001.0001/oxfordhb-9780199276349)
Standards
⟡ DTA Basisformat (https://www.deutschestextarchiv.de/doku/basisformat/)
⟡ ISO TC 37 SC 4 (https://www.iso.org/committee/297592.html)
⟡ UIMA (https://docs.oasis-open.org/uima/v1.0/os/uima-spec-os.html)
Lists
⟡ 15 most popular books on good reads (https://www.goodreads.com/shelf/show/natural-language-processing)
⟡ GitHub topics corpus-linguistics (https://github.com/topics/corpus-linguistics) & nlp (https://github.com/topics/nlp)
⟡ nlp-datasets (https://github.com/niderhoff/nlp-datasets)
⟡ NLP-progress (https://github.com/sebastianruder/NLP-progress)
⟡ /r/LanguageTechnology/ (https://www.reddit.com/r/LanguageTechnology/)
⟡ awesome-nlp (https://github.com/keon/awesome-nlp)
⟡ Awesome Community-Curated NLP List (https://github.com/alvations/awesome-community-curated-nlp)
⟡ awesome-chinese-nlp (https://github.com/crownpku/Awesome-Chinese-NLP)
⟡ awesome-danish (https://github.com/fnielsen/awesome-danish)
⟡ awesome-hungarian-nlp (https://github.com/oroszgy/awesome-hungarian-nlp)
⟡ awesome Information Retrieval (https://github.com/harpribot/awesome-information-retrieval)
⟡ Indonesian NLP (https://github.com/kmkurn/id-nlp-resource)
⟡ Norwegian NLP resources (https://github.com/web64/norwegian-nlp-resources)
⟡ German NLP resources (https://github.com/adbar/German-NLP/)
⟡ awesome-nlp-polish (https://github.com/ksopyla/awesome-nlp-polish)
⟡ awesome-spanish-nlp (https://github.com/dav009/awesome-spanish-nlp)
⟡ M. Weisser's list of NLP/Computational Linguistics Resources (https://martinweisser.org/corpora_site/comp_ling_resources.html)
Communities
⟡ Linguistics Stack Exchange (https://linguistics.stackexchange.com/)
⟡ Untranslatable.co, Multilingual urban dictionary (https://untranslatable.co/)
linguistics Github: https://github.com/theimpossibleastronaut/awesome-linguistics