Files
awesome-awesomeness/html/informationretrieval.html
2025-07-18 22:22:32 +02:00

438 lines
21 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<h1 id="awesome-information-retrieval-awesome">Awesome Information
Retrieval <a href="https://github.com/sindresorhus/awesome"><img
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
alt="Awesome" /></a></h1>
<p><a
href="https://gitter.im/awesome-information-retrieval/Lobby?utm_source=badge&amp;utm_medium=badge&amp;utm_campaign=pr-badge&amp;utm_content=badge"><img
src="https://badges.gitter.im/awesome-information-retrieval/Lobby.svg"
alt="Join the chat at https://gitter.im/awesome-information-retrieval/Lobby" /></a></p>
<p>Curated list of information retrieval and web search resources from
all around the web. ## Introduction <a
href="https://en.wikipedia.org/wiki/Information_retrieval">Information
Retrieval</a> involves finding relevant information for user queries,
ranging from simple domain of database search to complicated aspects of
web search (Eg - Google, Bing, Yahoo). Currently, researchers are
developing algorithms to address <a
href="https://en.wikipedia.org/wiki/Information_needs">Information
Need</a> of user(s), by maximizing <a
href="https://en.wikipedia.org/wiki/Relevance_(information_retrieval)">User
and Topic Relevance</a> of retrieved results, while minimizing <a
href="https://en.wikipedia.org/wiki/Information_overload">Information
Overload</a> and retrieval time. ## Contributing Please feel free to
send me <a
href="https://github.com/harpribot/awesome-information-retrieval/pulls">pull
requests</a> or [email] (mailto:harshal.priyadarshi@utexas.edu) me to
add new links. I am very open to suggestions and corrections. Please
look at the <a href="contributing.md">contributions guide</a>.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#books">Books</a></li>
<li><a href="#courses">Courses</a></li>
<li><a href="#software">Software</a></li>
<li><a href="#datasets">Datasets</a></li>
<li><a href="#talks">Talks</a></li>
<li><a href="#conferences">Conferences</a></li>
<li><a href="#blogs">Blogs</a>
<ul>
<li><a href="#interesting-reads">Interesting Reads</a></li>
</ul></li>
</ul>
<h2 id="books">Books</h2>
<ul>
<li><a href="http://www-nlp.stanford.edu/IR-book/">Introduction to
Information Retrieval</a> - C.D. Manning, P. Raghavan, H. Schütze.
Cambridge UP, 2008. (First book for getting started with Information
Retrieval).</li>
<li><a href="http://ciir.cs.umass.edu/downloads/SEIRiP.pdf">Search
Engines: Information Retrieval in Practice</a> - Bruce Croft, Don
Metzler, and Trevor Strohman. 2009. (Great book for readers interested
in knowing how Search Engines work. The book is very detailed).</li>
<li><a href="http://people.ischool.berkeley.edu/~hearst/irbook/">Modern
Information Retrieval</a> - R. Baeza-Yates, B. Ribeiro-Neto.
Addison-Wesley, 1999.</li>
<li><a href="http://www.search-engines-book.com/">Information Retrieval
in Practice</a> - B. Croft, D. Metzler, T. Strohman. Pearson Education,
2009.</li>
<li><a href="http://www.cse.iitb.ac.in/%7Esoumen/mining-the-web/">Mining
the Web: Analysis of Hypertext and Semi Structured Data</a> - S.
Chakrabarti. Morgan Kaufmann, 2002.</li>
<li><a
href="http://www.springer.com/prod/b/1-4020-1216-0?referer=www.wkap.nl">Language
Modeling for Information Retrieval</a> - W.B. Croft, J. Lafferty.
Springer, 2003. (Handles Language Modeling aspect of Information
Retrieval. It also extensively details probabilistic perspective in this
domain, which is interesting).</li>
<li><a
href="http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf">Information
Retrieval: A Survey</a> - Ed Greengrass, 2000. (Comprehensive survey of
Conventional Information Retrieval, before Deep Learning era).</li>
<li><a
href="https://www.amazon.com/Introduction-Modern-Information-Retrieval-Third/dp/185604694X">Introduction
to Modern Information Retrieval</a> - G.G. Chowdhury. Neal-Schuman,
2003. (Intended for students of library and information studies).</li>
<li><a
href="https://www.amazon.com/Information-Retrieval-Systems-Library-Hardcover/dp/0123694124">Text
Information Retrieval Systems</a> - C.T. Meadow, B.R. Boyce, D.H. Kraft,
C.L. Barry. Academic Press, 2007 (library/information science
perspective).</li>
</ul>
<h2 id="courses">Courses</h2>
<ul>
<li><a
href="http://courses.ischool.utexas.edu/Lease_Matt/2016/Fall/INF384H/">INF384H
/ CS395T / INF350E: Concepts of Information Retrieval (and Web
Search)</a> - Matthew Lease (University of Texas at Austin).</li>
<li><a href="http://web.stanford.edu/class/cs276/">CS 276 / LING 286:
Information Retrieval and Web Search</a> - Chris Manning and Pandu Nayak
(Stanford University).</li>
<li><a href="https://www.cs.utexas.edu/~mooney/ir-course/">CS 371R:
Information Retrieval and Web Search</a> - Raymond J. Mooney (University
of Texas at Austin).</li>
<li><a href="http://www.cs.ucr.edu/~vagelis/classes/CS172/">CS 172:
Introduction to Information Retrieval</a> - Vagelis Hristidis
(University of California - Riverside).</li>
<li><a
href="http://www2.sims.berkeley.edu/academics/courses/is240/s06/">SIMS
240: Principles of Information Retrieval</a> - Ray R. Larson (UC
berkeley).</li>
<li><a href="http://boston.lti.cs.cmu.edu/classes/11-642/">11-442 /
11-642: Search Engines</a> - Jamie Callan (CMU).</li>
<li><a href="http://www.cs.jhu.edu/%7Eyarowsky/cs466.html">600.466:
Information Retrieval and Web Agents</a> - David Yarowsky (John Hopkins
University).</li>
<li><a
href="http://www.cs.princeton.edu/courses/archive/spring06/cos435/">CS
435: Information Retrieval, Discovery, and Delivery</a> - Andrea LaPaugh
(Princeton University).</li>
<li><a
href="https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/teaching/winter-semester-201516/information-retrieval-and-data-mining/">Information
Retrieval and Data Mining</a> - Dr. Jilles Vreeken , Prof. Dr. Gerhard
Weikum (MPI).</li>
<li><a href="https://www.coursera.org/learn/text-retrieval">Coursera -
Text Retrieval and Search Engines</a> - Prof. ChengXiang Zhai
(University of Illinois at Urbana-Champaign).</li>
</ul>
<h2 id="software">Software</h2>
<ul>
<li><a href="http://lucene.apache.org/core/">Apache Lucene</a> - Open
Source Search Engine that can be used to test Information Retrieval
Algorithm. Twitter uses this core for its real-time search.</li>
<li><a href="http://www.lemurproject.org">The Lemur Project</a> - The
Lemur Project develops search engines, browser toolbars, text analysis
tools, and data resources that support research and development of
information retrieval and text mining software.
<ul>
<li><a href="http://www.lemurproject.org/indri.php">Indri Search
Engine</a> - Another Open Source Search Engine competitor of Apache
Lucene.</li>
<li><a href="http://www.lemurproject.org/lemur.php">Lemur Toolkit</a> -
Open Source Toolkit for research in Language Modeling, filtering and
categorization.</li>
</ul></li>
</ul>
<h2 id="datasets">Datasets</h2>
<h4 id="standard-ir-collections">Standard IR Collections</h4>
<ul>
<li><a href="http://wiki.dbpedia.org/Downloads2015-10">DBPedia</a> -
Linked data web.</li>
<li><a
href="http://ir.dcs.gla.ac.uk/resources/test_collections/cran/">Cranfield
Collections</a> - This is one of the first collections in IR domain,
however the dataset is too small for any statistical significance
analysis, but is nevertheless suitable for pilot runs.</li>
<li><a href="http://trec.nist.gov/data.html">TREC Collections</a> - TREC
is the benchmark dataset used by most IR and Web search algorithms. It
has several tracks, each of which consists of dataset to test for a
specific task. The tracks along with suggested use-case are:
<ul>
<li><a href="http://trec.nist.gov/data/blog.html">Blog</a> - Explore
information seeking behavior in the blogosphere.</li>
<li><a href="http://trec.nist.gov/data/chem-ir.html">Chemical IR</a> -
Address challenges in building large chemical testbeds for chemical
IR.</li>
<li><a href="http://trec.nist.gov/data/clinical.html">Clinical Decision
Support</a> - Investigate techniques to link medical cases to
information relevant for patient care.</li>
<li><a href="http://trec.nist.gov/data/confusion.html">Confusion</a> -
Study <a href="http://trec.nist.gov/data/confusion/t5confusion.ps">Known
Item Searching</a> problem.</li>
<li><a href="http://trec.nist.gov/data/context.html">Contextual
Suggestion</a> - Investigate search techniques for complex information
needs (context and user interests based).</li>
<li><a href="http://trec.nist.gov/data/crowd.html">Crowdsourcing</a> -
Explore crowdsourcing methods for performing and evaluating search.</li>
<li><a href="http://trec.nist.gov/data/enterprise.html">Enterprise</a> -
Study search over the organization data.</li>
<li><a href="http://trec.nist.gov/data/entity.html">Entity</a> - Perform
entity-related search (find entities and their properties) on Web
data.</li>
<li><a href="http://trec.nist.gov/data/filtering.html">Filtering</a> -
Binarily decide retrieval of new incoming documents given a stable
information need.</li>
<li><a href="http://trec.nist.gov/data/federated.html">Federated Web
Search</a> - Study merge performance for results from various search
services.</li>
<li><a href="http://trec.nist.gov/data/genomics.html">Genomics</a> -
Study retrieval efficiency of genomics data and corresponding
documentation.</li>
<li><a href="http://trec.nist.gov/data/hard.html">HARD</a> - Obtain High
Accuracy Retrieval from Documents by leveraging searchers context.</li>
<li><a href="http://trec.nist.gov/data/interactive.html">Interactive
Track</a> - Study user interaction with text retrieval systems.</li>
<li><a href="http://trec.nist.gov/data/kba.html">Knowledge base
acceleration</a> - Study algorithms that improve efficiency of human
Knowledge Base.</li>
<li><a href="http://trec.nist.gov/data/legal.html">Legal Track</a> -
Study retrieval systems that have high recall for legal documents use
case.</li>
<li><a href="http://trec.nist.gov/data/medical.html">Medical Track</a> -
Explore unstructured search performance over patients record data.</li>
<li><a href="http://trec.nist.gov/data/microblog.html">Microblog
Track</a> - Examine satisfaction of real-time information need for
microblogging sites.</li>
<li><a href="http://trec.nist.gov/data/million.query.html">Million Query
Track</a> - Explore ad-hoc retrieval over large set of queries.</li>
<li><a href="http://trec.nist.gov/data/novelty.html">Novelty Track</a> -
Investigate systems abilities to locate new (non-redundant)
information.</li>
<li><a href="http://trec.nist.gov/data/qamain.html">Question Answering
Track</a> - Test systems that scale beyond document retrieval, to
retrieve answers to factoid, list and definition type questions.</li>
<li><a
href="http://trec.nist.gov/data/relevance.feedback.html">Relevance
Feedback Track</a> - For deep evaluation of relevance feedback
processes.</li>
<li><a href="http://trec.nist.gov/data/robust.html">Robust Track</a> -
Study individual topics effectiveness.</li>
<li><a href="http://trec.nist.gov/data/session.html">Session Track</a> -
Develop methods for measuring multiple-query sessions where information
needs drift.</li>
<li><a href="http://trec.nist.gov/data/spam.html">SPAM Track</a> -
Benchmark spam filtering approaches.</li>
<li><a href="http://trec.nist.gov/data/tasks.html">Tasks Track</a> -
Test if systems can induce possible tasks, users might be trying to
accomplish for the query.</li>
<li><a href="http://trec.nist.gov/data/tempsumm.html">Temporal
Summarization Track</a> - Develop systems that allow users to
efficiently monitor the information associated with an event over
time.</li>
<li><a href="http://trec.nist.gov/data/terabyte.html">Terabyte Track</a>
- Test scalability of IR systems to large scale collection.</li>
<li><a href="http://trec.nist.gov/data/webmain.html">Web Track</a> -
Explore information seeking behaviors common in general web search.</li>
</ul></li>
<li><a
href="http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm">GOV2
Test Collection</a> - This is one of the largest Web collection of
documents obtained from crawl of government websites by Charlie Clarke
and Ian Soboroff, using NIST hardware and network, then formatted by
Nick Craswel.</li>
<li><a href="http://research.nii.ac.jp/ntcir/data/data-en.html">NTCIR
Test Collection</a> - This is collection of wide variety of dataset
ranging from Ad-hoc collection, Chinese IR collection, mobile
clickthrough collections to medical collections. The focus of this
collection is mostly on east asian languages and cross language
information retrieval.
<ul>
<li><a
href="http://research.nii.ac.jp/ntcir/permission/ntcir-6/perm-en-CLIR.html">CLIR
Test Collections</a> - This dataset can be used for cross lingual IR
between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable
for the following tasks:
<ul>
<li>Multilingual CLIR</li>
<li>Bilingual CLIR</li>
<li>Single Language CLIR</li>
</ul></li>
<li><a
href="http://research.nii.ac.jp/ntcir/permission/ntcir-6/perm-en-CLQA.html">Cross
Language Q&amp;A (CLQA) dataset collection</a> - It supports following
bi-lingua and mono-lingua:
<ul>
<li>Bi-lingua
<ul>
<li>Japanese to English.</li>
<li>Chinese to English.</li>
<li>English to Japanese.</li>
<li>English to Chinese.</li>
</ul></li>
<li>Mono-lingua
<ul>
<li>Chinese to Chinese.</li>
<li>Japanese to Japanese.</li>
<li>English to English.</li>
</ul></li>
</ul></li>
<li><a
href="http://research.nii.ac.jp/ntcir/permission/ntcir-8/perm-en-ACLIA.html">Advanced
Cross Linugal Information Retrieval and Question Answering (ACLIA)</a> -
The dataset is used for the task of cross-lingual question answering but
the complexity of the task is higher than CLQA dataset.</li>
</ul></li>
<li><a
href="http://www.clef-initiative.eu/dataset/test-collection">Conference
and Labs of the Evaluation Forum (CLEF) dataset</a> - It contains a
multi-lingual document collection. The test suite includes:
<ul>
<li>AdHoc - News Test suite.</li>
<li>Domain Specific Test Suite - On collections of scientific
articles.</li>
<li>Question Answering Test Suite.</li>
</ul></li>
<li><a href="http://trec.nist.gov/data/reuters/reuters.html">Reuters
Corpora</a> - The corpora is now available through NIST. The corpora
includes following:
<ul>
<li>RCV1 (Reuters Corpus Volume 1) - Consists of only English language
News stories.</li>
<li>RCV2 (Reuters Corpus Volume 2) - Consists of stories in 13
languages (Dutch, French, German, Chinese, Japanese, Russian,
Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian,
and Swedish). Note that the stories are not parallel.</li>
<li>TRC (Thomson Reuters Text Research Collection) - This is a fairly
recent corpus consisting of 1,800,370 news stories covering the period
from 2008-01-01 00:00:03 to 2009-02-28 23:54:14.</li>
</ul></li>
<li><a
href="https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html">20
Newsgroup dataset</a> - This data set consists of 20000 newsgroup
messages.posts taken from 20 newsgroup topics.</li>
<li><a href="https://catalog.ldc.upenn.edu/LDC2011T07">English Gigaword
Fifth Edition</a> - This data set is a comprehensive archive of English
newswire text data including headlines, datelines and articles.</li>
<li><a href="http://www-nlpir.nist.gov/projects/duc/data.html">Document
Understanding Conference (DUC) datasets</a> - Past newswire/paper
datasets (DUC 2001 - DUC 2007) are available upon request.</li>
</ul>
<h4 id="external-curation-links">External Curation Links</h4>
<ul>
<li><a href="http://boston.lti.cs.cmu.edu/callan/Data/#DIR">CMU
List</a></li>
<li><a
href="http://nlp.stanford.edu/IR-book/html/htmledition/standard-test-collections-1.html">Stanford
List</a></li>
<li><a href="http://web.eecs.utk.edu/research/lsi/corpa.html">University
of Tennesse Knoxville</a></li>
</ul>
<h2 id="talks">Talks</h2>
<h4 id="technical-talks">Technical Talks</h4>
<ul>
<li><a href="https://youtu.be/1X71fTx1LKA">Extreme Classification: A New
Paradigm for Ranking &amp; Recommendation</a> - Manik Verma (Microsoft
Research)</li>
<li><a
href="https://www.ted.com/talks/tim_berners_lee_on_the_next_web">The
next web</a> - Tim Berners-Lee (Ted Talk) [Tim Berners-Lee invented the
World Wide Web. He leads the World Wide Web Consortium (W3C), overseeing
the Webs standards and development].</li>
<li><a
href="https://www.ted.com/talks/gary_flake_is_pivot_a_turning_point_for_web_exploration?utm_source=tedcomshare&amp;utm_medium=referral&amp;utm_campaign=tedspread">Is
Pivot a turning point for web exploration?</a> - Gary Flake, Technical
Fellow at Microsoft (TED Talks).</li>
<li><a href="http://videolectures.net/wsdm09_dean_cblirs/">Challenges in
Building Large-Scale Information Retrieval Systems</a> - Jeff Dean (WSDM
Conference, 2009).</li>
<li><a href="https://youtu.be/NFCZuzA4cFc">Knowledge-based Information
Retrieval with Wikipedia</a> - David Wilne (The University of Waikato,
2008).</li>
<li><a
href="https://www.youtube.com/watch?v=SghMq1xBJPI&amp;list=PLdktw5AjQqP2gpQNgHRJaSgEkHiaVLfTi&amp;index=24">Music
Information Retrieval Using Locality Sensitive Hashing</a> - Steve Tjoa
(RackSpace Developers) [This talk shows that IR is not just text and
images].</li>
<li><a href="https://youtu.be/u6oqr3gMyxk">The Functional Web The
Future of Apps and the Web</a> - Liron Shapira (Box Tech Talk).</li>
<li><a href="https://youtu.be/EnvtsbCfiAI">Information Experience -
Solution to Information Overload on Web</a> - Doug Imbruce (Techcrunch
Disrupt)[Doug Imbruce is the Founder of Qwiki, Inc, a technology startup
in New York, NY, acquired by Yahoo! in 2013].</li>
<li><a href="https://youtu.be/tnsyhKHalGs">Internet Privacy</a> -
Dr. Alma Whitten (Google Brussels Tech Talk).</li>
</ul>
<h4 id="philosophical-talks">Philosophical Talks</h4>
<ul>
<li><a
href="https://www.ted.com/talks/andreas_ekstrom_the_moral_bias_behind_your_search_results">The
moral bias behind your search results</a> - Andreas Ekström (Swedish
Author &amp; Journalist, TED Talk).</li>
<li><a
href="https://www.ted.com/talks/eli_pariser_beware_online_filter_bubbles?language=en">Beware
online “filter bubbles”</a> - Eli Pariser (Author of the Filter Bubble,
TED Talk).</li>
<li><a
href="https://www.ted.com/talks/andy_yen_think_your_email_s_private_think_again">Think
your emails private? Think again</a> - Andy Yen (CERN, TED Talk) [This
talk talks about privacy, which Search Engines intrude into, and how can
people protect it].</li>
<li><a href="https://youtu.be/YO0lbdhF30g">Do we have the right to be
forgotten?</a> - Michael Douglas [TEDx SouthBank].</li>
<li><a
href="https://www.ted.com/talks/christopher_m00t_poole_the_case_for_anonymity_online?utm_source=tedcomshare&amp;utm_medium=referral&amp;utm_campaign=tedspread">The
case for anonymity online</a> - Christopher “moot” Poole” (Ted Talks)
[Christopher “moot” Poole is founder of 4chan, an online imageboard
whose anonymous denizens have spawned the webs most bewildering and
influential subculture].</li>
</ul>
<h2 id="conferences">Conferences</h2>
<ul>
<li>Web Search and Data Mining Conference - <a
href="http://www.wsdm-conference.org">WSDM</a>.</li>
<li>Special Interests Group on Information Retrieval - <a
href="http://sigir.org">SIGIR</a>.</li>
<li>Text REtrieval Conference - <a
href="http://trec.nist.gov">TREC</a>.</li>
<li>European Conference on Information Retrieval - <a
href="http://irsg.bcs.org/ecir.php">ECIR</a>.</li>
<li>World Wide Web Conference - <a
href="http://www.iw3c2.org">WWW</a>.</li>
<li>Conference on Information and Knowledge Management - <a
href="http://www.cikmconference.org">CIKM</a>.</li>
<li>Forum for Information Retrieval Evaluation - <a
href="http://fire.irsi.res.in/fire/2016/home">FIRE</a>.</li>
<li>Conference and Labs of the Evaluation Forum - <a
href="http://www.clef-initiative.eu/">CLEF</a>.</li>
<li>NII Testsbeds and Community for Information access Research - <a
href="http://research.nii.ac.jp/ntcir/index-en.html">NTCIR</a>.</li>
</ul>
<h2 id="blogs">Blogs</h2>
<ul>
<li><a
href="http://research.google.com/pubs/InformationRetrievalandtheWeb.html">Information
Retrieval and the Web</a> - Google Research.</li>
<li><a href="https://irthoughts.wordpress.com">IR Thoughts</a> -
Dr. Edel Garcia.</li>
</ul>
<h4 id="interesting-reads">Interesting Reads</h4>
<ul>
<li><a
href="https://www.technologyreview.com/s/602807/deep-neural-network-learns-to-judge-books-by-their-covers/?utm_campaign=socialflow&amp;utm_source=facebook&amp;utm_medium=post">Deep
Neural Network Learns to Judge Books by Their Covers</a> - Information
Extraction.</li>
<li><a
href="http://www.theverge.com/2016/11/7/13551210/ai-deep-learning-lip-reading-accuracy-oxford">Can
Deep Learning help solve Deep Learning</a> - Information Retrieval from
Lip Reading.</li>
<li><a
href="https://enterprisersproject.com/article/2016/9/reduce-biases-machine-learning-start-openly-discussing-problem?sc_cid=70160000000q8YTAAY">To
reduce biases in machine learning start with openly discussing the
problem</a> - Bias in Relevance.</li>
<li><a
href="https://www.wired.com/2016/11/woah-googles-ai-really-good-pictionary/">Whoa,
Googles AI Is Really Good at Pictionary</a> - Sketch-based search.</li>
<li><a
href="https://www.technologyreview.com/s/602955/neural-network-learns-to-identify-criminals-by-their-faces/?utm_campaign=socialflow&amp;utm_source=facebook&amp;utm_medium=post">Neural
Network Learns to Identify Criminals by Their Faces</a> - Information
Extraction.</li>
</ul>
<h2 id="license">License</h2>
<p><a href="https://creativecommons.org/publicdomain/zero/1.0/"><img
src="http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg"
alt="CC0" /></a></p>
<p>To the extent possible under law, <a
href="http://www.harshalpriyadarshi.com">Harshal Priyadarshi</a> and all
the contributors have waived all copyright and related or neighboring
rights to this work.</p>
<p><a
href="https://github.com/harpribot/awesome-information-retrieval">informationretrieval.md
Github</a></p>