
Curated list of information retrieval and web search resources from
all around the web. ## Introduction Information
Retrieval involves finding relevant information for user queries,
ranging from simple domain of database search to complicated aspects of
web search (Eg - Google, Bing, Yahoo). Currently, researchers are
developing algorithms to address Information
Need of user(s), by maximizing User
and Topic Relevance of retrieved results, while minimizing Information
Overload and retrieval time. ## Contributing Please feel free to
send me pull
requests or [email] (mailto:harshal.priyadarshi@utexas.edu) me to
add new links. I am very open to suggestions and corrections. Please
look at the contributions guide.
Contents
Books
- Introduction to
Information Retrieval - C.D. Manning, P. Raghavan, H. Schütze.
Cambridge UP, 2008. (First book for getting started with Information
Retrieval).
- Search
Engines: Information Retrieval in Practice - Bruce Croft, Don
Metzler, and Trevor Strohman. 2009. (Great book for readers interested
in knowing how Search Engines work. The book is very detailed).
- Modern
Information Retrieval - R. Baeza-Yates, B. Ribeiro-Neto.
Addison-Wesley, 1999.
- Information Retrieval
in Practice - B. Croft, D. Metzler, T. Strohman. Pearson Education,
2009.
- Mining
the Web: Analysis of Hypertext and Semi Structured Data - S.
Chakrabarti. Morgan Kaufmann, 2002.
- Language
Modeling for Information Retrieval - W.B. Croft, J. Lafferty.
Springer, 2003. (Handles Language Modeling aspect of Information
Retrieval. It also extensively details probabilistic perspective in this
domain, which is interesting).
- Information
Retrieval: A Survey - Ed Greengrass, 2000. (Comprehensive survey of
Conventional Information Retrieval, before Deep Learning era).
- Introduction
to Modern Information Retrieval - G.G. Chowdhury. Neal-Schuman,
2003. (Intended for students of library and information studies).
- Text
Information Retrieval Systems - C.T. Meadow, B.R. Boyce, D.H. Kraft,
C.L. Barry. Academic Press, 2007 (library/information science
perspective).
Courses
Software
- Apache Lucene - Open
Source Search Engine that can be used to test Information Retrieval
Algorithm. Twitter uses this core for its real-time search.
- The Lemur Project - The
Lemur Project develops search engines, browser toolbars, text analysis
tools, and data resources that support research and development of
information retrieval and text mining software.
- Indri Search
Engine - Another Open Source Search Engine competitor of Apache
Lucene.
- Lemur Toolkit -
Open Source Toolkit for research in Language Modeling, filtering and
categorization.
Datasets
Standard IR Collections
- DBPedia -
Linked data web.
- Cranfield
Collections - This is one of the first collections in IR domain,
however the dataset is too small for any statistical significance
analysis, but is nevertheless suitable for pilot runs.
- TREC Collections - TREC
is the benchmark dataset used by most IR and Web search algorithms. It
has several tracks, each of which consists of dataset to test for a
specific task. The tracks along with suggested use-case are:
- Blog - Explore
information seeking behavior in the blogosphere.
- Chemical IR -
Address challenges in building large chemical testbeds for chemical
IR.
- Clinical Decision
Support - Investigate techniques to link medical cases to
information relevant for patient care.
- Confusion -
Study Known
Item Searching problem.
- Contextual
Suggestion - Investigate search techniques for complex information
needs (context and user interests based).
- Crowdsourcing -
Explore crowdsourcing methods for performing and evaluating search.
- Enterprise -
Study search over the organization data.
- Entity - Perform
entity-related search (find entities and their properties) on Web
data.
- Filtering -
Binarily decide retrieval of new incoming documents given a stable
information need.
- Federated Web
Search - Study merge performance for results from various search
services.
- Genomics -
Study retrieval efficiency of genomics data and corresponding
documentation.
- HARD - Obtain High
Accuracy Retrieval from Documents by leveraging searcher’s context.
- Interactive
Track - Study user interaction with text retrieval systems.
- Knowledge base
acceleration - Study algorithms that improve efficiency of human
Knowledge Base.
- Legal Track -
Study retrieval systems that have high recall for legal documents use
case.
- Medical Track -
Explore unstructured search performance over patients record data.
- Microblog
Track - Examine satisfaction of real-time information need for
microblogging sites.
- Million Query
Track - Explore ad-hoc retrieval over large set of queries.
- Novelty Track -
Investigate systems’ abilities to locate new (non-redundant)
information.
- Question Answering
Track - Test systems that scale beyond document retrieval, to
retrieve answers to factoid, list and definition type questions.
- Relevance
Feedback Track - For deep evaluation of relevance feedback
processes.
- Robust Track -
Study individual topic’s effectiveness.
- Session Track -
Develop methods for measuring multiple-query sessions where information
needs drift.
- SPAM Track -
Benchmark spam filtering approaches.
- Tasks Track -
Test if systems can induce possible tasks, users might be trying to
accomplish for the query.
- Temporal
Summarization Track - Develop systems that allow users to
efficiently monitor the information associated with an event over
time.
- Terabyte Track
- Test scalability of IR systems to large scale collection.
- Web Track -
Explore information seeking behaviors common in general web search.
- GOV2
Test Collection - This is one of the largest Web collection of
documents obtained from crawl of government websites by Charlie Clarke
and Ian Soboroff, using NIST hardware and network, then formatted by
Nick Craswel.
- NTCIR
Test Collection - This is collection of wide variety of dataset
ranging from Ad-hoc collection, Chinese IR collection, mobile
clickthrough collections to medical collections. The focus of this
collection is mostly on east asian languages and cross language
information retrieval.
- Conference
and Labs of the Evaluation Forum (CLEF) dataset - It contains a
multi-lingual document collection. The test suite includes:
- AdHoc - News Test suite.
- Domain Specific Test Suite - On collections of scientific
articles.
- Question Answering Test Suite.
- Reuters
Corpora - The corpora is now available through NIST. The corpora
includes following:
- RCV1 (Reuter’s Corpus Volume 1) - Consists of only English language
News stories.
- RCV2 (Reuter’s Corpus Volume 2) - Consists of stories in 13
languages (Dutch, French, German, Chinese, Japanese, Russian,
Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian,
and Swedish). Note that the stories are not parallel.
- TRC (Thomson Reuters Text Research Collection) - This is a fairly
recent corpus consisting of 1,800,370 news stories covering the period
from 2008-01-01 00:00:03 to 2009-02-28 23:54:14.
- 20
Newsgroup dataset - This data set consists of 20000 newsgroup
messages.posts taken from 20 newsgroup topics.
- English Gigaword
Fifth Edition - This data set is a comprehensive archive of English
newswire text data including headlines, datelines and articles.
- Document
Understanding Conference (DUC) datasets - Past newswire/paper
datasets (DUC 2001 - DUC 2007) are available upon request.
External Curation Links
Talks
Technical Talks
Philosophical Talks
Conferences
- Web Search and Data Mining Conference - WSDM.
- Special Interests Group on Information Retrieval - SIGIR.
- Text REtrieval Conference - TREC.
- European Conference on Information Retrieval - ECIR.
- World Wide Web Conference - WWW.
- Conference on Information and Knowledge Management - CIKM.
- Forum for Information Retrieval Evaluation - FIRE.
- Conference and Labs of the Evaluation Forum - CLEF.
- NII Testsbeds and Community for Information access Research - NTCIR.
Blogs
Interesting Reads
License

To the extent possible under law, Harshal Priyadarshi and all
the contributors have waived all copyright and related or neighboring
rights to this work.
informationretrieval.md
Github