Abstract
Information retrieval (IR) is the identification of documents or other units of information in a collection that are relevant to a particular information need—a set of questions to which someone would like to find an answer. This chapter presents the basic strategies of IR in length along with their analysis, particularly emphasizing the vector space and probabilistic models of IR, with worked examples in each category; gives the detailed coverage to construction and maintenance of index, and its parallel processing. The fuzzy logic-based retrieval, concept-based retrieval techniques, their algorithms, and worked examples are presented; and Automatic Query Expansion has been dealt with at length. Application of Bayesian networks, and inferences using these have been demonstrated for IR. The newly emerged semantic web for futuristic IR and its applications have been introduced; and the design aspects of distributed IR suited for currently distributed information resources are treated in depth. The chapter ends with the summary and a set of practice exercises.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Precision and recall are parameters which represent the performance of any IR system.
- 2.
Polysemy is an ambiguity in an individual word or phrase, such that the word can be used (in different contexts) to express two or more different meanings.
- 3.
A set that contained not only the elements but a number associated with each element that indicates the strength of the membership of the term.
- 4.
A Wikipedia article is an article about some topic, for example, we find articles in the collections of Wikipedia, like websites, Internet, WWW, etc.
- 5.
Polysemy: A single term with several meanings.
References
Carpineto C, Romano G (2012). A survey of automatic query expansion in information retrieval. ACM Comput Surv 44(1): 50. https://doi.org/10.1145/2071389.2071390
Chowdhary KR, Bansal VS (2001) Current trends in information retrieval. In: The 4th international conference of asian digital libraries, Dec. 10–12, 2001 Bangalore, pp 306–319
Chowdhary KR, Bansal VS (2003) Fuzzy Logic-based information retrieval. In: Conference proceedings on algorithms and artificial systems, Allied Publishers Pvt. Ltd. pp 297–307. ISBN 81-7764-403-3
Chowdhary KR (2004) Natural language processing for word sense disambiguation and information extraction. PhD Thesis, J.N.V. University, Jodhpur, May 2004
Chowdhary KR (2005) Word sense disambiguation. J Comput Sci 1(1):30–37
Chowdhary KR (2008) Information retrieval from digital libraries using probabilistic-possibilistic inferences. In: IR@INFLIBNET INFLIBNET’s Convention Proceedings CALIBER 2008 Allahabad, http://ir.inflibnet.ac.in/handle/1944/1225
Chowdhary KR, Bansal VS (2006) Information extraction from natural language texts. J Institut Eng (India), 87:14–19
Chowdhary KR, Bansal VS (2011) Information retrieval using probability and belief theory. International conference emerging trends in networks and computer communications (ETNCC). https://doi.org/10.1109/ETNCC.2011.5958513
Egozi O et al (2011).Concept-based information retrieval using explicit semantic analysis. ACM Trans Informat Syst, 29(2):8.1–8.34. https://doi.org/10.1145/1961209.1961211
Fung R, DelFavero B (1995) Applying Bayesian networks to information retrieval. Commun ACM 38(3):42–49
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41. https://doi.org/10.1145/219717.219748
Grossman DA, Ophir F (2008) Information retrieval-algorithms and heuristics, 2nd edn. Springer
Heckerman D et al (1995) Real-world applications of Bayesian networks. Commun ACM 38(3):24–26
Recardo BY, Berthier RN (1999) Modern information retrieval. Addison Wesley-ACM Press
Rinaldi AM (2009) An ontology-driven approach for semantic information retrieval on the web. Trans Internet Technol 9(3):10:1–10:24. https://doi.org/10.1145/1552291.1552293
Smith LC (1976) Artificial intelligence in information retrieval systems. Informat Process Manage 12:189–222. Pergamon Press
Wright (1921) Correlation and causation. Agric Res 20:557–585
Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Comput Surv 38(2):1–56
Author information
Authors and Affiliations
Corresponding author
Exercises
Exercises
-
1.
Show how the vector space model can be modeled using an inference network.
-
2.
Consider a documents collection made of 100 documents. Given a query q, the set of documents relevant to the users is \(D^*= \{d_4 , d_{15} , d_{34}, d_{56}, d_{98}\}\). An IR system retrieves the following documents \(D = \{d_4, d_{15}, d_{35}, d_{56}, d_{66}, d_{88} , d_{95}\}\)
-
a.
Compute the number of True-Negatives, True-Positives, False-Negatives, False-Positives.
-
b.
Compute Precision, Recall, and F-measure.
-
a.
-
3.
Consider an IR scenario in the following: It has been found in some hospital, results of blood tests taken on a specific day are unreliable for diabetic patients due to equipment malfunction. The hospital uses an IR system to identify these patients. Suppose the collection of patients’ records contains 10, 000 documents, 500 of that are relevant to the query. The system returns 350 documents, 225 of that are relevant to the query. Answer the following for this scenario:
-
a.
Calculate the precision and recall for this system.
-
b.
Based on your results from above, explain how well would you say about the working of hospital’s IR system.
-
c.
Knowing about the precision-recall trade-off, what is likely to happen if an IR system is tuned to aim for 100% precision?
-
d.
Knowing about the precision-recall trade-off, what is likely to happen if an IR system is tuned to aim for 100% recall?
-
e.
For the trade-off given scenario, which measure do you think is more important, precision or recall? Why?
-
a.
-
4.
You are looking for information on “Economic growth in India” in a large document collection, during the period of last 3 years. You decide to search using the terms: India, banks, growth, economy, business, agriculture, using an IR system, which recommends three possible documents given below with term frequencies.
Term
Economy
India
Growth
Banks
Business
Agriculture
Document-1
15
10
3
4
2
9
Document-2
0
0
9
8
7
8
Document-3
4
2
4
4
6
10
There is no additional information about the documents. Make use of each of the following models to find out the relevancy of the documents to the query.
-
a.
Boolean model
-
b.
Vector space model
-
c.
tf.idf model
-
a.
-
5.
Take any three small documents of size, approximately 100 words.
-
a.
Build a matrix of an inverted index for these documents, in the format shown in Fig. 18.5.
-
b.
Weight terms by their presence/absence (binary), and also and by \(tf\times idf\) (with estimated IDFs).
-
c.
Compute the memory requirements for this inverted index. Make necessary assumptions for character size, pointer size, etc.
-
d.
Construct a suitable query, and calculate document–query similarity, for the following scenarios:
-
i
Cosine (with normalization)
-
ii
Inner product (i.e., cosine without normalization)
-
iii
Does the normalization has any effect? Justify.
-
i
-
a.
-
6.
Consider that we submit the queries to search engines for searching the needed information on WWW.
-
a.
Does the search process use a stop-word list?
-
b.
Can you search “The”, “The a”, “An a”, etc.? Justify.
-
c.
Is it a practice to search the above terms?
-
d.
Does the search process use stemming?
-
e.
Are there different results for two queries “Human body”, “Humanly body”. Justify your answer.
-
f.
Does it normalize words to lower case?
-
a.
-
7.
“Having the knowledge of the sense of a query term may help a document retrieval system, especially for short queries.” Why it is not true for longer queries?
-
8.
Comment on the validity of following statements for Boolean model:
-
a.
“Stemming does not lower the precision of a Boolean retrieval system.”
-
b.
“Stemming does not lower recall of a Boolean retrieval system.”
-
a.
-
9.
Answer the following in brief:
-
a.
Why is the idf of a term always finite?
-
b.
What is the idf of a term that occurs in every document?
-
c.
What is the idf of a term that appears in one document only?
-
d.
What is the idf of a term that appears in no document?
-
a.
-
10.
Answer the following in brief:
-
a.
Name three criteria for evaluating a search engine.
-
b.
What is an easy way to maximize the recall of a search engine?
-
c.
What is an easy way to maximize the precision of a search engine?
-
a.
-
11.
What is the difference between clustering and classification? How can they be used in a complete IR system?
-
12.
Discuss the merits and demerits of following, suggest as to which one will provide better response time?
-
a.
Document-distributed architecture.
-
b.
Term-distributed architecture.
-
a.
Rights and permissions
Copyright information
© 2020 Springer Nature India Private Limited
About this chapter
Cite this chapter
Chowdhary, K.R. (2020). Information Retrieval. In: Fundamentals of Artificial Intelligence. Springer, New Delhi. https://doi.org/10.1007/978-81-322-3972-7_18
Download citation
DOI: https://doi.org/10.1007/978-81-322-3972-7_18
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-3970-3
Online ISBN: 978-81-322-3972-7
eBook Packages: Computer ScienceComputer Science (R0)