Skip to main content

Information Retrieval

  • Chapter
  • First Online:
Fundamentals of Artificial Intelligence
  • 12k Accesses

Abstract

Information retrieval (IR) is the identification of documents or other units of information in a collection that are relevant to a particular information need—a set of questions to which someone would like to find an answer. This chapter presents the basic strategies of IR in length along with their analysis, particularly emphasizing the vector space and probabilistic models of IR, with worked examples in each category; gives the detailed coverage to construction and maintenance of index, and its parallel processing. The fuzzy logic-based retrieval, concept-based retrieval techniques, their algorithms, and worked examples are presented; and Automatic Query Expansion has been dealt with at length. Application of Bayesian networks, and inferences using these have been demonstrated for IR. The newly emerged semantic web for futuristic IR and its applications have been introduced; and the design aspects of distributed IR suited for currently distributed information resources are treated in depth. The chapter ends with the summary and a set of practice exercises.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Precision and recall are parameters which represent the performance of any IR system.

  2. 2.

    Polysemy is an ambiguity in an individual word or phrase, such that the word can be used (in different contexts) to express two or more different meanings.

  3. 3.

    A set that contained not only the elements but a number associated with each element that indicates the strength of the membership of the term.

  4. 4.

    A Wikipedia article is an article about some topic, for example, we find articles in the collections of Wikipedia, like websites, Internet, WWW, etc.

  5. 5.

    Polysemy: A single term with several meanings.

References

  1. Carpineto C, Romano G (2012). A survey of automatic query expansion in information retrieval. ACM Comput Surv 44(1): 50. https://doi.org/10.1145/2071389.2071390

  2. Chowdhary KR, Bansal VS (2001) Current trends in information retrieval. In: The 4th international conference of asian digital libraries, Dec. 10–12, 2001 Bangalore, pp 306–319

    Google Scholar 

  3. Chowdhary KR, Bansal VS (2003) Fuzzy Logic-based information retrieval. In: Conference proceedings on algorithms and artificial systems, Allied Publishers Pvt. Ltd. pp 297–307. ISBN 81-7764-403-3

    Google Scholar 

  4. Chowdhary KR (2004) Natural language processing for word sense disambiguation and information extraction. PhD Thesis, J.N.V. University, Jodhpur, May 2004

    Google Scholar 

  5. Chowdhary KR (2005) Word sense disambiguation. J Comput Sci 1(1):30–37

    Google Scholar 

  6. Chowdhary KR (2008) Information retrieval from digital libraries using probabilistic-possibilistic inferences. In: IR@INFLIBNET INFLIBNET’s Convention Proceedings CALIBER 2008 Allahabad, http://ir.inflibnet.ac.in/handle/1944/1225

  7. Chowdhary KR, Bansal VS (2006) Information extraction from natural language texts. J Institut Eng (India), 87:14–19

    Google Scholar 

  8. Chowdhary KR, Bansal VS (2011) Information retrieval using probability and belief theory. International conference emerging trends in networks and computer communications (ETNCC). https://doi.org/10.1109/ETNCC.2011.5958513

  9. Egozi O et al (2011).Concept-based information retrieval using explicit semantic analysis. ACM Trans Informat Syst, 29(2):8.1–8.34. https://doi.org/10.1145/1961209.1961211

  10. Fung R, DelFavero B (1995) Applying Bayesian networks to information retrieval. Commun ACM 38(3):42–49

    Article  Google Scholar 

  11. Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41. https://doi.org/10.1145/219717.219748

  12. Grossman DA, Ophir F (2008) Information retrieval-algorithms and heuristics, 2nd edn. Springer

    Google Scholar 

  13. Heckerman D et al (1995) Real-world applications of Bayesian networks. Commun ACM 38(3):24–26

    Article  Google Scholar 

  14. Recardo BY, Berthier RN (1999) Modern information retrieval. Addison Wesley-ACM Press

    Google Scholar 

  15. Rinaldi AM (2009) An ontology-driven approach for semantic information retrieval on the web. Trans Internet Technol 9(3):10:1–10:24. https://doi.org/10.1145/1552291.1552293

  16. Smith LC (1976) Artificial intelligence in information retrieval systems. Informat Process Manage 12:189–222. Pergamon Press

    Google Scholar 

  17. Wright (1921) Correlation and causation. Agric Res 20:557–585

    Google Scholar 

  18. Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Comput Surv 38(2):1–56

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. R. Chowdhary .

Exercises

Exercises

  1. 1.

    Show how the vector space model can be modeled using an inference network.

  2. 2.

    Consider a documents collection made of 100 documents. Given a query q, the set of documents relevant to the users is \(D^*= \{d_4 , d_{15} , d_{34}, d_{56}, d_{98}\}\). An IR system retrieves the following documents \(D = \{d_4, d_{15}, d_{35}, d_{56}, d_{66}, d_{88} , d_{95}\}\)

    1. a.

      Compute the number of True-Negatives, True-Positives, False-Negatives, False-Positives.

    2. b.

      Compute Precision, Recall, and F-measure.

  3. 3.

    Consider an IR scenario in the following: It has been found in some hospital, results of blood tests taken on a specific day are unreliable for diabetic patients due to equipment malfunction. The hospital uses an IR system to identify these patients. Suppose the collection of patients’ records contains 10, 000 documents, 500 of that are relevant to the query. The system returns 350 documents, 225 of that are relevant to the query. Answer the following for this scenario:

    1. a.

      Calculate the precision and recall for this system.

    2. b.

      Based on your results from above, explain how well would you say about the working of hospital’s IR system.

    3. c.

      Knowing about the precision-recall trade-off, what is likely to happen if an IR system is tuned to aim for 100% precision?

    4. d.

      Knowing about the precision-recall trade-off, what is likely to happen if an IR system is tuned to aim for 100% recall?

    5. e.

      For the trade-off given scenario, which measure do you think is more important, precision or recall? Why?

  4. 4.

    You are looking for information on “Economic growth in India” in a large document collection, during the period of last 3 years. You decide to search using the terms: India, banks, growth, economy, business, agriculture, using an IR system, which recommends three possible documents given below with term frequencies.

    Term

    Economy

    India

    Growth

    Banks

    Business

    Agriculture

    Document-1

    15

    10

    3

    4

    2

    9

    Document-2

    0

    0

    9

    8

    7

    8

    Document-3

    4

    2

    4

    4

    6

    10

    There is no additional information about the documents. Make use of each of the following models to find out the relevancy of the documents to the query.

    1. a.

      Boolean model

    2. b.

      Vector space model

    3. c.

      tf.idf model

  5. 5.

    Take any three small documents of size, approximately 100 words.

    1. a.

      Build a matrix of an inverted index for these documents, in the format shown in Fig. 18.5.

    2. b.

      Weight terms by their presence/absence (binary), and also and by \(tf\times idf\) (with estimated IDFs).

    3. c.

      Compute the memory requirements for this inverted index. Make necessary assumptions for character size, pointer size, etc.

    4. d.

      Construct a suitable query, and calculate document–query similarity, for the following scenarios:

      1. i

        Cosine (with normalization)

      2. ii

        Inner product (i.e., cosine without normalization)

      3. iii

        Does the normalization has any effect? Justify.

  6. 6.

    Consider that we submit the queries to search engines for searching the needed information on WWW.

    1. a.

      Does the search process use a stop-word list?

    2. b.

      Can you search “The”, “The a”, “An a”, etc.? Justify.

    3. c.

      Is it a practice to search the above terms?

    4. d.

      Does the search process use stemming?

    5. e.

      Are there different results for two queries “Human body”, “Humanly body”. Justify your answer.

    6. f.

      Does it normalize words to lower case?

  7. 7.

    “Having the knowledge of the sense of a query term may help a document retrieval system, especially for short queries.” Why it is not true for longer queries?

  8. 8.

    Comment on the validity of following statements for Boolean model:

    1. a.

      “Stemming does not lower the precision of a Boolean retrieval system.”

    2. b.

      “Stemming does not lower recall of a Boolean retrieval system.”

  9. 9.

    Answer the following in brief:

    1. a.

      Why is the idf of a term always finite?

    2. b.

      What is the idf of a term that occurs in every document?

    3. c.

      What is the idf of a term that appears in one document only?

    4. d.

      What is the idf of a term that appears in no document?

  10. 10.

    Answer the following in brief:

    1. a.

      Name three criteria for evaluating a search engine.

    2. b.

      What is an easy way to maximize the recall of a search engine?

    3. c.

      What is an easy way to maximize the precision of a search engine?

  11. 11.

    What is the difference between clustering and classification? How can they be used in a complete IR system?

  12. 12.

    Discuss the merits and demerits of following, suggest as to which one will provide better response time?

    1. a.

      Document-distributed architecture.

    2. b.

      Term-distributed architecture.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature India Private Limited

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chowdhary, K.R. (2020). Information Retrieval. In: Fundamentals of Artificial Intelligence. Springer, New Delhi. https://doi.org/10.1007/978-81-322-3972-7_18

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-3972-7_18

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-3970-3

  • Online ISBN: 978-81-322-3972-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics