Skip to main content

Investigating the Optimise k-Dimensions and Threshold Values of Latent Semantic Indexing Retrieval Performance for Small Malay Language Corpus

  • Conference paper
  • First Online:
  • 892 Accesses

Abstract

Presenting users with relevant feedback is the main aim and core in information retrieval (IR). Due to the poor relevance feedback returned by simple exact term-matching technique, a latent semantic indexing (LSI) based IR has come into place to overcome the retrieval drawback, and improve the effectiveness of retrieval performance. In other words, LSI-based IR aims in satisfying users rather than satisfying a given query. However, in developing an LSI-based information retrieval application, there are parameters that need to be considered in order to produce relevant feedback which optimise the precision and recall in retrieval process. Therefore, this paper investigates two important parameters that characterised the retrieval performance, which are the optimise k-dimension to represent terms and documents in corpus, and the optimise threshold values for the documents to be accepted, judged and returned as relevant for a given term query. A small Malay corpus which comprises of 1395 Malay language documents and terms were used as the test collection. The analyses suggest that the effective performance of the retrieval which satisfied as well as balanced the precision and recall, is obtained for k-dimension is k = 4 and threshold value is ε = 0.8 The study helps the software developers particularly the IR application developers in designing and choosing the optimise value of the k-dimension and the threshold in the search engine.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Baeza-Yates R (2004) Challenges in the interaction of information retrieval and natural language processing. In: Computational linguistics and intelligent text processing, vol 2945, pp 445–456. http://link.springer.com/chapter/10.1007/978-3-540-24630-5_55

    Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 9. ACM press, New York. doi:10.1080/14735789709366603

    Google Scholar 

  • Bartell BT, Cottrell GW, Belew RK (1992) Latent semantic indexing is an optimal special case of multidimensional scaling. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’92. ACM Press, New York, USA, pp 161–167. doi:10.1145/133160.133191

  • Buckeridge AM, Sutcliffe RFE (2002) Disambiguating noun compounds with latent semantic indexing. In: COLING-02 on COMPUTERM 2002 second international workshop on computational terminology, vol 14. Association for Computational Linguistics, Morristown, NJ, USA, pp 1–7. doi:10.3115/1118771.1118772

  • Dasgupta A, Kumar R, Raghavan P, Tomkins A (2005) Variable latent semantic indexing. In: Proceeding of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining—KDD’05. ACM Press, New York, USA, p. 13. doi:10.1145/1081870.1081876

  • Deerwester S, Dumais S, Landauer T (1990) Indexing by latent semantic analysis. JASIS. http://www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf

    Google Scholar 

  • Ding CHQ (1999) A similarity-based probability model for latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’99. ACM Press, New York, USA, pp. 58–65. doi:10.1145/312624.312652

  • Efron M (2007) Model-averaged latent semantic indexing. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’07. ACM Press, New York, USA, p 755. doi:10.1145/1277741.1277893

  • Foltz PW (1990) Using latent semantic indexing for information filtering. ACM SIGOIS Bull 11(2–3):40–47. doi:10.1145/91474.91486

    Article  Google Scholar 

  • Foltz P, Dumais S (1992) Personalized information delivery: an analysis of information filtering methods. Commun ACM 12. http://dl.acm.org/citation.cfm?id=138866

  • Furnas GW, Deerwester S, Dumais ST, Landauer TK, Harshman RA, Streeter LA, Lochbaum KE (1988) Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval—SIGIR’88. New York, USA, pp 465–480. doi:10.1145/62437.62487

  • Gee KR (2003) Using latent semantic indexing to filter spam. In: Proceedings of the 2003 ACM symposium on applied computing—SAC’03. ACM Press, New York, USA, p 460. doi:10.1145/952532.952623

  • Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’99, pp 50–57. doi:10.1145/312624.312649

  • Hutchison D, Mitchell JC (2009) Lecture notes in computer science. String processing and information retrieval, vol 5721. Springer, Berlin. doi:10.1007/978-3-642-03784-9

    Google Scholar 

  • Kim W, Khudanpur S (2004) Lexical triggers and latent semantic analysis for cross-lingual language model adaptation. ACM Trans Asian Lang Inf Process 3(2):94–112. doi:10.1145/1034780.1034782

    Article  Google Scholar 

  • Kokiopoulou E, Saad Y (2004) Polynomial filtering in latent semantic indexing for information retrieval. In: Proceedings of the 27th annual international conference on research and development in information retrieval—SIGIR’04, vol 104. doi:10.1145/1008992.1009013

  • Kolda TG, Leary DPO (2008) A semidiscrete matrix decomposition for latent semantic indexing in information retrieval 16(4):322–346

    Google Scholar 

  • Kowalski G (1997) Information retrieval systems: theory and implementation. Springer, New York

    Google Scholar 

  • Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: 25th international conference on software engineering, 2003. IEEE Proceedings. pp 125–135. doi:10.1109/ICSE.2003.1201194

  • Mckinley KS (2000) The effect of collection organization and query locality on information retrieval system performance. Adv Inf Retrieval 7:173–202. doi:10.1007/0-306-47019-5_7

    Google Scholar 

  • Quan X, Chen E, Luo Q, Xiong H (2008) Adaptive label-driven scaling for latent semantic indexing. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’08, pp 827–828. doi:10.1145/1390334.1390525

  • Ricardo B, Berthier R (2011) Modern information retrieval: the concepts and technology behind search second edition, vol 82. Addision Wesley, p 944. http://www.amazon.com/Modern-Information-Retrieval-Concepts-Technology/dp/0321416910

  • Russell S, Norvig P (2010) Natural language processing. In: Artificial intelligence: a modern approach, 3rd edn. Pearson, Prentice Hall, pp 861–887

    Google Scholar 

  • Sadjirin R, Rahman NA (2010) Efficient retrieval of Malay language documents using latent semantic indexing. In: 2010 international symposium on information technology, vol 3. IEEE, pp. 1410–1415. doi:10.1109/ITSIM.2010.5561613

  • Syu I, Lang SD, Deo N (1996) Incorporating latent semantic indexing into a neural network model for information retrieval. In: Proceedings of the fifth international conference on information and knowledge management—CIKM’96, pp 145–153. doi:10.1145/238355.238475

  • Tang C, Dwarkadas S, Xu Z (2004) On scaling latent semantic indexing for large peer-to-peer systems. In: Proceedings of the 27th annual international conference on research and development in information retrieval—SIGIR’04. ACM Press, New York, USA, p 112. doi:10.1145/1008992.1009014

  • Trotman A (2003) Compressing inverted files. Inf Retrieval 6(1):5–19. doi:10.1023/A:1022949613039

    Article  Google Scholar 

  • Yu K, Yu S, Tresp V (2005) Multi-label informed latent semantic indexing. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’05. ACM Press, New York, USA, p 258. doi:10.1145/1076034.1076080

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roslan Sadjirin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this paper

Cite this paper

Sadjirin, R., Nordin, N.M., Md Raus, M.I., Sahri, Z. (2016). Investigating the Optimise k-Dimensions and Threshold Values of Latent Semantic Indexing Retrieval Performance for Small Malay Language Corpus. In: Yacob, N., Mohamed, M., Megat Hanafiah, M. (eds) Regional Conference on Science, Technology and Social Sciences (RCSTSS 2014). Springer, Singapore. https://doi.org/10.1007/978-981-10-0534-3_31

Download citation

Publish with us

Policies and ethics