Abstract
Presenting users with relevant feedback is the main aim and core in information retrieval (IR). Due to the poor relevance feedback returned by simple exact term-matching technique, a latent semantic indexing (LSI) based IR has come into place to overcome the retrieval drawback, and improve the effectiveness of retrieval performance. In other words, LSI-based IR aims in satisfying users rather than satisfying a given query. However, in developing an LSI-based information retrieval application, there are parameters that need to be considered in order to produce relevant feedback which optimise the precision and recall in retrieval process. Therefore, this paper investigates two important parameters that characterised the retrieval performance, which are the optimise k-dimension to represent terms and documents in corpus, and the optimise threshold values for the documents to be accepted, judged and returned as relevant for a given term query. A small Malay corpus which comprises of 1395 Malay language documents and terms were used as the test collection. The analyses suggest that the effective performance of the retrieval which satisfied as well as balanced the precision and recall, is obtained for k-dimension is k = 4 and threshold value is ε = 0.8 The study helps the software developers particularly the IR application developers in designing and choosing the optimise value of the k-dimension and the threshold in the search engine.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Baeza-Yates R (2004) Challenges in the interaction of information retrieval and natural language processing. In: Computational linguistics and intelligent text processing, vol 2945, pp 445–456. http://link.springer.com/chapter/10.1007/978-3-540-24630-5_55
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 9. ACM press, New York. doi:10.1080/14735789709366603
Bartell BT, Cottrell GW, Belew RK (1992) Latent semantic indexing is an optimal special case of multidimensional scaling. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’92. ACM Press, New York, USA, pp 161–167. doi:10.1145/133160.133191
Buckeridge AM, Sutcliffe RFE (2002) Disambiguating noun compounds with latent semantic indexing. In: COLING-02 on COMPUTERM 2002 second international workshop on computational terminology, vol 14. Association for Computational Linguistics, Morristown, NJ, USA, pp 1–7. doi:10.3115/1118771.1118772
Dasgupta A, Kumar R, Raghavan P, Tomkins A (2005) Variable latent semantic indexing. In: Proceeding of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining—KDD’05. ACM Press, New York, USA, p. 13. doi:10.1145/1081870.1081876
Deerwester S, Dumais S, Landauer T (1990) Indexing by latent semantic analysis. JASIS. http://www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf
Ding CHQ (1999) A similarity-based probability model for latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’99. ACM Press, New York, USA, pp. 58–65. doi:10.1145/312624.312652
Efron M (2007) Model-averaged latent semantic indexing. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’07. ACM Press, New York, USA, p 755. doi:10.1145/1277741.1277893
Foltz PW (1990) Using latent semantic indexing for information filtering. ACM SIGOIS Bull 11(2–3):40–47. doi:10.1145/91474.91486
Foltz P, Dumais S (1992) Personalized information delivery: an analysis of information filtering methods. Commun ACM 12. http://dl.acm.org/citation.cfm?id=138866
Furnas GW, Deerwester S, Dumais ST, Landauer TK, Harshman RA, Streeter LA, Lochbaum KE (1988) Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval—SIGIR’88. New York, USA, pp 465–480. doi:10.1145/62437.62487
Gee KR (2003) Using latent semantic indexing to filter spam. In: Proceedings of the 2003 ACM symposium on applied computing—SAC’03. ACM Press, New York, USA, p 460. doi:10.1145/952532.952623
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’99, pp 50–57. doi:10.1145/312624.312649
Hutchison D, Mitchell JC (2009) Lecture notes in computer science. String processing and information retrieval, vol 5721. Springer, Berlin. doi:10.1007/978-3-642-03784-9
Kim W, Khudanpur S (2004) Lexical triggers and latent semantic analysis for cross-lingual language model adaptation. ACM Trans Asian Lang Inf Process 3(2):94–112. doi:10.1145/1034780.1034782
Kokiopoulou E, Saad Y (2004) Polynomial filtering in latent semantic indexing for information retrieval. In: Proceedings of the 27th annual international conference on research and development in information retrieval—SIGIR’04, vol 104. doi:10.1145/1008992.1009013
Kolda TG, Leary DPO (2008) A semidiscrete matrix decomposition for latent semantic indexing in information retrieval 16(4):322–346
Kowalski G (1997) Information retrieval systems: theory and implementation. Springer, New York
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: 25th international conference on software engineering, 2003. IEEE Proceedings. pp 125–135. doi:10.1109/ICSE.2003.1201194
Mckinley KS (2000) The effect of collection organization and query locality on information retrieval system performance. Adv Inf Retrieval 7:173–202. doi:10.1007/0-306-47019-5_7
Quan X, Chen E, Luo Q, Xiong H (2008) Adaptive label-driven scaling for latent semantic indexing. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’08, pp 827–828. doi:10.1145/1390334.1390525
Ricardo B, Berthier R (2011) Modern information retrieval: the concepts and technology behind search second edition, vol 82. Addision Wesley, p 944. http://www.amazon.com/Modern-Information-Retrieval-Concepts-Technology/dp/0321416910
Russell S, Norvig P (2010) Natural language processing. In: Artificial intelligence: a modern approach, 3rd edn. Pearson, Prentice Hall, pp 861–887
Sadjirin R, Rahman NA (2010) Efficient retrieval of Malay language documents using latent semantic indexing. In: 2010 international symposium on information technology, vol 3. IEEE, pp. 1410–1415. doi:10.1109/ITSIM.2010.5561613
Syu I, Lang SD, Deo N (1996) Incorporating latent semantic indexing into a neural network model for information retrieval. In: Proceedings of the fifth international conference on information and knowledge management—CIKM’96, pp 145–153. doi:10.1145/238355.238475
Tang C, Dwarkadas S, Xu Z (2004) On scaling latent semantic indexing for large peer-to-peer systems. In: Proceedings of the 27th annual international conference on research and development in information retrieval—SIGIR’04. ACM Press, New York, USA, p 112. doi:10.1145/1008992.1009014
Trotman A (2003) Compressing inverted files. Inf Retrieval 6(1):5–19. doi:10.1023/A:1022949613039
Yu K, Yu S, Tresp V (2005) Multi-label informed latent semantic indexing. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’05. ACM Press, New York, USA, p 258. doi:10.1145/1076034.1076080
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Sadjirin, R., Nordin, N.M., Md Raus, M.I., Sahri, Z. (2016). Investigating the Optimise k-Dimensions and Threshold Values of Latent Semantic Indexing Retrieval Performance for Small Malay Language Corpus. In: Yacob, N., Mohamed, M., Megat Hanafiah, M. (eds) Regional Conference on Science, Technology and Social Sciences (RCSTSS 2014). Springer, Singapore. https://doi.org/10.1007/978-981-10-0534-3_31
Download citation
DOI: https://doi.org/10.1007/978-981-10-0534-3_31
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-0532-9
Online ISBN: 978-981-10-0534-3
eBook Packages: Business and ManagementBusiness and Management (R0)