Skip to main content

A Semantic Text Retrieval for Indonesian Using Tolerance Rough Sets Models

  • Chapter
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 8988))

Abstract

The research of Tolerance Rough Sets Model (TRSM) ever conducted acted in accordance with the rational approach of AI perspective. This article presented studies who complied with the contrary path, i.e. a cognitive approach, for an objective of a modular framework of semantic text retrieval system based on TRSM specifically for Indonesian. In addition to the proposed framework, this article proposes three methods based on TRSM, which are the automatic tolerance value generator, thesaurus optimization, and lexicon-based document representation. All methods were developed by the use of our own corpus, namely ICL-corpus, and evaluated by employing an available Indonesian corpus, called Kompas-corpus. The endeavor of a semantic information retrieval system is the effort to retrieve information and not merely terms with similar meaning. This article is a baby step toward the objective.

This work was partially supported by the Polish National Science Centre grant DEC-2012/05/B/ST6/03215.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Key statistical highlights: ITU data release June 2012. URL: http://www.itu.int. Accessed on 25 October 2012.

  2. 2.

    Utterances may include sound, marks, gesture, grunts, and groans (anything that can signal an intention).

  3. 3.

    The reason is, in the context of speech act, we do not concern about whether the belief of a speaker is true or not, rather we concern about the intention of speaker what he/she wants to represent by his/her utterance. Thus, it might be the case that a speaker represents his/her false belief as a true belief to the audience, e.g. a speaker utters ‘it is raining’, while in fact ‘it is a sunny day’.

  4. 4.

    In other words, ‘the mind to fit the world’. It is because a belief is like a statement, can be true or false; if the statement is false then it is the fault of the statement, not the world. The world-to-mind direction of fit is applied for the psychological mode such as desire or promise; if the promise is broken, it is the fault of the promiser.

  5. 5.

    BPS-Statistics Indonesia. URL: http://www.bps.go.id/. Accessed on 25 October 2012.

  6. 6.

    July 2012 estimation of The World Factbook. URL: https://www.cia.gov. Accessed on 25 October 2012.

  7. 7.

    Portal Nasional Indonesia (National Portal of Indonesia). URL: http://www.indonesia.go.id. Accessed on 25 October 2012.

  8. 8.

    Key statistical highlights: International Telecommunication Union (ITU) data release June 2012. URL: http://www.itu.int. Accessed on 25 October 2012.

  9. 9.

    URL: http://www.internetworldstats.com. Accessed on 25 October 2012.

  10. 10.

    The graph was taken from the International Telecommunication Union (ITU). URL: http://www.itu.int/ITU-D/ict/statistics/explorer/index.html. Accessed on 25 October 2012.

  11. 11.

    Appendix A provides an explanation about the TF*IDF weighting scheme.

  12. 12.

    The cognitive modeling is an approach employed in the Cognitive Science (CS). Cognitive science is an interdisciplinary study of mental representations and computations and of the physical systems that support those processes [18, p. xv].

  13. 13.

    Explanation about all corpora used in this article is available in Appendix C.

  14. 14.

    TREC is a forum for IR community which provides an infrastructure necessary to evaluate an IR system on a broad range of problems. URL: http://trec.nist.gov/.

  15. 15.

    Appendix B provides explanation about Cosine similarity measure as a document ranking algorithm.

  16. 16.

    Consistent with VSM, GVSM interprets index term vectors as linearly independent, however they are not orthogonal.

  17. 17.

    ICL-corpus consists of 1,000 documents taken from an Indonesian choral mailing list, while WORDS-corpus consists of 1,000 documents created from ICL-corpus in an annotation process conducted by human experts. Further explanation of these corpora is available in Appendix C.1.

  18. 18.

    We collaborated with 3 choral experts during annotation process. Their backgrounds could be reviewed in Appendix C.3.

  19. 19.

    We used CS stemmer and Vega’s stopword in all of our studies presented in this article.

  20. 20.

    Please see Appendix C.1 for explanation of annotation process.

  21. 21.

    Please see Appendix C.1.

  22. 22.

    If the size of tolerance classes are smaller then the size of upper sets will be smaller, and vice versa.

  23. 23.

    These values are for the process with stemming task.

  24. 24.

    Most of the foreign terms was English.

  25. 25.

    It comes from an English term workshop and an Indonesian suffix -nya.

  26. 26.

    Inverted index was applied for document representations in all experiments in this article.

  27. 27.

    It is an open source project implemented in Java licensed under the liberal Apache Software License [40]. We used Lucene 3.1.0 in our study. URL for download: http://lucene.apache.org/core/downloads.html.

  28. 28.

    JAMA has been developed by the MathWorks and NIST. It provides user-level classes for constructing and manipulating real, dense matrices. We used JAMA 1.0.2 in this study. URL: http://math.nist.gov/javanumerics/jama/.

  29. 29.

    We used the trec_eval.9.0 which is publicly available on http://trec.nist.gov/trec_eval/.

  30. 30.

    WORDS-corpus is generated based on ICL-corpus hence they dwell in a single domain.

  31. 31.

    Base method means that we employed the TF*IDF weighting scheme only without TRSM implementation.

  32. 32.

    Please see Appendix C.2.

  33. 33.

    Explanation about Cosine as a document ranking is available in Appendix B.

  34. 34.

    In fact, we found the same result between ICL_1000 and ICL_1000 + WORDS_1000 in all calculations we made, such as in R-Precision, Precision@10, Precision@20, and Precision@30.

  35. 35.

    It is an Indonesian lexicon created by the University of Indonesia described in a study of Nazief and Adriani in 1996 [43] which consists of 29,337 Indonesian root words. The lexicon has been used in other studies [10, 38].

  36. 36.

    KBBI is a dictionary copyrighted by Pusat Bahasa (in English: Language Center), Indonesian Ministry of Education, which consists of 27,828 root words.

  37. 37.

    The index terms of thesaurus are in the form of single term, hence we choose term partitur as the representative of the karya musik concept.

  38. 38.

    Figure 35 serves as a basis for the choice of \(\theta \) values in which the TRSM-representation, LEX-representation, TRSM-representation, and TFIDF-representation outperform the other representations at \(\theta \) = 2, \(\theta \) = 8, \(\theta \) = 41, and \(\theta \) = 88 in respective order. However, particularly at \(\theta \) = 88, the TFIDF-representation only performs better than the LEX-representation.

  39. 39.

    The base model means that we employed the TF*IDF weighting scheme without TRSM implementation nor the mapping process.

  40. 40.

    Kompas-corpus is a TREC-like Indonesian testbed which is composed of 3,000 newswire articles and is accompanied by 20 topics. Please see Appendix C.4 for more explanation.

  41. 41.

    Big data is a term to describe the enormity of data, both structured and unstructured, in volume, velocity, and variety [45].

  42. 42.

    Please see Appendix C.4 for more explanation about Kompas-corpus.

  43. 43.

    Indonesian Wikipedia: http://id.wikipedia.org/wiki/Halaman_Utama.

  44. 44.

    DBpedia is a community project which was started and is administered by research group from Universität Leipzig, Freie Universität Berlin, and OpenLink Software. The project is an effort to extract information from Wikipedia, make this information available on the Web under an open license, and interlink the DBpedia dataset with other open datasets on the Web. The Indonesian short abstracts of DBpedia was downloaded from http://downloads.dbpedia.org/3.7/id/.

  45. 45.

    Kompas. URL: http://www.kompas.com.

References

  1. Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval: Implementing and Evaluating Search Engine. MIT Press, Cambridge (2010)

    Google Scholar 

  2. Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.J.: Text Mining - Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)

    Google Scholar 

  3. Eifring, H., Theil, R.: Linguistics for Students of Asian and African Languages (2005)

    Google Scholar 

  4. Grandy, R.E., Warner, R.: Paul grice. http://plato.stanford.edu/entries/grice/, May 2006. Accessed 02 Oct 2012

  5. Searle, J.R.: Intentionality: An Essay in the Philosophy of Mind. Cambridge University Press, Cambridge (1983)

    Book  Google Scholar 

  6. Grice, H.P.: Studies in the Way of Words. Harvard University Press, Cambridge (1989)

    Google Scholar 

  7. Haugh, M., Jaszczolt, K.M.: Speaker intentions and intentionality. In: Allan, K., Jaszczolt, K.M. (eds.) The Cambridge Handbook of Pragmatics, pp. 87–112. Cambridge University Press, Cambridge (2012)

    Chapter  Google Scholar 

  8. Akand, M.: Grice and searle on meaning. Copula - J. Philos. Dept XXVIII, 51–58 (2011)

    Google Scholar 

  9. Adriani, M., Manurung, R.: A survey of bahasa Indonesia NLP research conducted at the University of Indonesia. In: Proceedings of the 2nd International MALINDO Workshop (2008)

    Google Scholar 

  10. Asian, J.: Effective techniques for Indonesian text retrieval. Ph.D. thesis, School of Computer Science and Information Technology, RMIT University, Doctor of Philosophy Thesis (March 2007)

    Google Scholar 

  11. Asian, J., Williams, H.E., Tahaghoghi, S.M.M.: A testbed for Indonesian text retrieval. In: Bruza, P., Moffat, A., Turpin, A. (eds.) ADCS, pp. 55–58. University of Melbourne, Department of Computer Science (2004)

    Google Scholar 

  12. Sneddon, J.: The Indonesian Language: It’s History and Role in Modern Society. UNSW Press, Sydney (2003)

    Google Scholar 

  13. Kawasaki, S., Nguyen, N.B., Ho, T.-B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  14. Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst. 17(2), 199–212 (2002)

    Article  Google Scholar 

  15. Nguyen, H.S., Ho, T.B.: Rough document clustering and the internet. In: Handbook of Granular Computing, pp. 987–1003. Wiley, Hoboken (2008)

    Google Scholar 

  16. Wu, Y., Ding, Y., Wang, X., Xu, J.: On-line hot topic recommendation using tolerance rough set based topic clustering. J. Comput. 5, 549–556 (2010)

    Google Scholar 

  17. Gaoxiang, Y., Heping, H., Zhengding, L., Ruixuan, L.: A novel web query automatic expansion based on rough set. Wuhan Univ. J. Nat. Sci. 11(5), 1167–1171 (2006)

    Article  Google Scholar 

  18. Bly, B.M., Rumelhart, D.E. (eds.): Cognitive Science: Handbook of Perception and Cognition, 2nd edn. Academic Press, Millbrae (1999)

    Google Scholar 

  19. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Pearson Education Inc., Upper Saddle River (2010)

    Google Scholar 

  20. Voorhees, E.M., Harman, D.: Overview of the ninth text retrieval conference (TREC-9). In: Proceedings of the Ninth Text Retrieval Conference (TREC-9), National Institute of Standards and Technology (NIST), pp. 1–14 (2000)

    Google Scholar 

  21. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)

    Google Scholar 

  22. Chomsky, N.: Language and Mind, 3rd edn. Cambridge University Press, New York (2006)

    Book  Google Scholar 

  23. Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 1988, New York, NY, USA, pp. 465–480. ACM (1988)

    Google Scholar 

  24. Grossman, D.A., Frieder, O.: Information Retrieval: Algorithms and Heuristics, 2nd edn. Springer, Netherlands (2004)

    Book  Google Scholar 

  25. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial intelligence. IJCAI 2007, San Francisco, CA, USA, pp. 1606–1611. Morgan Kaufmann Publishers Inc (2007)

    Google Scholar 

  26. Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM 2011, New York, NY, USA, pp. 1961–1964. ACM (2011)

    Google Scholar 

  27. Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 1985, New York, NY, USA, pp. 18–25. ACM (1985)

    Google Scholar 

  28. Nguyen, S.H., Świeboda, W., Jaśkiewicz, G.: Extended document representation for search result clustering. In: Bembenik, R., Skonieczny, L., Rybiński, H., Niezgodka, M. (eds.) Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390, pp. 77–95. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  29. Nguyen, S.H., Jaśkiewicz, G., Świeboda, W., Nguyen, H.S.: Enhancing search result clustering with semantic indexing. In: Proceedings of the Third Symposium on Information and Communication Technology. SoICT 2012, New York, NY, USA, pp. 71–80. ACM (2012)

    Google Scholar 

  30. Szczuka, M., Janusz, A., Herba, K.: Semantic clustering of scientific articles with use of DBpedia knowledge base. In: Bembenik, R., Skonieczny, L., Rybiński, H., Niezgodka, M. (eds.) Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390, pp. 61–76. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  31. Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)

    Article  MathSciNet  Google Scholar 

  32. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough Sets: A Tutorial, pp. 3–98. Springer, Singapore (1998)

    Google Scholar 

  33. Pawlak, Z.: Some issues on rough sets. In: Peters, J.F., Skowron, A., Grzymała-Busse, J.W., Kostek, B., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 1–58. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  34. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundam. Inf. 27, 245–253 (1996)

    MathSciNet  Google Scholar 

  35. Lassila, O., Mcguinness, D.: The role of frame-based representation on the semantic web. Technical report, Knowledge System Laboratory, Standford University (2001)

    Google Scholar 

  36. Virginia, G., Nguyen, H.S.: Lexicon-based document representation. Fundamenta Informaticae 124, 27–45 (2013, to appear)

    Google Scholar 

  37. Vega, V.B.: Information retrieval for the Indonesian language. Master’s thesis, National University of Singapore, Unpublished (2001)

    Google Scholar 

  38. Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S.M.M., Williams, H.E.: Stemming indonesian: a confix-stripping approach. ACM Trans. Asian Lang. Inf. Process. 6, 1–33 (2007)

    Article  Google Scholar 

  39. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  Google Scholar 

  40. McCandless, M., Hatcher, E., Gospodnetić, O.: Lucene in Action. Manning Publications Co., Greenwich (2010)

    Google Scholar 

  41. Virginia, G., Nguyen, H.S.: An algorithm for tolerance value generator in tolerance rough sets model. In: Na, M.G., Toro, C., Posada, J., Howlett, R.J., Jain, L.C. (eds.) Advances in Knowledge-Based and Intelligent Information and Engineering Systems. KES 2012, Netherlands, pp. 595–604. IOS Press (2012)

    Google Scholar 

  42. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)

    Google Scholar 

  43. Adriani, M., Nazief, B.: Confix-Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Internal Publication, Depok (1996)

    Google Scholar 

  44. Obadi, G., Dráždilová, P., Hlaváček, L., Martinovič, J., Snášel, V.: A tolerance rough set based overlapping clustering for the DBLP data. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and International Conference on Intelligent Agent Technology - Workshops. WI-IAT 2010, vol. 3, pp. 57–60. IEEE (2010)

    Google Scholar 

  45. Troester, M.: Big data meets big data analytics. http://www.sas.com/resources/whitepaper/wp_46345.pdf (2012). SAS Institute Inc. Accessed 22 Feb 2013

  46. Ingwersen, P.: Information Retrieval Interaction, 1st edn. Taylor Graham, London (1992)

    Google Scholar 

  47. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  48. Manola, F., Miller, E.: Rdf primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ (2004). W3C. Accessed 12 Jan 2013

Download references

Acknowledgments

This work is partially supported by (1) Specific Grant Agreement Number-2008-4950/001-001-MUN-EWC from European Union Erasmus Mundus “External Cooperation Window” EMMA, (2) the National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 by the Strategic Scientific Research and Experimental Development Program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”, (3) grant from Ministry of Science and Higher Education of the Republic of Poland N N516 077837, and (4) grant from Yayasan Arsari Djojohadikusumo (YAD) based on Addendum Agreement No. 029/C10/UKDW/2012. We thank Faculty of Computer Science, University of Indonesia, for the permission of using the CS stemmer.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gloria Virginia .

Editor information

Editors and Affiliations

Appendices

Appendix

A Weighting Scheme: The TF*IDF

Salton and Buckley summarised clearly in their paper [47] the insights gained in automatic term weighting and provided baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared. The main function of a term-weighting system is the enhancement of retrieval effectiveness where this result depends crucially on the choice of effective term-weighting systems. Recall and Precision are two measures normally used to assess the ability of a system to retrieve the relevant and reject the non-relevant items of a collection. Considering the trade-off between recall and precision, in practice compromises are normally made by using terms that are broad enough to achieve a reasonable recall level without at the same time producing unreasonably low precision.

Salton and Buckley further explained that, with regard to the differing recall and precision requirements, three main considerations appear important:

  1. 1.

    Term frequency (tf). The frequent terms in individual documents appear to be useful as recall-enhancing devices.

  2. 2.

    Inverse document frequency (idf). The idf factor varies inversely with the number of documents \(df_t\) to which a term t is assigned in a collection of N documents. It favors terms concentrated in a few documents of a collection and avoids the effect of high frequency terms which are widespread in the entirety of documents.

  3. 3.

    Normalisation. Normally, all relevant documents should be treated as equally important for retrieval purposes. The normalisation factor is suggested to equalise the length of the document vectors.

Table A.1. Term-weighting components with SMART notation [39]. Here, \(tf_{t,d}\) is the term frequency of term t in document d, N is the size of document collection, \(df_t\) is document frequency of term t, \(w_i\) is the weight of term t in document i, u is the number of unique terms in document d, and CharLength is the number of characters in the document.

Table A.1 summarises some of the term weighting schemes together with the mne-monic which is sometimes called SMART notation. One example of the mnemonic is lnc.ltc. The first triplet (i.e. lnc) represents the weighting combination for the document vector, while the second triplet (i.e. ltc) represents the weighting combination for the query vector. For each triplet, it describes the form of tf component, idf component, and normalization component being used. Thus, mnemonic lnc.ltc means that the document vector employs log-weighted term frequency, no idf for collection component, and cosine normalisation, while the query vector employs log-weighted term frequency, idf weighting for collection component, and cosine normalisation. Equation A.1 is the common weighting scheme used for a term in a document, i.e. mnemonic ntn, which is called TF*IDF weighting scheme.

$$\begin{aligned} w_{t,d} = tf \cdot idf = tf_{t,d} \cdot \log \frac{N}{df_t} \end{aligned}$$
(A.1)

B Document Ranking Method: The Cosine Measure

Manning et al. [39] stated that cosine similarity is fundamental to IR systems that use any form of vector space scoring. Given a query vector and a set of document vectors in a high dimensional space, we may rank the documents by comparing the angle between the query vector and each document vector; the smaller the angle, the more similar the vectors. In linear algebra, the angle \(\theta \) between two vectors, \(\overrightarrow{x}\) and \(\overrightarrow{y}\), can be measured as follows:

$$\begin{aligned} \overrightarrow{x} \cdot \overrightarrow{y} = |\overrightarrow{x}| * |\overrightarrow{y}| * cos(\theta ) \end{aligned}$$
(B.1)

where \(\overrightarrow{x} \cdot \overrightarrow{y}\) represents the dot product while \(|\overrightarrow{x}|\) and \(|\overrightarrow{y}|\) represent the lenght of the vectors. The dot product \(\overrightarrow{x} \cdot \overrightarrow{y}\) of two vectors is defined as \(\sum _{j=1}^{M}x_{j} * y_{j}\) and the Euclidean length of a vector \(|\overrightarrow{x}|\) is defined as \(\sqrt{\sum _{j=1}^{M}(x_{j})^2}\). Thus, formula (B.2) can be used to measure the similarity between a query vector Q and a document vector D:

$$\begin{aligned} similarity(Q, D) = \frac{\sum _{j=1}^{M}w_{qj} * w_{dj} }{\sqrt{ \sum _{j=1}^{M}(w_{qj})^2 * \sum _{j=1}^{M}(w_{dj})^2}} \end{aligned}$$
(B.2)

C The Corpora

1.1 C.1 ICL-Corpus and WORDS-Corpus

Our original corpus, called ICL-corpus, consists of 1,000 first emails of Indonesian Choral Lovers (ICL) Yahoo! Groups and are formatted as of the Text REtrieval Conference (TREC) format [20]. Therefore our test collections consist of three parts (a set of documents, a set of information needs, and relevance judgments) and all documents are marked up in a TREC-like format, i.e. each document is marked up by \(<\)DOC\(>\) and \(<\)/DOC\(>\) tags, the document number is marked up by \(<\)DOCNO\(>\) and \(<\)/DOCNO\(>\) tags, the subject of email is marked up by \(<\)SUBJECT\(>\) and \(<\)/SUBJECT\(>\) tags, the date of email is marked up by \(<\)DATE\(>\) and \(<\)/DATE\(>\) tags, the sender is marked up by \(<\)FROM\(>\) and \(<\)/FROM\(>\) tags, and the text body is marked up by \(<\)TEXT\(>\) and \(<\)/TEXT\(>\) tags.

We worked with two choral experts intensively in the annotation process in order to construct the information needs and relevance judgments for our testbed. The annotation process consisted of two tasks which were a) topic assignment, where the human experts assigned topic(s) for each document within the original corpus; and b) keywords determination, where they determined terms considered as highly related with the topic(s) given. The annotation process aimed to grasp how the topic(s) could be assigned to a particular document which was mainly described by the keywords determined. We take benefit from these keywords as the list of terms closely related with the topic of document, as well as the document itself, and assume that the other terms not listed are less important terms. The first step of topic assignment yielded 127 topics and the keywords determination yielded a new corpus, called WORDS-corpus.

Fig. C.1.
figure 44

The content of corpora. Picture on the left is an example of ICL-corpus document which consists of original document, while picture on the right is an example of WORDS-corpus document which consists of keywords given by human expert manually for particular ICL-corpus document, i.e. in this case, the ICL-corpus document with number “DR-480” which is shown on the left.

Consult Fig. C.1 to see the content of both corpora. Notice that the main difference between documents in ICL-corpus and WORDS-corpus lies in the text body, i.e. the document of ICL-corpus consists of a body of emails while the document of WORDS-corpus consists of keywords defined by human experts. Figure C.2 shows the relationship between both corpora.

Fig. C.2.
figure 45

Corpus relationship. The WORDS-corpus was yielded by human expert in annotation process over ICL-corpus.

Table C.1. List of topics. This is a list of 127 topics of ICL-corpus and the total number (document frequency) of relevant documents for each topic with ID 0 to 35.

As we mentioned above, the topic assignment yielded 127 topics of which many have low document frequency; 81.10 % of them have document frequency \(<\) 10 and 32.28 % of them have document frequency 1. We further processed the 127-topics, as it is shown by Tables C.1 and C.2, and came up with 28 topics as listed in Table C.3. Thus, we have two version of relevance judgments a) relevance judgment which consists of 127 topics; and b) relevance judgment which consists of 28 topics.

Table C.2. List of topics. This is a list of 127 topics of ICL-corpus and the total number (document frequency) of relevant documents for each topic with ID 36 to 126.
Table C.3. List of topics. This is a list of 28 topics of ICL-corpus and the total number (document frequency) of relevant documents for each topic.

For the 127-topics, distribution of topics is showed by Table C.4 while list of topics with document frequency \(\ge \) 10 is showed by Table C.5. For all the tables here, ID column defines the topic identifier, Topic column is the topic in Indonesian, and DF column is the document frequency or total number of relevant documents with regard to the topic.

Table C.4. Topic distribution. This table shows the total number of topic which has document frequency \(<\) 10 out of 127 topics.
Table C.5. List of topics. This table presents topics of ICL-corpus with document frequency \(\ge \) 10 out of 127 topics.

Refer to the TREC format, Fig. C.3 is an example of relevance judgment file while Fig. C.4 is an example of the information needs file. For the relevance judgment file, the first column defines the topic identifier, the third column defines the document identifier, and the fourth column defines the relevancy, i.e. 1 if the document is relevant to the topic, and 0 otherwise. The second column is an arbitrary string and in this case brings no information. The information needs file consists of topics (string between \(<\)TITLE\(>\) and \(<\)/TITLE\(>\) tags) with its description (string between \(<\)DESC\(>\) and \(<\)/DESC\(>\) tags) and narrative (string between \(<\)NARR\(>\) and \(<\)/NARR\(>\) tags). It follows the TREC format, thereby marked up by some tags in which each topic is enclosed by \(<\)TOP\(>\) and \(<\)/TOP\(>\) tags.

Fig. C.3.
figure 46

The relevance judgment file. This picture is an inset of the relevance judgment file. Respectively, column 1 to 4 are the topic identifier, random string, document identifier, and document relevancy with topic.

Fig. C.4.
figure 47

The information needs file. This picture is an inset of the information needs file.

Annotation Process

We have mentioned above that the annotation process consisted of two tasks, namely topic assignment and keywords determination, and yielded WORDS-corpus and two lists of topics (127-topics and 28-topics). This was a collaborative work with three choral experts in which four phases were carried out as it is presented in Fig. C.5.

Fig. C.5.
figure 48

Annotation process. The annotation process had four phases: determination, revision, decision, and categorization.

First of all, the first expert did the topic assignment and keyword determination for 1,000 documents of ICL-corpus. Considering his work (namely Result_1), we did the same thing and came with a different result (namely Result_2). Based on Result_1 and Result_2, the first expert did the revision of his previous result and produced the new result (namely Result_3). The second expert made a decision (Result_4) by analyzing Result_1, Result_2, and Result_3.

On this stage, we had 127 topics and decided to make the list smaller by categorizing it. Thus, we analyzed those topics and agreed on 28 topics. Refer to the 28-topics, the third expert reassigned each documents of ICL-corpus.

In addition to the construction process, this is another main difference of our corpus with an Indonesian corpus made by Jelita Asian from Kompas newswire articles (Kompas-corpus)Footnote 42. In ICL-corpus, each document must be assigned by at least one topic while in Kompas-corpus it is not the case, i.e. there are documents that are not designated to any topics.

1.2 C.2 WIKI_1800

WIKI-1800 is a corpus consists of 1,800 text documents in music domain which are the short abstract of Indonesian Wikipedia articlesFootnote 43. The full version of the corpus consists of 85,601 short abstracts in variety of topics and was downloaded from DBpediaFootnote 44. The WIKI_1800 employed in this study was obtained by filtering out the 85,610 abstracts specifically based on music domain which was conducted by our third expert.

Figure C.6 shows a small chunk of WIKI-1800 document. Each document is represented as an RDF triple notation which contains three components (i.e. subject, predicate, and object), plus the URL of the Web page. In Fig. C.6, the \(<\) http://dbpedia.org/resource/Indonesia_Raya \(>\), which acts as the subject, is an URI reference to the resource of Indonesia Raya. The \(<\) http://www.w3.org/2000/01/rdf-schema#comment \(>\) (or rdfs:comment for short), which acts as the predicate, is an URI reference that refers to the property used to provide a human-readable description of a resource; R rdfs:comment L states that L is a human-readable description of R [48]. Therefore, the string inside the quotes next to the rdfs:comment is the human-readable description of Indonesia Raya, which is actually the short abstract of the Indonesia Raya article. Finally, the \(<\) http://id.wikipedia.org/wiki/Indonesia_Raya# \(>\) is the URL that will go to the Web page of Indonesia Raya.

Fig. C.6.
figure 49

WIKI-1800. An example of WIKI-1800 document.

1.3 C.3 The Choral Experts

In data preparation of our study, we worked in collaboration with three people who have experiences in choral for years. They were Agastya Rama Listya, Kristoforus Kuntarahadi, and Inke Kusumastuti; in Appendix C.1, we called them the first expert, second expert, and third expert respectively. Figure C.7 displays the pictures of them.

Fig. C.7.
figure 50

The choral experts. The choral experts involved in annotation process of our study: (1) Agastya Rama Listya, (2) Kristoforus Kuntarahadi, and (3) Inke Kusumastuti.

Agastya Rama Listya was born in Yogyakarta on February 18, 1968, and now is living in Salatiga, Central Java, Indonesia. He obtained his Bachelor of Arts in Theory and Music Composition from the Indonesian Arts Institute, Yogyakarta, Indonesia, in 1992. In 2001, he received his Master of Sacred Music in Choral Conducting from Luther Seminary and St. Olaf College, Minnesota, USA. He was the Dean of the Faculty of Performing Arts, Satya Wacana Christian University at Salatiga for two periods (2009–2011) and was affiliated as the committee member of Badan Kerjasama Gereja-Gereja se-Salatiga (2007–2010), Lembaga Pengembangan Pesparawi Daerah Jawa Tengah (2007–2010), and Badan Pembina Seni Mahasiswa Indonesia Jawa Tengah (2008–2010). Agastya has published 7 books, 6 articles in journals, and 16 essays. He is a productive music composer and arranger in which many of his choral works were performed by numerous choirs in Indonesia. He is also an active choral coach of a number of choirs where under his direction have made some prominent achievements regionally, nationally, and internationally. Individually, he was the winner of 4 different national choral composition contests during 1998–2009 and the winner of Yazeed Djamin Award for Piano Composition Contest in 2006. Agastya Rama Listya’s name was included in the 30th Pearl Anniversary of Marquis Who’s Who in The World (November 2012).

Kristoforus Kuntarahadi was born in Yogyakarta on January 14, 1979. He is now a staff in the office of Bishop’s Conference of Indonesia, in Jakarta. He was the student of several well-known Indonesian vocalists and chorister, i.e. Avip Priatna, Lucia Kusumawardhani, Yoseph Chang, and Tommy Prabowo. He has been an active singers in some choirs since 1990, including the famous Indonesian choir, Batavia Madrigal Singers in Jakarta, and the tenor solo performer in some concerts. He obtained several achievements on regional singing festival during 1993–1997. Nationally, as a classical singer, he was the runner-up of Bintang Radio dan Televisi (a national radio and television singing competition) in 1995 and the third prize winner of PEKSIMINAS V (a national singing competition for student) in 1999. He received an award from Governor of Yogyakarta as an outstanding vocal artist in 1997.

Table C.6. List of topics. This is a list of 20 topics of Kompas-corpus and the document frequency, DF, of relevant documents for each topic.

Inke Kusumastuti is a medical doctor and currently continuing her education in Psychiatry in Udayana University, Denpasar, Bali. She was born in Blitar on April 17, 1986. She did not receive any formal education in music specifically, but she is practically a motivated self-learner when it comes to singing. She got numerous prizes in individual regional singing contests since she was in elementary school (1992–2001). In 2001–2004 she was involved in a band as the vocalist and the band won several regional competitions. In 2003, she experienced to be a cafe singer for a year. After that, while pursuing her medical education, she had been an active sopranos in some choirs, including the Eternal Choir, a well-known semi-professional small choir in Yogyakarta. As a chorister, she was involved in numerous concerts and choral competitions and received some achievements. In 2007, she followed a conducting workshop given by Andrew deQuadros in the First Asian Choir Games and, in 2010, she joined a choral clinic given by Marc Anthony Carpio, a choirmaster of Phillippine Madrigal Singers. Recently, in 2012, she got the third prize winner in Bintang Radio RRI Jember (a singing contest conducted by national radio of Indonesia at Jember). Her favorite artist is The Real Group, a world-acclaimed Swedish-based a capella group, which has significantly shaped her current music interest, and her dream is to able to employ music as part of therapy for people with mental disorders.

1.4 C.4 Kompas-Corpus

Kompas-corpus [11] is a set of newswire articles collected from a known Indonesian newspaper KompasFootnote 45 published between January and June 2002. It consists of 3,000 documents constructed by following the TREC format, thereby accompanied by a file of information needs and a file of relevance judgments. There are 20 topics chosen by a native speaker after reading each documents in order to represent the user information needs. Those topics are listed in Table C.6 as well as the total number of relevant documents for each topic. Out of 3,000, only 433 documents are assigned topic(s).

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Virginia, G., Nguyen, H.S. (2015). A Semantic Text Retrieval for Indonesian Using Tolerance Rough Sets Models. In: Peters, J., Skowron, A., Ślȩzak, D., Nguyen, H., Bazan, J. (eds) Transactions on Rough Sets XIX. Lecture Notes in Computer Science(), vol 8988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47815-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-47815-8_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-47814-1

  • Online ISBN: 978-3-662-47815-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics