A Semantic Text Retrieval for Indonesian Using Tolerance Rough Sets Models

Virginia, Gloria; Nguyen, Hung Son

doi:10.1007/978-3-662-47815-8_9

A Semantic Text Retrieval for Indonesian Using Tolerance Rough Sets Models

Gloria Virginia²⁰ &
Hung Son Nguyen²¹

Chapter
First Online: 01 January 2015

495 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 8988))

Abstract

The research of Tolerance Rough Sets Model (TRSM) ever conducted acted in accordance with the rational approach of AI perspective. This article presented studies who complied with the contrary path, i.e. a cognitive approach, for an objective of a modular framework of semantic text retrieval system based on TRSM specifically for Indonesian. In addition to the proposed framework, this article proposes three methods based on TRSM, which are the automatic tolerance value generator, thesaurus optimization, and lexicon-based document representation. All methods were developed by the use of our own corpus, namely ICL-corpus, and evaluated by employing an available Indonesian corpus, called Kompas-corpus. The endeavor of a semantic information retrieval system is the effort to retrieve information and not merely terms with similar meaning. This article is a baby step toward the objective.

This work was partially supported by the Polish National Science Centre grant DEC-2012/05/B/ST6/03215.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Key statistical highlights: ITU data release June 2012. URL: http://www.itu.int. Accessed on 25 October 2012.
2.
Utterances may include sound, marks, gesture, grunts, and groans (anything that can signal an intention).
3.
The reason is, in the context of speech act, we do not concern about whether the belief of a speaker is true or not, rather we concern about the intention of speaker what he/she wants to represent by his/her utterance. Thus, it might be the case that a speaker represents his/her false belief as a true belief to the audience, e.g. a speaker utters ‘it is raining’, while in fact ‘it is a sunny day’.
4.
In other words, ‘the mind to fit the world’. It is because a belief is like a statement, can be true or false; if the statement is false then it is the fault of the statement, not the world. The world-to-mind direction of fit is applied for the psychological mode such as desire or promise; if the promise is broken, it is the fault of the promiser.
5.
BPS-Statistics Indonesia. URL: http://www.bps.go.id/. Accessed on 25 October 2012.
6.
July 2012 estimation of The World Factbook. URL: https://www.cia.gov. Accessed on 25 October 2012.
7.
Portal Nasional Indonesia (National Portal of Indonesia). URL: http://www.indonesia.go.id. Accessed on 25 October 2012.
8.
Key statistical highlights: International Telecommunication Union (ITU) data release June 2012. URL: http://www.itu.int. Accessed on 25 October 2012.
9.
URL: http://www.internetworldstats.com. Accessed on 25 October 2012.
10.
The graph was taken from the International Telecommunication Union (ITU). URL: http://www.itu.int/ITU-D/ict/statistics/explorer/index.html. Accessed on 25 October 2012.
11.
Appendix A provides an explanation about the TF*IDF weighting scheme.
12.
The cognitive modeling is an approach employed in the Cognitive Science (CS). Cognitive science is an interdisciplinary study of mental representations and computations and of the physical systems that support those processes [18, p. xv].
13.
Explanation about all corpora used in this article is available in Appendix C.
14.
TREC is a forum for IR community which provides an infrastructure necessary to evaluate an IR system on a broad range of problems. URL: http://trec.nist.gov/.
15.
Appendix B provides explanation about Cosine similarity measure as a document ranking algorithm.
16.
Consistent with VSM, GVSM interprets index term vectors as linearly independent, however they are not orthogonal.
17.
ICL-corpus consists of 1,000 documents taken from an Indonesian choral mailing list, while WORDS-corpus consists of 1,000 documents created from ICL-corpus in an annotation process conducted by human experts. Further explanation of these corpora is available in Appendix C.1.
18.
We collaborated with 3 choral experts during annotation process. Their backgrounds could be reviewed in Appendix C.3.
19.
We used CS stemmer and Vega’s stopword in all of our studies presented in this article.
20.
Please see Appendix C.1 for explanation of annotation process.
21.
Please see Appendix C.1.
22.
If the size of tolerance classes are smaller then the size of upper sets will be smaller, and vice versa.
23.
These values are for the process with stemming task.
24.
Most of the foreign terms was English.
25.
It comes from an English term workshop and an Indonesian suffix -nya.
26.
Inverted index was applied for document representations in all experiments in this article.
27.
It is an open source project implemented in Java licensed under the liberal Apache Software License [40]. We used Lucene 3.1.0 in our study. URL for download: http://lucene.apache.org/core/downloads.html.
28.
JAMA has been developed by the MathWorks and NIST. It provides user-level classes for constructing and manipulating real, dense matrices. We used JAMA 1.0.2 in this study. URL: http://math.nist.gov/javanumerics/jama/.
29.
We used the trec_eval.9.0 which is publicly available on http://trec.nist.gov/trec_eval/.
30.
WORDS-corpus is generated based on ICL-corpus hence they dwell in a single domain.
31.
Base method means that we employed the TF*IDF weighting scheme only without TRSM implementation.
32.
Please see Appendix C.2.
33.
Explanation about Cosine as a document ranking is available in Appendix B.
34.
In fact, we found the same result between ICL_1000 and ICL_1000 + WORDS_1000 in all calculations we made, such as in R-Precision, Precision@10, Precision@20, and Precision@30.
35.
It is an Indonesian lexicon created by the University of Indonesia described in a study of Nazief and Adriani in 1996 [43] which consists of 29,337 Indonesian root words. The lexicon has been used in other studies [10, 38].
36.
KBBI is a dictionary copyrighted by Pusat Bahasa (in English: Language Center), Indonesian Ministry of Education, which consists of 27,828 root words.
37.
The index terms of thesaurus are in the form of single term, hence we choose term partitur as the representative of the karya musik concept.
38.
Figure 35 serves as a basis for the choice of $\theta $ values in which the TRSM-representation, LEX-representation, TRSM-representation, and TFIDF-representation outperform the other representations at $\theta $ = 2, $\theta $ = 8, $\theta $ = 41, and $\theta $ = 88 in respective order. However, particularly at $\theta $ = 88, the TFIDF-representation only performs better than the LEX-representation.
39.
The base model means that we employed the TF*IDF weighting scheme without TRSM implementation nor the mapping process.
40.
Kompas-corpus is a TREC-like Indonesian testbed which is composed of 3,000 newswire articles and is accompanied by 20 topics. Please see Appendix C.4 for more explanation.
41.
Big data is a term to describe the enormity of data, both structured and unstructured, in volume, velocity, and variety [45].
42.
Please see Appendix C.4 for more explanation about Kompas-corpus.
43.
Indonesian Wikipedia: http://id.wikipedia.org/wiki/Halaman_Utama.
44.
DBpedia is a community project which was started and is administered by research group from Universität Leipzig, Freie Universität Berlin, and OpenLink Software. The project is an effort to extract information from Wikipedia, make this information available on the Web under an open license, and interlink the DBpedia dataset with other open datasets on the Web. The Indonesian short abstracts of DBpedia was downloaded from http://downloads.dbpedia.org/3.7/id/.
45.
Kompas. URL: http://www.kompas.com.

References

Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval: Implementing and Evaluating Search Engine. MIT Press, Cambridge (2010)
Google Scholar
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.J.: Text Mining - Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)
Google Scholar
Eifring, H., Theil, R.: Linguistics for Students of Asian and African Languages (2005)
Google Scholar
Grandy, R.E., Warner, R.: Paul grice. http://plato.stanford.edu/entries/grice/, May 2006. Accessed 02 Oct 2012
Searle, J.R.: Intentionality: An Essay in the Philosophy of Mind. Cambridge University Press, Cambridge (1983)
Book Google Scholar
Grice, H.P.: Studies in the Way of Words. Harvard University Press, Cambridge (1989)
Google Scholar
Haugh, M., Jaszczolt, K.M.: Speaker intentions and intentionality. In: Allan, K., Jaszczolt, K.M. (eds.) The Cambridge Handbook of Pragmatics, pp. 87–112. Cambridge University Press, Cambridge (2012)
Chapter Google Scholar
Akand, M.: Grice and searle on meaning. Copula - J. Philos. Dept XXVIII, 51–58 (2011)
Google Scholar
Adriani, M., Manurung, R.: A survey of bahasa Indonesia NLP research conducted at the University of Indonesia. In: Proceedings of the 2nd International MALINDO Workshop (2008)
Google Scholar
Asian, J.: Effective techniques for Indonesian text retrieval. Ph.D. thesis, School of Computer Science and Information Technology, RMIT University, Doctor of Philosophy Thesis (March 2007)
Google Scholar
Asian, J., Williams, H.E., Tahaghoghi, S.M.M.: A testbed for Indonesian text retrieval. In: Bruza, P., Moffat, A., Turpin, A. (eds.) ADCS, pp. 55–58. University of Melbourne, Department of Computer Science (2004)
Google Scholar
Sneddon, J.: The Indonesian Language: It’s History and Role in Modern Society. UNSW Press, Sydney (2003)
Google Scholar
Kawasaki, S., Nguyen, N.B., Ho, T.-B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)
Chapter Google Scholar
Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst. 17(2), 199–212 (2002)
Article Google Scholar
Nguyen, H.S., Ho, T.B.: Rough document clustering and the internet. In: Handbook of Granular Computing, pp. 987–1003. Wiley, Hoboken (2008)
Google Scholar
Wu, Y., Ding, Y., Wang, X., Xu, J.: On-line hot topic recommendation using tolerance rough set based topic clustering. J. Comput. 5, 549–556 (2010)
Google Scholar
Gaoxiang, Y., Heping, H., Zhengding, L., Ruixuan, L.: A novel web query automatic expansion based on rough set. Wuhan Univ. J. Nat. Sci. 11(5), 1167–1171 (2006)
Article Google Scholar
Bly, B.M., Rumelhart, D.E. (eds.): Cognitive Science: Handbook of Perception and Cognition, 2nd edn. Academic Press, Millbrae (1999)
Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Pearson Education Inc., Upper Saddle River (2010)
Google Scholar
Voorhees, E.M., Harman, D.: Overview of the ninth text retrieval conference (TREC-9). In: Proceedings of the Ninth Text Retrieval Conference (TREC-9), National Institute of Standards and Technology (NIST), pp. 1–14 (2000)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)
Google Scholar
Chomsky, N.: Language and Mind, 3rd edn. Cambridge University Press, New York (2006)
Book Google Scholar
Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 1988, New York, NY, USA, pp. 465–480. ACM (1988)
Google Scholar
Grossman, D.A., Frieder, O.: Information Retrieval: Algorithms and Heuristics, 2nd edn. Springer, Netherlands (2004)
Book Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial intelligence. IJCAI 2007, San Francisco, CA, USA, pp. 1606–1611. Morgan Kaufmann Publishers Inc (2007)
Google Scholar
Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM 2011, New York, NY, USA, pp. 1961–1964. ACM (2011)
Google Scholar
Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 1985, New York, NY, USA, pp. 18–25. ACM (1985)
Google Scholar
Nguyen, S.H., Świeboda, W., Jaśkiewicz, G.: Extended document representation for search result clustering. In: Bembenik, R., Skonieczny, L., Rybiński, H., Niezgodka, M. (eds.) Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390, pp. 77–95. Springer, Heidelberg (2012)
Chapter Google Scholar
Nguyen, S.H., Jaśkiewicz, G., Świeboda, W., Nguyen, H.S.: Enhancing search result clustering with semantic indexing. In: Proceedings of the Third Symposium on Information and Communication Technology. SoICT 2012, New York, NY, USA, pp. 71–80. ACM (2012)
Google Scholar
Szczuka, M., Janusz, A., Herba, K.: Semantic clustering of scientific articles with use of DBpedia knowledge base. In: Bembenik, R., Skonieczny, L., Rybiński, H., Niezgodka, M. (eds.) Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390, pp. 61–76. Springer, Heidelberg (2012)
Chapter Google Scholar
Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)
Article MathSciNet Google Scholar
Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough Sets: A Tutorial, pp. 3–98. Springer, Singapore (1998)
Google Scholar
Pawlak, Z.: Some issues on rough sets. In: Peters, J.F., Skowron, A., Grzymała-Busse, J.W., Kostek, B., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 1–58. Springer, Heidelberg (2004)
Chapter Google Scholar
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundam. Inf. 27, 245–253 (1996)
MathSciNet Google Scholar
Lassila, O., Mcguinness, D.: The role of frame-based representation on the semantic web. Technical report, Knowledge System Laboratory, Standford University (2001)
Google Scholar
Virginia, G., Nguyen, H.S.: Lexicon-based document representation. Fundamenta Informaticae 124, 27–45 (2013, to appear)
Google Scholar
Vega, V.B.: Information retrieval for the Indonesian language. Master’s thesis, National University of Singapore, Unpublished (2001)
Google Scholar
Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S.M.M., Williams, H.E.: Stemming indonesian: a confix-stripping approach. ACM Trans. Asian Lang. Inf. Process. 6, 1–33 (2007)
Article Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book Google Scholar
McCandless, M., Hatcher, E., Gospodnetić, O.: Lucene in Action. Manning Publications Co., Greenwich (2010)
Google Scholar
Virginia, G., Nguyen, H.S.: An algorithm for tolerance value generator in tolerance rough sets model. In: Na, M.G., Toro, C., Posada, J., Howlett, R.J., Jain, L.C. (eds.) Advances in Knowledge-Based and Intelligent Information and Engineering Systems. KES 2012, Netherlands, pp. 595–604. IOS Press (2012)
Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
Google Scholar
Adriani, M., Nazief, B.: Confix-Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Internal Publication, Depok (1996)
Google Scholar
Obadi, G., Dráždilová, P., Hlaváček, L., Martinovič, J., Snášel, V.: A tolerance rough set based overlapping clustering for the DBLP data. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and International Conference on Intelligent Agent Technology - Workshops. WI-IAT 2010, vol. 3, pp. 57–60. IEEE (2010)
Google Scholar
Troester, M.: Big data meets big data analytics. http://www.sas.com/resources/whitepaper/wp_46345.pdf (2012). SAS Institute Inc. Accessed 22 Feb 2013
Ingwersen, P.: Information Retrieval Interaction, 1st edn. Taylor Graham, London (1992)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Manola, F., Miller, E.: Rdf primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ (2004). W3C. Accessed 12 Jan 2013

Download references

Acknowledgments

This work is partially supported by (1) Specific Grant Agreement Number-2008-4950/001-001-MUN-EWC from European Union Erasmus Mundus “External Cooperation Window” EMMA, (2) the National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 by the Strategic Scientific Research and Experimental Development Program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”, (3) grant from Ministry of Science and Higher Education of the Republic of Poland N N516 077837, and (4) grant from Yayasan Arsari Djojohadikusumo (YAD) based on Addendum Agreement No. 029/C10/UKDW/2012. We thank Faculty of Computer Science, University of Indonesia, for the permission of using the CS stemmer.

Author information

Authors and Affiliations

Informatics Engineering Department, Duta Wacana Christian University, Dr. Wahidin Sudirohusodo 5-25, 55224, Yogyakarta, Indonesia
Gloria Virginia
Institute of Mathematics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland
Hung Son Nguyen

Authors

Gloria Virginia
View author publications
You can also search for this author in PubMed Google Scholar
Hung Son Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gloria Virginia .

Editor information

Editors and Affiliations

Electrical and Computer Engineering, University of Manitoba, Winnipeg, Manitoba, Canada
James F. Peters
University of Warsaw, Warsaw, Poland
Andrzej Skowron
University of Warsaw, Warsaw, Poland
Dominik Ślȩzak
University of Warsaw, Warsaw, Poland
Hung Son Nguyen
University of Rzeszów, Rzeszów, Poland
Jan G. Bazan

Appendices

Appendix

A Weighting Scheme: The TF*IDF

Salton and Buckley summarised clearly in their paper [47] the insights gained in automatic term weighting and provided baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared. The main function of a term-weighting system is the enhancement of retrieval effectiveness where this result depends crucially on the choice of effective term-weighting systems. Recall and Precision are two measures normally used to assess the ability of a system to retrieve the relevant and reject the non-relevant items of a collection. Considering the trade-off between recall and precision, in practice compromises are normally made by using terms that are broad enough to achieve a reasonable recall level without at the same time producing unreasonably low precision.

Salton and Buckley further explained that, with regard to the differing recall and precision requirements, three main considerations appear important:

1.
Term frequency (tf). The frequent terms in individual documents appear to be useful as recall-enhancing devices.
2.
Inverse document frequency (idf). The idf factor varies inversely with the number of documents $df_t$ to which a term t is assigned in a collection of N documents. It favors terms concentrated in a few documents of a collection and avoids the effect of high frequency terms which are widespread in the entirety of documents.
3.
Normalisation. Normally, all relevant documents should be treated as equally important for retrieval purposes. The normalisation factor is suggested to equalise the length of the document vectors.

Table A.1. Term-weighting components with SMART notation [39]. Here, $tf_{t,d}$ is the term frequency of term t in document d, N is the size of document collection, $df_t$ is document frequency of term t, $w_i$ is the weight of term t in document i, u is the number of unique terms in document d, and CharLength is the number of characters in the document.

Full size table

Table A.1 summarises some of the term weighting schemes together with the mne-monic which is sometimes called SMART notation. One example of the mnemonic is lnc.ltc. The first triplet (i.e. lnc) represents the weighting combination for the document vector, while the second triplet (i.e. ltc) represents the weighting combination for the query vector. For each triplet, it describes the form of tf component, idf component, and normalization component being used. Thus, mnemonic lnc.ltc means that the document vector employs log-weighted term frequency, no idf for collection component, and cosine normalisation, while the query vector employs log-weighted term frequency, idf weighting for collection component, and cosine normalisation. Equation A.1 is the common weighting scheme used for a term in a document, i.e. mnemonic ntn, which is called TF*IDF weighting scheme.

$$\begin{aligned} w_{t,d} = tf \cdot idf = tf_{t,d} \cdot \log \frac{N}{df_t} \end{aligned}$$

(A.1)

B Document Ranking Method: The Cosine Measure

Manning et al. [39] stated that cosine similarity is fundamental to IR systems that use any form of vector space scoring. Given a query vector and a set of document vectors in a high dimensional space, we may rank the documents by comparing the angle between the query vector and each document vector; the smaller the angle, the more similar the vectors. In linear algebra, the angle $\theta $ between two vectors, $\overrightarrow{x}$ and $\overrightarrow{y}$, can be measured as follows:

$$\begin{aligned} \overrightarrow{x} \cdot \overrightarrow{y} = |\overrightarrow{x}| * |\overrightarrow{y}| * cos(\theta ) \end{aligned}$$

(B.1)

where $\overrightarrow{x} \cdot \overrightarrow{y}$ represents the dot product while $|\overrightarrow{x}|$ and $|\overrightarrow{y}|$ represent the lenght of the vectors. The dot product $\overrightarrow{x} \cdot \overrightarrow{y}$ of two vectors is defined as $\sum _{j=1}^{M}x_{j} * y_{j}$ and the Euclidean length of a vector $|\overrightarrow{x}|$ is defined as $\sqrt{\sum _{j=1}^{M}(x_{j})^2}$. Thus, formula (B.2) can be used to measure the similarity between a query vector Q and a document vector D:

$$\begin{aligned} similarity(Q, D) = \frac{\sum _{j=1}^{M}w_{qj} * w_{dj} }{\sqrt{ \sum _{j=1}^{M}(w_{qj})^2 * \sum _{j=1}^{M}(w_{dj})^2}} \end{aligned}$$

(B.2)

C The Corpora

1.1 C.1 ICL-Corpus and WORDS-Corpus

Our original corpus, called ICL-corpus, consists of 1,000 first emails of Indonesian Choral Lovers (ICL) Yahoo! Groups and are formatted as of the Text REtrieval Conference (TREC) format [20]. Therefore our test collections consist of three parts (a set of documents, a set of information needs, and relevance judgments) and all documents are marked up in a TREC-like format, i.e. each document is marked up by $<$DOC$>$ and $<$/DOC$>$ tags, the document number is marked up by $<$DOCNO$>$ and $<$/DOCNO$>$ tags, the subject of email is marked up by $<$SUBJECT$>$ and $<$/SUBJECT$>$ tags, the date of email is marked up by $<$DATE$>$ and $<$/DATE$>$ tags, the sender is marked up by $<$FROM$>$ and $<$/FROM$>$ tags, and the text body is marked up by $<$TEXT$>$ and $<$/TEXT$>$ tags.

We worked with two choral experts intensively in the annotation process in order to construct the information needs and relevance judgments for our testbed. The annotation process consisted of two tasks which were a) topic assignment, where the human experts assigned topic(s) for each document within the original corpus; and b) keywords determination, where they determined terms considered as highly related with the topic(s) given. The annotation process aimed to grasp how the topic(s) could be assigned to a particular document which was mainly described by the keywords determined. We take benefit from these keywords as the list of terms closely related with the topic of document, as well as the document itself, and assume that the other terms not listed are less important terms. The first step of topic assignment yielded 127 topics and the keywords determination yielded a new corpus, called WORDS-corpus.

Consult Fig. C.1 to see the content of both corpora. Notice that the main difference between documents in ICL-corpus and WORDS-corpus lies in the text body, i.e. the document of ICL-corpus consists of a body of emails while the document of WORDS-corpus consists of keywords defined by human experts. Figure C.2 shows the relationship between both corpora.

Table C.1. List of topics. This is a list of 127 topics of ICL-corpus and the total number (document frequency) of relevant documents for each topic with ID 0 to 35.

Full size table

As we mentioned above, the topic assignment yielded 127 topics of which many have low document frequency; 81.10 % of them have document frequency $<$ 10 and 32.28 % of them have document frequency 1. We further processed the 127-topics, as it is shown by Tables C.1 and C.2, and came up with 28 topics as listed in Table C.3. Thus, we have two version of relevance judgments a) relevance judgment which consists of 127 topics; and b) relevance judgment which consists of 28 topics.

Table C.2. List of topics. This is a list of 127 topics of ICL-corpus and the total number (document frequency) of relevant documents for each topic with ID 36 to 126.

Full size table

Table C.3. List of topics. This is a list of 28 topics of ICL-corpus and the total number (document frequency) of relevant documents for each topic.

Full size table

For the 127-topics, distribution of topics is showed by Table C.4 while list of topics with document frequency $\ge $ 10 is showed by Table C.5. For all the tables here, ID column defines the topic identifier, Topic column is the topic in Indonesian, and DF column is the document frequency or total number of relevant documents with regard to the topic.

Table C.4. Topic distribution. This table shows the total number of topic which has document frequency $<$ 10 out of 127 topics.

Full size table

Table C.5. List of topics. This table presents topics of ICL-corpus with document frequency $\ge $ 10 out of 127 topics.

Full size table

Refer to the TREC format, Fig. C.3 is an example of relevance judgment file while Fig. C.4 is an example of the information needs file. For the relevance judgment file, the first column defines the topic identifier, the third column defines the document identifier, and the fourth column defines the relevancy, i.e. 1 if the document is relevant to the topic, and 0 otherwise. The second column is an arbitrary string and in this case brings no information. The information needs file consists of topics (string between $<$TITLE$>$ and $<$/TITLE$>$ tags) with its description (string between $<$DESC$>$ and $<$/DESC$>$ tags) and narrative (string between $<$NARR$>$ and $<$/NARR$>$ tags). It follows the TREC format, thereby marked up by some tags in which each topic is enclosed by $<$TOP$>$ and $<$/TOP$>$ tags.

Annotation Process

We have mentioned above that the annotation process consisted of two tasks, namely topic assignment and keywords determination, and yielded WORDS-corpus and two lists of topics (127-topics and 28-topics). This was a collaborative work with three choral experts in which four phases were carried out as it is presented in Fig. C.5.

First of all, the first expert did the topic assignment and keyword determination for 1,000 documents of ICL-corpus. Considering his work (namely Result_1), we did the same thing and came with a different result (namely Result_2). Based on Result_1 and Result_2, the first expert did the revision of his previous result and produced the new result (namely Result_3). The second expert made a decision (Result_4) by analyzing Result_1, Result_2, and Result_3.

On this stage, we had 127 topics and decided to make the list smaller by categorizing it. Thus, we analyzed those topics and agreed on 28 topics. Refer to the 28-topics, the third expert reassigned each documents of ICL-corpus.

In addition to the construction process, this is another main difference of our corpus with an Indonesian corpus made by Jelita Asian from Kompas newswire articles (Kompas-corpus)^{Footnote 42}. In ICL-corpus, each document must be assigned by at least one topic while in Kompas-corpus it is not the case, i.e. there are documents that are not designated to any topics.

1.2 C.2 WIKI_1800

WIKI-1800 is a corpus consists of 1,800 text documents in music domain which are the short abstract of Indonesian Wikipedia articles^{Footnote 43}. The full version of the corpus consists of 85,601 short abstracts in variety of topics and was downloaded from DBpedia^{Footnote 44}. The WIKI_1800 employed in this study was obtained by filtering out the 85,610 abstracts specifically based on music domain which was conducted by our third expert.

Figure C.6 shows a small chunk of WIKI-1800 document. Each document is represented as an RDF triple notation which contains three components (i.e. subject, predicate, and object), plus the URL of the Web page. In Fig. C.6, the $<$ http://dbpedia.org/resource/Indonesia_Raya $>$, which acts as the subject, is an URI reference to the resource of Indonesia Raya. The $<$ http://www.w3.org/2000/01/rdf-schema#comment $>$ (or rdfs:comment for short), which acts as the predicate, is an URI reference that refers to the property used to provide a human-readable description of a resource; R rdfs:comment L states that L is a human-readable description of R [48]. Therefore, the string inside the quotes next to the rdfs:comment is the human-readable description of Indonesia Raya, which is actually the short abstract of the Indonesia Raya article. Finally, the $<$ http://id.wikipedia.org/wiki/Indonesia_Raya# $>$ is the URL that will go to the Web page of Indonesia Raya.

1.3 C.3 The Choral Experts

In data preparation of our study, we worked in collaboration with three people who have experiences in choral for years. They were Agastya Rama Listya, Kristoforus Kuntarahadi, and Inke Kusumastuti; in Appendix C.1, we called them the first expert, second expert, and third expert respectively. Figure C.7 displays the pictures of them.

Agastya Rama Listya was born in Yogyakarta on February 18, 1968, and now is living in Salatiga, Central Java, Indonesia. He obtained his Bachelor of Arts in Theory and Music Composition from the Indonesian Arts Institute, Yogyakarta, Indonesia, in 1992. In 2001, he received his Master of Sacred Music in Choral Conducting from Luther Seminary and St. Olaf College, Minnesota, USA. He was the Dean of the Faculty of Performing Arts, Satya Wacana Christian University at Salatiga for two periods (2009–2011) and was affiliated as the committee member of Badan Kerjasama Gereja-Gereja se-Salatiga (2007–2010), Lembaga Pengembangan Pesparawi Daerah Jawa Tengah (2007–2010), and Badan Pembina Seni Mahasiswa Indonesia Jawa Tengah (2008–2010). Agastya has published 7 books, 6 articles in journals, and 16 essays. He is a productive music composer and arranger in which many of his choral works were performed by numerous choirs in Indonesia. He is also an active choral coach of a number of choirs where under his direction have made some prominent achievements regionally, nationally, and internationally. Individually, he was the winner of 4 different national choral composition contests during 1998–2009 and the winner of Yazeed Djamin Award for Piano Composition Contest in 2006. Agastya Rama Listya’s name was included in the 30th Pearl Anniversary of Marquis Who’s Who in The World (November 2012).

Kristoforus Kuntarahadi was born in Yogyakarta on January 14, 1979. He is now a staff in the office of Bishop’s Conference of Indonesia, in Jakarta. He was the student of several well-known Indonesian vocalists and chorister, i.e. Avip Priatna, Lucia Kusumawardhani, Yoseph Chang, and Tommy Prabowo. He has been an active singers in some choirs since 1990, including the famous Indonesian choir, Batavia Madrigal Singers in Jakarta, and the tenor solo performer in some concerts. He obtained several achievements on regional singing festival during 1993–1997. Nationally, as a classical singer, he was the runner-up of Bintang Radio dan Televisi (a national radio and television singing competition) in 1995 and the third prize winner of PEKSIMINAS V (a national singing competition for student) in 1999. He received an award from Governor of Yogyakarta as an outstanding vocal artist in 1997.

Table C.6. List of topics. This is a list of 20 topics of Kompas-corpus and the document frequency, DF, of relevant documents for each topic.

Full size table

Inke Kusumastuti is a medical doctor and currently continuing her education in Psychiatry in Udayana University, Denpasar, Bali. She was born in Blitar on April 17, 1986. She did not receive any formal education in music specifically, but she is practically a motivated self-learner when it comes to singing. She got numerous prizes in individual regional singing contests since she was in elementary school (1992–2001). In 2001–2004 she was involved in a band as the vocalist and the band won several regional competitions. In 2003, she experienced to be a cafe singer for a year. After that, while pursuing her medical education, she had been an active sopranos in some choirs, including the Eternal Choir, a well-known semi-professional small choir in Yogyakarta. As a chorister, she was involved in numerous concerts and choral competitions and received some achievements. In 2007, she followed a conducting workshop given by Andrew deQuadros in the First Asian Choir Games and, in 2010, she joined a choral clinic given by Marc Anthony Carpio, a choirmaster of Phillippine Madrigal Singers. Recently, in 2012, she got the third prize winner in Bintang Radio RRI Jember (a singing contest conducted by national radio of Indonesia at Jember). Her favorite artist is The Real Group, a world-acclaimed Swedish-based a capella group, which has significantly shaped her current music interest, and her dream is to able to employ music as part of therapy for people with mental disorders.

1.4 C.4 Kompas-Corpus

Kompas-corpus [11] is a set of newswire articles collected from a known Indonesian newspaper Kompas^{Footnote 45} published between January and June 2002. It consists of 3,000 documents constructed by following the TREC format, thereby accompanied by a file of information needs and a file of relevance judgments. There are 20 topics chosen by a native speaker after reading each documents in order to represent the user information needs. Those topics are listed in Table C.6 as well as the total number of relevant documents for each topic. Out of 3,000, only 433 documents are assigned topic(s).

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Virginia, G., Nguyen, H.S. (2015). A Semantic Text Retrieval for Indonesian Using Tolerance Rough Sets Models. In: Peters, J., Skowron, A., Ślȩzak, D., Nguyen, H., Bazan, J. (eds) Transactions on Rough Sets XIX. Lecture Notes in Computer Science(), vol 8988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47815-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-662-47815-8_9
Published: 05 July 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47814-1
Online ISBN: 978-3-662-47815-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix

A Weighting Scheme: The TF*IDF

B Document Ranking Method: The Cosine Measure

C The Corpora

1.1 C.1 ICL-Corpus and WORDS-Corpus

1.2 C.2 WIKI_1800

1.3 C.3 The Choral Experts

1.4 C.4 Kompas-Corpus

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation