Disambiguation of partial cognates
- First Online:
- Cite this article as:
- Frunza, O. & Inkpen, D. Lang Resources & Evaluation (2008) 42: 325. doi:10.1007/s10579-008-9072-x
- 59 Views
Partial cognates are pairs of words in two languages that have the same meaning in some, but not all contexts. Detecting the actual meaning of a partial cognate in context can be useful for Machine Translation tools and for Computer-Assisted Language Learning tools. We propose a supervised and a semi-supervised method to disambiguate partial cognates between two languages: French and English. The methods use only automatically-labeled data; therefore they can be applied to other pairs of languages as well. The aim of our work is to automatically detect the meaning of a French partial cognate word in a specific context.
KeywordsPartial cognatesWord sense disambiguationMonolingual bootstrappingBilingual bootstrapping
Cognates—words that have similar spelling and meaning in two or more languages—can accelerate vocabulary acquisition and facilitate the reading comprehension task. A student has to pay attention to the pairs of words that look and sound similar but have different meanings—false friends, and especially to pairs of words that share meanings in some but not all contexts—partial cognates.
Our goal is to present a method to disambiguate partial cognates between French and English. The task of disambiguating partial cognates for French and English can be seen as coarse-grain cross-language Word-Sense Disambiguation (WSD) task. A lot of work has been done on monolingual WSD systems that use supervised and unsupervised methods and report good results on Senseval data, but there is less work on cross-language WSD.
Although French and English belong to different branches of the Indo-European family of languages, their vocabularies share a great number of similarities due to the geographical, historical, and cultural contact over many centuries. Most of these borrowings have changed their orthography and most likely their meaning as well.
Second language learners of French, native speakers of English, can be assisted by a partial-cognate disambiguation system during the learning process. Claims that false friends can be a hindrance in second language learning are supported by Carroll (1992). She suggested that a cognate pairing process between two words that look alike happens faster in the learner’s mind than a false-friend pairing. Experiments with second language learners of different stages conducted by Heuven et al. (1998) suggest that missing false-friend recognition can be corrected when cross-language activation is used.
Besides second language learning, Machine Translation (MT) systems can also benefit from extra information when translating a certain word in context. Knowing if a French word is a cognate or a false friend with an English word can improve translation results. Cross-Language Information Retrieval systems can also use the knowledge of the sense of certain words in a query.
We describe a supervised and a semi-supervised method to discriminate the senses of a partial cognate in a French text (according to its English cognate or false-friend sense). The methods are based on Machine Learning (ML) techniques. The semi-supervised method uses a monolingual and bilingual bootstrapping technique. We use parallel corpora to automatically create training data/seeds for the bootstrapping techniques. Our methods are independent of the language pair at hand; they can be applied to any pair of languages for which a parallel corpus, two monolingual text collections, and an MT system are available.
2 Related work
Previous work on automatic cognate identification is mostly related to bilingual corpora and translation lexicons (Simard et al. 1992). Brew and McKelvie (1996) extracted French-English cognates and false friends from aligned bitexts using simple orthographic similarity measures. Kondrak (2001) identified cognates between various pairs of languages, paying attention to phonetic aspects, especially for genetic cognates—words in related languages that derive directly from the same word in the ancestor (proto)-language.
For French and English, substantial work on cognate detection was done manually. LeBlanc and Séguin (1996) concluded that cognates appear to make up over 30% of the French vocabulary. Inkpen et al. (2005) looked at different combinations of orthographic similarity measures using ML techniques to identify cognates and false friends between French and English.
From the wealth of publications on WSD we have chosen to briefly discuss only those that are related to our work. Determining the sense of an ambiguous word, using bootstrapping and texts from a different language was done by Yarowsky (1995); Hearst (1991); Diab and Resnik (2001); Li and Li (2004). Yarowsky (1995) used a few seeds and untagged sentences in a bootstrapping algorithm based on decision lists. He added two constraints—words tend to have one sense per discourse and one sense per collocation. Hearst (1991) used monolingual bootstrapping with a small set of hand-labeled data as seeds and a larger unlabeled corpus for training a noun disambiguation system for English. Diab and Resnik (2001) used cross-language lexicalization for an English monolingual unsupervised WSD system.
The difference between our approach and the ones mentioned above, is that our technique uses the whole sentences from the parallel text, not only the target words (the translation of certain English words) like Diab and Resnik (2001); our focus is not only on nouns as in Hearst (1991), and we look at words that have closely related senses, not only at words with distinct senses as in Li and Li (2004) and Yarowsky (1995).
Our task, disambiguating partial cognates between two languages is a new task, different than the Word Translation Disambiguation task because we do not see each translation as a different sense of a target word (two or more possible translation can have the same meaning). We perform a coarse-grained cross-lingual disambiguation into two senses: cognate and false friend. We use automatically-collected training data, eliminating the costly effort of the manual annotation; off-the-shelf ML and MT tools; and existing parallel corpora.
3 Data for partial cognate disambiguation
blanc; blank; white, livid
circulation; circulation; traffic
client; client; customer, patron, patient, spectator, user, shopper
corps; corps; body, corpse
détail; detail; retail
mode; mode; fashion, trend, style, vogue
note; note; mark, grade, bill, check, account
police; police; policy, insurance, font, face
responsable; responsible; in charge, responsible party, official, representative, person in charge, executive, officer
route; route; road, roadside
Fr (PC:COG) Je note, par exemple, que l’accusé a fait une autre déclaration très incriminante à Hall environ deux mois plus tard.
En (COG) I note, for instance, that he made another highly incriminating statement to Hall two months later.
Fr (PC:FF) S’ il gèle les gens ne sont pas capables de régler leur note de chauffage.
En (FF) If there is a hard frost, people are unable to pay their bills.
Number of parallel sentences used as seeds
Because our goal is to disambiguate partial cognates in general, not only in the particular domain of Hansard and EuroParl we created and experimented with another set of automatically extracted and labeled sentences from a 1.5 million words multi-domain parallel corpus of magazine articles, modern fiction, texts from international organizations and academic textbooks (we will call this corpus MDC 5). The number of extracted parallel sentences for the two senses varied from zero to a maximum of 288.
In this section we describe our supervised and semi-supervised method. The goal is to determine which of the two senses (cognate or false-friend) of a partial-cognate is present in a French test sentence. Therefore the classes in which we classify a sentence are: COG (cognate) and FF (false-friend).
4.1 Supervised method
For both the supervised and semi-supervised method we used the bag-of-words (BOW) approach of modeling context, with binary values for the features. The features are words from the training corpus that appeared at least 3 times after removing the stopwords. 6 We ran experiments when we kept the stopwords as features but the results did not improve.
As a baseline for the experiments that we present we used the ZeroR classifier from WEKA, 7 which predicts the class that is the most frequent in the training corpus. The classifier for which we report results is Naïve Bayes with a kernel estimator (NB-K). We performed experiments with other classifiers as well, with no better results. The supervised method consists in training the classifiers on the automatically-collected training seed sentences, for each partial cognate, and then test their performance on the test set.
4.2 Semi-supervised method
For the semi-supervised method we add unlabeled examples, an average of 200 sentences for each of the senses, from monolingual corpora: the French newspaper Le Monde 8 1994, 1995 (LM), and the BNC 9 corpus; these are different domain corpora than the seeds. The procedure of adding and using this unlabeled data is described below.
4.3 Monolingual bootstrapping
The monolingual bootstrapping algorithm used on French sentences (MB-F) and on English sentences (MB-E) is:
Train a classifier on the training seeds.
Apply the classifier on unlabeled data, sentences that contain the French word from the partial-cognate pair, extracted from Le Monde (MB-F) or the English word from BNC (MB-E).
Take the first few newly classified sentences, both from the COG and FF class and add them to the training seeds.
Rerun the experiments training on the new training set.
For the first step of the algorithm we used NB-K classifier because it was the classifier that consistently performed better. We chose to perform attribute selection on the features after we tried the method without attribute selection. We obtained better results when using attribute selection. This sub-step was performed with the WEKA tool, the Chi-Square attribute selection was chosen because is commonly used for text processing tasks. In the second step of the MB algorithm the classifier that was trained on the training seeds was then used to classify the unlabeled data that was collected from the two additional resources. For the MB algorithm on the French side we trained the classifier on the French side of the training seeds and then we applied the classifier to classify the sentences that were extracted from Le Monde and contained the French word of the partial cognate pair. The same approach was used for the MB on the English side only this time we were using the English side of the training seeds for training the classifier and the BNC corpus to extract new examples. In fact, the MB-E step is needed only for the BB method. Only the sentences that were classified with a probability greater than 0.85 (experimentally chosen value) were selected for the bootstrapping algorithm.
4.4 Bilingual bootstrapping
Translate the English sentences that were collected in the MB-E step into French using an online MT 10 tool and add them to the French training data.
Execute the MB-F step (in order to re-train the classifier on the new labeled data and the original seeds).
The BB algorithm uses as a new source of knowledge sentences that were selected in the MB-E experiments. It has been shown (Li and Li 2004) that two languages are more informative than one and since that task that we need to solve is similar to a cross-language word sense disambiguation the idea of using knowledge from English was straightforward. The ML tool even with potential translation errors adds useful information to our classification task.
5 Evaluation and results
Results for the supervised method (SM), monolingual bootstrapping (MB), and bilingual bootstrapping (BB) methods on the initial test set data and on the multi-domain corpus
5.1 Discussion of the results
The results of the experiments and the methods that we propose show that we can successfully use unlabeled data to learn from, and that the noise that is introduced due to the seed set collection is tolerable by the ML techniques that we use. The supervised method improves over the baseline with 20% for the test set and 15% for the MDC corpus.
The BB method improved the results on the NB-K classifier with 3.24%, compared with the supervised method (no bootstrapping), when we tested only on the test set, the one that represents 1/3 of the initially-collected parallel sentences. BB with NB-K brought an improvement of 1.95% from no bootstrapping, when tested on the multi-domain corpus, the line for AVERAGE_MDC. According to a t-test this improvement is statistically significant.
For some experiments MB did better, for others BB was the method that improved the performance; nonetheless for some combinations of experiments (we performed additional experiments when we used the multi-domain corpus in the training data set as well and experiments when we combined the two semi-supervised methods) MB together with BB was the method that worked the best. Improvements over the supervised method were always obtained using the semi-supervised methods. This observation is also valid in experiments with different combinations of training and testing data sets that we conducted for our task.
Another positive aspect that we want to emphasize throughout the experiments that we performed is that the number of features that were extracted from the seeds was more than double at each MB and BB experiment, showing that even though we started with seeds from a restricted domain, the method is able to capture knowledge form different domains as well. Besides the change in the number of features, the domain of the features has also changed form the parliamentary one to others, more general, showing that the method will be able to disambiguate sentences where the partial cognates cover different types of context.
Unlike previous work that has been done with monolingual or bilingual bootstrapping, we tried to disambiguate not only words that have senses that are very different, e.g., plant with a sense of biological plant or with the sense of factory. In our set of partial cognates the French word route is a difficult word to disambiguate even for humans: it has a cognate sense when it refers to a maritime or trade route and a false-friend sense when it is used as road. The same observation applies to client (the cognate sense is client, and the false-friend sense is customer, patron, or patient) and to circulation (cognate in air or blood circulation, false friend in street traffic).
6 Conclusion and future work
We showed that with simple methods and using available tools we can achieve good results in the task of partial cognate disambiguation. The accuracy might be increased by using dependency relations, lemmatization, part-of-speech tagging (to extract sentences where the partial cognate has the same POS in both languages), and other types of data representation combined with other semantic tools. In future work we plan to try different representations of the data, to use knowledge of the relations that exists between the partial cognate and the context words, and to run experiments when we iterate the MB and BB steps more than once.
The MDC corpus was provided by Prof. Raphael Salkie, Brighton University, UK.