Human–Machine Collaboration for a Multilingual Service Platform

Murakami, Yohei

doi:10.1007/978-981-97-0779-9_3

Yohei Murakami³

495 Accesses

Abstract

Communication is a significant activity in a city. Especially, in a multiethnic country and intercultural city, multilingual communication is necessary for mutual understanding. Smart cities should provide a multilingual service platform for overcoming the language barrier. However, fewer language data causes fewer service components in low-resource languages. To augment the service components, the multilingual service platform requires effective collaboration between humans and artificial intelligence. In this chapter, we regard human activities and artificial intelligence as services and realize human–machine collaboration by composing the services dynamically. Firstly, we describe how to choose reliable human services among many unqualified ones. And then, we present a loop where data generated by human services is augmented by AI-based services and fed back to human services. Finally, we propose a planning technique that dynamically composes both human and AI services and report experimental results in Indonesia, one of the biggest multiethnic countries.

You have full access to this open access chapter, Download chapter PDF

1 Introduction

1.1 Background

Globalization has caused large-scale human migration across borders and thus increased the demand for multilingual communication in an intercultural city. Although multilingual communication support is one of the most significant applications in smart cities, it is difficult to the application customized for each user activity because language resources that serve as service components are fragmented and distributed and do not provide a common access method. To address these challenges, multilingual service platforms have been constructed, such as the Language Grid and European Language Grid. The platforms allow users to share language resources, combine them, and integrate them into their applications.

However, the platforms mainly focus on official languages but do not support low-resourced languages sufficiently, especially ethnic languages. There are more than 7000 languages around the world, one-third of which are spoken in Asia [36]. The linguistic diversity in Asia is greater than that in Europe. Many multiethnic countries are located in Asia. For example, Indonesia, one of the typical multiethnic countries, is said to have almost 700 ethnic languages, and their ethnic languages lack digital language resources and face digital extinction. Therefore, a multilingual service platform is required as a unifying umbrella [29] and expected to support multilingual communication between ethnics in a local city as well as between global citizens in an intercultural city.

1.2 Approach

This monograph aims to construct comprehensive language resources in low-resource languages by combining crowdsourced human tasks and machine induction methods. The crowdsourced tasks create new language resources, while the machine induction augments the created language resources. To seamlessly integrate these two components, we regard each as a service and propose a dynamic service composition method that can address the uncertainties occurring in each service invocation. The existing service composition methods are classified into two types: one is a vertical service composition that achieves user’s goal by combining functional requirements of services, and the other is a horizontal service composition that selects the best combination of services to execute a given plan while considering non-functional requirements. Different from these service composition methods, the proposed method optimizes the total cost by choosing the next service invocation from crowdsourced human services and machine induction services according to the results of the previous invocation as well as functional and non-functional requirements.

This section outlines the following steps for applying a collaboration of crowdsourced human services and machine induction services to the construction of multilingual services in low-resource languages.

Firstly, to improve the accuracy of the crowdsourced human services, we establish a crowdsourced workflow to make crowdsourcing services highly reliable. Quality assurance of crowdsourcing is a significant issue in an environment where a variable number of workers participate. Especially, in the creation task of language resources in low-resource languages, it is difficult to ensure enough highly reliable workers because there are fewer bilingual workers between low-resource languages. Therefore, a crowdsourced workflow is necessary, one that can identify a small number of highly reliable workers and preferentially allocate the creation tasks to these workers [4].

Secondly, to inductively create a new bilingual dictionary from two crowdsourced bilingual dictionaries, we adopt a pivot-based approach. This approach constructs a graph that connects the two bilingual dictionaries via a pivot language and identifies correct translation pairs from this graph. We formalized the identification task as a weighted max SAT problem. To ensure accuracy, we introduce semantic constraints based on language similarity. By solving this problem, we improve recall while maintaining the precision achieved by the existing inverse consultation method [21, 22]. Furthermore, to augment translation pairs in a bilingual dictionary, we employ a neural network-based approach to learn transformation rules between a source word and a target word [31, 32]. The learned rules are applied to translate a list of source words into the target language.

Finally, to achieve comprehensive coverage of bilingual dictionaries for closely related languages while minimizing total costs, it is essential to select the most suitable language pairs. This selection process entails a sequence of decisions, each with uncertainty. This uncertainty arises from the variability in dictionary induction accuracy and the size of the generated dictionary, both of which depend on language similarity and the size of pre-existing dictionaries. Therefore, we formalize the planning phase as a Markov Decision Process, enabling the generation of optimal plans [23,24,25].

1.3 Structure of This Chapter

Section 2 briefly introduces the background of multilingual service platforms for Asia and Europe. Also, this section discusses the requirements for each multilingual service platform by comparing Europe, which focuses on multilingualism and where people move across borders, and Asia, which has a large number of ethnic languages, most of which are digital extinctions. Asia targeted in this chapter requires a platform that involves various ethnics to collaboratively create language resources.

Section 3 designs human–machine service composition for multilingual service creation. The composite service creates translation pairs as seed data by iterating creation and evaluation tasks by crowdsourced services. Then, AI services are applied to induce new translation pairs, followed by evaluation with crowdsourced human services. This section also explains a crowdsourcing platform for collaboratively creating and evaluating translation pairs. This platform allows speakers of low-resource languages to collaboratively create and evaluate bilingual dictionaries between the low-resource languages, which are difficult to collect bilingual workers for.

Section 4 establishes a crowdsourced workflow to realize reliable crowdsourced human services. Even in an environment with a small number of highly reliable workers, this workflow can aggregate evaluation results more accurately by utilizing a hyper-question, a set of single questions. Moreover, by scoring the reliability of workers based on the evaluation results, this workflow preferentially assigns more tasks to reliable workers to improve the quality of the created language resources. We conduct experiments on simulated data to validate the workflow. The experimental results show the workflow achieves higher accuracy than other methods, regardless of the ratios of highly reliable workers.

Section 5 presents two types of AI-based services to augment language resources. One is a pivot-based method to induce translation pairs. This method creates a new bilingual dictionary between closely related languages by combining two bilingual dictionaries, which share a pivot language. To increase the recall rate compared to the existing pivot-based approach, the method is generalized to obtain translation pairs by relaxing constraints and implementing iterative induction, in which each cycle of induction is based on the previous induction results. The generalized method (64% average F-score) largely outperforms the existing method (41% average F-score). The other is a neural network-based method to acquire transformation rules of spelling between closely related languages. This method employs a two-layer Bi-LSTM encoder and LSTM decoder and compares character-based tokenization and BPE-based tokenization. Experimental results show both tokenizations achieve almost 80% precision in generating translation pairs between Indonesian and Minangkabau.

Section 6 proposes a dynamic service composition with Markov Decision Process. Manual creation of translation pairs needs to complement machine creation because low-resource languages do not have enough source dictionaries to perform the machine creation. To optimally combine AI-based creation services and crowdsourced human creation services, the composition process is modeled as Markov Decision Process (MDP) to minimize the total cost. We conducted a real experiment to create bilingual dictionaries with a minimum size threshold of 2,000 translation pairs between any combinations of 5 Indonesian ethnic languages: Indonesian, Malay, Minangkabau, Javanese, and Sundanese. The experiment result shows the proposed planning method achieves 42% cost reduction compared to an all-investment plan and is reliable: the actual total cost was 97% close to the estimated total cost.

Section 7 concludes this chapter by summarizing the results obtained through the Indonesia Language Sphere project. We also address the prospect of future research about multilingual service platforms.

2 Multilingual Service Platform for Smart Cities

In smart cities, multilingual communication support is one of the most significant applications, especially given the increased cross-border mobility resulting from globalization. To develop multilingual services, including multilingual communication support, we need a multilingual service platform that facilitates the integration of fragmented language resources. Therefore, we developed the Language Grid, a multilingual services platform for supporting intercultural collaboration [11]. This platform enables users to share various language services and combine them to create new language services customized for each user. In 2007, we initiated the operation of an experimental infrastructure to accumulate and share language resources as Web services.

In the 3-year operation of the Language Grid, we encountered difficulty in reaching service providers in other countries due to the barriers posed by geographical separation. This language locality motivated us to launch a new service grid in other countries. To address the language resource bias, we designed a federated operation of the Language Grid. In this operation model, grid operators, globally dispersed, operate local grids and facilitate service interoperability among them. Furthermore, we extended our grid architecture for interconnectivity between these local grids. This federated approach forms a network of operation centers that cover various Asian languages. Operation centers were opened in Bangkok in 2010, Jakarta in 2011, and Urumqi in 2014; they have connected themselves to us to share a variety of services in Asian languages [11]. For instance, through our federation with Bangkok, 14 Asian WordNets are now accessible. Meanwhile, Jakarta and Urumqi contribute language services for the Indonesian and Turkic language families, respectively. Currently, the Language Grid has 183 participating groups from 24 countries, collectively sharing 226 language services.

In Europe, the European Language Grid has also been constructed since 2019 [30]. The European Language Grid is a scalable cloud platform that provisions access to hundreds of commercial and non-commercial language resources for all European languages and aims to be the primary platform and marketplace for language resources in Europe. The European Language Grid harvests all relevant language resource repositories such as META-SHARE [27] and ELRC-SHARE [15, 28], collects metadata about resources and makes them available through the European Language Grid to increase in visibility of language resources. The European Language Grid now provides access to more than 14,000 commercial and non-commercial language resources.

These multilingual service platforms allow users to develop multilingual communication support services in smart cities. A multilingual medical reception support system called $M^3$ and SmartClassroom connecting classrooms in Japan and China were constructed with the Language Grid [18, 37], and a personal assistant named YouTwinDi that supports interaction with European citizens was developed with European Language Grid [40]. Provisioning a development environment for a new language tool that integrates the existing language resources fragmented among countries is one of the main purposes of the multilingual service platforms. On the other hand, in Asia, where more ethnic languages exist within a country than in Europe, the platforms are also required to sustainably create comprehensive language resources in various languages while involving citizens in the creation process. In a multiethnic country, how to support communication between different ethnic groups is a significant issue in local cities where ethnic languages are usually spoken.

Although the Language Grid and European Language Grid enhanced language service sharing and expanded language coverage, challenges persist in generating language services in low-resource languages. As per the data from LREMap in 2016, out of 5,758 entries, 1,999 resources related to English (approximately 34%, compared to 2% in 2012). This is followed by French (440 resources), German (403), Spanish (294), Chinese (218), and Japanese (196). In contrast, language resources in Indonesian are limited, with only 13 resources, and Malay has a mere 3 resources [2, 8]. Notably, Indonesian ethnic languages, even Javanese and Sundanese, each with over 30 million speakers, have seen no resources submitted to the top conferences related to language resources. Figure 1 shows the statistics of language resources in LREMap by language. The left vertical axis represents the number of language resources, while the right vertical axis indicates the cumulative percentage of speakers. Speakers of 11 languages, each with over 100 resources, occupy 54% of the world’s population. This means the remaining speakers are not supported by adequate language resources. Therefore, we need technology to create language resources not limited to specific languages. Especially, to preserve and increase the use of Indonesia ethnic languages, we started the Indonesia Language Sphere project^{Footnote 1} in 2015. The purpose of this project is to develop comprehensive sets of bilingual dictionaries among Indonesian ethnic languages, which are closely related languages.

A dual-axis bar graph of count and cumulative rate in percentage versus 48 languages. The count decreases exponentially with the highest of 2000 in ENglish and 0 in most. Cumulative % increases concavely downwards with the highest at Veneto and Yaqui at 100% with higher values in most. Data are approximate. — **Fig. 1**

3 Human–Machine Service Composition

3.1 Collaborative Creation Workflow

Manual creation of language resources is essential to develop multilingual services in low-resource languages. To assure the quality of data, the created language resources need to be subsequently evaluated by other workers. In this manual creation process, reducing the total costs is a challenge while ensuring the quality of the language resources due to the high costs associated with manual creation and subsequent evaluation. Furthermore, augmenting manually created language resources with machine-induced data becomes significant in increasing the size of language resources without proportionally increasing the total costs.

A workflow of the human-machine and human-human loop. The human translator translates a list of source words into the target language, the human evaluator checks the accuracy of the transaction pairs, and the A I services induce a new dictionary from two dictionaries, if the two share a common language. — **Fig. 2**

Therefore, we have constructed a human–machine collaboration workflow that combines a loop of manual creation and evaluation (called human–human loop) with a loop of machine induction and manual evaluation (called human–machine loop), as illustrated in Fig. 2. The human–human loop continues to modify mistranslations until sufficient seed data is obtained. Once it creates enough seed data, the data is utilized to induce a language resource. The induced data is manually evaluated, and any incorrect results are either manually modified or filtered out in the human–machine loop.

In the human–human loop, finding highly reliable workers is challenging because fewer bilingual speakers can create and evaluate translations. Although crowdsourcing, which allows us to request tasks from a variable number of workers on the Internet, is one possible solution, securing many highly reliable workers remains difficult. Therefore, we need a crowdsourced workflow that can create translation pairs at low costs, regardless of the ratio of highly reliable workers. To solve this problem, Sect. 4 proposes a crowdsourced workflow using hyper-questions, a technique designed to generate more informative responses from workers.

In the human–machine loop, machine induction methods cannot expect a large amount of training data as usual due to the nature of low-resource languages. Therefore, we need to augment language resources by using domain knowledge that the target languages are closely related and belong to the same language family. Specifically, Sect. 5 presents a pivot-based approach and a neural network approach. The former focuses on cognates originating from the same word in a proto-language, and the latter utilizes the similarity of spelling between the closely related languages to acquire transformation rules of spelling.

The total cost of this workflow varies according to which language pairs are crowdsourced and which language pairs are induced by the machine. For example, the cost of manual creation and evaluation depends on the number of highly reliable workers. Low language similarity decreases the accuracy of machine induction methods, which results in low cost-effectiveness due to the evaluation costs of mistranslations. Therefore, Sect. 6 describes a plan optimization method that selects which language pairs are targeted by crowdsourcing or by machine induction methods to minimize the total cost.

3.2 Crowdsourcing System for Language Resource Creation

To manually create and evaluate language resources in the human–human loop, we developed a crowdsourcing system [24]. This system enables a task requester to upload a list of headwords for creating a bilingual dictionary. The requester can assign translation creation and evaluation tasks to workers proficient in both languages for each headword. Each task progresses through eight states: pre-creation assignment, creation assignment, creation in progress, creation completion, pre-evaluation assignment, evaluation assignment, evaluation in progress, and evaluation completion, which are monitored by the task requester. After the creation of the requested translation pairs, the task requester can assign the evaluation task to the other workers. Once all of the evaluation results are collected, the task requester aggregates them to determine the final evaluation result. If incorrect, the task state reverts to pre-creation assignment, enabling re-assignment of the translation creation task.

Workers can manage their own assigned creation tasks and evaluation tasks on the system. When the tasks are assigned, they appear in the worker’s management console, separated by task types such as creation and evaluation. As shown in Fig. 3, a headword (iklim (climate)) in a source language (Indonesian) is displayed in the creation task tab, and workers can register its translation (Cuaca (weather)) in a target language (Palembang). When the created translation pairs are accumulated, the task requester or the system generates their evaluation tasks and assigns them to workers different from the creators. As illustrated by Fig. 4, a translation pair (gelas (glass) and Cangkir (cup)) then appears in the evaluation task tab and is evaluated as correct (BENAR) or incorrect (SALAH) by the workers.

A screenshot of an user interface. It displays the settings under the creation tab with text in Indonesian. — **Fig. 3**

A screenshot of the user interface. It displays the settings under the evaluation tab with text in Indonesian. — **Fig. 4**

In addition to the individual tasks, the system facilitates collaborative tasks addressed by several workers collaboratively. This system also displays the meaning of the headword as a reference. This is particularly useful when creating and evaluating translations between two low-resource languages where bilingual workers may be scarce. For example, two workers, each understanding a different low-resource language, can communicate the meaning of the target word to each other and collaboratively create and evaluate its translation pair.

4 Reliable Crowdsourced Services for Creating Language Resources

4.1 Introduction

Crowdsourcing is a service for requesting work from a large and open group of people via the Internet, and it can be used to order a large number of works that require human labor. Crowdsourced service is especially used to request relatively difficult tasks for computers but not so difficult for humans. However, in crowdsourcing, where the tasks are executed by an unspecified number of workers, the abilities of whom vary, it is difficult to guarantee the quality of the execution results. Especially, in the case of bilingual dictionary creation between low-resource languages [19], the number of people who can speak multiple low-resource languages is limited, and the average ability of workers is low. This results in the method of assigning the same task to multiple workers and using majority voting has a high possibility of obtaining wrong answers, and quality control cannot be performed well.

Therefore, we aim to improve quality in an environment with a small number of highly reliable workers by using an answer aggregation method on hyper-questions (multiple tasks considered together as one task). Since workers with high ability tend to agree on the answers to hyper-questions, the method increases the possibility that workers with high ability will be in the majority. To this end, we address the following two problems.

Selecting highly reliable evaluators:: In the answer aggregation method on hyper-questions, it is assumed that a small number of high-quality workers are involved. Therefore, it is necessary to select highly reliable evaluators from a crowd.
Reducing the number of tasks:: Even if a worker is able to correctly evaluate whether a translation pair is correct or not, in the case of wrong translation pairs, the worker may have to redo the translation, which increases the number of tasks.

For these problems, we dynamically evaluate the reliability of workers based on their work results, and selected workers who were estimated to be highly skilled. Specifically, we set a parameter “Reliability” for each worker and increased or decreased the reliability based on the task results. In addition, we adjust the probability of task assignment based on the reliability of each worker.

4.2 Issues in Crowdsourcing

4.2.1 Quality Control

One of the most important research topics in crowdsourcing is quality control. Since tasks are performed by humans, it is not always possible to obtain correct results. In addition, since tasks are requested from an unspecified number of people, there is a possibility that workers with low ability or workers who intentionally perform low-quality work (spammers) will perform tasks. Therefore, the quality of the results cannot be guaranteed only by the results of a single worker. In the research of quality control, there are two main approaches: an approach to aggregate work results for improving the overall quality and an approach to improve the quality of individual work results.

The former is mainly an approach that attempts to obtain high-quality results by removing errors from the work results. As an example, the method of assigning the same task to multiple workers, and then taking a majority vote is used. However, the majority voting can lead to the correct answer when the ability of the workers is high, while it is difficult to obtain the correct answer when the ability of the workers is low (less than 50% correct in the case of binary choice type tasks) [35]. For such cases where experts are in the minority, an answer aggregation method using hyper-questions has been proposed as an effective method [14]. A hyper-question is a set of single questions, in which multiple questions are considered together as one. Since experts are more likely to agree on the answers to multiple questions than non-experts, majority voting on hyper-questions is particularly effective when there are few workers with high ability.

The latter is an approach that attempts to improve the results of task execution itself by designing rewards and tasks or selecting workers before requesting workers to perform tasks. Especially, the method of extracting workers who are estimated to have high ability in advance and assigning tasks to the extracted workers is expected to improve the quality of the work results, because it can eliminate low-ability workers and spammers before executing the task, and only workers who are estimated to have a high ability can actually perform the task.

4.2.2 Task Assignment

In the task assignment, it is necessary to estimate the abilities of workers in advance in order to extract workers who can be expected to deliver high-quality work results. However, it is difficult to know the abilities of workers in advance because the abilities of workers in crowdsourcing vary widely.

Therefore, a method of detecting workers with high ability by using a task the correct answer of which is known in advance (gold task) has been used. For example, there are two methods: one is to assign gold tasks in advance and filter workers by evaluating their answers, and the other is to blend gold tasks into normal tasks to measure and select the ability of workers [12]. When a worker is judged to have a low ability by these methods, it is possible to take countermeasures such as not assigning tasks to the worker afterward, placing restrictions on some tasks, or not using the results of the worker’s output. These methods are considered to be the most effective ways of estimating the abilities of workers when the average ability of workers is not high. However, if the gold tasks are mixed in with the actual tasks, the reward for answering the gold tasks, whose answers are already known, must be paid, which reduces the cost-effectiveness of the method. In the case of measuring workers’ abilities in advance, it is necessary to assign gold tasks to all workers, which simply reduces the efficiency of the workload. Furthermore, it is known that it is very difficult and costly to generate gold tasks, so a method to automatically generate gold tasks based on data collected has been proposed [26]

In this paper, we assume the bilingual dictionaries creation using crowdsourcing in low-resource languages. Therefore, the number of workers who can speak these languages is small, and the average ability of workers is not high. Therefore, we aim to improve the quality of the created bilingual dictionary by combining an answer aggregation method that is effective even for such a crowd with low average ability and a task assignment method based on workers’ reliability calculated from the results of each worker’s work.

A workflow. A requester requests to make a bilingual text, a translator executes a creation task, and an evaluator executes an evaluation task. The requester obtains a correct pair if the aggregated evaluation results are correct, if wrong it goes to allocation. — **Fig. 5**

4.3 Crowdsourced Workflow

Considering a workflow consisting of a creation task and multiple evaluation tasks (Fig. 5), we ensure redundancy by performing multiple evaluation tasks for each bilingual creation task. In other words, the final evaluation of the translation pair produced by a creation task is determined by a majority vote on the results of evaluation tasks. If a “Correct” translation pair is produced and it is evaluated “Correctly,” the “Correct” translation pair is obtained. If a “Wrong” translation pair is produced and it is evaluated “Wrongly,” the “Wrong” translation is obtained. Otherwise, the translation pair is ignored. If no translation pair is obtained, the process is repeated from a creation task until translation pairs for all words are obtained.

We assume that there are two types of tasks assigned to workers: a creation task, which is a free-input task to create a translation from a given word or sentence, and an evaluation task, which is a binary-choice task to evaluate whether the translation created by the previous task is “Correct” or “Wrong.”

4.4 Evaluation Aggregation with Hyper-Questions

The common aggregation methods, such as majority voting, often fail when the majority of workers do not know the correct answers. To emphasize the answers of a few high-quality workers, the aggregation method on hyper-questions was proposed [14]. A hyper-question consists of a subset of original single questions, and an answer to a hyper-question is a set of answers to the questions included in the hyper-question. A set of k original single questions is defined as a k-hyper-question. As the specific answer aggregation method on hyper-questions, we use majority voting on hyper-questions for evaluation tasks.

Given a set of some evaluation tasks Q, our evaluation method constructs k-hyper-questions by combining single evaluation tasks in Q. Then conducting a majority voting for each hyper-question results in an answer to the hyper-question. The aggregated results of the hyper-questions are decoded into answers to the single questions. Finally, another round of majority voting is carried out for each question. Consequently, the results of the first round of majority voting on hyper-questions are aggregated to obtain the final answer for every single question.

Figure 6 shows the procedure of majority voting on hyper-questions, which consists of five evaluators $e_1$, $e_2$, $e_3$, $e_4$, and $e_5$, and four evaluation tasks $q_1$, $q_2$, $q_3$, and $q_4$ in which the evaluators determine whether each translation pair (Indonesian–Minangkabau) is “P(correct)” or “N(wrong).” In this example, k is set to 3. “P” is the correct answer for all of the evaluation tasks. In the first step, four 3-hyper-questions are created from the four evaluation tasks. An answer to a hyper-question is the concatenation of the answers to the constituent single evaluation task. In the second step, majority voting for each hyper-question; in this case, the answer “PPP” is chosen for the first three hyper-questions, and the answer for the last one is not determined. In the third step, each of the majority answers to the hyper-questions votes for the single evaluation task included in it. Finally, in the fourth step, another round of majority voting aggregates the votes to the single evaluation task to obtain the final answers. Simple majority voting fails in the evaluation task $q_2$, but majority voting on hyper-questions succeeds. If there are no majority answers in the second step and some of the single evaluation tasks do not get the final answers, another round of majority voting is taken among the evaluators who voted majority answers for the rest of the evaluation tasks. By narrowing the evaluators and reusing the evaluation results from the narrowed evaluators, we can reduce the number of evaluation tasks.

A set of tables define how majority voting is done on the hyper-questions procedure. The tabulated questions with the evaluators are encoded to hyperquestions and undergo majority voting where the experts are extracted. It is decoded to extract single questions along with respective experts. — **Fig. 6**

4.5 Task Assignment Based on Workers’ Reliability

In this research, we aim to improve the quality and reduce the cost of crowdsourcing by identifying workers who are estimated to be highly skilled based on their work results and proactively assigning tasks to them. For this purpose, we propose a method to dynamically evaluate the reliability of workers based on their work results.

4.5.1 Workers’ Reliability

A “reliability” is set for each worker, and the initial value is 0. The reliability is calculated based on the results of creation tasks and evaluation tasks as follows.

If the translation pair created by a creation task is evaluated as “correct” by evaluation tasks, the reliability of the translator is increased by $+1$.
If the translation pair created by a creation task is evaluated as “wrong” by evaluation tasks, the reliability of the translator is increased by $-1$.
If a worker’s evaluation of all the created translation pairs in a given task set Q is a majority of the final evaluation obtained from the aggregation of the evaluation tasks, the reliability of the evaluator is increased by $+1$.
If a worker’s evaluation of all the created translation pairs in a given task set Q is a minority of the final evaluation obtained from the aggregation of the evaluation tasks, the reliability of the evaluator is increased by $-1$.

This calculation is performed each time the evaluation of all the created translations in one problem set Q is completed.

4.5.2 Task Assignment

By using the reliability of each worker, we proposed two types of task assignment methods:

Assigning evaluation task using a threshold
Task assignment using weighted probabilities.

For the first method, we placed restrictions on workers to allocate evaluation tasks. For the bilingual evaluation task, we consider a worker whose reliability is 1 or higher to be a trusted worker, and only trusted workers can perform evaluation tasks. This method is expected to reduce the number of errors in evaluation tasks.

For the second method, the probability of task assignment for both creation tasks and evaluation tasks is adjusted based on the weight of each worker using his/her reliability. When the total number of workers who can perform a task is n, the weight $w_i$ of the ith worker is calculated as in Eq. (1).

$$\begin{aligned} w_{i}=1+r_{i}-r_{min} \end{aligned}$$

(1)

The $r_i$ shows the reliability of the ith worker, and the $r_{min}$ is the lowest reliability among all workers who can perform the task. By calculating the weight as in Eq. (1), we can avoid that the weight of the worker with the lowest reliability becomes 0 (the probability of being assigned the task becomes 0). As the work progresses, the difference in the weights increases as the difference in the reliability among the workers becomes larger.

The probability that a task is assigned to a worker, $p_i$, can be calculated by using weights, as in Eq. (2).

$$\begin{aligned} p_i=\frac{w_i}{w_1+w_2+w_3+ \cdots + w_i+ \cdots +w_n} \end{aligned}$$

(2)

By performing these calculations each time a task is assigned, we can make it easier to assign a task to a worker with high reliability (a worker who is estimated to be highly capable) and harder to assign a task to a worker with low reliability (a worker who is estimated to be less capable), thereby automatically eliminating workers who are estimated to be less capable. This can be expected to improve accuracy and reduce costs.

4.6 Evaluation

4.6.1 Models

For the evaluation, we modeled crowdsourcing workers and tasks for creating a bilingual dictionary between low-resource languages.

Workers The higher the ability of the worker, the quality of the task execution result is higher. In this paper, the ability of a worker is defined as the vocabulary in multiple languages and is represented by $x (0\le x\le 1)$. When x is closer to 1, the worker recognizes more vocabulary, and the more likely he/she is to perform the task correctly. On the other hand, when x is closer to 0, the worker recognizes less vocabulary and the possibility that the task will be incorrect increases. For simplicity, we assume that the quality of the task execution result is probabilistically determined by the ability of a worker. In this paper, we follow previous studies and represent the ability of a worker using a beta distribution. The probability density function f(x|a, v) is represented by Eq. 3 [7].

$$\begin{aligned} f(x|a, v) \!= \!\textrm{Beta}(\frac{a}{\min (a,1\!-\!a)v}, \frac{1\!-\!a}{\min (a,1\!-\!a)v}) \end{aligned}$$

(3)

$a \in (0,1)$ is the normalized value of workers’ ability and $v \in (0,1)$ is the parameter that determines the variance of workers’ ability. When v is closer to 0, the variance is closer to 0, and when v is closer to 1, the variance in the beta distribution with the average a is larger. The above model of workers was adopted by [7].

Tasks We assume that the result of a creation task is “Correct” if the worker knows the translation of the given word and “Wrong” if the worker does not know the translation of the given word. Therefore, it is completely dependent on the ability of the worker whether a correct translation pair is produced or not (Fig. 7). However, since an evaluation task is a binary choice task, if the worker knows the correct translation for a given word, he/she will evaluate it as “Correct.” However, if the worker does not know the translation of the word, he/she will randomly select one of the two values “Correct” or “Wrong” (Fig. 8). Therefore, in an evaluation task, no matter how low the ability of the worker is, it is guaranteed that the worker will make a “Correct” evaluation with a probability of more than 50%.

A flow chart defines how a given target word undergoes translation. If the worker knows the translation of the given word, then a correct translation pair is created with x%. If no, a wrong translation pair is created with 1 minus x%. — **Fig. 7**

A flow chart defines how a translation pair created by a creation task is given. If the worker knows the translation of the given word, then a translation pair is evaluated correctly. If no, the translation pair is evaluated wrong;y. — **Fig. 8**

4.6.2 Evaluation Method

The methods, including the proposed method, are evaluated in terms of the accuracy of the produced translation pairs and the work quantity required to obtain all the translation pairs.

Proposed Method 1 (Reliable_hyper_reuse):: A model that combines the answer aggregation on hyper-questions and the task assignment based on workers’ reliability. In the case of failure of majority voting using hyper-questions, this model takes another majority voting by reusing the evaluation results from the evaluators who voted majority answers for the successful evaluation tasks.
Proposed Method 2(Reliable_hyper):: A model that combines the answer aggregation on hyper-questions and the task assignment based on workers’ reliability.
Comparison Method 1 (Random_hyper):: A model that combines the answer aggregation on hyper-questions and the random task assignment for the entire workers.
Comparison Method 2 (Reliable):: A model that combines a simple majority voting in evaluation tasks and the task assignment based on workers’ reliability.
Comparison Method 3 (Random):: A model that combines a simple majority voting in evaluation tasks and the random task assignment for the entire workers.

In order to measure the performance of each method described above, we use the following indicators.

1.
Accuracy of the produced translation pairs

The accuracy of the produced translation pairs by each method is calculated as follows:
$$\begin{aligned} \text{ Accuracy }= \frac{\text{ Number } \text{ of } \text{ translation } \text{ pairs } \text{ produced } \text{ correctly }}{\text{ Total } \text{ number } \text{ of } \text{ obtained } \text{ translation } \text{ pairs }} \end{aligned}$$
(4)
This indicator helps to compare the simple quality of the outputs from each method.
2.
Work quantity required to obtain all the translation pairs.

The work quantity is the total unit times of the creation tasks and the evaluation tasks, which are executed until all the translation pairs are obtained. A unit time is calculated from the estimated time taken for doing the task. Since creation tasks are more difficult than evaluation tasks, we defined that a creation task takes 3 units and an evaluation task takes 1 unit. The cost model was adopted by [25]. This indicator helps to compare the efficiency and cost of each method.

In order to evaluate the indicators described above, we conducted simulations using each method. We set the number of workers to 20 and assumed that there were 1,000 target words. The ability of each worker is determined based on the model in 4.6.1, and we varied the average of workers’ abilities between 0.2 and 0.7 with a variance of 0.5. To eliminate bias due to random numbers, we used the average of the results of 100 simulations for each method.

4.6.3 Results

The accuracy of the proposed methods, Reliable_hyper_reuse and Reliable_hyper, were almost the same and the highest, followed by Reliable_hyper_reuse, Reliable, Random_hyper, and Random. The difference in accuracy between the proposed methods and Reliable, the second highest, was about 5–10%, as illustrated in Fig. 9.

A multiline graph of accuracy versus the average ability of workers. It plots five increasing trends for random, reliable, random hyper, reliable hyper, and reliable hyper reuse. Reliable hyper has the highest accuracy than the others, and random the lowest. — **Fig. 9**

A line graph of work quantity versus average ability of workers. It plots 5 decreasing trends for random, reliable, random hyper, reliable hyper, and reliable hyper reuse. Random hyper has the highest work quality, and reliable has the lowest in the beginning, but then by reliable hyper reuse towards the end. — **Fig. 10**

The work quantity tended to be larger for Reliable_hyper_reuse, Reliable_hyper, and Random_hyper, which are the models using the answer aggregation on hyper-questions. However, for Reliable_hyper_reuse, the work quantity was the smallest when the average of the workers’ ability was 0.5 or higher, as shown in Fig. 10, illustrating the cost reduction by reusing the evaluation results of reliable workers.

Both Reliable and Random_hyper were more accurate than Random, indicating that the task assignment based on workers’ reliability and the answer aggregation on hyper-questions are effective. In addition, when we compared Reliable and Random_hyper, the accuracy of Reliable was higher than that of Random_hyper, indicating that it is more effective to assign tasks to workers with high reliability than to improve the quality of answer aggregation. Furthermore, the accuracy of Reliable_hyper_reuse and Reliable_hyper, which combine the task assignment based on workers’ reliability and the answer aggregation on hyper-questions, were particularly high, indicating that these methods are more effective when combined than when used individually.

Since the work quantity for Reliable_hyper_reuse, Reliable_hyper, and Random_hyper, which use the answer aggregation on hyper-questions, tended to be larger, it is easy to assume that many redos occurred. This may be because the majority voting on hyper-questions makes it more difficult to reach an agreement than in simple majority voting. Therefore, the evaluation aggregations often fail. However, when the average worker’s ability was 0.5 or higher, the work quantity for Reliable_hyper_reuse and Reliable_hyper got lower rapidly. This shows that if evaluation tasks can be assigned to high-quality workers from a crowd with more than a certain number of high-ability workers, the majority voting on hyper-questions is more likely to be successful, and redoing the task is less likely to occur. Furthermore, in Reliable_hyper_reuse and Reliable_hyper, creation tasks are also assigned preferentially to the worker with the highest reliability, resulting in few wrong translation pairs created in the first place. Regarding the number of reliable workers whose abilities are more than 0.7, there were two reliable workers when the average of workers’ abilities was 0.4, and there were four reliable workers when the average of workers’ abilities was 0.5. This shows that two reliable workers are too few to assign evaluation tasks as well as creation tasks to them, which results in the majority voting on hyper-questions not working well even if they perform creation tasks very well.

5 AI Services for Augmenting Language Resources

5.1 Introduction

Crowdsourced bilingual dictionary creation between low-resource languages is challenging, especially for languages with fewer speakers. This challenge is primarily due to high manual costs and the scarcity of bilingual workers. Numerous studies have explored the semi-automatic or automatic creation of bilingual lexicons, leveraging various language resources such as parallel corpora, comparable corpora, WordNet, and existing bilingual dictionaries. However, these methods often fail when applied to low-resource languages, which typically lack substantial parallel corpora. To address this issue, this section proposes two machine induction methods that utilize small existing bilingual dictionaries as seed data. The first method is a pivot-based approach. It generates a new bilingual dictionary by linking two existing dictionaries through a pivot language. However, this approach must address the inherent ambiguity caused by polysemous words in the pivot language when identifying correct translation pairs between the languages. The second method is a neural network approach that infers spelling transformation rules from the seed data based on the orthographic similarity of cognates.

5.2 Pivot-Based Approach

The pivot-based approach is commonly used in bilingual dictionary induction, especially when the only available language resources are dictionaries. This method constructs a graph, termed a “transgraph,” by connecting two bilingual dictionaries via a shared pivot language. To model a transgraph, we utilize a tripartite graph. Figure 11 illustrates an example of a transgraph between language A and C via pivot language B. each vertex denotes a word, while each edge represents a translation relation between two vertices. In the basic form of a transgraph, every pivot vertex must be linked to at least one non-pivot vertex and be interconnected through non-pivot vertices. Transgraphs are merged when there exists at least one edge connecting a pivot vertex in one transgraph to a non-pivot vertex in the other. From this graph, reachable word pairs between two non-pivot languages are extracted as “translation pair candidates,” such as pairs $(w_1^A, w_1^C), (w_1^A, w_2^C), (w_2^A, w_1^C), (w_2^A, w_2^C), (w_3^A, w_1^C)$, and $(w_3^A, w_2^C)$. Subsequently, correct translation pairs are identified from these candidates. Wushouer et al. formalized the pivot-based bilingual dictionary induction as an optimization problem [45]. They assumed that translation pairs between closely related languages were one-to-one mapping and cognates (words originating from the same word in a proto-language). Based on this assumption, they solved the constraint optimization problem to induce a Uyghur-Kazakh bilingual dictionary using Chinese as the pivot language. In this research, we aim to develop a generalized framework for the constraint-based bilingual dictionary by relaxing the existing one-to-one mapping assumption into the many-to-many assumption.

A transgraph. A with w 3, w 1, and w 2 via s 2, s 1 and s 2, and s 1 respectively to B with w 1, which via s 1 and s 2 to C with w 1 and via s 1 to C with w 2. A with w 2 links via s 1 to B with w 2, which via s 1 to C with w 2. — **Fig. 11**

5.2.1 Symmetry Assumption

Given that dictionaries incorporating sense information, as denoted by $s_1$ and $s_2$ in Fig. 11, correct translation pairs can be readily derived from a transgraph by identifying cognate pairs, each pair of which has a complete overlap in their senses. For instance, the cognate pair $(w_1^A, w_1^C)$ shares two senses, namely $s_1$ and $s_2$, through the pivot word $w_1^B$. Also, the cognate pair $(w_2^A, w_2^C)$ shares only the sense $s_1$ through the pivot words $w_1^B$ and $w_2^B$. However, available machine-readable bilingual dictionaries with sense information are limited, especially for low-resource languages. Therefore, we assume that connected words share at least one sense. Furthermore, non-pivot words symmetrically connected through pivot word(s) are presumed to share all their senses and are thus identified as cognates. In Fig. 11, the pairs of $(w_1^A, w_1^C)$, $(w_3^A, w_1^C)$, and $(w_2^A, w_2^C)$ are regarded as cognates. We employ this symmetry assumption for extracting cognates between closely related languages because most linguists argue that lexical comparison alone is insufficient for cognate identification [3].

5.2.2 N-Cycle Symmetry Assumption

Machine-readable bilingual dictionaries for low-resource languages are often limited in size and lack the desired quality. Such dictionaries may miss translation relations essential for constructing a symmetrical topology in a transgraph. Figure 12 illustrates an asymmetry transgraph, where the dashed edge, $(w_2^B, w_1^C)$, is expected to be a missing translation relation. The pivot-based approach adds these missing edges to a transgraph with some costs to satisfy the symmetry assumption.

The existing one-to-one approach identifies missing edges only once to ensure the symmetry assumption of initial translation pair candidates linked by solid edges. In Fig. 13a, the five translation pair candidates are extracted, and the four missing dashed edges are identified to satisfy the symmetry assumption of all the candidates. Since this compensation for missing edges is limited to only initial translation pairs, we call this “one-cycle symmetry assumption.” To apply this compensation to new translation pair candidates linked by the added edges, we iterate the one-cycle symmetry assumption n times, called the “n-cycle symmetry assumption.” Figure 13b illustrates the second cycle after Fig. 13a. The three more candidates, 6, 7, and 8, are extracted from the previously added solid edges. Users can specify the maximum number of iterations for the experiment.

An asymmetry transgraph. A with w 1 links to B with w 1 and B with w 2. B with w 1 links to C with w 1 and C with w 2. B with w 2 links to C with w 2 and via dashed line to C with w 1. — **Fig. 12**

Two transgraphs. A. Translation pair candidates from existing solid edges in 1-cycle symmetry assumption identifies the 4 missing dashed edges. B. Translation pair candidates from existing and previously added solid edges in 2-cycle symmetry assumption extracts the candidates 6, 7, and 8. — **Fig. 13**

5.2.3 Formalization

Constraint optimization problems have been commonly introduced into many natural language processing and web service composition problems [9, 16]. Wushouer et al. [45] applied a Weighted Partial MaxSAT (WPMaxSAT) to a bilingual dictionary induction. Following them, we also adopted CNF encoding in our formalization [1]. A literal is defined as either Boolean variable x or its negation $\lnot x$ and a clause C as a disjunction of literals $x_1 \vee \ldots \vee x_n$. In weighting a clause C, we represent it as a pair ($C, \omega $), where $\omega $, a weight, denotes the penalty for violating the clause C. In the case of a hard clause, infinity ($\infty $) is assigned as a weight. A propositional formula $\varphi _c^\omega $ is a conjunction of one or more clauses $C_1 \wedge ... \wedge C_n$. A formula with soft clauses and one with hard clauses are represented as $\varphi _c^+$ and $\varphi _c^\infty $, respectively. A WPMaxSAT problem comprises multiple formulae $\varphi _c^\omega $. The solution of the WPMaxSAT problem provides an optimal assignment to the variables in C, resulting in the minimal cost of that assignment.

To apply the WPMaxSAT to bilingual dictionary induction, we introduced two types of variables for the literal: e and c. e indicates edge existence between a given word pair, while c represents cognates for a given word pair. For instance, the edge existence between word $w_i^A$ in languageA and word $w_j^B$ in language B is denoted by $e(w_i^A,w_j^B)$, and the cognate pair between words $w_i^A$ and $w_j^B$ by $c(w_i^A, w_j^B)$.

To represent various word pairs for e and c, we define five sets of word pairs: $E_E$, $E_N$, $D_C$, $D_{Co}$, and $D_R$. The first two sets focus on the existence of edges. $E_E$ and $E_N$ are a set of word pairs connected by existing edges and missing edges, respectively. In contrast, the rest three sets are related to translation pairs between non-pivot languages. Specifically, $D_C$ denotes a set of translation pair candidates, $D_{Co}$ signifies a set of cognate pairs, and $D_R$ indicates a set of all the translation pairs identified by the WPMaxSAT solver.

5.2.4 Heuristics to Find Cognate

We have introduced two heuristics to the cognate identification modeled by the WPMaxSAT problem: cognate pair coexistence probability and cognate form similarity.

Cognate Pair Coexistence Probability In assessing the likelihood that a translation pair candidate $t(w_i^{A},w_k^{C})$ is a cognate pair $c(w_i^{A},w_k^{C})$, we calculate the cognate coexistence probability, denoted as $H_{coex}$. This probability is derived by multiplying two chain rules, as given in Eqs. (5) and (6), which results in Eq. (7). A marginal probability $P(w_i^A)$ represents the likelihood that $w_i^A$ connects to any word in language C. A conditional probability $P(w_i^A|w_k^C)$ indicates the likelihood that $w_k^C$ connects to $w_i^A$ when $w_k^C$ connects to any word in language A. A joint probability $P(w_i^A, w_k^C)$ signifies the likelihood that $w_i^A$ and $w_k^C$ are interconnected. $P(w_i^A)$ and $P(w_k^C)$ are independent because they are from different bilingual dictionaries. Thus, $P(w_k^C, w_i^A) = P(w_i^A)P(w_k^C)$ and Eq. (7) can be converted to Eq. (8). To calculate $P(w_i^A|w_k^C)$ and $P(w_k^C|w_i^A)$, we employ a generative probabilistic process which is commonly used in previous works [5, 20, 33, 43] in Eq. (9).

$$\begin{aligned} \small P(w_i^A, w_k^C) = P(w_k^C|w_i^A)P(w_i^A) \end{aligned}$$

(5)

$$\begin{aligned} \small P(w_k^C, w_i^A) = P(w_i^A|w_k^C)P(w_k^C) \end{aligned}$$

(6)

$$\begin{aligned} \small P(w_i^A, w_k^C)P(w_k^C, w_i^A) = P(w_i^A|w_k^C)P(w_k^C|w_i^A)P(w_i^A)P(w_k^C) \end{aligned}$$

(7)

$$\begin{aligned} \small P(w_i^A, w_k^C) = P(w_i^A|w_k^C)P(w_k^C|w_i^A) \end{aligned}$$

(8)

$$\begin{aligned} \small P(w_i^A|w_k^C) = \sum _{j=0}P(w_i^A|w_j^B)P(w_j^B|w_k^C) \end{aligned}$$

(9)

Cognate Form Similarity The symmetry assumption may sometimes fail to identify the correct cognate from the translation pair candidates when a pivot word has multiple in-degrees/out-degrees. To correctly identify cognates, not only the word sense represented by edges but also the word form is useful. We, therefore, calculate cognate form similarity $H_{formSim}$ of the translation candidate $t(w_i^A, w_k^C)$ using the Longest Common Subsequent Ratio (LCSR). This ratio ranges from 0 (0% form-similarity) to 1 (100% form-similarity) [17]. In Eq. (10), $LCS(w_i^A, w_k^C)$ is the longest common subsequence of $w_i^A$ and $w_k^C$; |x| is the length of x; and $max(|w_i^A|, |w_k^C|)$ returns the longest length.

$$\begin{aligned} LCSR(w_i^A, w_k^C) = \frac{|LCS(w_i^A, w_k^C)|}{max(|w_i^A|, |w_k^C|)} \end{aligned}$$

(10)

$$\begin{aligned} t(w_i^A, w_k^C).H_{formSim} = LCSR(w_i^A, w_k^C) \end{aligned}$$

(11)

5.2.5 Constraints to Identify Cognates

All the constraints for the WPMaxSAT are summarized in Table 1.

Table 1 Constraints for cognates extraction

Full size table

Edge Existence.:

In the transgraph, there exists an edge between words that share similar meanings. Edges that currently exist in the transgraph are encoded as TRUE in the CNF formula. Specifically, edges such as $e(w_i^{A},w_j^{B})$ and $e(w_j^{B},w_k^{C})$ are represented as hard constraints $\varphi _1^\infty $.

Edge Non-existence.:

In the transgraph, there does not exist an edge between words that do not share similar meanings. This non-existence of edge is encoded as the negation of the edge existence literal in the CNF formula. Specifically, $\lnot e(w_i^{A},w_j^{B})$ and $\lnot e(w_j^{B},w_k^{C})$ are represented as soft constraint $\varphi _2^+$.

Symmetry.:

Cognates share all of their senses, resulting in a symmetrical topology via the pivot language in the transgraph. We convert

$$\begin{aligned} c(w_i^{A},w_k^{C}) \rightarrow & e(w_i^{A},w_1^{B}) \wedge e(w_i^{A},w_2^{B}) \wedge \ldots \wedge e(w_i^{A},w_n^{B})\\ {} &\wedge e(w_1^{B},w_k^{C}) \wedge e(w_2^{B},w_k^{C}) \wedge \ldots \wedge e(w_n^{B},w_k^{C}) \end{aligned}$$

into

$$\begin{aligned} {} & {} (\lnot c(w_i^{A},w_k^{C}) \vee e(w_i^{A},w_1^{B})) \wedge (\lnot c(w_i^{A},w_k^{C}) \vee e(w_i^{A},w_2^{B})) \wedge \ldots \\ {} & {} \wedge (\lnot c(w_i^{A},w_k^{C}) \vee e(w_i^{A},w_n^{B})) \wedge (\lnot c(w_i^{A},w_k^{C}) \vee e(w_1^{B},w_k^{C})) \\ {} & {} \wedge (\lnot c(w_i^{A},w_k^{C}) \vee e(w_2^{B},w_k^{C})) \wedge \ldots \wedge (\lnot c(w_i^{A},w_k^{C}) \vee e(w_n^{B},w_k^{C})). \end{aligned}$$

In the transgraph, the symmetry assumption is encoded as a hard constraint $\varphi _3^\infty $. However, challenges arise with low-resource languages. Due to the small size of their dictionaries, these languages often lack senses, leading to many missing edges in the transgraph. To compensate for the missing edges, we introduce new edges, ensuring that cognate pairs share all senses. This is achieved by violating the soft constraint $\varphi _2^+$ for edge non-existence and incurring a cost based on user-selected heuristics, namely the cognate pair coexistence probability and cognate form similarity. Essentially, we operate under the assumption that these edges exist. A higher cognate pair coexistence probability and greater cognate form similarity increase the likelihood of a pair being cognate. Consequently, the cost of introducing a new edge to such a pair is lower. In the CNF formula, these new edges in the transgraph are encoded as FALSE, represented as $\lnot e(w_i^{A},w_j^{B})$ or $\lnot e(w_j^{B},w_k^{C})$, and visually depicted as dashed edges in the transgraph. The weights of the new edges, whether from a non-pivot word $w_i^A$ to a pivot word $w_j^B$ or vice versa, are defined as $\omega (w_i^{A},w_j^{B})$ and $\omega (w_j^{B},w_k^{C})$. Both of these weights are equivalent to $t(w_i^{A},w_k^{C}).H_{coex} + t(w_i^{A},w_k^{C}).H_{formSim}$.

Uniqueness.:

The uniqueness constraint ensures that only one-to-one cognates that share all of their pivot words are regarded as correct translation pairs. This limits a cognate of a word in language A to just one word in language C. This constraint is encoded as a hard constraint $\varphi _4^\infty $.

Extracting at Least One Cognate.:

Due to the iterative interaction between the framework and the WPMaxSAT solver, the hard constraint $\varphi _5^\infty $, which is a disjunction of all $c(w_i^A,w_k^C)$ variables, ensures that at least one of these variables is evaluated as TRUE. As a result, each iteration identifies the most possible cognate pair, storing it in both $D_{Co}$ and $D_{R}$ as a correct translation pair result.

Encoding Cognate.:

We filter out previously selected translation pairs in $D_{Co}$ from the list of translation pair candidates. These pairs are encoded as TRUE, represented by $c(w_i^A,w_k^C)$, and are encoded as the hard constraint $\varphi _6^\infty $. Additionally, they are excluded from “$\varphi _5^\infty $”.

5.2.6 Generalized Framework

We define two main CNF formulae: $CNF_{cognate}$ as shown in Eq. (12) and $CNF_{M-M}$ as shown in Eq. (13) [21]. The former aims at identifying unique cognate pairs, and the latter at extracting many-to-many translation pairs by omitting uniqueness constraint $\varphi _{4}^\infty $.

$$\begin{aligned} CNF_{cognate} = \varphi _{1}^\infty \wedge \varphi _{2}^+ \wedge \varphi _{3}^\infty \wedge \varphi _{4}^\infty \wedge \varphi _{5}^\infty \wedge \varphi _{6}^\infty \end{aligned}$$

(12)

$$\begin{aligned} CNF_{M-M} = \varphi _{1}^\infty \wedge \varphi _{2}^+ \wedge \varphi _{3}^\infty \wedge \varphi _{5}^\infty \wedge \varphi _{6}^\infty \end{aligned}$$

(13)

To construct various constraint-based bilingual dictionary induction methods suitable for available language resources and target languages, we generalize the constraint-based framework based on the above two CNF formulae. This allows users to choose the set of constraints such as $CNF_{cognate}$ and $CNF_{M-M}$, the number of iterations for the symmetry assumption, and individual or combined heuristics. The generalized framework is defined in Backus Normal Form as follows:

$\langle situatedMethod \rangle {:}{:}= \langle cycle \rangle $“:” $ \langle method \rangle $“:” $ \langle heuristic\rangle $

$\langle cycle \rangle {:}{:}= $ “1” | “2” | “3”| “4”| “5”| “6” | “7” | “8” | “9”

$\langle method \rangle {:}{:}=$ “C” | “M”

$\langle heuristic \rangle {:}{:}=$ “H1” | “H2” | “H12”

cycle: the number of iteration for symmetry assumption (cycle $\ge $ 1).
method: C indicating $CNF_{cognate}$ or M denoting $CNF_{M-M}$).
heuristic: an individual or combined heuristics. The heuristics involves H1 indicating cognate pair coexistence probability and H2 denoting cognate form similarity.

Using this generalized framework, we can express the previous constraint-based methods. $CNF_{cognate}$ formula with 1-cycle symmetry assumption and heuristic 1 is represented by 1:C:H1, identical with one-to-one approach [44] and $\Omega _1$ in our prior work [21]. $CNF_{M-M}$ formula with 1-cycle symmetry assumption and heuristic 1 is represented by 1:M:H1, identical with $\Omega _2$ in our prior work [21].

5.3 Experiment for Pivot-Based Approach

We conducted experiments using 6 methods derived from our generalized framework. Three of them extract unique cognate pairs (1–1) with the combined heuristics and 1-cycle symmetry assumption (1:C:H12), 2-cycle one (2:C:H12), and 3-cycle one (3:C:H12). The remaining three methods extract many-to-many translation pairs (M-M) with the combined heuristics and 1-cycle symmetry assumption (1:M:H12), 2-cycle one (2:M:H12), and 3-cycle one (3:M:H12). For comparison, we utilized two baseline methods employed in the previous constraint-based methods: 1:C:H1 and 1:M:H1. Furthermore, we also compared the 6 variations with the inverse consultation method (IC)[39] and translation pairs generated from the Cartesian product of each transgraph (CP).

5.3.1 Experimental Settings

We targeted three Indonesian ethnic languages for evaluating our methods: Minangkabau (min), Riau Mainland Malay (zlm), and Indonesian (ind) as the pivot language (min-ind-zlm). The language similarities between Minangkabau and Indonesian, Indonesian and Riau Mainland Malay, and Minangkabau and Riau Mainland Malay are 69.14%, 87.70%, and 61.66%, respectively, obtained from ASJP [10, 42]. This experiment aims to induce a Minangkabau-Malay bilingual dictionary from two bilingual dictionaries between Minangkabau and Indonesian and Malay and Indonesian. To create the gold standard for evaluating precision and recall, we generated all possible translation pairs using the Cartesian product (CP) of each transgraph, which were then verified by the Minangkabau-Malay bilingual crowd workers. Table 2 summarizes the details of the input dictionaries and the gold standard.

Table 2 Details of input dictionaries and gold standard

Full size table

5.3.2 Experiment Result

In this experiment, all transgraphs achieve full symmetric connectivity by the third cycle, obtaining all possible translation pair candidates. To extract many-to-many translation pairs, the soft-constraint violation threshold is used to filter out all translation pairs whose costs surpass the threshold. Decreasing the threshold could yield high precision but low recall while increasing the threshold could yield high recall but low precision. To balance the precision and recall, we utilize the harmonic mean of precision and recall, F-measure. Table 3 presents the results targeting the optimal threshold for the highest F-score. For min-ind-zlm, our best M-M method (2:M:H12) achieves an F-score that is 3.4% higher than CP and 12.9 times higher than IC. Meanwhile, our best 1–1 method (3:C:H12) achieves precision that is 1.3% higher than our previous method (1:C:H1).

Table 3 Comparison of thresholds producing the highest F-score

Full size table

5.4 Neural Network Approach

Given a set of translation pairs as a bilingual dictionary, we can utilize the translation pairs to train a model that transforms a source word into a target word, which augments the size of the dictionary. Therefore, we introduced a neural network approach to acquire the transformation rules or patterns between words in closely related languages. Seq2seq model consisting of an encoder and a decoder, one of the neural network approaches, is commonly used to learn a model to transform one language to another. We have employed it with Bi-LSTM as the encoder and LSTM as the decoder. The encoder receives a word in a hub language among closely related languages and produces a context vector, while the decoder takes the vector from the encoder and generates a word in another closely related language. The encoder for the hub language can be suitable for transfer learning applied to word translation tasks in other closely related languages because the hub language is most similar to the other closely related languages. In this research, we have validated two tokenization methods for applying the sequence-to-sequence (seq2seq) model to word translation tasks: character-based and subword-based tokenization.

5.4.1 Character-Based Sequence to Sequence

A network with a series of L S T M layers defines the character-based sequence-to-sequence model. The encoder receives the Indonesian word and processes it via embedding and Bi-L S T M layers. In the decoder, the Minangkabau word passes through one-hot vector and L S T M layer to the output layer. — **Fig. 14**

The first method employs character-based tokenization. Figure 14 shows the seq2seq model, where the encoder reads the input sequence character-by-character and the decoder also produces an output sequence character-by-character in which each character affects the subsequent character. For example, the encoder for Indonesian can accept 28 types of input tokens, and the decoder for Minangkabau generates 31 types of output tokens, including special tokens like $\langle bos\rangle $ and $\langle eos\rangle $. The token $\langle bos\rangle $ and the token $\langle eos\rangle $ denotes the beginning of a sentence triggering to produce a translated word and the end of a sentence determining when to stop predicting the subsequent character, respectively [38]. In Fig. 14, the encoder receives the word “adalah (is)” character-by-character. On the other hand, the decoder takes the token $\langle bos\rangle $ and the context vector from the encoder and outputs “a.” Subsequently, this “a” is input into the decoder, which then outputs “d.” This process continues until the token $\langle eos\rangle $ is outputted.

5.4.2 Byte-Pair Encoding-Based Sequence to Sequence

A network with a series of L S T M layers defines the byte pair encoding based sequence to sequence model. The encoder receives the Indonesian word and processes it via embedding and Bi-L S T M layers. In the decoder, the Minangkabau word passes through one-hot vector and L S T M layer to the output layer. — **Fig. 15**

The second method employs SentencePiece as subword tokenization. SentencePiece builds subword vocabulary with the specified vocabulary size by using the byte-pair encoding (BPE) segmentation method, which divides words into chunks of characters [13]. The BPE starts with a vocabulary consisting of all symbols found in the set of words, then continues to combine two symbols most frequently co-occurring from the vocabulary to create a new symbol until the vocabulary size reaches the specified size [34]. Subword-based tokenization is expected to work because the phonemes of Indonesian ethnic languages are similar due to the closely related languages, and a similar chunk of the alphabet is assigned to them. To explore the appropriate vocabulary size, which means the number of the most frequent co-occurring characters, we have applied the BPE-based seq2seq model with various vocabulary sizes. From this perspective, the character-based seq2seq model is regarded as a special case of the BPE-based seq2seq model with a vocabulary size of 28, the total number of alphabets. As shown in Fig. 15, an input word “adalah (is)” is tokenized by the BPE method in the preprocess and then each token, “a,” “d,” “a,” “la,” and “h” is input to the encoder. The decoder also chooses a token from the built vocabulary one by one.

The vocabularies, except for alphabets, obtained by BPE with the sizes of 40 and 100 are summarized in Table 4. Overall, the same number of vocabularies in Indonesia and Minangkabau (7 and 68, respectively) are acquired. The symbol “_” indicates the beginning of the word. For example, the difference between the “sa” and “_sa” in Minangkabau is that “sa” can occur in any place in a word. Table 5 shows the tokenization results of “yang and nan (which),” “pada and pado (on),”“adalah and adolah (is),”“segera and sagiro (quick),” and “dasarnya and dasanyo (basically)” with the learned vocabularies.

Table 4 Vocabularies obtained from BPE Indonesian-Minangkabau

Full size table

Table 5 Example of tokenization BPE with different vocabulary size Indonesian-Minangkabau

Full size table

5.5 Experiment for Neural-Based Approach

5.5.1 Experimental Settings

We conducted an experiment to find the optimal tokenization method for applying the seq2seq model to a word translation task. The experiment targeted Indonesian as a source language and Minangkabau as a target language, the language similarity of which is $69.14\%$ based on ASJP. The 10,278 translation pairs are split into 8,221 pairs for training data and 2,056 pairs for test data.

5.5.2 Experiment Result

As shown in Table 6, the results demonstrate that character-based tokenization outperforms BPE tokenization for a word translation task. The experiment was iterated seven times with different vocabulary sizes; the minimum and maximum sizes were 33 and 300, respectively. The smaller the vocabulary size of BPE is, the higher the performance is, and the performance with the minimal size of 33 is approximately the same as the character-based tokenization. This shows that a vector length for a token has an impact on the performance compared to the number of tokens, resulting in the fewer choices being more significant than fewer choice times. For example, in the case of “adolah,” the vector length for a token and the number of tokens in character-based tokenization are 31 and 6, while the ones in BPE-based tokenization are 300 and 3.

Table 6 Comparison experiment results

Full size table

6 Markov-Based Composite Service for Human–Machine Collaboration

This chapter has proposed a crowdsourced method for language resource creation and machine induction methods for language resource augmentation. However, the accuracy of these methods heavily depends on the quality of the input data and the similarity between the target language pairs. When languages are closely related, securing reliable bilingual workers for crowdsourcing becomes more straightforward, thus reducing costs in bilingual dictionary creation. Conversely, inducing a bilingual dictionary from less similar languages can result in decreasing accuracy. This low accuracy can lead to mistranslations, leading to the need for corrections and increasing the overall costs. Therefore, strategic planning for service composition is necessary to determine the optimal combination of two interdependent services, crowdsourced human services and machine induction services, and to prioritize the language pairs to be targeted.

To this end, we have proposed a plan optimizer to produce a feasible optimal plan for creating multiple bilingual dictionaries. Considering uncertainties inherent in constraint-based induction and crowdsourced creation, this optimizer employs a Markov Decision Process (MDP) to decide the most cost-effective bilingual dictionary creation method for each state [25].

6.1 Formalizing Plan Optimization

A Markov Decision Process is commonly used in the services computing domain, especially for modeling workflow composition and optimization with uncertainty [6, 46]. To deal with the inherent uncertainty in the constraint-based bilingual dictionary induction, we model the plan optimization for creating bilingual dictionaries as a directed acyclic graph with the MDP. To apply the MDP to our plan optimization problem, we need to define a set of states ($s, s' \in S$), a set of actions ($a \in A$), a transition probability distribution $T(s, a, s')$ representing the likelihood that the process transitions from state s to state $s'$ upon taking action a, and a cost function $C(s, a, s')$ that associates a cost with each state transition.

6.1.1 State

In the case of n target languages, the total number of all possible combinations of language pair is $h = {n\atopwithdelims ()2} $. Each state contains h bilingual dictionaries, each of which between language x and y, denoted by $d_{(x,y)}$, can take four types of status:

n: not existing
eu: existing, but the dictionary size is below the user’s requested minimum size
pu(z): induced by the pivot action with pivot language z, but the dictionary size is below the minimum size
s: existing, and the dictionary size satisfies the minimum size.

A state is defined as a combination of the above statuses for each dictionary. In the initial state, all bilingual dictionaries must take either status n, eu, or s, while the final state consists of all the dictionaries whose statuses are s.

6.1.2 Action

We have two actions to create or augment a dictionary $d_{(x,y)}$: one is pivot action $a^p_{(x,z,y)}$ where z is the pivot language and the other is crowdsourced creation $a^i_{(x,y)}$. Both actions aim at changing the status of a bilingual dictionary from n, eu, or pu(z) to s. The set of available actions for each state is determined by the following rules.

If a dictionary in a state takes status n or eu, it can be augmented by both pivot action and crowdsourced action.
If a dictionary in a state takes status pu(z), it can be augmented by only crowdsourced action.
If both input dictionaries to create a dictionary $d_{(x,y)}$ take status s, eu, and pu, pivot action $a^p_{(x,z,y)}$ is available.

6.1.3 State Transition Probability

An action to create or augment a dictionary transitions from one state to another by updating the status of the target dictionary. A crowdsourced action can deterministically decide the next state, as workers can be instructed to create translation pairs until the dictionary satisfies the user’s specified minimum size. In contrast, a pivot action non-deterministically decides the next state because the size of the output dictionary depends on the input dictionaries, resulting in either status s or pu(z). Figure 16 illustrates state transition triggered by both actions to augment a dictionary $d_{(1, 2}$ between language 1 and 2. The crowdsourced action $a^i_{(1,2}$ ensures the subsequent state is $st'_{sat}$, where the status of $d_{(1, 2)}$ is s, and the statuses of the other dictionaries remain unchanged from the previous state st. The pivot action $a^p_{(1,3,2)}$ whose pivot language is language 3, can lead to two potential subsequent states: $st'_{sat}$ and $st'_{unsat}$. If the output dictionary size satisfies the minimum criteria, the next state becomes $s'_{sat}$. Otherwise, it transitions to $s'_{unsat}$, where the status of $d_{(1,2)}$ is updated to pu(3), and the other dictionaries remain unchanged.

A state transition diagram. d 1 2, d 1 3, and d 2 3, with e u status in s t, via a p 1 3 2 with 0.26 to d 1 2 p u 3, d 1 3 e u, and d 2 3 e u in s t dash unsat, and via a p 1 3 2 with 0.74 and via a i 1 2 with 1 to d 1 2 s, d 1 3 e u, and d 2 3 e u in s t dash sat. — **Fig. 16**

The state transition probability from one state to another after the pivot action is obtained by estimating the output dictionary size. This size is influenced by the size of the two input dictionaries used in the pivot action. In practice, we assume the number of translation pair candidates, $size(d^{c}_{(x,y)})$, to be double that of the smaller input dictionary, either of $size(d_{(x,z)})$ or $size(d_{(y,z)})$. By multiplying the number of translation pair candidates with the precision of the pivot action, we can calculate the number of induced translation pairs, $size(d^{i}_{(x,y)})$.

$$\begin{aligned} size(d^{c}_{(x,y)}) = 2 \times \min \big \{size(d_{(x,z)}), size(d_{(y,z)})\big \} \end{aligned}$$

(14)

$$\begin{aligned} size(d^{i}_{(x,y)}) = precision(a^p_{(x,z,y)}) \times size(d^{c}_{(x,y)}) \end{aligned}$$

(15)

To satisfy the minimum criteria, we can define the minimum precision k of the pivot action as the following expression.

$$\begin{aligned} k = \frac{minimumSize - size(d_{(x,y)})}{size(d^{c}_{(x,y)})} \end{aligned}$$

(16)

We have introduced a beta distribution parameterized by language similarity as $\alpha $ and polysemy of topology as $\beta $ to model the precision of the pivot action. Using this model, we can calculate the state transition probability that the pivot action changes from the current state s to the state $s'_{unsat}$ where it fails to match the minimum size. Given the cumulative distribution function $F(k; \alpha , \beta )$ for the beta distribution, the transition probability is defined as follows.

$$\begin{aligned} T(s,a^p_{(x,z,y)},s'_{unsat}) = F(k;\alpha ,\beta ) = \int _{0}^{k} f(x;\alpha ,\beta ) dx \end{aligned}$$

(17)

In contrast, the state transition probability from the current state s to the state $s'_{sat}$ where the pivot action successfully satisfies the minimum size is defined as follows.

$$\begin{aligned} T(s,a^p_{(x,z,y)},s'_{sat}) = 1 - F(k;\alpha ,\beta ) = 1 - \int _{0}^{k} f(x;\alpha ,\beta ) dx \end{aligned}$$

(18)

6.1.4 Cost

In the MDP, a reward is received after transitioning from one state to another caused by an action. In the case of a bilingual dictionary creation, we need to pay some cost to manually create and evaluate translation pairs, resulting in that we alternatively regard the cost as a negative reward. The reward and cost are interchangeable in the previous MDP studies [41].

In the crowdsourced action, we instruct workers to manually create and evaluate translation pairs until they reach the minimum size. The cost of the crowdsourced action $a^i_{(x, y)}$ from state s to state $s'$ is defined as the cost for one translation pair, the sum of creationCost and evaluationCost, multiplied by the required number of translation pairs. Furthermore, by estimating the accuracy of the crowdsourced action at 0.8, the cost of the crowdsourced action is finally as follows.

$$\begin{aligned} C(s,a^i_{(x,y)},s') = \frac{minimumSize - size(d_{(x,y)})}{0.8} \times (creationCost + evaluationCost) \end{aligned}$$

(19)

On the other hand, when we already have the input dictionaries to induce a new dictionary with a pivot action, we can create translation pairs without cost, that is $creationCost=0$, but still need to pay a cost for evaluating it. The cost of the pivot action $a^p_{(x,z,y)}$ from state s to state $s'$ is defined as the evaluation cost for one translation pair, evaluationCost, multiplied by the number of translation pair candidates.

$$\begin{aligned} C(s,a^p_{(x,z,y)},s') = size(d^c_{(x,y)}) \times evaluationCost \end{aligned}$$

(20)

Table 7 Similarity matrix of the target languages

Full size table

6.2 Experiment

To evaluate the MDP-based plan optimizer for bilingual dictionary creation, we conducted an experiment under Indonesia language sphere project [19]. Since our pivot-based bilingual dictionary induction method works better on closely related languages, we targeted Indonesian, Malay, and Minangkabau, whose language similarities are high, as shown in Table 7. Additionally, we also selected Javanese and Sundanese, considering the population of their speakers. Thus, we targeted 5 languages, Indonesian (ind), Malay (zlm), Minangkabau (min), Javanese (jav), and Sundanese (sun), and created or augmented 10 dictionaries for every combination of the target languages. The users’ specified minimum size is 2,000 translation pairs, that is $minimumSize=2,000$. We also decided on the cost of creating and evaluating translation pairs based on the availability of the native speakers.

6.2.1 Modeling Task for Native Speaker

We have two types of tasks by native speakers: a creation task and an evaluation task. Even though Indonesia is a multiethnic country where various ethnic people coexist, it is difficult to recruit a bilingual native speaker between the two ethnic languages because ethnic languages are not taught in school, and only Indonesian, the national language of Indonesia, is commonly used in education. To overcome this limitation, $s_{(ind,x)}$, a native bilingual speaker of Indonesian language and ethnic language x, and $s_{(ind,y)}$, a native bilingual speaker of Indonesian language and ethnic language y, collaboratively create and evaluate translation pairs by communicating the senses in Indonesian. By considering this collaboration, we classify the native speakers’ tasks into four: an individual creation task T1(ind, x) and an individual evaluation task T2(ind, x) of a bilingual dictionary $d_{(ind, x)}$, and a collaborative creation task T3(x, y) and a collaborative evaluation task T4(x, y) of a bilingual dictionary $d_{(x,y)}$ between ethic language x and y.

Based on the preliminary experiments, we estimated the creation and evaluation cost for each translation pair with a unit time taken for doing the evaluation task T2(ind, x), which is the simplest task. The cost of the creation task T1(ind, x) is calculated as three times the cost of its evaluation task T2(ind, x). On the other hand, the costs of the creation task T3(x, y) and the evaluation task T4(x, y) are calculated as eight times and four times the cost of T2(ind, x) if they need the collaboration of two native speakers, respectively. Otherwise, they are six times and two times the cost of T2(ind, x).

To ensure the quality of the manually created bilingual dictionary $d_{(ind,x)}$, the created translation pairs should be evaluated by the different native bilingual speaker $s_{(ind,x)}$. We only pay for correct translation pairs to motivate them to create translation pairs carefully. In this way, we couple a creation task and an evaluation task to make two composite tasks: CT1(ind, x) consists of T1(ind, x) and T2(ind, x) between Indonesian and an ethnic language, and CT2(x, ind, y) consists of T3(x, ind, y) and T4(x, ind, y) between ethnic languages via Indonesian as a pivot language.

Table 8 Estimated cost of all-crowdsourced action plan

Full size table

6.2.2 Estimated Plans

To show the effectiveness of our method, we compare them with an all-crowdsourced action plan as a baseline. The estimated cost of the baseline is summarized in Table 8. This cost is estimated by the total number of translation pairs manually created and evaluated by workers. By considering the accuracy of the crowdsourced action as 0.8 and no payment for creating wrong translation pairs, the total cost is the cost of creating the required number of correct translation pairs and evaluating all created translation pairs, including wrong translation pairs. The number of all the created translation pairs is the required number divided by the accuracy, 0.8.

On the other hand, we generated the MDP optimal plan by modeling the pivot action precision with prior beta distributions. We employed the language similarities as the $\alpha $ parameter shown in Table 7 and a topology polysemy, 3 in practice, as the $\beta $ parameter. The generated optimal plan and its estimated cost are summarized in Table 9. The plan column indicates the task order in the plan. The cost calculation for the crowdsourced actions in this plan, CT1, is the same as the all-crowdsourced plan. Meanwhile, the cost of the pivot actions is estimated by the only number of evaluating translation pairs induced by the pivot actions.

Table 9 Estimated cost of the MDP optimal plan

Full size table

6.2.3 Experiment Result

To validate the MDP optimal plan in Table 9, we conducted a real experiment in Indonesia collaboratively with the Islamic University of Riau and Telkom University. In this experiment, 34 native speakers, consisting of 5 Minangkabau speakers, 8 Malay speakers, 9 Javanese speakers, and 12 Sundanese speakers, joined as crowd workers. The real costs are summarized in Table 10.

Table 10 Real cost of the MDP optimal plan

Full size table

This result shows that the MDP optimal plan outperformed the all-crowdsourced plan with 42% cost reduction, and the real cost of the optimal plan is close to the estimated cost in Table 9 within a 3% margin of error. Furthermore, our estimated human accuracy of 0.8 and the topology polysemy were also validated because the average human accuracy is 0.837 and the average topology polysemy is 2.958 in the real experiment.

6.3 Discussion

The current plan optimization algorithm is offline and generates the optimal policy based on approximate models beforehand. This results in the optimal plan can be sub-optimal after taking a few actions in the plan. For instance, although all the pivot actions in the MDP optimal plan shown in Table 9 successfully satisfy the required number of translation pairs, five out of six pivot actions in the real experiment failed to satisfy the required number, in spite of the higher pivot action precision compared to the beta distribution-based estimation. This is caused by the low accuracy of estimating the number of translation pair candidates. This could cause a difference between the estimated and real costs.

One possible way to solve this problem is to change the offline algorithm to an online one by recursively reformalizing the planning problem with newly acquired information on the environment, such as the number of translation pair candidates and created correct translation pairs, every time after executing an action. This allows the plan optimizer to adapt to the dynamic and uncertain environment.

7 Conclusion

To create a multilingual service platform for smart cities, it is necessary to collect language resources in low-resource languages as well as high-resource languages for language equality. However, existing multilingual service platforms are targeted mainly at official languages but not ethnic languages, spoken more in Asia than in Europe, because there exist fewer resources in ethnic languages. Multiethnic countries such as Indonesia require a multilingual service platform to support their ethnic languages. This chapter focused on creating language resources in ethnic languages by combining crowdsourced human services and automatic machine services.

Crowdsourcing is widely adopted to create language resources when there is less data on the Web. By introducing hyper-questions to aggregate answers from crowd workers into the crowdsourced workflow, this chapter aimed at improving the evaluation accuracy under a majority of less reliable workers and assigning creation tasks to highly reliable workers preferentially. The proposed workflow has been demonstrated to achieve higher accuracy than the existing methods regardless of the ratios of less reliable workers.

Additionally, induction methods are employed to acquire language resources from a large amount of data. To complement the lack of data for ethnic languages, this chapter utilized similarities between ethnic languages, such as cognates. Assuming that cognates maintain several common senses and have similar spelling, we filtered mistranslation pairs from candidates with constraints optimization techniques and obtained spelling transformation rules between cognates with neural networks. The proposed methods have been empirically shown to achieve higher recall than the existing methods.

Moreover, to optimally combine crowdsourced creation and machine induction of language resources, this chapter modeled resource creation planning as a Markov Decision Process (MDP). The MDP calculated the optimal policy that decided which actions, such as manual creation and machine induction, are best to minimize the total cost. This chapter proved that the proposed planning method significantly reduced the total cost with a close real cost estimate compared to entirely manual creations.

Notes

1.
https://langsphere.org.

References

Biere A, Heule M, van Maaren H (2009) Handbook of satisfiability, vol 185. IOS Press
Google Scholar
Calzolari N, Gratta R, Francopoulo G, Mariani J, Rubino F, Russo I, Soria C (2012) The lre map harmonising community descriptions of resources. In: International conference on language resources and evaluation (LREC 2012). ELRA, pp 1084–1089
Google Scholar
Campbell L (2013) Historical linguistics. Edinburgh University Press
Google Scholar
Chida H, Murakami Y, Pituxcoosuvarn M (2022) Quality control for crowdsourced bilingual dictionary in low-resource languages. In: Proceedings of the thirteenth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 6590–6596. URL https://aclanthology.org/2022.lrec-1.709
Déjean H, Gaussier E, Sadat F (2002) An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th international conference on computational linguistics—vol. 1, COLING ’02. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1–7. https://doi.org/10.3115/1072228.1072394
Doshi P, Goodwin R, Akkiraju R, Verma K (2004) Dynamic workflow composition using Markov decision processes. In: Proceedings of the IEEE international conference on web services, pp 576–582. https://doi.org/10.1109/ICWS.2004.1314784
Goto S, Ishida T, Lin D (2016) Understanding crowdsourcing workflow: modeling and optimizing iterative and parallel processes. In: Proceedings of the AAAI conference on human computation and crowdsourcing
Google Scholar
Gratta R, Frontini F, Khan A, Mariani J, Soria C (2014) The lremap for under-resourced languages. In: International workshop on collaboration and computing for under-resourced languages in the linked open data era (CCURL 2014), pp 78–83
Google Scholar
Hassine AB, Matsubara S, Ishida T (2006) A constraint-based approach to horizontal web service composition. In: International semantic web conference. Springer, pp 130–143
Google Scholar
Holman EW, Brown CH, Wichmann S, Müller A, Velupillai V, Hammarström H, Sauppe S, Jung H, Bakker D, Brown P et al (2011) Automated dating of the world’s language families based on lexical similarity. Curr Anthropol 52(6):841–875
Article Google Scholar
Ishida T, Murakami Y, Lin D, Nakaguchi T, Otani M (2018) Language service infrastructure on the web: The language grid. Computer 51(6):72–81. https://doi.org/10.1109/MC.2018.2701643
Article Google Scholar
Kazai G, Kamps J, Koolen M, Milic-Frayling N (2011) Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 205–214
Google Scholar
Kudo T (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol. 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 66–75. https://doi.org/10.18653/v1/P18-1007
Li J, Baba Y, Kashima H (2017) Hyper questions: unsupervised targeting of a few experts in crowdsourcing. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp 1069–1078
Google Scholar
Lösch A, Mapelli V, Piperidis S, Vasiļjevs A, Smal L, Declerck T, Schnur E, Choukri K, van Genabith J (2018) European language resource coordination: collecting language resources for public sector multilingual information management. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1213
Matsuno J, Ishida T (2011) Constraint optimization approach to context based word selection. In: IJCAI Proceedings-international joint conference on artificial intelligence, vol 22
Google Scholar
Melamed ID (1995) Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. CoRR cmp-lg/9505044. http://arxiv.org/abs/cmp-lg/9505044
Miyabe M, Yoshino T, Shigeno A (2011) Sharing multilingual resources to support hospital receptions. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 119–132. https://doi.org/10.1007/978-3-642-21178-2_8
Murakami Y (2019) Indonesia language sphere: an ecosystem for dictionary development for low-resource languages. J Phys Conf Ser 1192:012001. IOP Publishing
Google Scholar
Nakov P, Ng HT (2012) Improving statistical machine translation for a resource-poor language using related resourcerich languages. J Artif Intell Res:179–222
Google Scholar
Nasution AH, Murakami Y, Ishida T (2016) Constraint-based bilingual lexicon induction for closely related languages. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Paris, France, pp 3291–3298
Google Scholar
Nasution AH, Murakami Y, Ishida T (2017a) A generalized constraint approach to bilingual dictionary induction for low-resource language families. ACM Trans Asian Low Resour Lang Inf Process 17(2):9:1–9:29. https://doi.org/10.1145/3138815
Nasution AH, Murakami Y, Ishida T (2017b) Plan optimization for creating bilingual dictionaries of low-resource languages. In: 2017 International conference on culture and computing, pp 35–41. https://doi.org/10.1109/Culture.and.Computing.2017.21
Nasution AH, Murakami Y, Ishida T (2018) Designing a collaborative process to create bilingual dictionaries of indonesian ethnic languages. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Paris, France, pp 3397–3404
Google Scholar
Nasution AH, Murakami Y, Ishida T (2021) Plan optimization to bilingual dictionary induction for low-resource language families. ACM Trans Asian Low Resour Lang Inf Process 20(2). https://doi.org/10.1145/3448215
Oleson D, Sorokin A, Laughlin G, Hester V, Le J, Biewald L (2011) Programmatic gold: targeted and scalable quality assurance in crowdsourcing. In: Workshops at the twenty-fifth AAAI conference on artificial intelligence, pp 43–48
Google Scholar
Piperidis S (2012) The META-SHARE language resources sharing infrastructure: principles, challenges, solutions. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey, pp 36–42. http://www.lrec-conf.org/proceedings/lrec2012/pdf/1086_Paper.pdf
Piperidis S, Labropoulou P, Deligiannis M, Giagkou M (2018) Managing public sector data for multilingual applications development. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1205
Rehm G, Berger M, Elsholz E, Hegele S, Kintzel F, Marheinecke K, Piperidis S, Deligiannis M, Galanis D, Gkirtzou K, Labropoulou P, Bontcheva K, Jones D, Roberts I, Hajič J, Hamrlová J, Kačena L, Choukri K, Arranz V, Vasiļjevs A, Anvari O, Lagzdiņš A, Meļņika J, Backfried G, Dikici E, Janosik M, Prinz K, Prinz C, Stampler S, Thomas-Aniola D, Gómez-Pérez JM, Garcia Silva A, Berrío C, Germann U, Renals S, Klejch O (2020) European language grid: an overview. In: Proceedings of the twelfth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 3366–3380. https://aclanthology.org/2020.lrec-1.413
Rehm G, Piperidis S, Bontcheva K, Hajic J, Arranz V, Vasiļjevs A, Backfried G, Gomez-Perez JM, Germann U, Calizzano R, Feldhus N, Hegele S, Kintzel F, Marheinecke K, Moreno-Schneider J, Galanis D, Labropoulou P, Deligiannis M, Gkirtzou K, Kolovou A, Gkoumas D, Voukoutis L, Roberts I, Hamrlova J, Varis D, Kacena L, Choukri K, Mapelli V, Rigault M, Melnika J, Janosik M, Prinz K, Garcia-Silva A, Berrio C, Klejch O, Renals S (2021) European language grid: a joint platform for the European language technology community. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: system demonstrations. Association for Computational Linguistics, pp 221–230. https://doi.org/10.18653/v1/2021.eacl-demos.26
Resiandi K, Murakami Y, Nasution AH (2022) A neural network approach to create Minangkabau-Indonesia bilingual dictionary. In: Proceedings of the 1st annual meeting of the ELRA/ISCA special interest group on under-resourced languages. European Language Resources Association, Marseille, France, pp 122–128. https://aclanthology.org/2022.sigul-1.16
Resiandi K, Murakami Y, Nasution AH (2023) Neural network-based bilingual lexicon induction for Indonesian ethnic languages. Appl Sci 13(15). https://doi.org/10.3390/app13158666
Richardson J, Nakazawa T, Kurohashi S (2015) Pivot-based topic models for low-resource lexicon extraction. In: PACLIC
Google Scholar
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (vol 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pp 1715–1725. https://doi.org/10.18653/v1/P16-1162
Sheng V, Provost F, Ipeirotis P (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 614–622. https://doi.org/10.1145/1401890.1401965
Simons GF, Fennig CD (eds) (2017) Ethnologue: languages of the world, 20th edn. http://www.ethnologue.com
Suo Y, Miyata N, Morikawa H, Ishida T, Shi Y (2009) Open smart classroom: extensible and scalable learning system in smart space using web service technology. IEEE Trans Knowl Data Eng 21(6):814–828. https://doi.org/10.1109/TKDE.2008.117
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 27th international conference on neural information processing systems—vol 2, NIPS’14. MIT Press, Cambridge, MA, USA, pp 3104–3112
Google Scholar
Tanaka K, Umemura K (1994) Construction of a bilingual dictionary intermediated by a third language. In: Proceedings of the 15th conference on Computational linguistics-vol 1. Association for Computational Linguistics, pp 297–303
Google Scholar
Weber F, Jarisch G (2023) Virtual personal assistant prototype YouTwinDi. Springer International Publishing, Cham, pp 355–360. https://doi.org/10.1007/978-3-031-17258-8_29
White DJ (1993) A survey of applications of Markov decision processes. J Oper Res Soc 44(11):1073–1096
Article Google Scholar
Wichmann S, Holman EW, Brown CH (eds) (2022) The ASJP database (version 20)
Google Scholar
Wu H, Wang H (2007) Pivot language approach for phrase-based statistical machine translation. Mach Transl 21(3):165–181. https://doi.org/10.1007/s10590-008-9041-6
Article Google Scholar
Wushouer M, Lin D, Ishida T, Hirayama K (2014) Pivot-based bilingual dictionary extraction from multiple dictionary resources. Springer International Publishing, Cham, pp 221–234. https://doi.org/10.1007/978-3-319-13560-1_18
Wushouer M, Lin D, Ishida T, Hirayama K (2015) A constraint approach to pivot-based bilingual dictionary induction. ACM Trans Asian Low Resour Lang Inf Process 15(1):4:1–4:26. https://doi.org/10.1145/2723144
Yu J, Buyya R, Tham CK (2005) Cost-based scheduling of scientific workflow applications on utility grids. In: First international conference on e-science and grid computing (e-Science’05), pp 8–147. https://doi.org/10.1109/E-SCIENCE.2005.26

Download references

Acknowledgements

This research was partially supported by a Grant-in-Aid for Scientific Research (B) (21H03561,2021–2024, 21H03556, 2021–2023) and a Grant-in-Aid Young Scientists (A) (17H04706, 2017–2020) from the Japan Society for the Promotion of Science (JSPS). I would like to express my sincere gratitude to Prof. Toru Ishida at Hong Kong Baptist University for his valuable advice and helpful discussions. I am also particularly grateful to Associate Prof. Arbi Haza Nasution at the Islamic University of Riau for his collaboration on my research. I wish to thank Prof. Mirna Adriani, Prof. Totok Suhardijanto, and Dr. Kemas Muslim Lhaksmana for organizing the field experiments. Finally, I appreciate my students in the Social Intelligence Laboratory at Ritsumeikan University, especially Hiroki Chida and Kartika Findra Resiandi, for contributing to the Indonesia Language Sphere project.

Author information

Authors and Affiliations

Ritsumeikan University, 1-1-1 Nojihigashi, Kusatsu, Japan
Yohei Murakami

Authors

Yohei Murakami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yohei Murakami .

Editor information

Editors and Affiliations

Faculty of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga, Japan
Yohei Murakami
Artificial Intelligence Laboratory, Fujitsu Research, Fujitsu Limited, Kawasaki, Kanagawa, Japan
Kosaku Kimura

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Murakami, Y. (2024). Human–Machine Collaboration for a Multilingual Service Platform. In: Murakami, Y., Kimura, K. (eds) Human-Centered Services Computing for Smart Cities. Springer, Singapore. https://doi.org/10.1007/978-981-97-0779-9_3

Download citation

DOI: https://doi.org/10.1007/978-981-97-0779-9_3
Published: 05 May 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0778-2
Online ISBN: 978-981-97-0779-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Human–Machine Collaboration for a Multilingual Service Platform

Abstract

1 Introduction

1.1 Background

1.2 Approach

1.3 Structure of This Chapter

2 Multilingual Service Platform for Smart Cities

3 Human–Machine Service Composition

3.1 Collaborative Creation Workflow

3.2 Crowdsourcing System for Language Resource Creation

4 Reliable Crowdsourced Services for Creating Language Resources

4.1 Introduction

4.2 Issues in Crowdsourcing

4.2.1 Quality Control

4.2.2 Task Assignment

4.3 Crowdsourced Workflow

4.4 Evaluation Aggregation with Hyper-Questions

4.5 Task Assignment Based on Workers’ Reliability

4.5.1 Workers’ Reliability

4.5.2 Task Assignment

4.6 Evaluation

4.6.1 Models

4.6.2 Evaluation Method

4.6.3 Results

5 AI Services for Augmenting Language Resources

5.1 Introduction

5.2 Pivot-Based Approach

5.2.1 Symmetry Assumption

5.2.2 N-Cycle Symmetry Assumption

5.2.3 Formalization

5.2.4 Heuristics to Find Cognate

5.2.5 Constraints to Identify Cognates

5.2.6 Generalized Framework

5.3 Experiment for Pivot-Based Approach

5.3.1 Experimental Settings

5.3.2 Experiment Result

5.4 Neural Network Approach

5.4.1 Character-Based Sequence to Sequence

5.4.2 Byte-Pair Encoding-Based Sequence to Sequence

5.5 Experiment for Neural-Based Approach

5.5.1 Experimental Settings

5.5.2 Experiment Result

6 Markov-Based Composite Service for Human–Machine Collaboration

6.1 Formalizing Plan Optimization

6.1.1 State

6.1.2 Action

6.1.3 State Transition Probability

6.1.4 Cost

6.2 Experiment

6.2.1 Modeling Task for Native Speaker

6.2.2 Estimated Plans

6.2.3 Experiment Result

6.3 Discussion

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation