Machine Translation

, Volume 28, Issue 1, pp 1–17 | Cite as

A conjoint analysis framework for evaluating user preferences in machine translation

  • Katrin Kirchhoff
  • Daniel Capurro
  • Anne M. Turner


Despite much research on machine translation (MT) evaluation, there is surprisingly little work that directly measures users’ intuitive or emotional preferences regarding different types of MT errors. However, the elicitation and modeling of user preferences is an important prerequisite for research on user adaptation and customization of MT engines. In this paper we explore the use of conjoint analysis as a formal quantitative framework to assess users’ relative preferences for different types of translation errors. We apply our approach to the analysis of MT output from translating public health documents from English into Spanish. Our results indicate that word order errors are clearly the most dispreferred error type, followed by word sense, morphological, and function word errors. The conjoint analysis-based model is able to predict user preferences more accurately than a baseline model that chooses the translation with the fewest errors overall. Additionally we analyze the effect of using a crowd-sourced respondent population versus a sample of domain experts and observe that main preference effects are remarkably stable across the two samples.


Machine translation Evaluation User modeling  Preference elicitation 



We are grateful to Aurora Salvador Sanchis and Lorena Ruiz Marcos for providing the error annotations and corrections, to Megumu Brownstein for recruiting the domain experts, and to Kate Cole for comments on an earlier draft of this paper. This study was funded by Grant #1R01LM010811-01 from the National Library of Medicine (NLM). Its content is solely the responsibility of the authors and does not necessarily represent the view of the NLM.


  1. Al-Maskari A, Sanderson M (2006) The affect [sic] of machine translation on the performance of Arabic-English QA system. In: EACL-2006, 11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the workshop on Multilingual Question Answering MLQA06, Trento, Italy, pp 9–14Google Scholar
  2. Altman D (1991) Practical statistics for medical research. Chapman & Hall, LondonGoogle Scholar
  3. Boutilier C, Brafman R, Geib C, Poole D (1997) A constraint-based approach to preference elicitation and decision making. In: AAAI Spring Symposium on Qualitative Preferences in Deliberation and Practical Reasoning, Stanford, CA, pp 19–28Google Scholar
  4. Braziunas D (2006) Computational approaches to preference elicitation. Tech. rep., Department of Computer Science, University of Toronto, CanadaGoogle Scholar
  5. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-)evaluation of machine translation. In: ACL 2007: Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, pp 136–158Google Scholar
  6. Chen L, Pu P (2004) Survey of preference elicitation methods. Tech. Rep. IC/2004/67, Human Computer Interaction Group, Ecole Politechnique Fédérale de Lausanne, SwitzerlandGoogle Scholar
  7. Christiadi, Cushing B (2007) Conditional logit, IIA, and alternatives for estimating models of interstate migration. In: 46th Annual Meeting of the Southern Regional Science Association, Charleston, SC, available online at, Accessed 23 April 2013
  8. Condon S, Parvaz D, Aberdeen J, Doran C, Freeman A, Awad M (2010) Evaluation of machine translation errors in English and Iraqi Arabic. In: LREC 2010: proceedings of the seventh international conference on Language Resources and Evaluation, Valetta, Malta, pp 729–735Google Scholar
  9. Denkowski M, Lavie A (2010) Choosing the right evaluation for machine translation: an examination of annotator and automatic metric performance on human judgment tasks. In: AMTA 2010: Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas, Denver, CO., USA, available online at, Accessed 23 April 2013
  10. Doyle J, Thomason R (1999) Background to qualitative decision theory. AI Magazine 20(2):55–68Google Scholar
  11. Farrús M, Costa-Jussà M, Popovic M (2012) Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations. J Am Soc Inf Sci Technol 63(1):174–184CrossRefGoogle Scholar
  12. Fleiss J (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382CrossRefGoogle Scholar
  13. Goodman L (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61(2):215–231CrossRefzbMATHMathSciNetGoogle Scholar
  14. Green P, Rao V (1971) Conjoint measurement for quantifying judgmental data. J Mark Res 8(3):355–363CrossRefGoogle Scholar
  15. Green P, Srinivasan V (1978) Conjoint analysis in consumer research: issues and outlook. J Consumer Res 5:103–123CrossRefGoogle Scholar
  16. Hui B (2002) Measuring user acceptability of machine translations to diagnose system errors: An experience report. In: Coling-2002 workshop “Machine translation in Asia”, Taipei, Taiwan, pp 63–70Google Scholar
  17. Kirchhoff K, Turner A, Axelrod A, Saavedra F (2011) Application of statistical machine translation to public health information: a feasibility study. J Am Med Inform Assoc 18:472–482Google Scholar
  18. Kirchhoff K, Capurro D, Turner A (2012) Evaluating user preferences in machine translation using conjoint analysis. In: EAMT 2012: Proceedings of the 16th Annual Conference of the European Association for Machine Translation, Trento, Italy, pp 119–126Google Scholar
  19. Krings H (2001) Empirical investigations of machine translation post-editing processes. Kent State University Press, Kent, OHGoogle Scholar
  20. Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159174MathSciNetGoogle Scholar
  21. Lavie A, Agarwal A (2007) METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In: ACL 2007: proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, pp 228–231Google Scholar
  22. LDC (2005) Linguistic data annotation specification: Assessment of fluency and adequacy in translations. revision 1.5. Tech. rep., Linguistic Data Consortium, Philadelphia, PAGoogle Scholar
  23. Louviere J, Woodworth G (1983) Design and analysis of simulated consumer choice experiments: an approach based on aggregate data. J Market Res 20(4):350–367CrossRefGoogle Scholar
  24. Maier G, Edward M (2002) Modelling preferences and stability among transport alternatives. Transportation Research Part E 38:319–334CrossRefGoogle Scholar
  25. McFadden D (1974) Conditional logit analysis of qualitative choice behavior. In: Zarembka P (ed) Frontiers in Econometrics. Academic Press, New York, pp 105–142Google Scholar
  26. O’Brien S (ed) (2011) Cognitive Explorations of Translation: Eyes, Keys, Taps. Continuum, London/New YorkGoogle Scholar
  27. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Philadelphia, PA, USA, pp 311–318Google Scholar
  28. Parton K, McKeown K (2010) MT error detection for cross-lingual question answering. In: Coling 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, Beijing, China, pp 946–954Google Scholar
  29. Parton K, Habash N, McKeown K, Iglesias G, Gispert A (2012) Can automatic post-editing make MT more meaningful? In: Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp 111–118Google Scholar
  30. Philips K, Maddala T, Johnson F (2002) Measuring preferences for health care interventions using conjoint analysis. Health Serv Res 37(6):1681–1705CrossRefGoogle Scholar
  31. Popovic M, Ney H (2011) Towards automatic error analysis of machine translation output. Comput Linguistics 37(4):657–688CrossRefMathSciNetGoogle Scholar
  32. Saaty T (1977) A scaling method for priorities in hierarchical structure. J Math Psychol 15:234–281CrossRefzbMATHMathSciNetGoogle Scholar
  33. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Visions for the Future of Machine Translation, Cambridge, MA, USA, pp 223–231Google Scholar
  34. Specia L, Raj D, Turchi M (2012) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50CrossRefGoogle Scholar
  35. Vilar D, Xiu J, D’Haro L, Ney H (2006) Error analysis of statistical machine translation output. In: LREC-2006: Fifth International Conference on Language Resources and Evaluation, Proceedings, Genoa, Italy, pp 697–702Google Scholar
  36. Yamashita N, Ishida T (2006) Effects of machine translation on collaborative work. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW), Banff, Alberta, Canada, pp 515–523Google Scholar
  37. Ypma T (1995) Historical development of the Newton-Raphson method. SIAM Rev 37(4):531–551CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Katrin Kirchhoff
    • 1
  • Daniel Capurro
    • 2
  • Anne M. Turner
    • 3
    • 4
  1. 1.Department of Electrical EngineeringUniversity of WashingtonSeattleUSA
  2. 2.Department of Internal MedicinePontificia Universidad Católica de ChileSantiagoChile
  3. 3.Department of Health ServicesUniversity of WashingtonSeattleUSA
  4. 4.Department of Biomedical Informatics and Medical EducationUniversity of WashingtonSeattleUSA

Personalised recommendations