Skip to main content

Bridging Natural Language and Graphical User Interfaces

  • Chapter
  • First Online:
Artificial Intelligence for Human Computer Interaction: A Modern Approach

Part of the book series: Human–Computer Interaction Series ((HCIS))

  • 2164 Accesses

Abstract

“Language as symbolic action” (https://en.wikipedia.org/wiki/Kenneth_Burke) has a natural connection with direct-manipulation interaction (e.g., via GUI or physical appliances) that is common for modern computers such as smartphones. In this chapter, we present our efforts for bridging the gap between natural language and graphical user interfaces, which can potentially enable a broad category of interaction scenarios. Specifically, we develop datasets and deep learning models that can ground natural language instructions or command into executable actions on GUIs, and on the other hand generate natural language descriptions of user interfaces such that a user knows how to control them in language. These projects resemble research efforts intersecting Natural Language Processing (NLP) and HCI, and produce datasets and opensource code that lay a foundation for future research in the area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our data pipeline is available at https://github.com/google-research/google-research/tree/master/seq2act.

  2. 2.

    https://developer.android.com/reference/android/view/View.html

  3. 3.

    https://support.google.com/pixelphone.

  4. 4.

    Our model code is released at https://github.com/google-research/google-research/tree/master/seq2act.

  5. 5.

    While it is possible to directly use screen visual data for grounding, detecting UI objects from raw pixels is nontrivial. It would be ideal to use both structural and visual data.

  6. 6.

    We use widgets and elements interchangeably.

  7. 7.

    Our dataset is released at https://github.com/google-research-datasets/widget-caption.

  8. 8.

    Our model code is released at https://github.com/google-research/google-research/tree/master/widget-caption.

  9. 9.

    https://developer.android.com/reference/android/view/View.

  10. 10.

    https://developer.android.com/guide/topics/ui/accessibility/apps.

  11. 11.

    https://www.mturk.com.

References

  1. Anderson, P., Fernando, B., Johnson, M., and Gould, S. SPICE: semantic propositional image caption evaluation. CoRR abs/1607.08822 (2016)

    Google Scholar 

  2. Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014)

    Google Scholar 

  3. Branavan, S., Zettlemoyer, L., and Barzilay, R. Reading between the lines: Learning to map high-level instructions to commands. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (Uppsala, Sweden, July 2010), 1268–1277

    Google Scholar 

  4. Branavan, S. R. K., Chen, H., Zettlemoyer, L. S., and Barzilay, R. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1, ACL ’09, Association for Computational Linguistics (Stroudsburg, PA, USA, 2009), 82–90

    Google Scholar 

  5. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P., and Lawrence Zitnick, C. Microsoft COCO captions: Data collection and evaluation server

    Google Scholar 

  6. Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., and Kumar, R. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, UIST ’17, ACM (New York, NY, USA, 2017), 845–854

    Google Scholar 

  7. Denkowski, M., and Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Association for Computational Linguistics (Stroudsburg, PA, USA, 2014), 376–380

    Google Scholar 

  8. Gur, I., Rueckert, U., Faust, A., and Hakkani-Tur, D. Learning to navigate the web. In International Conference on Learning Representations (2019)

    Google Scholar 

  9. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition (2015). cite arxiv:1512.03385Comment: Tech report

  10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. 9(8):1735–1780 Nov

    Article  Google Scholar 

  11. Lee, K., He, L., Lewis, M., and Zettlemoyer, L. End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (Copenhagen, Denmark, Sept. 2017), 188–197

    Google Scholar 

  12. Lee, K., Kwiatkowski, T., Parikh, A. P., and Das, D. Learning recurrent span representations for extractive question answering. CoRR abs/1611.01436 (2016)

    Google Scholar 

  13. Li, Y., He, J., Zhou, X., Zhang, Y., and Baldridge, J. Mapping natural language instructions to mobile ui action sequences. In ACL 2020: Association for Computational Linguistics (2020)

    Google Scholar 

  14. Li, Y., Kaiser, L., Bengio, S., and Si, S. Area attention. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97 of Proceedings of Machine Learning Research, PMLR (Long Beach, California, USA, 09–15 Jun 2019), 3846–3855

    Google Scholar 

  15. Li, Y., Li, G., He, L., Zheng, J., Li, H., and Guan, Z. Widget captioning: Generating natural language description for mobile user interface elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics (Online, Nov. 2020), 5495–5510

    Google Scholar 

  16. Lin, C.-Y., and Och, F. J. Orange: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04, Association for Computational Linguistics (Stroudsburg, PA, USA, 2004)

    Google Scholar 

  17. Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)

    Google Scholar 

  18. Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  19. Luong, T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (Lisbon, Portugal, Sept. 2015), 1412–1421

    Google Scholar 

  20. Niepert, M., Ahmed, M., and Kutzkov, K. Learning convolutional neural networks for graphs. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, Eds., vol. 48 of Proceedings of Machine Learning Research, PMLR (New York, New York, USA, 20–22 Jun 2016), 2014–2023

    Google Scholar 

  21. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Association for Computational Linguistics (USA, July 2002), 311–318

    Google Scholar 

  22. Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. Image transformer. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, Eds., vol. 80 of Proceedings of Machine Learning Research, PMLR (Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018), 4055–4064

    Google Scholar 

  23. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP) (2014), 1532–1543

    Google Scholar 

  24. Ramshaw, L., and Marcus, M. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora (1995)

    Google Scholar 

  25. Ross, A. S., Zhang, X., Fogarty, J., and Wobbrock, J. O. Epidemiology as a framework for large-scale mobile application accessibility assessment. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ’17, ACM (New York, NY, USA, 2017), 2–11

    Google Scholar 

  26. Ross, A. S., Zhang, X., Fogarty, J., and Wobbrock, J. O. Examining image-based button labeling for accessibility in android apps through large-scale analysis. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ’18, ACM (New York, NY, USA, 2018), 119–130

    Google Scholar 

  27. Sarsenbayeva, Z. Situational impairments during mobile interaction. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (2018), 498–503

    Google Scholar 

  28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR abs/1706.03762 (2017)

    Google Scholar 

  29. Vedantam, R., Zitnick, C., and Parikh, D. Cider: Consensus-based image description evaluation (06 2015). 4566–4575

    Google Scholar 

  30. Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, 2692–2700

    Google Scholar 

  31. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. CoRR abs/1502.03044 (2015)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Jiacong He, Yuan Zhang, Jason Baldridge, Song Wang, Justin Cui, Christina Ou, Luheng He, Jingjie Zheng, Hong Li, Zhiwei Guan, Ashwin Kakarla, and Muqthar Mohammad who contributed to these projects.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Li, Y., Zhou, X., Li, G. (2021). Bridging Natural Language and Graphical User Interfaces. In: Li, Y., Hilliges, O. (eds) Artificial Intelligence for Human Computer Interaction: A Modern Approach. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-030-82681-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-82681-9_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82680-2

  • Online ISBN: 978-3-030-82681-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics