Skip to main content

Demonstration + Natural Language: Multimodal Interfaces for GUI-Based Interactive Task Learning Agents

  • Chapter
  • First Online:
Artificial Intelligence for Human Computer Interaction: A Modern Approach


We summarize our past five years of work on designing, building, and studying Sugilite, an interactive task learning agent that can learn new tasks and relevant associated concepts interactively from the user’s natural language instructions and demonstrations leveraging the graphical user interfaces (GUIs) of third-party mobile apps. Through its multi-modal and mixed-initiative approaches for Human-AI interaction, Sugilite made important contributions in improving the usability, applicability, generalizability, flexibility, robustness, and shareability of interactive task learning agents. Sugilite also represents a new human-AI interaction paradigm for interactive task learning, where it uses existing app GUIs as a medium for users to communicate their intents with an AI agent instead of the interfaces for users to interact with the underlying computing services. In this chapter, we describe the Sugilite system, explain the design and implementation of its key features, and show a prototype in the form of a conversational assistant on Android.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

    Sugilite is named after a purple gemstone, and stands for: Smartphone Users Generating Intelligent Likeable Interfaces Through Examples.

  2. 2.

    A demo video is available at

  3. 3.

  4. 4.

    Sovite is named after a type of rock. It is also an acronym for System for Optimizing Voice Interfaces to Tackle Errors.

  5. 5.

    Available at:

  6. 6.

    Available at:

  7. 7.

    Since the next screen is always within the same app, and therefore, shares an app description embedding, the prediction task favors having information about the specific app (i.e., app store description embedding) dominate the embedding


  1. Adar E, Dontcheva M, Laput G (2014) CommandSpace: modeling the relationships between tasks, descriptions and features. In: Proceedings of the 27th annual ACM symposium on user interface software and technology, UIST ’14, pp 167–176. ACM, New York, NY, USA.

  2. Alharbi K, Yeh T (2015) Collect, decompile, extract, stats, and diff: mining design pattern changes in android apps. In: Proceedings of the 17th international conference on human-computer interaction with mobile devices and services, MobileHCI ’15, pp 515–524. ACM, New York, NY, USA.

  3. Allen J, Chambers N, Ferguson G, Galescu L, Jung H, Swift M, Taysom W (2007) PLOW: a collaborative task learning agent. In: Proceedings of the 22Nd national conference on artificial intelligence - volume 2, AAAI’07, pp 1514–1519. AAAI Press, Vancouver, British Columbia, Canada

    Google Scholar 

  4. Allen JF, Guinn CI, Horvtz E (1999) Mixed-initiative interaction. IEEE Intell Syst Appl 14(5):14–23

    Article  Google Scholar 

  5. Amazon: Alexa Design Guide (2020).

  6. Antila V, Polet J, Lämsä A, Liikka J (2012) RoutineMaker: towards end-user automation of daily routines using smartphones. In: 2012 IEEE international conference on pervasive computing and communications workshops (PERCOM workshops), pp 399–402.

  7. Argall BD, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57(5):469–483.

  8. Ashktorab Z, Jain M, Liao QV, Weisz JD (2019) Resilient chatbots: repair strategy preferences for conversational breakdowns. In: Proceedings of the 2019 CHI conference on human factors in computing systems, p 254. ACM

    Google Scholar 

  9. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus for a web of open data. The semantic web, pp 722–735.

  10. Azaria A, Krishnamurthy J, Mitchell TM (2016) Instructable intelligent personal agent. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI), vol 4

    Google Scholar 

  11. Ballard BW, Biermann AW (1979) Programming in natural language “NLC” as a prototype. In: Proceedings of the 1979 annual conference, ACM ’79, pp 228–237. ACM, New York, NY, USA.

  12. Banovic N, Grossman T, Matejka J, Fitzmaurice G (2012) Waken: reverse engineering usage information and interface structure from software videos. In: Proceedings of the 25th annual ACM symposium on user interface software and technology, UIST ’12, pp 83–92. ACM, New York, NY, USA.

  13. Barman S, Chasins S, Bodik R, Gulwani S (2016) Ringer: web automation by demonstration. In: Proceedings of the 2016 ACM SIGPLAN international conference on object-oriented programming, systems, languages, and applications, OOPSLA 2016, pp 748–764. ACM, New York, NY, USA.

  14. Beneteau E, Richards OK, Zhang M, Kientz JA, Yip J, Hiniker A (2019) Communication breakdowns between families and alexa. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19, pp 243:1–243:13. ACM, New York, NY, USA.

  15. Bentley F, Luvogt C, Silverman M, Wirasinghe R, White B, Lottridge D (2018) Understanding the long-term use of smart speaker assistants. Proc ACM Interact Mob Wearable Ubiquitous Technol 2(3).

  16. Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on freebase from question-answer pairs. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1533–1544

    Google Scholar 

  17. Bergman L, Castelli V, Lau T, Oblinger D (2005) DocWizards: a system for authoring follow-me documentation wizards. In: Proceedings of the 18th annual ACM symposium on user interface software and technology, UIST ’05, pp 191–200. ACM, New York, NY, USA.

  18. Biermann AW (1983) Natural Language Programming. In: Biermann AW, Guiho G (eds) Computer program synthesis methodologies, NATO advanced study institutes series. Springer, Netherlands, pp 335–368

    Google Scholar 

  19. Bigham JP, Lau T, Nichols J (2009) Trailblazer: enabling blind users to blaze trails through the web. In: Proceedings of the 14th international conference on intelligent user interfaces, IUI ’09, pp 177–186. ACM, New York, NY, USA.

  20. Billard A, Calinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In: Springer handbook of robotics, pp 1371–1394. Springer.

  21. Bohus D, Rudnicky AI (2005) Sorry, I didn’t catch that!-An investigation of non-understanding errors and recovery strategies. In: 6th SIGdial workshop on discourse and dialogue

    Google Scholar 

  22. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 1247–1250. ACM.

  23. Bolt RA (1980) “Put-that-there”: voice and gesture at the graphics interface. In: Proceedings of the 7th annual conference on computer graphics and interactive techniques, SIGGRAPH ’80, pp 262–270. ACM, New York, NY, USA

    Google Scholar 

  24. Bosselut A, Rashkin H, Sap M, Malaviya C, Celikyilmaz A, Choi Y (2019) COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4762–4779. ACL, Florence, Italy.

  25. Brennan SE (1991) Conversation with and through computers. User Model User-Adap Int 1(1):67–86.

  26. Brennan SE (1998) The grounding problem in conversations with and through computers. Social and cognitive approaches to interpersonal communication, pp 201–225

    Google Scholar 

  27. Böhmer M, Hecht B, Schöning J, Krüger A, Bauer G (2011) Falling asleep with angry birds, facebook and kindle: a large scale study on mobile application usage. In: Proceedings of the 13th international conference on human computer interaction with mobile devices and services, MobileHCI ’11, pp 47–56. ACM, New York, NY, USA.

  28. Chai JY, Gao Q, She L, Yang S, Saba-Sadiya S, Xu G (2018) Language to action: towards interactive task learning with physical agents. In: IJCAI, pp 2–9

    Google Scholar 

  29. Chandramouli V, Chakraborty A, Navda V, Guha S, Padmanabhan V, Ramjee R (2015) Insider: towards breaking down mobile app silos. In: TRIOS workshop held in conjunction with the SIGOPS SOSP 2015

    Google Scholar 

  30. Chen F, Xia K, Dhabalia K, Hong JI (2019) Messageontap: a suggestive interface to facilitate messaging-related tasks. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19. ACM, New York, NY, USA.

  31. Chen J, Chen C, Xing Z, Xu X, Zhu L, Li G, Wang J (2020) Unblind your apps: predicting natural-language labels for mobile gui components by deep learning. In: Proceedings of the 42nd international conference on software engineering, ICSE ’20

    Google Scholar 

  32. Chen JH, Weld DS (2008) Recovering from errors during programming by demonstration. In: Proceedings of the 13th international conference on intelligent user interfaces, IUI ’08, pp 159–168. ACM, New York, NY, USA.

  33. Chkroun M, Azaria A (2019) Lia: a virtual assistant that can be taught new commands by speech. Int J Hum–Comput Interact 1–12

    Google Scholar 

  34. Cho J, Rader E (2020) The role of conversational grounding in supporting symbiosis between people and digital assistants. Proc ACM Hum-Comput Interact 4(CSCW1)

    Google Scholar 

  35. Clark HH, Brennan SE (1991) Grounding in communication. In: Perspectives on socially shared cognition, pp 127–149. APA, Washington, DC, US.

  36. Cowan BR, Pantidi N, Coyle D, Morrissey K, Clarke P, Al-Shehri S, Earley D, Bandeira N (2017) “what can i help you with?”: Infrequent users’ experiences of intelligent personal assistants. In: Proceedings of the 19th international conference on human-computer interaction with mobile devices and services, MobileHCI ’17, pp 43:1–43:12. ACM, New York, NY, USA.

  37. Cypher A, Halbert DC (1993) Watch what I do: programming by demonstration. MIT Press

    Google Scholar 

  38. Deka B, Huang Z, Franzen C, Hibschman J, Afergan D, Li Y, Nichols J, Kumar R (2017) Rico: a mobile app dataset for building data-driven design applications. In: Proceedings of the 30th annual ACM symposium on user interface software and technology, UIST ’17, pp 845–854. ACM, New York, NY, USA.

  39. Deka B, Huang Z, Kumar R (2016) ERICA: interaction mining mobile apps. In: Proceedings of the 29th annual symposium on user interface software and technology, UIST ’16, pp 767–776. ACM, New York, NY, USA.

  40. Dixon M, Fogarty J (2010) Prefab: implementing advanced behaviors using pixel-based reverse engineering of interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10, pp 1525–1534. ACM, New York, NY, USA.

  41. Dixon M, Leventhal D, Fogarty J (2011) Content and hierarchy in pixel-based methods for reverse engineering interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’11, pp 969–978. ACM, New York, NY, USA.

  42. Dixon M, Nied A, Fogarty J (2014) Prefab layers and prefab annotations: extensible pixel-based interpretation of graphical interfaces. In: Proceedings of the 27th annual ACM symposium on user interface software and technology, UIST ’14, pp 221–230. ACM, New York, NY, USA.

  43. Fast E, Chen B, Mendelsohn J, Bassen J, Bernstein MS (2018) Iris: a conversational agent for complex tasks. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18, pp 473:1–473:12. ACM, New York, NY, USA.

  44. Gao X, Gong R, Zhao Y, Wang S, Shu T, Zhu SC (2020) Joint mind modeling for explanation generation in complex human-robot collaborative tasks. In: 2020 29th IEEE international conference on robot and human interactive communication (RO-MAN), pp 1119–1126. IEEE

    Google Scholar 

  45. Gluck KA, Laird JE (2019) Interactive task learning: humans, robots, and agents acquiring new tasks through natural interactions, vol 26. MIT Press

    Google Scholar 

  46. Green TR (1989) Cognitive dimensions of notations. People and Computers V pp 443–460.

  47. Grudin J, Jacques R (2019) Chatbots, humbots, and the quest for artificial general intelligence. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp 1–11

    Google Scholar 

  48. Guo A, Kong J, Rivera M, Xu FF, Bigham JP (2019) StateLens: a reverse engineering solution for making existing dynamic touchscreens accessible. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology (UIST 2019), p 15

    Google Scholar 

  49. Gur I, Yavuz S, Su Y, Yan X (2018) DialSQL: dialogue based structured query generation. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1339–1349. ACL, Melbourne, Australia.

  50. Hartmann B, Wu L, Collins K, Klemmer SR (2007) Programming by a sample: rapidly creating web applications with d.mix. In: Proceedings of the 20th annual ACM symposium on user interface software and technology, UIST ’07, pp 241–250. ACM, New York, NY, USA.

  51. Horvitz E (1999) Principles of mixed-initiative user interfaces. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’99, pp 159–166. ACM, New York, NY, USA.

  52. Huang F, Canny JF, Nichols J (2019) Swire: sketch-based user interface retrieval. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19, pp 1–10. ACM, New York, NY, USA.

  53. Huang THK, Azaria A, Bigham JP (2016) InstructableCrowd: creating IF-THEN rules via conversations with the crowd, pp 1555–1562. ACM Press.

  54. Hutchins EL, Hollan JD, Norman DA (1986) Direct manipulation interfaces

    Google Scholar 

  55. Iba S, Paredis CJJ, Khosla PK (2005) Interactive multimodal robot programming. Int J Robot Res 24(1):83–104.

  56. IFTTT (2016) IFTTT: connects the apps you love.

  57. Intharah T, Turmukhambetov D, Brostow GJ (2019) Hilc: domain-independent pbd system via computer vision and follow-up questions. ACM Trans Interact Intell Syst 9(2-3):16:1–16:27.

  58. Jain M, Kumar P, Kota R, Patel SN (2018) Evaluating and informing the design of chatbots. In: Proceedings of the 2018 designing interactive systems conference, pp 895–906. ACM

    Google Scholar 

  59. Jiang J, Jeng W, He D (2013) How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 143–152. ACM

    Google Scholar 

  60. Kasturi T, Jin H, Pappu A, Lee S, Harrison B, Murthy R, Stent A (2015) The cohort and speechify libraries for rapid construction of speech enabled applications for android. In: Proceedings of the 16th annual meeting of the special interest group on discourse and dialogue, pp 441–443

    Google Scholar 

  61. Kate RJ, Wong YW, Mooney RJ (2005) Learning to transform natural to formal languages. In: Proceedings of the 20th national conference on artificial intelligence - volume 3, AAAI’05, pp 1062–1068. AAAI Press, Pittsburgh, Pennsylvania.

  62. Kim D, Park S, Ko J, Ko SY, Lee SJ (2019) X-droid: a quick and easy android prototyping framework with a single-app illusion. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology, UIST ’19, pp 95–108. ACM, New York, NY, USA.

  63. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.

  64. Kirk J, Mininger A, Laird J (2016) Learning task goals interactively with visual demonstrations. Biol Inspired Cogn Archit 18:1–8

    Google Scholar 

  65. Ko AJ, Abraham R, Beckwith L, Blackwell A, Burnett M, Erwig M, Scaffidi C, Lawrance J, Lieberman H, Myers B, Rosson MB, Rothermel G, Shaw M, Wiedenbeck S (2011) The state of the art in end-user software engineering. ACM Comput Surv 43(3), 21:1–21:44.

  66. Kumar R, Satyanarayan A, Torres C, Lim M, Ahmad S, Klemmer SR, Talton JO (2013) Webzeitgeist: design mining the web. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’13, pp 3083–3092. ACM, New York, NY, USA.

  67. Kurihara K, Goto M, Ogata J, Igarashi T (2006) Speech pen: predictive handwriting based on ambient multimodal recognition. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 851–860. ACM

    Google Scholar 

  68. Labutov I, Srivastava S, Mitchell T (2018) Lia: a natural language programmable personal assistant. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp 145–150

    Google Scholar 

  69. Laird JE, Gluck K, Anderson J, Forbus KD, Jenkins OC, Lebiere C, Salvucci D, Scheutz M, Thomaz A, Trafton G, Wray RE, Mohan S, Kirk JR (2017) Interactive task learning. IEEE Intell Syst 32(4):6–21.

    Article  Google Scholar 

  70. Laput GP, Dontcheva M, Wilensky G, Chang W, Agarwala A, Linder J, Adar E (2013) PixelTone: a multimodal interface for image editing. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’13, pp 2185–2194. ACM, New York, NY, USA.

  71. Lau T (2009) Why programming-by-demonstration systems fail: lessons learned for usable AI. AI Mag 30(4):65–67.

  72. Lee C, Kim S, Han D, Yang H, Park YW, Kwon BC, Ko S (2020) Guicomp: a gui design assistant with real-time, multi-faceted feedback. In: Proceedings of the 2020 CHI conference on human factors in computing systems, CHI ’20, pp 1–13. ACM, New York, NY, USA.

  73. Lee HY, Yang W, Jiang L, Le M, Essa I, Gong H, Yang MH (2020) Neural design network: graphic layout generation with constraints. In: European conference on computer vision (ECCV)

    Google Scholar 

  74. Lee TY, Dugan C, Bederson BB (2017) Towards understanding human mistakes of programming by example: an online user study. In: Proceedings of the 22nd international conference on intelligent user interfaces, IUI ’17, pp 257–261. ACM, New York, NY, USA.

  75. Leshed G, Haber EM, Matthews T, Lau T (2008) CoScripter: automating & sharing how-to knowledge in the enterprise. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’08, pp 1719–1728. ACM, New York, NY, USA.

  76. Li F, Jagadish HV (2014) Constructing an interactive natural language interface for relational databases. Proc VLDB Endow 8(1):73–84.

  77. Li H, Wang YP, Yin J, Tan G (2019) Smartshell: automated shell scripts synthesis from natural language. Int J Softw Eng Knowl Eng 29(02):197–220

    Article  Google Scholar 

  78. Li I, Nichols J, Lau T, Drews C, Cypher A (2010) Here’s What I Did: sharing and reusing web activity with ActionShot. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10, pp 723–732. ACM, New York, NY, USA.

  79. Li J, Yang J, Hertzmann A, Zhang J, Xu T (2019) Layoutgan: synthesizing graphic layouts with vector-wireframe adversarial networks. IEEE Trans Pattern Anal Mach Intell

    Google Scholar 

  80. Li TJJ, Azaria A, Myers BA (2017) SUGILITE: creating multimodal smartphone automation by demonstration. In: Proceedings of the 2017 CHI conference on human factors in computing systems, CHI ’17, pp 6038–6049. ACM, New York, NY, USA.

  81. Li TJJ, Chen J, Canfield B, Myers BA (2020) Privacy-preserving script sharing in gui-based programming-by-demonstration systems. Proc ACM Hum-Comput Interact 4(CSCW1).

  82. Li TJJ, Chen J, Xia H, Mitchell TM, Myers BA (2020) Multi-modal repairs of conversational breakdowns in task-oriented dialogs. In: Proceedings of the 33rd annual ACM symposium on user interface software and technology, UIST 2020. ACM.

  83. Li TJJ, Hecht B (2014) WikiBrain: making computer programs smarter with knowledge from wikipedia

    Google Scholar 

  84. Li TJJ, Labutov I, Li XN, Zhang X, Shi W, Mitchell TM, Myers BA (2018) APPINITE: a multi-modal interface for specifying data descriptions in programming by demonstration using verbal instructions. In: Proceedings of the 2018 IEEE symposium on visual languages and human-centric computing (VL/HCC 2018)

    Google Scholar 

  85. Li TJJ, Labutov I, Myers BA, Azaria A, Rudnicky AI, Mitchell TM (2018) Teaching agents when they fail: end user development in goal-oriented conversational agents. In: Studies in conversational UX design. Springer

    Google Scholar 

  86. Li TJJ, Li Y, Chen F, Myers BA (2017) Programming IoT devices by demonstration using mobile apps. In: Barbosa S, Markopoulos P, Paterno F, Stumpf S, Valtolina S (eds) End-user development. Springer, Cham, pp 3–17

    Chapter  Google Scholar 

  87. Li TJJ, Popowski L, Mitchell TM, Myers BA (2021) Screen2vec: semantic embedding of gui screens and gui components. In: Proceedings of the 2021 CHI conference on human factors in computing systems, CHI ’21. ACM

    Google Scholar 

  88. Li TJJ, Radensky M, Jia J, Singarajah K, Mitchell TM, Myers BA (2019) PUMICE: a multi-modal agent that learns concepts and conditionals from natural language and demonstrations. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology (UIST 2019), UIST 2019. ACM.

  89. Li TJJ, Riva O (2018) KITE: building conversational bots from mobile apps. In: Proceedings of the 16th ACM international conference on mobile systems, applications, and services (MobiSys 2018). ACM

    Google Scholar 

  90. Li Y, He J, Zhou X, Zhang Y, Baldridge J (2020) Mapping natural language instructions to mobile UI action sequences. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8198–8210. ACL, Online.

  91. Li Y, Li G, He L, Zheng J, Li H, Guan Z (2020) Widget captioning: generating natural language description for mobile user interface elements. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 5495–5510. ACL, Online.

  92. Liang P (2016) Learning executable semantic parsers for natural language understanding. Commun ACM 59(9):68–76

    Article  Google Scholar 

  93. Liang P, Jordan MI, Klein D (2013) Learning dependency-based compositional semantics. Comput Linguist 39(2):389–446

    Article  MathSciNet  Google Scholar 

  94. Lieberman H (2001) Your wish is my command: programming by example. Morgan Kaufmann

    Google Scholar 

  95. Lieberman H, Liu H (2006) Feasibility studies for programming in natural language. In: End user development, pp 459–473. Springer

    Google Scholar 

  96. Lieberman H, Maulsby D (1996) Instructible agents: software that just keeps getting better. IBM Syst J 35(3.4):539–556.

  97. Lin J, Wong J, Nichols J, Cypher A, Lau TA (2009) End-user programming of mashups with vegemite. In: Proceedings of the 14th international conference on intelligent user interfaces, IUI ’09, pp 97–106. ACM, New York, NY, USA.

  98. Liu EZ, Guu K, Pasupat P, Shi T, Liang P (2018) Reinforcement learning on web interfaces using workflow-guided exploration. CoRR.

  99. Liu TF, Craft M, Situ J, Yumer E, Mech R, Kumar R (2018) Learning design semantics for mobile apps. In: Proceedings of the 31st annual ACM symposium on user interface software and technology, UIST ’18, pp 569–579. ACM, New York, NY, USA.

  100. LlamaLab: Automate: everyday automation for Android (2016).

  101. Luger E, Sellen A (2016) “like having a really bad pa”: the gulf between user expectation and experience of conversational agents. In: Proceedings of the 2016 CHI conference on human factors in computing systems, CHI ’16, pp 5286–5297. ACM, New York, NY, USA.

  102. Maes P (1994) Agents that reduce work and information overload. Commun ACM 37(7):30–40.

  103. Mankoff J, Abowd GD, Hudson SE (2000) Oops: a toolkit supporting mediation techniques for resolving ambiguity in recognition-based interfaces. Comput Graph 24(6):819–834

    Article  Google Scholar 

  104. Marin R, Sanz PJ, Nebot P, Wirz R (2005) A multimodal interface to control a robot arm via the web: a case study on remote programming. IEEE Trans Ind Electron 52(6):1506–1520.

    Article  Google Scholar 

  105. Maués RDA, Barbosa SDJ (2013) Keep doing what i just did: automating smartphones by demonstration. In: Proceedings of the 15th international conference on human-computer interaction with mobile devices and services, MobileHCI ’13, pp 295–303. ACM, New York, NY, USA.

  106. McDaniel RG, Myers BA (1999) Getting more out of programming-by-demonstration. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’99, pp 442–449. ACM, New York, NY, USA.

  107. McTear M, O’Neill I, Hanna P, Liu X (2005) Handling errors and determining confirmation strategies–an object-based approach. Speech Commun 45(3):249–269. Special Issue on Error Handling in Spoken Dialogue Systems

  108. Menon A, Tamuz O, Gulwani S, Lampson B, Kalai A (2013) A machine learning framework for programming by example, pp 187–195.

  109. Mihalcea R, Liu H, Lieberman H (2006) NLP (Natural Language Processing) for NLP (Natural Language Programming). In: Gelbukh A (ed) Computational linguistics and intelligent text processing. Lecture notes in computer science. Springer, Berlin, Heidelberg, pp 319–330

    Google Scholar 

  110. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs]. ArXiv: 1301.3781

  111. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119.

  112. Mohan S, Laird JE (2014) Learning goal-oriented hierarchical tasks from situated interactive instruction. In: Proceedings of the twenty-eighth AAAI conference on artificial intelligence, AAAI’14, pp 387–394. AAAI Press

    Google Scholar 

  113. Myers B, Malkin R, Bett M, Waibel A, Bostwick B, Miller RC, Yang J, Denecke M, Seemann E, Zhu J et al (2002) Flexi-modal and multi-machine user interfaces. In: Proceedings of the fourth IEEE international conference on multimodal interfaces, pp 343–348. IEEE

    Google Scholar 

  114. Myers BA (1986) Visual programming, programming by example, and program visualization: a taxonomy. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’86, pp 59–66. ACM, New York, NY, USA.

  115. Myers BA, Ko AJ, Scaffidi C, Oney S, Yoon Y, Chang K, Kery MB, Li TJJ (2017) Making end user development more natural. In: New perspectives in end-user development, pp 1–22. Springer, Cham.

  116. Myers BA, McDaniel R (2001) Sometimes you need a little intelligence, sometimes you need a lot. Your wish is my command: programming by example. Morgan Kaufmann Publishers, San Francisco, CA, pp 45–60.

  117. Myers C, Furqan A, Nebolsky J, Caro K, Zhu J (2018) Patterns for how users overcome obstacles in voice user interfaces. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp 1–7

    Google Scholar 

  118. Norman D (2013) The design of everyday things: revised and expanded edition. Basic Books

    Google Scholar 

  119. Oviatt S (1999) Mutual disambiguation of recognition errors in a multimodel architecture. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 576–583. ACM

    Google Scholar 

  120. Oviatt S (1999) Ten myths of multimodal interaction. Commun ACM 42(11):74–81

  121. Oviatt S, Cohen P (2000) Perceptual user interfaces: multimodal interfaces that process what comes naturally. Commun ACM 43(3):45–53

    Article  Google Scholar 

  122. Pasupat P, Jiang TS, Liu E, Guu K, Liang P (2018) Mapping natural language commands to web elements. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4970–4976. ACL, Brussels, Belgium.

  123. Pasupat P, Liang P (2015) Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. ArXiv: 1508.00305

  124. Porcheron M, Fischer JE, Reeves S, Sharples S (2018) Voice interfaces in everyday life. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18. ACM, New York, NY, USA.

  125. Price D, Rilofff E, Zachary J, Harvey B (2000) NaturalJava: a natural language interface for programming in java. In: Proceedings of the 5th international conference on intelligent user interfaces, IUI ’00, pp 207–211. ACM, New York, NY, USA.

  126. Qi S, Jia B, Huang S, Wei P, Zhu SC (2020) A generalized earley parser for human activity parsing and prediction. IEEE Trans Pattern Anal Mach Intell

    Google Scholar 

  127. Ravindranath L, Thiagarajan A, Balakrishnan H, Madden S (2012) Code in the air: simplifying sensing and coordination tasks on smartphones. In: Proceedings of the twelfth workshop on mobile computing systems & applications, HotMobile ’12, pp 4:1–4:6. ACM, New York, NY, USA.

  128. Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing. ACL.

  129. Rodrigues A (2015) Breaking barriers with assistive macros. In: Proceedings of the 17th international ACM SIGACCESS conference on computers & accessibility, ASSETS ’15, pp 351–352. ACM, New York, NY, USA.

  130. Sahami Shirazi A, Henze N, Schmidt A, Goldberg R, Schmidt B, Schmauder H (2013) Insights into layout patterns of mobile user interfaces by an automatic analysis of android apps. In: Proceedings of the 5th ACM SIGCHI symposium on engineering interactive computing systems, EICS ’13, pp 275–284. ACM, New York, NY, USA.

  131. Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, Roof B, Smith NA, Choi Y (2019) Atomic: an atlas of machine commonsense for if-then reasoning. Proc AAAI Conf Artif Intell 33:3027–3035

    Google Scholar 

  132. Sereshkeh AR, Leung G, Perumal K, Phillips C, Zhang M, Fazly A, Mohomed I (2020) Vasta: a vision and language-assisted smartphone task automation system. In: Proceedings of the 25th international conference on intelligent user interfaces, pp 22–32

    Google Scholar 

  133. She L, Chai J (2017) Interactive learning of grounded verb semantics towards human-robot communication. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1634–1644. ACL, Vancouver, Canada.

  134. Shneiderman B (1983) Direct manipulation: a step beyond programming languages. Computer 16(8):57–69.

  135. Shneiderman B, Plaisant C, Cohen M, Jacobs S, Elmqvist N, Diakopoulos N (2016) Designing the user interface: strategies for effective human-computer interaction, 6, edition. Pearson, Boston

    Google Scholar 

  136. Srivastava S, Labutov I, Mitchell T (2017) Joint concept learning and semantic parsing from natural language explanations. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 1527–1536

    Google Scholar 

  137. Su Y, Hassan Awadallah A, Wang M, White RW (2018) Natural language interfaces with fine-grained user interaction: a case study on web apis. In: The 41st international ACM SIGIR conference on research and development in information retrieval, SIGIR ’18, pp 855–864. ACM, New York, NY, USA.

  138. Suhm B, Myers B, Waibel A (2001) Multimodal error correction for speech user interfaces. ACM Trans Comput-Hum Interact 8(1):60–98.

  139. Swearngin A, Dontcheva M, Li W, Brandt J, Dixon M, Ko AJ (2018) Rewire: interface design assistance from examples. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18, pp 1–12. ACM, New York, NY, USA.

  140. Ur B, McManus E, Pak Yong Ho M, Littman ML (2014) Practical trigger-action programming in the smart home. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’14, pp 803–812. ACM, New York, NY, USA.

  141. Vadas D, Curran JR (2005) Programming with unrestricted natural language. In: Proceedings of the Australasian language technology workshop 2005, pp 191–199

    Google Scholar 

  142. Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85.

  143. Xu Q, Erman J, Gerber A, Mao Z, Pang J, Venkataraman S (2011) Identifying diverse usage behaviors of smartphone apps. In: Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference, IMC ’11, pp 329–344. ACM, New York, NY, USA.

  144. Yang JJ, Lam MS, Landay JA (2020) Dothishere: multimodal interaction to improve cross-application tasks on mobile devices. In: Proceedings of the 33rd annual ACM symposium on user interface software and technology, UIST ’20, pp 35–44. ACM, New York, NY, USA.

  145. Yao Z, Su Y, Sun H, Yih WT (2019) Model-based interactive semantic parsing: a unified framework and a text-to-SQL case study. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5447–5458. ACL, Hong Kong, China.

  146. Yao Z, Tang Y, Yih WT, Sun H, Su Y (2020) An imitation game for learning semantic parsers from user interaction. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 6883–6902. ACL, Online.

  147. Yeh T, Chang TH, Miller RC (2009) Sikuli: using GUI screenshots for search and automation. In: Proceedings of the 22nd annual ACM symposium on user interface software and technology, UIST ’09, pp 183–192. ACM, New York, NY, USA.

  148. Zhang X, Ross AS, Fogarty J (2018) Robust annotation of mobile application interfaces in methods for accessibility repair and enhancement. In: Proceedings of the 31st annual ACM symposium on user interface software and technology, UIST ’18

    Google Scholar 

  149. Zhang Z, Zhu Y, Zhu SC (2020) Graph-based hierarchical knowledge representation for robot task transfer from virtual to physical world. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)

    Google Scholar 

  150. Zhao S, Ramos J, Tao J, Jiang Z, Li S, Wu Z, Pan G, Dey AK (2016) Discovering different kinds of smartphone users through their application usage behaviors. In: Proceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing, UbiComp ’16, pp 498–509. ACM, New York, NY, USA.

Download references


This research was supported in part by Verizon through the Yahoo! InMind project, a J.P. Morgan Faculty Research Award, NSF grant IIS-1814472, AFOSR grant FA95501710218, and Google Cloud Research Credits. Any opinions, findings or recommendations expressed here are those of the authors and do not necessarily reflect views of the sponsors. We thank Amos Azaria, Yuanchun Li, Fanglin Chen, Igor Labutov, Xiaohan Nancy Li, Xiaoyi Zhang, Wenze Shi, Wanling Ding, Marissa Radensky, Justin Jia, Kirielle Singarajah, Jingya Chen, Brandon Canfield, Haijun Xia, and Lindsay Popowski for their contributions to this project.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Toby Jia-Jun Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Li, T.JJ., Mitchell, T.M., Myers, B.A. (2021). Demonstration + Natural Language: Multimodal Interfaces for GUI-Based Interactive Task Learning Agents. In: Li, Y., Hilliges, O. (eds) Artificial Intelligence for Human Computer Interaction: A Modern Approach. Human–Computer Interaction Series. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82680-2

  • Online ISBN: 978-3-030-82681-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics