Skip to main content

Demonstration + Natural Language: Multimodal Interfaces for GUI-Based Interactive Task Learning Agents

  • Chapter
  • First Online:
Artificial Intelligence for Human Computer Interaction: A Modern Approach

Abstract

We summarize our past five years of work on designing, building, and studying Sugilite, an interactive task learning agent that can learn new tasks and relevant associated concepts interactively from the user’s natural language instructions and demonstrations leveraging the graphical user interfaces (GUIs) of third-party mobile apps. Through its multi-modal and mixed-initiative approaches for Human-AI interaction, Sugilite made important contributions in improving the usability, applicability, generalizability, flexibility, robustness, and shareability of interactive task learning agents. Sugilite also represents a new human-AI interaction paradigm for interactive task learning, where it uses existing app GUIs as a medium for users to communicate their intents with an AI agent instead of the interfaces for users to interact with the underlying computing services. In this chapter, we describe the Sugilite system, explain the design and implementation of its key features, and show a prototype in the form of a conversational assistant on Android.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Sugilite is named after a purple gemstone, and stands for: Smartphone Users Generating Intelligent Likeable Interfaces Through Examples.

  2. 2.

    A demo video is available at https://www.youtube.com/watch?v=tdHEk-GeaqE.

  3. 3.

    https://github.com/tobyli/Sugilite_development.

  4. 4.

    Sovite is named after a type of rock. It is also an acronym for System for Optimizing Voice Interfaces to Tackle Errors.

  5. 5.

    Available at: https://github.com/tobyli/screen2vec.

  6. 6.

    Available at: http://interactionmining.org/rico.

  7. 7.

    Since the next screen is always within the same app, and therefore, shares an app description embedding, the prediction task favors having information about the specific app (i.e., app store description embedding) dominate the embedding

References

  1. Adar E, Dontcheva M, Laput G (2014) CommandSpace: modeling the relationships between tasks, descriptions and features. In: Proceedings of the 27th annual ACM symposium on user interface software and technology, UIST ’14, pp 167–176. ACM, New York, NY, USA. https://doi.org/10.1145/2642918.2647395. http://doi.acm.org/10.1145/2642918.2647395

  2. Alharbi K, Yeh T (2015) Collect, decompile, extract, stats, and diff: mining design pattern changes in android apps. In: Proceedings of the 17th international conference on human-computer interaction with mobile devices and services, MobileHCI ’15, pp 515–524. ACM, New York, NY, USA. https://doi.org/10.1145/2785830.2785892. http://doi.acm.org/10.1145/2785830.2785892

  3. Allen J, Chambers N, Ferguson G, Galescu L, Jung H, Swift M, Taysom W (2007) PLOW: a collaborative task learning agent. In: Proceedings of the 22Nd national conference on artificial intelligence - volume 2, AAAI’07, pp 1514–1519. AAAI Press, Vancouver, British Columbia, Canada

    Google Scholar 

  4. Allen JF, Guinn CI, Horvtz E (1999) Mixed-initiative interaction. IEEE Intell Syst Appl 14(5):14–23

    Article  Google Scholar 

  5. Amazon: Alexa Design Guide (2020). https://developer.amazon.com/en-US/docs/alexa/alexa-design/get-started.html

  6. Antila V, Polet J, Lämsä A, Liikka J (2012) RoutineMaker: towards end-user automation of daily routines using smartphones. In: 2012 IEEE international conference on pervasive computing and communications workshops (PERCOM workshops), pp 399–402. https://doi.org/10.1109/PerComW.2012.6197519

  7. Argall BD, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57(5):469–483. https://doi.org/10.1016/j.robot.2008.10.024.

  8. Ashktorab Z, Jain M, Liao QV, Weisz JD (2019) Resilient chatbots: repair strategy preferences for conversational breakdowns. In: Proceedings of the 2019 CHI conference on human factors in computing systems, p 254. ACM

    Google Scholar 

  9. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus for a web of open data. The semantic web, pp 722–735. http://www.springerlink.com/index/rm32474088w54378.pdf

  10. Azaria A, Krishnamurthy J, Mitchell TM (2016) Instructable intelligent personal agent. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI), vol 4

    Google Scholar 

  11. Ballard BW, Biermann AW (1979) Programming in natural language “NLC” as a prototype. In: Proceedings of the 1979 annual conference, ACM ’79, pp 228–237. ACM, New York, NY, USA. https://doi.org/10.1145/800177.810072. http://doi.acm.org/10.1145/800177.810072

  12. Banovic N, Grossman T, Matejka J, Fitzmaurice G (2012) Waken: reverse engineering usage information and interface structure from software videos. In: Proceedings of the 25th annual ACM symposium on user interface software and technology, UIST ’12, pp 83–92. ACM, New York, NY, USA. https://doi.org/10.1145/2380116.2380129. http://doi.acm.org/10.1145/2380116.2380129

  13. Barman S, Chasins S, Bodik R, Gulwani S (2016) Ringer: web automation by demonstration. In: Proceedings of the 2016 ACM SIGPLAN international conference on object-oriented programming, systems, languages, and applications, OOPSLA 2016, pp 748–764. ACM, New York, NY, USA. https://doi.org/10.1145/2983990.2984020. http://doi.acm.org/10.1145/2983990.2984020

  14. Beneteau E, Richards OK, Zhang M, Kientz JA, Yip J, Hiniker A (2019) Communication breakdowns between families and alexa. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19, pp 243:1–243:13. ACM, New York, NY, USA. https://doi.org/10.1145/3290605.3300473. http://doi.acm.org/10.1145/3290605.3300473

  15. Bentley F, Luvogt C, Silverman M, Wirasinghe R, White B, Lottridge D (2018) Understanding the long-term use of smart speaker assistants. Proc ACM Interact Mob Wearable Ubiquitous Technol 2(3). https://doi.org/10.1145/3264901

  16. Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on freebase from question-answer pairs. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1533–1544

    Google Scholar 

  17. Bergman L, Castelli V, Lau T, Oblinger D (2005) DocWizards: a system for authoring follow-me documentation wizards. In: Proceedings of the 18th annual ACM symposium on user interface software and technology, UIST ’05, pp 191–200. ACM, New York, NY, USA. https://doi.org/10.1145/1095034.1095067. http://doi.acm.org/10.1145/1095034.1095067

  18. Biermann AW (1983) Natural Language Programming. In: Biermann AW, Guiho G (eds) Computer program synthesis methodologies, NATO advanced study institutes series. Springer, Netherlands, pp 335–368

    Google Scholar 

  19. Bigham JP, Lau T, Nichols J (2009) Trailblazer: enabling blind users to blaze trails through the web. In: Proceedings of the 14th international conference on intelligent user interfaces, IUI ’09, pp 177–186. ACM, New York, NY, USA. https://doi.org/10.1145/1502650.1502677

  20. Billard A, Calinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In: Springer handbook of robotics, pp 1371–1394. Springer. http://link.springer.com/10.1007/978-3-540-30301-5_60

  21. Bohus D, Rudnicky AI (2005) Sorry, I didn’t catch that!-An investigation of non-understanding errors and recovery strategies. In: 6th SIGdial workshop on discourse and dialogue

    Google Scholar 

  22. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 1247–1250. ACM. http://dl.acm.org/citation.cfm?id=1376746

  23. Bolt RA (1980) “Put-that-there”: voice and gesture at the graphics interface. In: Proceedings of the 7th annual conference on computer graphics and interactive techniques, SIGGRAPH ’80, pp 262–270. ACM, New York, NY, USA

    Google Scholar 

  24. Bosselut A, Rashkin H, Sap M, Malaviya C, Celikyilmaz A, Choi Y (2019) COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4762–4779. ACL, Florence, Italy. https://doi.org/10.18653/v1/P19-1470. https://www.aclweb.org/anthology/P19-1470

  25. Brennan SE (1991) Conversation with and through computers. User Model User-Adap Int 1(1):67–86. https://doi.org/10.1007/BF00158952

  26. Brennan SE (1998) The grounding problem in conversations with and through computers. Social and cognitive approaches to interpersonal communication, pp 201–225

    Google Scholar 

  27. Böhmer M, Hecht B, Schöning J, Krüger A, Bauer G (2011) Falling asleep with angry birds, facebook and kindle: a large scale study on mobile application usage. In: Proceedings of the 13th international conference on human computer interaction with mobile devices and services, MobileHCI ’11, pp 47–56. ACM, New York, NY, USA. https://doi.org/10.1145/2037373.2037383. http://doi.acm.org/10.1145/2037373.2037383

  28. Chai JY, Gao Q, She L, Yang S, Saba-Sadiya S, Xu G (2018) Language to action: towards interactive task learning with physical agents. In: IJCAI, pp 2–9

    Google Scholar 

  29. Chandramouli V, Chakraborty A, Navda V, Guha S, Padmanabhan V, Ramjee R (2015) Insider: towards breaking down mobile app silos. In: TRIOS workshop held in conjunction with the SIGOPS SOSP 2015

    Google Scholar 

  30. Chen F, Xia K, Dhabalia K, Hong JI (2019) Messageontap: a suggestive interface to facilitate messaging-related tasks. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19. ACM, New York, NY, USA. https://doi.org/10.1145/3290605.3300805

  31. Chen J, Chen C, Xing Z, Xu X, Zhu L, Li G, Wang J (2020) Unblind your apps: predicting natural-language labels for mobile gui components by deep learning. In: Proceedings of the 42nd international conference on software engineering, ICSE ’20

    Google Scholar 

  32. Chen JH, Weld DS (2008) Recovering from errors during programming by demonstration. In: Proceedings of the 13th international conference on intelligent user interfaces, IUI ’08, pp 159–168. ACM, New York, NY, USA. https://doi.org/10.1145/1378773.1378794. http://doi.acm.org/10.1145/1378773.1378794

  33. Chkroun M, Azaria A (2019) Lia: a virtual assistant that can be taught new commands by speech. Int J Hum–Comput Interact 1–12

    Google Scholar 

  34. Cho J, Rader E (2020) The role of conversational grounding in supporting symbiosis between people and digital assistants. Proc ACM Hum-Comput Interact 4(CSCW1)

    Google Scholar 

  35. Clark HH, Brennan SE (1991) Grounding in communication. In: Perspectives on socially shared cognition, pp 127–149. APA, Washington, DC, US. https://doi.org/10.1037/10096-006

  36. Cowan BR, Pantidi N, Coyle D, Morrissey K, Clarke P, Al-Shehri S, Earley D, Bandeira N (2017) “what can i help you with?”: Infrequent users’ experiences of intelligent personal assistants. In: Proceedings of the 19th international conference on human-computer interaction with mobile devices and services, MobileHCI ’17, pp 43:1–43:12. ACM, New York, NY, USA. https://doi.org/10.1145/3098279.3098539. http://doi.acm.org/10.1145/3098279.3098539

  37. Cypher A, Halbert DC (1993) Watch what I do: programming by demonstration. MIT Press

    Google Scholar 

  38. Deka B, Huang Z, Franzen C, Hibschman J, Afergan D, Li Y, Nichols J, Kumar R (2017) Rico: a mobile app dataset for building data-driven design applications. In: Proceedings of the 30th annual ACM symposium on user interface software and technology, UIST ’17, pp 845–854. ACM, New York, NY, USA. https://doi.org/10.1145/3126594.3126651. http://doi.acm.org/10.1145/3126594.3126651

  39. Deka B, Huang Z, Kumar R (2016) ERICA: interaction mining mobile apps. In: Proceedings of the 29th annual symposium on user interface software and technology, UIST ’16, pp 767–776. ACM, New York, NY, USA. https://doi.org/10.1145/2984511.2984581. http://doi.acm.org/10.1145/2984511.2984581

  40. Dixon M, Fogarty J (2010) Prefab: implementing advanced behaviors using pixel-based reverse engineering of interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10, pp 1525–1534. ACM, New York, NY, USA. https://doi.org/10.1145/1753326.1753554. http://doi.acm.org/10.1145/1753326.1753554

  41. Dixon M, Leventhal D, Fogarty J (2011) Content and hierarchy in pixel-based methods for reverse engineering interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’11, pp 969–978. ACM, New York, NY, USA. https://doi.org/10.1145/1978942.1979086. http://doi.acm.org/10.1145/1978942.1979086

  42. Dixon M, Nied A, Fogarty J (2014) Prefab layers and prefab annotations: extensible pixel-based interpretation of graphical interfaces. In: Proceedings of the 27th annual ACM symposium on user interface software and technology, UIST ’14, pp 221–230. ACM, New York, NY, USA. https://doi.org/10.1145/2642918.2647412. http://doi.acm.org/10.1145/2642918.2647412

  43. Fast E, Chen B, Mendelsohn J, Bassen J, Bernstein MS (2018) Iris: a conversational agent for complex tasks. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18, pp 473:1–473:12. ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3174047. http://doi.acm.org/10.1145/3173574.3174047

  44. Gao X, Gong R, Zhao Y, Wang S, Shu T, Zhu SC (2020) Joint mind modeling for explanation generation in complex human-robot collaborative tasks. In: 2020 29th IEEE international conference on robot and human interactive communication (RO-MAN), pp 1119–1126. IEEE

    Google Scholar 

  45. Gluck KA, Laird JE (2019) Interactive task learning: humans, robots, and agents acquiring new tasks through natural interactions, vol 26. MIT Press

    Google Scholar 

  46. Green TR (1989) Cognitive dimensions of notations. People and Computers V pp 443–460. https://books.google.com/books?hl=en&lr=&id=BTxOtt4X920C&oi=fnd&pg=PA443&dq=Cognitive+dimensions+of+notations&ots=OEqg1By_Rj&sig=dpg1zZFRHpBVC_r0--XLyLr6718

  47. Grudin J, Jacques R (2019) Chatbots, humbots, and the quest for artificial general intelligence. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp 1–11

    Google Scholar 

  48. Guo A, Kong J, Rivera M, Xu FF, Bigham JP (2019) StateLens: a reverse engineering solution for making existing dynamic touchscreens accessible. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology (UIST 2019), p 15

    Google Scholar 

  49. Gur I, Yavuz S, Su Y, Yan X (2018) DialSQL: dialogue based structured query generation. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1339–1349. ACL, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1124. https://www.aclweb.org/anthology/P18-1124

  50. Hartmann B, Wu L, Collins K, Klemmer SR (2007) Programming by a sample: rapidly creating web applications with d.mix. In: Proceedings of the 20th annual ACM symposium on user interface software and technology, UIST ’07, pp 241–250. ACM, New York, NY, USA. https://doi.org/10.1145/1294211.1294254. http://doi.acm.org/10.1145/1294211.1294254

  51. Horvitz E (1999) Principles of mixed-initiative user interfaces. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’99, pp 159–166. ACM, New York, NY, USA. https://doi.org/10.1145/302979.303030

  52. Huang F, Canny JF, Nichols J (2019) Swire: sketch-based user interface retrieval. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI ’19, pp 1–10. ACM, New York, NY, USA. https://doi.org/10.1145/3290605.3300334

  53. Huang THK, Azaria A, Bigham JP (2016) InstructableCrowd: creating IF-THEN rules via conversations with the crowd, pp 1555–1562. ACM Press. https://doi.org/10.1145/2851581.2892502. http://dl.acm.org/citation.cfm?doid=2851581.2892502

  54. Hutchins EL, Hollan JD, Norman DA (1986) Direct manipulation interfaces

    Google Scholar 

  55. Iba S, Paredis CJJ, Khosla PK (2005) Interactive multimodal robot programming. Int J Robot Res 24(1):83–104. https://doi.org/10.1177/0278364904049250

  56. IFTTT (2016) IFTTT: connects the apps you love. https://ifttt.com/

  57. Intharah T, Turmukhambetov D, Brostow GJ (2019) Hilc: domain-independent pbd system via computer vision and follow-up questions. ACM Trans Interact Intell Syst 9(2-3):16:1–16:27. https://doi.org/10.1145/3234508. http://doi.acm.org/10.1145/3234508

  58. Jain M, Kumar P, Kota R, Patel SN (2018) Evaluating and informing the design of chatbots. In: Proceedings of the 2018 designing interactive systems conference, pp 895–906. ACM

    Google Scholar 

  59. Jiang J, Jeng W, He D (2013) How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 143–152. ACM

    Google Scholar 

  60. Kasturi T, Jin H, Pappu A, Lee S, Harrison B, Murthy R, Stent A (2015) The cohort and speechify libraries for rapid construction of speech enabled applications for android. In: Proceedings of the 16th annual meeting of the special interest group on discourse and dialogue, pp 441–443

    Google Scholar 

  61. Kate RJ, Wong YW, Mooney RJ (2005) Learning to transform natural to formal languages. In: Proceedings of the 20th national conference on artificial intelligence - volume 3, AAAI’05, pp 1062–1068. AAAI Press, Pittsburgh, Pennsylvania. http://dl.acm.org/citation.cfm?id=1619499.1619504

  62. Kim D, Park S, Ko J, Ko SY, Lee SJ (2019) X-droid: a quick and easy android prototyping framework with a single-app illusion. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology, UIST ’19, pp 95–108. ACM, New York, NY, USA. https://doi.org/10.1145/3332165.3347890

  63. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980

  64. Kirk J, Mininger A, Laird J (2016) Learning task goals interactively with visual demonstrations. Biol Inspired Cogn Archit 18:1–8

    Google Scholar 

  65. Ko AJ, Abraham R, Beckwith L, Blackwell A, Burnett M, Erwig M, Scaffidi C, Lawrance J, Lieberman H, Myers B, Rosson MB, Rothermel G, Shaw M, Wiedenbeck S (2011) The state of the art in end-user software engineering. ACM Comput Surv 43(3), 21:1–21:44. https://doi.org/10.1145/1922649.1922658. http://doi.acm.org/10.1145/1922649.1922658

  66. Kumar R, Satyanarayan A, Torres C, Lim M, Ahmad S, Klemmer SR, Talton JO (2013) Webzeitgeist: design mining the web. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’13, pp 3083–3092. ACM, New York, NY, USA. https://doi.org/10.1145/2470654.2466420

  67. Kurihara K, Goto M, Ogata J, Igarashi T (2006) Speech pen: predictive handwriting based on ambient multimodal recognition. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 851–860. ACM

    Google Scholar 

  68. Labutov I, Srivastava S, Mitchell T (2018) Lia: a natural language programmable personal assistant. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp 145–150

    Google Scholar 

  69. Laird JE, Gluck K, Anderson J, Forbus KD, Jenkins OC, Lebiere C, Salvucci D, Scheutz M, Thomaz A, Trafton G, Wray RE, Mohan S, Kirk JR (2017) Interactive task learning. IEEE Intell Syst 32(4):6–21. https://doi.org/10.1109/MIS.2017.3121552

    Article  Google Scholar 

  70. Laput GP, Dontcheva M, Wilensky G, Chang W, Agarwala A, Linder J, Adar E (2013) PixelTone: a multimodal interface for image editing. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’13, pp 2185–2194. ACM, New York, NY, USA. https://doi.org/10.1145/2470654.2481301. http://doi.acm.org/10.1145/2470654.2481301

  71. Lau T (2009) Why programming-by-demonstration systems fail: lessons learned for usable AI. AI Mag 30(4):65–67. http://www.aaai.org/ojs/index.php/aimagazine/article/view/2262

  72. Lee C, Kim S, Han D, Yang H, Park YW, Kwon BC, Ko S (2020) Guicomp: a gui design assistant with real-time, multi-faceted feedback. In: Proceedings of the 2020 CHI conference on human factors in computing systems, CHI ’20, pp 1–13. ACM, New York, NY, USA. https://doi.org/10.1145/3313831.3376327

  73. Lee HY, Yang W, Jiang L, Le M, Essa I, Gong H, Yang MH (2020) Neural design network: graphic layout generation with constraints. In: European conference on computer vision (ECCV)

    Google Scholar 

  74. Lee TY, Dugan C, Bederson BB (2017) Towards understanding human mistakes of programming by example: an online user study. In: Proceedings of the 22nd international conference on intelligent user interfaces, IUI ’17, pp 257–261. ACM, New York, NY, USA. https://doi.org/10.1145/3025171.3025203. http://doi.acm.org/10.1145/3025171.3025203

  75. Leshed G, Haber EM, Matthews T, Lau T (2008) CoScripter: automating & sharing how-to knowledge in the enterprise. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’08, pp 1719–1728. ACM, New York, NY, USA. https://doi.org/10.1145/1357054.1357323. http://doi.acm.org/10.1145/1357054.1357323

  76. Li F, Jagadish HV (2014) Constructing an interactive natural language interface for relational databases. Proc VLDB Endow 8(1):73–84. https://doi.org/10.14778/2735461.2735468

  77. Li H, Wang YP, Yin J, Tan G (2019) Smartshell: automated shell scripts synthesis from natural language. Int J Softw Eng Knowl Eng 29(02):197–220

    Article  Google Scholar 

  78. Li I, Nichols J, Lau T, Drews C, Cypher A (2010) Here’s What I Did: sharing and reusing web activity with ActionShot. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10, pp 723–732. ACM, New York, NY, USA. https://doi.org/10.1145/1753326.1753432. http://doi.acm.org/10.1145/1753326.1753432

  79. Li J, Yang J, Hertzmann A, Zhang J, Xu T (2019) Layoutgan: synthesizing graphic layouts with vector-wireframe adversarial networks. IEEE Trans Pattern Anal Mach Intell

    Google Scholar 

  80. Li TJJ, Azaria A, Myers BA (2017) SUGILITE: creating multimodal smartphone automation by demonstration. In: Proceedings of the 2017 CHI conference on human factors in computing systems, CHI ’17, pp 6038–6049. ACM, New York, NY, USA. https://doi.org/10.1145/3025453.3025483. http://doi.acm.org/10.1145/3025453.3025483

  81. Li TJJ, Chen J, Canfield B, Myers BA (2020) Privacy-preserving script sharing in gui-based programming-by-demonstration systems. Proc ACM Hum-Comput Interact 4(CSCW1). https://doi.org/10.1145/3392869

  82. Li TJJ, Chen J, Xia H, Mitchell TM, Myers BA (2020) Multi-modal repairs of conversational breakdowns in task-oriented dialogs. In: Proceedings of the 33rd annual ACM symposium on user interface software and technology, UIST 2020. ACM. https://doi.org/10.1145/3379337.3415820

  83. Li TJJ, Hecht B (2014) WikiBrain: making computer programs smarter with knowledge from wikipedia

    Google Scholar 

  84. Li TJJ, Labutov I, Li XN, Zhang X, Shi W, Mitchell TM, Myers BA (2018) APPINITE: a multi-modal interface for specifying data descriptions in programming by demonstration using verbal instructions. In: Proceedings of the 2018 IEEE symposium on visual languages and human-centric computing (VL/HCC 2018)

    Google Scholar 

  85. Li TJJ, Labutov I, Myers BA, Azaria A, Rudnicky AI, Mitchell TM (2018) Teaching agents when they fail: end user development in goal-oriented conversational agents. In: Studies in conversational UX design. Springer

    Google Scholar 

  86. Li TJJ, Li Y, Chen F, Myers BA (2017) Programming IoT devices by demonstration using mobile apps. In: Barbosa S, Markopoulos P, Paterno F, Stumpf S, Valtolina S (eds) End-user development. Springer, Cham, pp 3–17

    Chapter  Google Scholar 

  87. Li TJJ, Popowski L, Mitchell TM, Myers BA (2021) Screen2vec: semantic embedding of gui screens and gui components. In: Proceedings of the 2021 CHI conference on human factors in computing systems, CHI ’21. ACM

    Google Scholar 

  88. Li TJJ, Radensky M, Jia J, Singarajah K, Mitchell TM, Myers BA (2019) PUMICE: a multi-modal agent that learns concepts and conditionals from natural language and demonstrations. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology (UIST 2019), UIST 2019. ACM. https://doi.org/10.1145/3332165.3347899

  89. Li TJJ, Riva O (2018) KITE: building conversational bots from mobile apps. In: Proceedings of the 16th ACM international conference on mobile systems, applications, and services (MobiSys 2018). ACM

    Google Scholar 

  90. Li Y, He J, Zhou X, Zhang Y, Baldridge J (2020) Mapping natural language instructions to mobile UI action sequences. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8198–8210. ACL, Online. https://doi.org/10.18653/v1/2020.acl-main.729. https://www.aclweb.org/anthology/2020.acl-main.729

  91. Li Y, Li G, He L, Zheng J, Li H, Guan Z (2020) Widget captioning: generating natural language description for mobile user interface elements. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 5495–5510. ACL, Online. https://doi.org/10.18653/v1/2020.emnlp-main.443. https://www.aclweb.org/anthology/2020.emnlp-main.443

  92. Liang P (2016) Learning executable semantic parsers for natural language understanding. Commun ACM 59(9):68–76

    Article  Google Scholar 

  93. Liang P, Jordan MI, Klein D (2013) Learning dependency-based compositional semantics. Comput Linguist 39(2):389–446

    Article  MathSciNet  Google Scholar 

  94. Lieberman H (2001) Your wish is my command: programming by example. Morgan Kaufmann

    Google Scholar 

  95. Lieberman H, Liu H (2006) Feasibility studies for programming in natural language. In: End user development, pp 459–473. Springer

    Google Scholar 

  96. Lieberman H, Maulsby D (1996) Instructible agents: software that just keeps getting better. IBM Syst J 35(3.4):539–556. https://doi.org/10.1147/sj.353.0539

  97. Lin J, Wong J, Nichols J, Cypher A, Lau TA (2009) End-user programming of mashups with vegemite. In: Proceedings of the 14th international conference on intelligent user interfaces, IUI ’09, pp 97–106. ACM, New York, NY, USA. https://doi.org/10.1145/1502650.1502667. http://doi.acm.org/10.1145/1502650.1502667

  98. Liu EZ, Guu K, Pasupat P, Shi T, Liang P (2018) Reinforcement learning on web interfaces using workflow-guided exploration. CoRR. http://arxiv.org/abs/1802.08802

  99. Liu TF, Craft M, Situ J, Yumer E, Mech R, Kumar R (2018) Learning design semantics for mobile apps. In: Proceedings of the 31st annual ACM symposium on user interface software and technology, UIST ’18, pp 569–579. ACM, New York, NY, USA. https://doi.org/10.1145/3242587.3242650

  100. LlamaLab: Automate: everyday automation for Android (2016). http://llamalab.com/automate/

  101. Luger E, Sellen A (2016) “like having a really bad pa”: the gulf between user expectation and experience of conversational agents. In: Proceedings of the 2016 CHI conference on human factors in computing systems, CHI ’16, pp 5286–5297. ACM, New York, NY, USA. https://doi.org/10.1145/2858036.2858288. http://doi.acm.org/10.1145/2858036.2858288

  102. Maes P (1994) Agents that reduce work and information overload. Commun ACM 37(7):30–40. https://doi.org/10.1145/176789.176792. http://doi.acm.org/10.1145/176789.176792

  103. Mankoff J, Abowd GD, Hudson SE (2000) Oops: a toolkit supporting mediation techniques for resolving ambiguity in recognition-based interfaces. Comput Graph 24(6):819–834

    Article  Google Scholar 

  104. Marin R, Sanz PJ, Nebot P, Wirz R (2005) A multimodal interface to control a robot arm via the web: a case study on remote programming. IEEE Trans Ind Electron 52(6):1506–1520. https://doi.org/10.1109/TIE.2005.858733

    Article  Google Scholar 

  105. Maués RDA, Barbosa SDJ (2013) Keep doing what i just did: automating smartphones by demonstration. In: Proceedings of the 15th international conference on human-computer interaction with mobile devices and services, MobileHCI ’13, pp 295–303. ACM, New York, NY, USA. https://doi.org/10.1145/2493190.2493216. http://doi.acm.org/10.1145/2493190.2493216

  106. McDaniel RG, Myers BA (1999) Getting more out of programming-by-demonstration. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’99, pp 442–449. ACM, New York, NY, USA. https://doi.org/10.1145/302979.303127. http://doi.acm.org/10.1145/302979.303127

  107. McTear M, O’Neill I, Hanna P, Liu X (2005) Handling errors and determining confirmation strategies–an object-based approach. Speech Commun 45(3):249–269. https://doi.org/10.1016/j.specom.2004.11.006. http://www.sciencedirect.com/science/article/pii/S0167639304001426. Special Issue on Error Handling in Spoken Dialogue Systems

  108. Menon A, Tamuz O, Gulwani S, Lampson B, Kalai A (2013) A machine learning framework for programming by example, pp 187–195. http://machinelearning.wustl.edu/mlpapers/papers/ICML2013_menon13

  109. Mihalcea R, Liu H, Lieberman H (2006) NLP (Natural Language Processing) for NLP (Natural Language Programming). In: Gelbukh A (ed) Computational linguistics and intelligent text processing. Lecture notes in computer science. Springer, Berlin, Heidelberg, pp 319–330

    Google Scholar 

  110. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs]. http://arxiv.org/abs/1301.3781. ArXiv: 1301.3781

  111. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality

  112. Mohan S, Laird JE (2014) Learning goal-oriented hierarchical tasks from situated interactive instruction. In: Proceedings of the twenty-eighth AAAI conference on artificial intelligence, AAAI’14, pp 387–394. AAAI Press

    Google Scholar 

  113. Myers B, Malkin R, Bett M, Waibel A, Bostwick B, Miller RC, Yang J, Denecke M, Seemann E, Zhu J et al (2002) Flexi-modal and multi-machine user interfaces. In: Proceedings of the fourth IEEE international conference on multimodal interfaces, pp 343–348. IEEE

    Google Scholar 

  114. Myers BA (1986) Visual programming, programming by example, and program visualization: a taxonomy. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’86, pp 59–66. ACM, New York, NY, USA. https://doi.org/10.1145/22627.22349. http://doi.acm.org/10.1145/22627.22349

  115. Myers BA, Ko AJ, Scaffidi C, Oney S, Yoon Y, Chang K, Kery MB, Li TJJ (2017) Making end user development more natural. In: New perspectives in end-user development, pp 1–22. Springer, Cham. https://doi.org/10.1007/978-3-319-60291-2_1. https://link.springer.com/chapter/10.1007/978-3-319-60291-2_1

  116. Myers BA, McDaniel R (2001) Sometimes you need a little intelligence, sometimes you need a lot. Your wish is my command: programming by example. Morgan Kaufmann Publishers, San Francisco, CA, pp 45–60. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.8085&rep=rep1&type=pdf

  117. Myers C, Furqan A, Nebolsky J, Caro K, Zhu J (2018) Patterns for how users overcome obstacles in voice user interfaces. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp 1–7

    Google Scholar 

  118. Norman D (2013) The design of everyday things: revised and expanded edition. Basic Books

    Google Scholar 

  119. Oviatt S (1999) Mutual disambiguation of recognition errors in a multimodel architecture. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 576–583. ACM

    Google Scholar 

  120. Oviatt S (1999) Ten myths of multimodal interaction. Commun ACM 42(11):74–81 https://doi.org/10.1145/319382.319398. http://doi.acm.org/10.1145/319382.319398

  121. Oviatt S, Cohen P (2000) Perceptual user interfaces: multimodal interfaces that process what comes naturally. Commun ACM 43(3):45–53

    Article  Google Scholar 

  122. Pasupat P, Jiang TS, Liu E, Guu K, Liang P (2018) Mapping natural language commands to web elements. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4970–4976. ACL, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1540. https://www.aclweb.org/anthology/D18-1540

  123. Pasupat P, Liang P (2015) Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. http://arxiv.org/abs/1508.00305. ArXiv: 1508.00305

  124. Porcheron M, Fischer JE, Reeves S, Sharples S (2018) Voice interfaces in everyday life. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18. ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3174214

  125. Price D, Rilofff E, Zachary J, Harvey B (2000) NaturalJava: a natural language interface for programming in java. In: Proceedings of the 5th international conference on intelligent user interfaces, IUI ’00, pp 207–211. ACM, New York, NY, USA. https://doi.org/10.1145/325737.325845. http://doi.acm.org/10.1145/325737.325845

  126. Qi S, Jia B, Huang S, Wei P, Zhu SC (2020) A generalized earley parser for human activity parsing and prediction. IEEE Trans Pattern Anal Mach Intell

    Google Scholar 

  127. Ravindranath L, Thiagarajan A, Balakrishnan H, Madden S (2012) Code in the air: simplifying sensing and coordination tasks on smartphones. In: Proceedings of the twelfth workshop on mobile computing systems & applications, HotMobile ’12, pp 4:1–4:6. ACM, New York, NY, USA. https://doi.org/10.1145/2162081.2162087. http://doi.acm.org/10.1145/2162081.2162087

  128. Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing. ACL. http://arxiv.org/abs/1908.10084

  129. Rodrigues A (2015) Breaking barriers with assistive macros. In: Proceedings of the 17th international ACM SIGACCESS conference on computers & accessibility, ASSETS ’15, pp 351–352. ACM, New York, NY, USA. https://doi.org/10.1145/2700648.2811322. http://doi.acm.org/10.1145/2700648.2811322

  130. Sahami Shirazi A, Henze N, Schmidt A, Goldberg R, Schmidt B, Schmauder H (2013) Insights into layout patterns of mobile user interfaces by an automatic analysis of android apps. In: Proceedings of the 5th ACM SIGCHI symposium on engineering interactive computing systems, EICS ’13, pp 275–284. ACM, New York, NY, USA. https://doi.org/10.1145/2494603.2480308. http://doi.acm.org/10.1145/2494603.2480308

  131. Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, Roof B, Smith NA, Choi Y (2019) Atomic: an atlas of machine commonsense for if-then reasoning. Proc AAAI Conf Artif Intell 33:3027–3035

    Google Scholar 

  132. Sereshkeh AR, Leung G, Perumal K, Phillips C, Zhang M, Fazly A, Mohomed I (2020) Vasta: a vision and language-assisted smartphone task automation system. In: Proceedings of the 25th international conference on intelligent user interfaces, pp 22–32

    Google Scholar 

  133. She L, Chai J (2017) Interactive learning of grounded verb semantics towards human-robot communication. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1634–1644. ACL, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1150. https://www.aclweb.org/anthology/P17-1150

  134. Shneiderman B (1983) Direct manipulation: a step beyond programming languages. Computer 16(8):57–69. https://doi.org/10.1109/MC.1983.1654471

  135. Shneiderman B, Plaisant C, Cohen M, Jacobs S, Elmqvist N, Diakopoulos N (2016) Designing the user interface: strategies for effective human-computer interaction, 6, edition. Pearson, Boston

    Google Scholar 

  136. Srivastava S, Labutov I, Mitchell T (2017) Joint concept learning and semantic parsing from natural language explanations. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 1527–1536

    Google Scholar 

  137. Su Y, Hassan Awadallah A, Wang M, White RW (2018) Natural language interfaces with fine-grained user interaction: a case study on web apis. In: The 41st international ACM SIGIR conference on research and development in information retrieval, SIGIR ’18, pp 855–864. ACM, New York, NY, USA. https://doi.org/10.1145/3209978.3210013

  138. Suhm B, Myers B, Waibel A (2001) Multimodal error correction for speech user interfaces. ACM Trans Comput-Hum Interact 8(1):60–98. https://doi.org/10.1145/371127.371166. http://doi.acm.org/10.1145/371127.371166

  139. Swearngin A, Dontcheva M, Li W, Brandt J, Dixon M, Ko AJ (2018) Rewire: interface design assistance from examples. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18, pp 1–12. ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3174078

  140. Ur B, McManus E, Pak Yong Ho M, Littman ML (2014) Practical trigger-action programming in the smart home. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’14, pp 803–812. ACM, New York, NY, USA. https://doi.org/10.1145/2556288.2557420. http://doi.acm.org/10.1145/2556288.2557420

  141. Vadas D, Curran JR (2005) Programming with unrestricted natural language. In: Proceedings of the Australasian language technology workshop 2005, pp 191–199

    Google Scholar 

  142. Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85. http://dl.acm.org/citation.cfm?id=2629489

  143. Xu Q, Erman J, Gerber A, Mao Z, Pang J, Venkataraman S (2011) Identifying diverse usage behaviors of smartphone apps. In: Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference, IMC ’11, pp 329–344. ACM, New York, NY, USA. https://doi.org/10.1145/2068816.2068847. http://doi.acm.org/10.1145/2068816.2068847

  144. Yang JJ, Lam MS, Landay JA (2020) Dothishere: multimodal interaction to improve cross-application tasks on mobile devices. In: Proceedings of the 33rd annual ACM symposium on user interface software and technology, UIST ’20, pp 35–44. ACM, New York, NY, USA. https://doi.org/10.1145/3379337.3415841

  145. Yao Z, Su Y, Sun H, Yih WT (2019) Model-based interactive semantic parsing: a unified framework and a text-to-SQL case study. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5447–5458. ACL, Hong Kong, China. https://doi.org/10.18653/v1/D19-1547. https://www.aclweb.org/anthology/D19-1547

  146. Yao Z, Tang Y, Yih WT, Sun H, Su Y (2020) An imitation game for learning semantic parsers from user interaction. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 6883–6902. ACL, Online. https://doi.org/10.18653/v1/2020.emnlp-main.559. https://www.aclweb.org/anthology/2020.emnlp-main.559

  147. Yeh T, Chang TH, Miller RC (2009) Sikuli: using GUI screenshots for search and automation. In: Proceedings of the 22nd annual ACM symposium on user interface software and technology, UIST ’09, pp 183–192. ACM, New York, NY, USA. https://doi.org/10.1145/1622176.1622213. http://doi.acm.org/10.1145/1622176.1622213

  148. Zhang X, Ross AS, Fogarty J (2018) Robust annotation of mobile application interfaces in methods for accessibility repair and enhancement. In: Proceedings of the 31st annual ACM symposium on user interface software and technology, UIST ’18

    Google Scholar 

  149. Zhang Z, Zhu Y, Zhu SC (2020) Graph-based hierarchical knowledge representation for robot task transfer from virtual to physical world. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)

    Google Scholar 

  150. Zhao S, Ramos J, Tao J, Jiang Z, Li S, Wu Z, Pan G, Dey AK (2016) Discovering different kinds of smartphone users through their application usage behaviors. In: Proceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing, UbiComp ’16, pp 498–509. ACM, New York, NY, USA. https://doi.org/10.1145/2971648.2971696. http://doi.acm.org/10.1145/2971648.2971696

Download references

Acknowledgements

This research was supported in part by Verizon through the Yahoo! InMind project, a J.P. Morgan Faculty Research Award, NSF grant IIS-1814472, AFOSR grant FA95501710218, and Google Cloud Research Credits. Any opinions, findings or recommendations expressed here are those of the authors and do not necessarily reflect views of the sponsors. We thank Amos Azaria, Yuanchun Li, Fanglin Chen, Igor Labutov, Xiaohan Nancy Li, Xiaoyi Zhang, Wenze Shi, Wanling Ding, Marissa Radensky, Justin Jia, Kirielle Singarajah, Jingya Chen, Brandon Canfield, Haijun Xia, and Lindsay Popowski for their contributions to this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toby Jia-Jun Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Li, T.JJ., Mitchell, T.M., Myers, B.A. (2021). Demonstration + Natural Language: Multimodal Interfaces for GUI-Based Interactive Task Learning Agents. In: Li, Y., Hilliges, O. (eds) Artificial Intelligence for Human Computer Interaction: A Modern Approach. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-030-82681-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-82681-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82680-2

  • Online ISBN: 978-3-030-82681-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics