Skip to main content

An Investigation of Challenges Encountered When Specifying Training Data and Runtime Monitors for Safety Critical ML Applications

  • Conference paper
  • First Online:
Requirements Engineering: Foundation for Software Quality (REFSQ 2023)

Abstract

[Context and motivation] The development and operation of critical software that contains machine learning (ML) models requires diligence and established processes. Especially the training data used during the development of ML models have major influences on the later behaviour of the system. Runtime monitors are used to provide guarantees for that behaviour. [Question/problem] We see major uncertainty in how to specify training data and runtime monitoring for critical ML models and by this specifying the final functionality of the system. In this interview-based study we investigate the underlying challenges for these difficulties. [Principal ideas/results] Based on ten interviews with practitioners who develop ML models for critical applications in the automotive and telecommunication sector, we identified 17 underlying challenges in 6 challenge groups that relate to the challenge of specifying training data and runtime monitoring. [Contribution] The article provides a list of the identified underlying challenges related to the difficulties practitioners experience when specifying training data and runtime monitoring for ML models. Furthermore, interconnection between the challenges were found and based on these connections recommendation proposed to overcome the root causes for the challenges.

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 957197.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Non-governmental organisations, e.g., https://algorithmwatch.org/en/stories/.

  2. 2.

    We define critical software as software that is safety, privacy, ethically, and/or mission critical, i.e., a failure in the software can cause significant injury or the loss of life, invasion of personal privacy, violation of human rights, and/or significant economic or environmental consequences [31].

  3. 3.

    The interview guide is available at https://doi.org/10.7910/DVN/WJ8TKY.

  4. 4.

    The list included functional safety experts, requirement engineers, product owners or function owners, function or model developers, and data engineers.

  5. 5.

    Very efficient deep learning in the Internet of Things.

References

  1. Abid, A., Farooqi, M., Zou, J.: Persistent anti-muslim bias in large language models. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 298–306 (2021)

    Google Scholar 

  2. Ashmore, R., Calinescu, R., Paterson, C.: Assuring the machine learning lifecycle: Desiderata, methods, and challenges. ACM Comput. Surv. 54(5), 1–39 (2021)

    Article  Google Scholar 

  3. Banko, M., Brill, E.: Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 26–33 (2001)

    Google Scholar 

  4. Barocas, S., Selbst, A.D.: Big data’s disparate impact. Calif. L. Rev. 104, 671 (2016)

    Google Scholar 

  5. Bayram, F., Ahmed, B.S., Kassler, A.: From concept drift to model degradation: An overview on performance-aware drift detectors. Knowl. Based Syst. 108632 (2022)

    Google Scholar 

  6. Bencomo, N., Guo, J.L., Harrison, R., Heyn, H.M., Menzies, T.: The secret to better ai and better software (is requirements engineering). IEEE Softw. 39(1), 105–110 (2021)

    Article  Google Scholar 

  7. Bencomo, N., Whittle, J., Sawyer, P., Finkelstein, A., Letier, E.: Requirements reflection: requirements as runtime entities. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, vol. 2, pp. 199–202 (2010)

    Google Scholar 

  8. Bernhardt, M., Jones, C., Glocker, B.: Potential sources of dataset bias complicate investigation of underdiagnosis by machine learning algorithms. Nat. Med. 1–2 (2022)

    Google Scholar 

  9. Blodgett, S.L., Barocas, S., Daum’e, H., Wallach, H.M.: Language (technology) is power: A critical survey of "bias” in nlp. In: ACL (2020)

    Google Scholar 

  10. Borg, M., et al.: Safely entering the deep: A review of verification and validation for machine learning and a challenge elicitation in the automotive industry. J. Automotive Softw. Eng. 1(1), 1–19 (2018)

    Article  Google Scholar 

  11. Breck, E., Cai, S., Nielsen, E., Salib, M., Sculley, D.: The ml test score: A rubric for ml production readiness and technical debt reduction. In: 2017 IEEE International Conference on Big Data, pp. 1123–1132. IEEE (2017)

    Google Scholar 

  12. Cheng, C.H., Nührenberg, G., Yasuoka, H.: Runtime monitoring neuron activation patterns. In: 2019 Design, Automation & Test in Europe Conference & Exhibition, pp. 300–303. IEEE (2019)

    Google Scholar 

  13. Creswell, J.W., Creswell, J.D.: Research design: Qualitative, quantitative, and mixed methods approaches. Sage publications (2017)

    Google Scholar 

  14. Creswell, John W.; Poth, C.N.: Qualitative Inquiry and Research Design: Choosing Among Five Approaches, 4th edn. Sage Publishing (2017)

    Google Scholar 

  15. Fabbrizzi, S., Papadopoulos, S., Ntoutsi, E., Kompatsiaris, I.: A survey on bias in visual datasets. arXiv preprint arXiv:2107.07919 (2021)

  16. Fauri, D., Dos Santos, D.R., Costante, E., den Hartog, J., Etalle, S., Tonetta, S.: From system specification to anomaly detection (and back). In: Proceedings of the 2017 Workshop on Cyber-Physical Systems Security and PrivaCy, pp. 13–24 (2017)

    Google Scholar 

  17. Giese, H., et al.: Living with uncertainty in the age of runtime models. In: Bencomo, N., France, R., Cheng, B.H.C., Aßmann, U. (eds.) Models@run.time. LNCS, vol. 8378, pp. 47–100. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08915-7_3

    Chapter  Google Scholar 

  18. Ginart, T., Zhang, M.J., Zou, J.: Mldemon: Deployment monitoring for machine learning systems. In: International Conference on Artificial Intelligence and Statistics, pp. 3962–3997. PMLR (2022)

    Google Scholar 

  19. Goodman, B., Flaxman, S.: European union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 38(3), 50–57 (2017)

    Google Scholar 

  20. Gwilliam, M., Hegde, S., Tinubu, L., Hanson, A.: Rethinking common assumptions to mitigate racial bias in face recognition datasets. In: Proceedings of the IEEE CVF, pp. 4123–4132 (2021)

    Google Scholar 

  21. Habibullah, K.M., Horkoff, J.: Non-functional requirements for machine learning: understanding current use and challenges in industry. In: 2021 IEEE 29th RE Conference, pp. 13–23. IEEE (2021)

    Google Scholar 

  22. Heyn, H.-M., Subbiah, P., Linder, J., Knauss, E., Eriksson, O.: Setting AI in context: a case study on defining the context and operational design domain for automated driving. In: Gervasi, V., Vogelsang, A. (eds.) REFSQ 2022. LNCS, vol. 13216, pp. 199–215. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98464-9_16

    Chapter  Google Scholar 

  23. Horkoff, J.: Non-functional requirements for machine learning: Challenges and new directions. In: 2019 IEEE 27th RE Conference, pp. 386–391. IEEE (2019)

    Google Scholar 

  24. Humbatova, N., Jahangirova, G., Bavota, G., Riccio, V., Stocco, A., Tonella, P.: Taxonomy of real faults in deep learning systems. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering, pp. 1110–1121 (2020)

    Google Scholar 

  25. Ishikawa, F., Yoshioka, N.: How do engineers perceive difficulties in engineering of machine-learning systems?-questionnaire survey. In: 2019 IEEE/ACM Joint 7th International Workshop on Conducting Empirical Studies in Industry, pp. 2–9. IEEE (2019)

    Google Scholar 

  26. Islam, M.J., Nguyen, G., Pan, R., Rajan, H.: A comprehensive study on deep learning bug characteristics. In: 2019 ACM 27th European Software Engineering Conference, pp. 510–520 (2019)

    Google Scholar 

  27. Jaipuria, N., et al.: Deflating dataset bias using synthetic data augmentation. In: Proceedings of the IEEE CVF, pp. 772–773 (2020)

    Google Scholar 

  28. Kang, D., Raghavan, D., Bailis, P., Zaharia, M.: Model assertions for monitoring and improving ml models. Proc. Mach. Learn. Syst. 2, 481–496 (2020)

    Google Scholar 

  29. Karkkainen, K., Joo, J.: Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In: Proceedings of the IEEE CVF, pp. 1548–1558 (2021)

    Google Scholar 

  30. King, N., Horrocks, C., Brooks, J.: Interviews in qualitative research. Sage (2018)

    Google Scholar 

  31. Knight, J.C.: Safety critical systems: challenges and directions. In: 24th International Conference on Software Engineering, pp. 547–550 (2002)

    Google Scholar 

  32. Kreuzberger, D., Kühl, N., Hirschl, S.: Machine learning operations (mlops): Overview, definition, and architecture. arXiv preprint arXiv:2205.02302 (2022)

  33. Liu, A., Tan, Z., Wan, J., Escalera, S., Guo, G., Li, S.Z.: Casia-surf cefa: A benchmark for multi-modal cross-ethnicity face anti-spoofing. In: Proceedings of the IEEE CVF, pp. 1179–1187 (2021)

    Google Scholar 

  34. Liu, H., Eksmo, S., Risberg, J., Hebig, R.: Emerging and changing tasks in the development process for machine learning systems. In: Proceedings of the International Conference on Software and System Processes, pp. 125–134 (2020)

    Google Scholar 

  35. Lwakatare, L.E., Crnkovic, I., Bosch, J.: Devops for ai-challenges in development of ai-enabled applications. In: 2020 International Conference on Software, Telecommunications and Computer Networks, pp. 1–6. IEEE (2020)

    Google Scholar 

  36. Marques, J., Yelisetty, S.: An analysis of software requirements specification characteristics in regulated environments. J. Softw. Eng. Appli. (IJSEA) 10(6), 1–15 (2019)

    Google Scholar 

  37. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. 54(6), 1–35 (2021)

    Article  Google Scholar 

  38. Miron, M., Tolan, S., Gómez, E., Castillo, C.: Evaluating causes of algorithmic bias in juvenile criminal recidivism. Artifi. Intell. Law 29(2), 111–147 (2021)

    Article  Google Scholar 

  39. Rabiser, R., Schmid, K., Eichelberger, H., Vierhauser, M., Guinea, S., Grünbacher, P.: A domain analysis of resource and requirements monitoring: Towards a comprehensive model of the software monitoring domain. Inf. Softw. Technol. 111, 86–109 (2019)

    Article  Google Scholar 

  40. Rahman, Q.M., Sunderhauf, N., Dayoub, F.: Per-frame map prediction for continuous performance monitoring of object detection during deployment. In: Proceedings of the IEEE CVF, pp. 152–160 (2021)

    Google Scholar 

  41. Roh, Y., Lee, K., Whang, S., Suh, C.: Sample selection for fair and robust training. Adv. Neural. Inf. Process. Syst. 34, 815–827 (2021)

    Google Scholar 

  42. Saldaña, J.: The coding manual for qualitative researchers. Sage Publishing, 2nd edn. (2013)

    Google Scholar 

  43. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.M.: “Everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In: 2021 Conference on Human Factors in Computing Systems, pp. 1–15 (2021)

    Google Scholar 

  44. Shao, Z., Yang, J., Ren, S.: Increasing trustworthiness of deep neural networks via accuracy monitoring. arXiv preprint arXiv:2007.01472 (2020)

  45. Slack, M.K., Draugalis, J.R., Jr.: Establishing the internal and external validity of experimental studies. Am. J. Health Syst. Pharm. 58(22), 2173–2181 (2001)

    Article  Google Scholar 

  46. Uchôa, V., Aires, K., Veras, R., Paiva, A., Britto, L.: Data augmentation for face recognition with cnn transfer learning. In: 2020 International Conference on Systems, Signals and Image Processing, pp. 143–148. IEEE (2020)

    Google Scholar 

  47. Uricár, M., Hurych, D., Krizek, P., Yogamani, S.: Challenges in designing datasets and validation for autonomous driving. arXiv preprint arXiv:1901.09270 (2019)

  48. Vierhauser, M., Rabiser, R., Grünbacher, P.: Requirements monitoring frameworks: A systematic review. Inf. Softw. Technol. 80, 89–109 (2016)

    Article  Google Scholar 

  49. Vierhauser, M., Rabiser, R., Grünbacher, P., Danner, C., Wallner, S., Zeisel, H.: A flexible framework for runtime monitoring of system-of-systems architectures. In: 2014 IEEE Conference on Software Architecture, pp. 57–66. IEEE (2014)

    Google Scholar 

  50. Vogelsang, A., Borg, M.: Requirements engineering for machine learning: Perspectives from data scientists. In: 2019 IEEE 27th International Requirements Engineering Conference Workshops, pp. 245–251. IEEE (2019)

    Google Scholar 

  51. Wang, A., et al.: Revise: A tool for measuring and mitigating bias in visual datasets. Int. J. Comput. Vis. 1–21 (2022)

    Google Scholar 

  52. Wang, T., Zhao, J., Yatskar, M., Chang, K.W., Ordonez, V.: Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (October 2019)

    Google Scholar 

  53. Wardat, M., Le, W., Rajan, H.: Deeplocalize: Fault localization for deep neural networks. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering, pp. 251–262. IEEE (2021)

    Google Scholar 

  54. Zhang, X., et al.: Towards characterizing adversarial defects of deep learning software from the lens of uncertainty. 2020 IEEE/ACM 42nd International Conference on Software Engineering, pp. 739–751 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hans-Martin Heyn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Heyn, HM., Knauss, E., Malleswaran, I., Dinakaran, S. (2023). An Investigation of Challenges Encountered When Specifying Training Data and Runtime Monitors for Safety Critical ML Applications. In: Ferrari, A., Penzenstadler, B. (eds) Requirements Engineering: Foundation for Software Quality. REFSQ 2023. Lecture Notes in Computer Science, vol 13975. Springer, Cham. https://doi.org/10.1007/978-3-031-29786-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-29786-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-29785-4

  • Online ISBN: 978-3-031-29786-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics