7.1 Introduction

The continued rise of the information economy meant an increase in the use of data to build and deploy many data-driven products. These data-driven products are used to extract meaningful insights from raw information, which is then used to address challenges across many different fields. This has coincided with the emergence and development of Data Science as a unique field of expertise, building data-driven products. Data Science is unique from Computer Science (the study of theory and practice of how computers work), and it encompasses many fields. From the perspective of users, the data-driven products have brought many new services and conveniences.

In health, for example, there were rapid deployment of data tools to inform the public on the COVID-19 pandemic (Alamo et al., 2020; Shuja et al., 2021), pandemic prediction models (Ray et al., 2020) and estimations of impact of COVID-19 (Bradshaw et al., 2021). At the same time, some of the tools developed to deal with diagnostics/treatments were not as successful. An example of such data-driven products are the many tools/algorithms that were developed or deployed to improve radiology scans (Roberts et al., 2021; Wynants et al., 2020). On the one hand, one may be tempted to say such deployments were a complete failure. However, on the other hand these challenges highlight some of the shortcomings of data tools and areas of improvement. More importantly, these challenges outline the need to manage data (and its products) so that we take into account the human factors and impacts data may have across all domains. Keeping with the COVID-19 topic, the pandemic also put a spotlight on the lack of basic data infrastructure (Mbow et al., 2020), lack of data skills and/or lack of political will in many countries to focus on the improvement of data-driven products. These data-driven products and tools ultimately impact on the quality of responses to the pandemic. The aforementioned examples, highlight the need for Data Governance that takes a refined view of data in.

I look at the Data Scientist (or Data Science Team) as the ones who make most of the decisions on the data tools they develop or create. This simplified view does not encapsulate all the challenges associated with what is currently taking place. It would be better to look at data-driven products through the lens of socio-technical systems. Socio-technical systems are systems which have interactions between humans, machines and the environment (Baxter & Sommerville, 2011). Even within the organization, the Data Science Team or Data Scientist cannot make decisions without a variety of different stakeholders, especially decisions that have an impact on humans and other environmental factors. As such, the Data Scientist should be able to understand the other inter-dependencies of organizations and society to better understand where they fit and that governance structures should exist to guide the development of systems with such inter dependencies.

In this work, I aim to provide a better understanding of the governance/human factors that Data Scientist and organizations should be aware of. To address this challenge, I will answer fundamental research questions for the domain.

Research Question: What are the salient points that Data Scientists should be aware of when it comes to Data Governance within organizations?

Research Sub-Questions:

  • Do the current policies or mechanisms on the African continent provide a coherent view that can be used by Data Scientists to navigate and respond appropriately to the needs of the organization.

  • Can we learn from the ICT4D community to better understand how interventions should take care of more than just deploying a tool.

It is important to contextualize why we need to answer these questions. We are at a time where policy is lagging deployment of data tools (this is discussed in this paper). This means there are gaps and blind spots that both Data Science practitioners and policy makers (both in public and private sectors) have. These blind spots have consequences. There has been much written about the data protection policy making and much written about Data Science practice and limitations. In this work I want to link the two in order to have a joint understanding that decision making has to be done together. The rest of the document is organized as follows. First, I look at the field of Data Science and how Data Governance fits into practice. The next step is to look at Data Governance on the African continent. I will set the scene and identify gaps that then intersect both areas of Data Science and Data Governance. In the proceeding section, I discuss how ICT4D may have already blazed a path that allows us to learn from in understanding the interactions of Data Science and Data Governance. The latter sections deal with the different stages of the Data Science process and proposals on how best Data Scientists can navigate human factors such as privacy, bias and security. Lastly, I conclude and summarize the viewpoints and evidence elaborated on in this paper.

7.2 Data Science and Practice

I first look at the practice of Data Science and its connections to Data Governance. As such I provide an overview of what Data Science is. An important definition that is still evolving but is important for joint understanding between the reader and the author.

7.2.1 What Is Data Science?

Data Science is a discipline that has arisen due to a number of factors. Data Science itself is a field that uses scientific modelling techniques (typically from a diverse set of scientific disciplines) to extract patterns/information/knowledge from a wide variety of data (Dhar, 2013). The rise in this discipline has been swift for many reasons. Organizations (public and private) have been working to explore the data that they have amassed over time and mine information for patterns and trends that may give them a competitive advantage. There has been an explosion in the number of large internet-based organizations and internet-generated content. Simply, with more users on the internet, and more content on the internet, the information economy needs better data and data tools to monetize these users (Mandl & Kohane, 2016; Zhang, 2017) (e.g. for advertising) or for services that motivate users staying within a company’s products (a walled garden) (Best, 2014; McCown & Nelson, 2009; Skorup & Thierer, 2013).

On the side of public organizations, Data Science has meant the work to analyse or collect data that improves on services provided by governments or new forms of ways to understand citizens (sometimes resulting in mass/hyper surveillance. It is very important to understand these factors, especially as they are connected to “value creation in the information age”. Consideration of the political economy of data, whereby incentives for the monetization of data may be at odds with the interests of private citizens is critical. Issues of concern include the ability of data scientists to shape and influence data governance around private incentives, as well as their ability to collect and utilize information for purposes beyond the intentions of the individual providing data (Nyamwena & Mondliwa, 2020). The factors necessitate that we understand the foundational data infrastructures (physical, virtual, human and otherwise) through the lens of governance, specifically Data Governance. Let us first break down the process of Data Science.

7.2.2 The Data Science Process

To provide the reader with better understanding of Data Science, I use the data analysis cycles to provide an insight into the typical Data Science Process. One can use the CRoss Industry Standard Process for Data Mining (CRISP-DM) as a representation of the process (Wirth & Hipp, 2000). The steps are typically: (a) understand a business problem, (b) understand the data required, (c) collect data, (d) prepare data, (e) perform modelling, (f) evaluate the solution to the problem and (g) adjust understanding and/or deploy (see Fig. 7.1).

Fig. 7.1
A flow diagram which includes, 1. Business understanding, 2. Data understanding, 3. Data preparation, 4. Modeling, 5. Evaluation and 6. Deployment. There is interaction between 1 and 2, and 3 and 4. There is a connect between 5 and 1.

CRISP-DM flow model. Source: Wikimedia Jensen (2012)

One notes that all of this focuses on solving a business challenge. We can easily extend this to solving any societal/organization/scientific challenge, it does not need to be business. This process is similar to the Epicycles of Analysis (Peng & Matsui, 2015) that splits the processes of the problem and the analysis for a solution to the problem. The former tries to separate the problem formulation from the modelling. Problem formulation takes understanding the correct data to gather or get access to. Ultimately, with all of these, we need to understand the human factors and dimensions that arise in all parts of the cycles. The inter-dependencies are discussed later in the document.

The rise of Data Science has also coincided with the rise of Machine Learning and Artificial Intelligence (West & Allen, 2018), and typically it is expected that Data Scientists have an understanding of, and can use, concepts from these fields (Tang & Sae-Lim, 2016). Machine Learning is a field of study concerned with creating tools that learn analytical models from data (Alpaydin, 2020) and is a subset of Artificial Intelligence. Artificial Intelligence is a field of study concerned with creating machines which mimic the intelligence of humans, typically defined as creating an agent that can perceive its environment, and perform actions to maximize some utility or achieve some goal(s) (Russell & Norvig, 1995).

Many Data Science researchers/practitioners are also Artificial Intelligence and/or Machine Learning practitioners/researchers. As such, from here on I will refer to Data Science researchers/practitioners even if I am talking about Artificial Intelligence and/or Machine Learning. Many Data Science researchers or practitioners are comfortable with the above models of understanding data and the subsequent analysis. For this to be successful, society and organizations have an over growing need to understand what actually happens during developing and deploying a system or model in the real world. Governance, in more ways than one, comes into play. The data collection needs considerations of humans and the human dynamic (Bender & Friedman, 2018; Gebru et al., 2018; Jo & Gebru, 2020). The choice of modelling requires consideration of people and their needs (Mitchell et al., 2019), the deployment further requires the consideration of the human dimension in all its guises (Raji, Gebru, et al., 2020; Raji, Smart, et al., 2020). As such Data Governance can be a useful tool for the Data Scientist to be aware of these human factors and the challenges when humans and data [collection, modelling or products] interact (Buolamwini & Gebru, 2018; Hooker, 2021; Ledford, 2019; Mehrabi et al., 2021; Sujan et al., 2019).

7.2.3 Why Do We Need Data Governance?

From the perspective of governments, as part of economic development and growth, they want to embrace “value creation in the information age” (Nyamwena & Mondliwa, 2020). To do so, the collection, use and flow of data has to be governed in order to be able to have oversight over this value creation. In short, Data Governance has to touch every part of the Data Science life cycle as discussed earlier. Data Governance also rises to prominence as a result of historical pushes for digitization of countries especially that of African countries. Governments are concerned that if they do not capitalize the data opportunity, they will be left behind on another economic development. The challenge arises when we look at ways Data Governance has to be shaped for different countries. Without adequate Data Governance in countries, the opportunities for both public and private sectors are at risk of not realizing the full potential of the information economy. This is a big risk as products that may fall short of the values of the countries citizens may be deployed and ultimately cause harm. Such examples of falling short are inadequate privacy protections (Osakwe & Adeniran, 2021), limitations on what data can be used for, regulation of data-driven products that could be harmful (Metcalf & Crawford, 2016), guidelines on data sovereignty (Hummel et al., 2021), and how specific sets of data should be treated as public goods to be shared within or outside a country (Borgesius et al., 2015). Good Data Governance is not only about the data creation stage, but about how governance permeates the full Data Science cycle (Metcalf & Crawford, 2016). Furthermore, good Data Governance requires the contextual knowledge of and from decision makers (in both public and private sectors) to understand the Data Science cycle (data, modelling, algorithms, etc.) (Kearns & Roth, n.d.). It is harder for the gatekeepers to regulate industry if they themselves do not have a foundational understanding of what typically happens within the Data Science cycle. This is an important point to highlight because industries such as finance, for example, have well defined regulators in most countries. These financial regulators regulate the industry to mitigate corruption and harm. Regulatory boards are made up of experts in the field who then work to set best practice, limitations and also penalties for breaches of the regulations. The challenges with many of the data-driven products we see nowadays is that many of the decision makers in the process of deploying these tools have little experience with the field itself and see most of what is going on as a black box that takes in data, and “magically” produces answers. This highlights the needs for basic foundational regulation that asks the right questions when developing data-driven products but also sets the path for a joint understanding of the field which should be understood by all people (not just experts). In the proceeding section I look at important parts of the Data Science cycle and highlight the human factors and questions that should be asked by Data Scientists and also be understood by decision makers.

7.3 Human Factors and the Data Science Cycle

In order to champion the joint understanding of Data Science and Data Governance, in this section I discuss the human factors in the Data Acquisition, Modelling and Presentation phases of the Data Science cycle.

7.3.1 Data Acquisition

One of the steps that is fraught with tension in the Data Science process is the data acquisition process. This can be a blind spot (Mitchell et al., 2018; Zhang et al., 2018) that can make or break many projects. Imagine using a dataset collected in the 1950s on financial lending by banks. Now building a predictive tool to assist in lending decisions with such a dataset will be full of gender and racial biases in many countries (Bond & Tait, 1997; Rice, 1996). Put simply, the model would learn to discriminate. This is still a challenge today (Runshan et al., 2021). Even if the data is taken as representative of the population being studied, it may encode societal bias and discrimination. Most times when talking and interacting with decision makers or clients, those without much experience tend to overlook the challenges in the acquisition of data. These challenges are connected with governance issues (Veale & Binns, 2017).

7.3.2 Processes and Procedures

In acquiring data, as part of the Data Science process, one connects the problem being approached with the data that will be needed to solve the problem. At some point, there may be data before the questions are clear, while at other times there is a question to be answered but the data has not been mapped out. In all instances, data has to move from where it rests and staged for processing by the Data Science team. This requires identification of the relevant data source, identification of which subset of the information is important and how the transmission will occur. In doing these identification steps we have to look at the human factors.

7.3.3 Human Factors

For each of the proceeding steps of the Data Science process, I focus on these three human factors. For the Data acquisition I focus on: Where does the data come from? Why is/was it being collected? Who is the data about? There are many more factors, but for conciseness and to communicate our message, the message will remain with three factors per step of the Data Science cycle. Where does the data come from? When identifying the source of data, it quickly becomes clear that one has to understand the structures of the organizations internally or externally that control access and use of the data. In an ideal case, there is a clear Data Governance structure that also provides information on how a data scientist can request data, how the data should be handled and any sensitive and salient information that the scientist should be aware of (Abraham et al., 2019). There would be questions that are related to the sensitiveness of the data. Was the data collected in an ethical manner? Is the data part of an open data repository? What licensing is the data under and expectations of use? Is the data from a governmental entity, what are the national expectations on Open Government data? For example, in a municipality, one may expect that aggregated water use data by municipal ward should be open and available (especially as many areas in some countries face water shortages), but there may be some resistance by some officials in making this data available.

It may be that there is not enough human resource to create and keep the data available, the data may normally be available for a fee that adds to revenue, there may be issues of transparency etc. Why is/was it being collected? This is an important factor as it establishes prior expectations on what the data that was collected or is being collected was used for. If we imagine that we have data about the transaction habits of bus riders in a city, the original use of the data and expectation was to manage the transportation system. If now the data will be used to understand behaviour to deliver advertising to bus riders, this new use may not be covered by original terms of reference. More importantly, bus riders may not agree with the change of the use of their data and there is a responsibility the organization has with them to treat their information with care and thought.

Who is the data about? In carrying through the process to build up the data one has to think if it is representative of the population it is serving. Again, focusing on when the data is about people, we need to understand who the data represents and if this distribution is equitable, fair (Mitchell et al., 2018; Zhang et al., 2018)? Further does this distribution of people actually match those we expect to make decisions about in the end data-driven product? If not, this may be a problem that introduces biased decision making. For example, in the recent decade, much has been highlighted about the bias in facial recognition systems (Raji, Gebru, et al., 2020). Some of this bias comes from the original data that was used to train them (Mitchell et al., 2018; Zhang et al., 2018). Some of this bias comes from the designs of the systems and also how success is measured. I will discuss more on this later in the modelling and the presentation subsections). One can see just from looking at the above, that there are important human factors that cannot just be left to the Data Scientist or organization to make decisions about. There needs to be foundational expectations on data handling, data storage, security, ethics and regulatory tests on what the data would be used for.

7.4 Data Analysis and Modelling

In the Data Analysis and Modelling step, the Data Scientist focuses their energy on using the correct approaches to extract meaningful information from the data. These choices will influence the final result as well as be the foundation on which many will choose to believe the results or not. Even though these may be established computational, statistical or mathematical approaches, we still need to understand how choices impact the end product and people.

7.4.1 Processes and Procedures

The Data Scientist takes the data that has been acquired in the prior step. They then work to clean it, transforming it into a form that can be used by downstream modelling tasks and then loading it into their modelling systems. The Data Scientist will make choices on metrics to be measured or optimized. Ultimately, these metrics are used to decide on success and then are used to know if new data should be sourced, the question should be re-framed or can one move to the next step of the Data Science cycle.

7.4.2 Human Factors

For the data analysis and modelling stages I focus on these factors: How are the modelling choices made? Who has the skills to model? What are the models for the use-case being used? How are the modelling choices made? For a period, there was a popular retort that people are biased and machines are unbiased. When it has highlighted that machines cannot be unbiased because the data that they use to learn may be biased, the needle moved to that algorithms cannot be biased, only the data (Birhane & Cummins, 2019). But, this still ignores many factors that modelling choices also impact the results of the final models (Jiang et al., 2020). In Machine Learning, we pride ourselves in working to build better and better generalizable, accurate and efficient algorithms, but this does not absolve us about thinking about our modelling choices (Birhane et al., 2021). Work by Hooker et al. (2020) highlighted the biases in compressed models. Further, more and more ML models use transfer learning (building on prior models or datasets), this then carries forward biases. This is one of the reasons Data Scientists should work to document their modelling choices (Mitchell et al., 2019). Modelling may seem insignificant at the time of decision making, but may lead to big consequences later. A recent example (Birhane et al., 2021) is how models influences the collection of massive (in order to fight against bias) dataset that, when looked at under a microscope, to not be as representative as the dataset authors claimed. This highlights the lack of participation and inclusive design choices that also call in to question, who has the modelling skills?

Who has the skills to model? ML/AI/Data Science is a field that typically is skewed in terms of demographics and who ends up building the underlying technologies. One may argue that this does not apply on the African Continent when it comes to racial makeup. But that is not a true reflection of the field. For a long period, in major technology companies on the continent, the senior technical roles were skewed Male and White (mirroring the challenges that have been criticized about Silicon Valley). Further making this worse is the lack of Data Science skills on the continent. Without these skills, we further have less connection between decision makers and those who design models. How many of the decision makers have a data/computational background? Another factor is that the major tech companies that do drive most of the internet economy tend to only have business offices on the continent (Birhane, 2020). Their aim, to sell their services (Birhane, 2020), extract data (Coleman, 2018) and handle regulatory issues (if there is regulation (Birhane, 2020; Coleman, 2018)). The offices do not build or shape the core technologies at these companies. As such, if we connect this question to the prior one, we see how modelling choices can become a life changing decision for those on the downstream tasks. Imagine how in organizations, automated hiring systems, were deployed to assist in the hiring process by using AI to screen or monitor candidates. These systems have been shown to be discriminatory (Sánchez-Monedero et al., 2020), but what are the odds that the decision makers and internal Data Science teams had the skills to be able to evaluate their facial recognition systems or text screening services against bias?

What are the models for the use-case being used? Recent work in the ML/AI field has brought about focus explainable models in the fight against harm and pursuit for better fairness. These choices of such models are in every use-case. Let’s take, for example, the increase in surveillance systems and facial recognition systems internationally [ref]. How the models are chosen for such use-cases and evaluated impact the ultimate impact these systems will have on society. Much work has highlighted how biased facial recognition systems (Raji, Smart, et al., 2020) can lead to discriminatory behaviour by law enforcement. This may end up being a life of death situation for someone at the end of these automated systems. A Data Scientist and decision maker needs to ask themselves, what is the cost of an error of our model? These should then impact how the deployment is done. Further, depending on the societal expectations, there may be regulatory restrictions in making one choice or another.

7.5 Presentation and Deployment of Data-Driven Products

The final step in many Data Science projects is presenting results to decision makers and/or the deployment of the data driven products.

7.5.1 Processes and Procedures

In this step, the Data Scientist would work to present a report on findings of the modelling in order to answer the original questions. From here, decisions may be made on these reports. Reports may be visualization, simulations or data-driven products with metrics that show their efficacy. Decisions on what to show and who the data-driven products will be aimed at will be made. These have human factors.

7.5.2 Human Factors

For the Presentation and Deployment of data-driven products stages, I focus on these factors: What decisions are being made with the models? What choices are being made in what to be shown? How will the models be kept updated? What decisions are being made with the models? The ultimate test for the usefulness of a model for the decision maker is when it is deployed for used or presented for decision making. This is a spot in the Data Science life cycle that requires careful understanding of the prior parts of the cycle or wrong decisions could be made. When looking at the data product or predictions of a model, the user must understand how the model works, how it was built and what limitations it has. The sub-question here could be, how do people interpret the results/predictions from the data product? This requires more than just displaying a result but also working with human computer interaction practitioners to design in such a way that is fair, transparent and mitigates bias or discrimination (Holstein et al., 2019; Lee & Singh, 2021).

What choices are being made in what to be shown? As in the statistical domain, we can also lie with data-driven products. The COVID 19 pandemic had many examples where decision makers worked to distort data, distort model predictions and even censor data researchers and practitioners in order to fit with a view that the decision maker held (A hostile environment, 2021; Vigjilenca, 2020; Zhang & Barr, 2021). This may be taken as an extreme public example, but this does happen in many ways. One may be testing for harm at run-time. How will the models be kept updated? When deploying data-driven products, the internal models have to be kept updated. The world did not stop changing when the model was trained and deployed. As such, the models will start exhibiting drift. This drift may also come from the how users respond to what the model does itself. Does the organization of Data Science team have procedures on the maintenance of the models in the data-driven product and how to test for drift before the system has high error in its results (predictive, prescriptive, diagnostic etc.?).

In this section I have discussed how Data Science and Data Governance intersect. In the latter part of the section, I chose three sections of the Data Science cycles to be able to analyse for human factors. Through identifying these human factors, we can better understand how Data Governance is an integral part of the full cycle as decisions being made by the scientist will impact users and humans in general. In the next section I then discuss Data Governance on the African continent.

7.6 Data Governance and the African Continent

With calls for African countries to jump on to the current advances of data driven economies, there has been some movements towards strategies and governance policies by governments that cover data. The African Union released the “The Digital Transformation Strategy for Africa 2020–2030” (African Union, 2020). This strategy should be understood in the context of the wider and more localized Data Governance and digitization challenges in different African countries. When it comes to privacy, the European general data protection regulation (GDPR) (European Commission, n.d.) has had wide ranging effect and impact on the internet economy as many companies who processed European citizen data had to abide by the rules set out by the EU. Around the African continent, as shown by the research in (Davis, 2021), there are efforts to strengthen data protection policies, even with only about 52% of African countries having such legislation.

The African Union Convention on Cyber Security and Personal Data Protection (known as the Malabo Convention) (African Union, 2014) was adopted by AU member states in 2014. It sets out to provide protections for cyber infrastructure, protection of personal information, cyber security and the necessary foundations to enable an information economy across the African continent. Even though ratified in 2014, only eight countries had ratified the convention by 18/06/2020.Footnote 1 The convention touches on many aspects that can form a unified foundation for African countries to benefit from the information economy. Without ratification, we have the reality that organizations and practitioners do not have a unified view on how to deploy data tools and for some countries the reality is much worse with very lax or non-existent protections (Davis, 2021).

In South Africa the Protection of Personal Information Act (POPIA) (Government of South Africa, n.d.), which has taken many years to get enacted, has also begun a discussion in the public on data acquisition, protection of personal information and the use of the data for downstream tasks (especially when it is not for the original purpose of data collection). Even so, Data Governance is not only the protection of personal information, but there are also many more human and organizational factors that data interacts with. I hope the preceding section has made it clear that Data Governance should cover more than just the data being used. But, as earlier discussed, there are many human factors that should be taken into consideration in all the stages of the Data Science cycle. To effectively govern the full process, countries have to have a clear understanding of the stages as well as the responsibilities of governments towards the Data Scientists and the responsibilities of the Data Scientists towards the public.

The African continent has made big strides in the ICT sector and building local skills and also championing local companies. Even so, there is still a dominance of the Big Tech Giants (Microsoft, IBM, Google, Facebook etc.) on the continent physically or with services that cross borders. Even though we do not have an agreed definition of the data skill gap, the work by (Sey & Mudongo, 2021) highlights how there is lack of understanding of the need of AI skills and that we need to have efforts to build these skills on the continent and this must connect public and private sectors. These insights are important as they place in context how few of the Big Tech firms have few or any research and development that is done on the continent. AI governance skills are recommended as part of the development of AI skills on the continent (Sey & Mudongo, 2021), echoing the message in this paper on the broader Data Science and Data Governance nexus.

The continent risks being just a source of data (Birhane, 2020) to build services that then are used by citizens without any local development of these services. This has been recently brought to bear with how Facebook only has 13% of its abuse team (which fights abuse on their online platforms) working on non US content, even though 90% of Facebook users are outside the US (Purnell et al., 2021). This is important as misinformation on Facebook outside the US has effect on many countries, but cannot be battled by Facebook itself. Further on, governments have to be able to govern the digital space and insure that the citizens get to benefit from the digital public goods (Gillwald & van der Spuy, 2019). Another challenge is the use of some of the data-driven products for surveillance by both governments and private sector on the continent (Mudongo, 2021). As already highlighted, the systems are less likely to be developed locally and may encode.

7.7 Case Study: Learning from Our Recent Past, Enter ICT4D

Data Science and Artificial Intelligence hailed as a silver bullet to many problems, data itself referred to as the new oil to be exploited by nations and organizations (Hirsch, 2013). But a challenge that organizations and nations should be able to spot rears its head again. With the rise of ICT and digitization efforts, many problems were pointed to where ICT could be the solution (Curtis, 2019). Throw in development practices, ICT4D has been a force for the last two or more decades (Walsham, 2017).

I argue that we now have had enough time that some of the shortcomings of seeing many problems as requiring ICT as the solution, especially from practitioners who would come from outside, drop in, deploy and then leave is very much akin to what is happening in the Data Science world currently and needs change (Shilton et al., 2021). There may be differences, chief among them, familiarity with what ICT is and less familiar with what Data Science, Artificial Intelligence or Machine Learning are (Osoba & Welser, 2017). Basically, Data Science researchers and practitioners are just seen as magicians you throw a problem and data at, and a solution arrives on the other side. We see this with the advent of touting of 4IR strategies for African nations that are driven by public institutions that do not have the skills or knowledge to really engage with the subject they are touting as a solution to many of the problems they face (McBride et al., 2018; Moorosi et al., 2017). In ICT4D, a historical debate was on the efficacy of having researchers and practitioners who were not locals come in with “solutions” using ICT to many development issues (Andrade & Urquhart, 2012; Toyama, 2015). Over time this has become an area of study within the field itself. It became very apparent on how the development and design of systems should be participatory (Andrade & Urquhart, 2012; Tongia & Subrahmanian, 2006; Toyama, 2015) and take into account more than just the technical challenge. This tough challenge took time and many failures. In contrast within Data Science and Artificial Intelligence field, a lot of work has been put into understanding fairness, ethics and the longer term effects of the technical interventions. This is a welcome change to the ICT4D history, but we still are lagging in the understanding of the need of participatory design as well as governance that guides the field (Singh & Flyverbom, 2016). We have large international bodies like the International Telecommunications Union, that many states belong to, that has shaped ICT policies across regions.

In Artificial Intelligence, one can say the debate on fairness and harm has been very much open due to the threats of wide scale impact on people. But, this does not mean that debates solve the problems. In most of the debates and discussions, it is mostly researchers, and not decision and policy makers who are doing work to document harm and make recommendations to mitigate it (Whittaker et al., 2018). Policy makers need to come to the table to also shape the debate by providing input from government. We need to draw from lessons of other fields while at the same time understanding the uniqueness of the take up of data-driven products before we even had the time to think about their impact.

7.8 Conclusion

In this paper, I used a survey of literature around Data Science and Data Governance to bring to the fore the connections within this nexus. Leaving decisions of design to only the Data Scientist ignores the many human factors that data-driven products have. As such, Data Governance is key to being able to create and deploy products that do add to the developing economies on the continent while mitigating harm. This requires that African countries have an appreciation of the needs of governance and skills to enable effective policy. The case study presented on ICT4D allows us to learn from a related discipline that has been active for two decades and has had similar challenges in deploying interventions in the Global South.

Recommendations:

  • There is a need for African governments to work together to practically implement Data Governance policy. The glaring reality that only 8 countries (as of this writing) have ratified the African Union Convention on Cyber Security and Personal Data leaves much to be desired.

  • Both public and private industries must engage with data scientists to better get an understanding of the areas of concern highlighted in this paper beyond data privacy. Most policy on the continent focuses on privacy protections and some automated decision making, but there are many other decisions made in the process of developing data tools that impact the final outcome.

  • For the data scientist, it must be a reality that policy and development of data tools go hand in hand. Even if national, regional or continental policies have not caught up, there is growing movement within our practice that works to develop best practice and also highlight challenges in ethics, fairness and mitigating abuse.