Abstract
Within a short space of time, the debate about Data Governance has fallen behind the realities of data driven industries and economies. The flow and trade of data is driven by the needs of different stake holders and evolution of global contexts of many technologies that are seen as local. To the Data Scientist, it may seem like an exciting time that has infinite possibility and opportunity to invent the near future. The gap between Data Governance on the African continent and Data practice poses a challenge that must be dealt with sooner than later. In this chapter I look at the intersection of Data Science practice and Data Governance and analyse some of the recent literature to identify areas of concern and focus. Ultimately, I want to look at how non-technical considerations are core in bridging Data Governance and Data Science practice. I borrow from other disciplines that had a head start with these challenges. Finally, I work to suggest steps that can be taken by practitioners to reduce this gap between governance and practice.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
7.1 Introduction
The continued rise of the information economy meant an increase in the use of data to build and deploy many data-driven products. These data-driven products are used to extract meaningful insights from raw information, which is then used to address challenges across many different fields. This has coincided with the emergence and development of Data Science as a unique field of expertise, building data-driven products. Data Science is unique from Computer Science (the study of theory and practice of how computers work), and it encompasses many fields. From the perspective of users, the data-driven products have brought many new services and conveniences.
In health, for example, there were rapid deployment of data tools to inform the public on the COVID-19 pandemic (Alamo et al., 2020; Shuja et al., 2021), pandemic prediction models (Ray et al., 2020) and estimations of impact of COVID-19 (Bradshaw et al., 2021). At the same time, some of the tools developed to deal with diagnostics/treatments were not as successful. An example of such data-driven products are the many tools/algorithms that were developed or deployed to improve radiology scans (Roberts et al., 2021; Wynants et al., 2020). On the one hand, one may be tempted to say such deployments were a complete failure. However, on the other hand these challenges highlight some of the shortcomings of data tools and areas of improvement. More importantly, these challenges outline the need to manage data (and its products) so that we take into account the human factors and impacts data may have across all domains. Keeping with the COVID-19 topic, the pandemic also put a spotlight on the lack of basic data infrastructure (Mbow et al., 2020), lack of data skills and/or lack of political will in many countries to focus on the improvement of data-driven products. These data-driven products and tools ultimately impact on the quality of responses to the pandemic. The aforementioned examples, highlight the need for Data Governance that takes a refined view of data in.
I look at the Data Scientist (or Data Science Team) as the ones who make most of the decisions on the data tools they develop or create. This simplified view does not encapsulate all the challenges associated with what is currently taking place. It would be better to look at data-driven products through the lens of socio-technical systems. Socio-technical systems are systems which have interactions between humans, machines and the environment (Baxter & Sommerville, 2011). Even within the organization, the Data Science Team or Data Scientist cannot make decisions without a variety of different stakeholders, especially decisions that have an impact on humans and other environmental factors. As such, the Data Scientist should be able to understand the other inter-dependencies of organizations and society to better understand where they fit and that governance structures should exist to guide the development of systems with such inter dependencies.
In this work, I aim to provide a better understanding of the governance/human factors that Data Scientist and organizations should be aware of. To address this challenge, I will answer fundamental research questions for the domain.
Research Question: What are the salient points that Data Scientists should be aware of when it comes to Data Governance within organizations?
Research Sub-Questions:
-
Do the current policies or mechanisms on the African continent provide a coherent view that can be used by Data Scientists to navigate and respond appropriately to the needs of the organization.
-
Can we learn from the ICT4D community to better understand how interventions should take care of more than just deploying a tool.
It is important to contextualize why we need to answer these questions. We are at a time where policy is lagging deployment of data tools (this is discussed in this paper). This means there are gaps and blind spots that both Data Science practitioners and policy makers (both in public and private sectors) have. These blind spots have consequences. There has been much written about the data protection policy making and much written about Data Science practice and limitations. In this work I want to link the two in order to have a joint understanding that decision making has to be done together. The rest of the document is organized as follows. First, I look at the field of Data Science and how Data Governance fits into practice. The next step is to look at Data Governance on the African continent. I will set the scene and identify gaps that then intersect both areas of Data Science and Data Governance. In the proceeding section, I discuss how ICT4D may have already blazed a path that allows us to learn from in understanding the interactions of Data Science and Data Governance. The latter sections deal with the different stages of the Data Science process and proposals on how best Data Scientists can navigate human factors such as privacy, bias and security. Lastly, I conclude and summarize the viewpoints and evidence elaborated on in this paper.
7.2 Data Science and Practice
I first look at the practice of Data Science and its connections to Data Governance. As such I provide an overview of what Data Science is. An important definition that is still evolving but is important for joint understanding between the reader and the author.
7.2.1 What Is Data Science?
Data Science is a discipline that has arisen due to a number of factors. Data Science itself is a field that uses scientific modelling techniques (typically from a diverse set of scientific disciplines) to extract patterns/information/knowledge from a wide variety of data (Dhar, 2013). The rise in this discipline has been swift for many reasons. Organizations (public and private) have been working to explore the data that they have amassed over time and mine information for patterns and trends that may give them a competitive advantage. There has been an explosion in the number of large internet-based organizations and internet-generated content. Simply, with more users on the internet, and more content on the internet, the information economy needs better data and data tools to monetize these users (Mandl & Kohane, 2016; Zhang, 2017) (e.g. for advertising) or for services that motivate users staying within a company’s products (a walled garden) (Best, 2014; McCown & Nelson, 2009; Skorup & Thierer, 2013).
On the side of public organizations, Data Science has meant the work to analyse or collect data that improves on services provided by governments or new forms of ways to understand citizens (sometimes resulting in mass/hyper surveillance. It is very important to understand these factors, especially as they are connected to “value creation in the information age”. Consideration of the political economy of data, whereby incentives for the monetization of data may be at odds with the interests of private citizens is critical. Issues of concern include the ability of data scientists to shape and influence data governance around private incentives, as well as their ability to collect and utilize information for purposes beyond the intentions of the individual providing data (Nyamwena & Mondliwa, 2020). The factors necessitate that we understand the foundational data infrastructures (physical, virtual, human and otherwise) through the lens of governance, specifically Data Governance. Let us first break down the process of Data Science.
7.2.2 The Data Science Process
To provide the reader with better understanding of Data Science, I use the data analysis cycles to provide an insight into the typical Data Science Process. One can use the CRoss Industry Standard Process for Data Mining (CRISP-DM) as a representation of the process (Wirth & Hipp, 2000). The steps are typically: (a) understand a business problem, (b) understand the data required, (c) collect data, (d) prepare data, (e) perform modelling, (f) evaluate the solution to the problem and (g) adjust understanding and/or deploy (see Fig. 7.1).
One notes that all of this focuses on solving a business challenge. We can easily extend this to solving any societal/organization/scientific challenge, it does not need to be business. This process is similar to the Epicycles of Analysis (Peng & Matsui, 2015) that splits the processes of the problem and the analysis for a solution to the problem. The former tries to separate the problem formulation from the modelling. Problem formulation takes understanding the correct data to gather or get access to. Ultimately, with all of these, we need to understand the human factors and dimensions that arise in all parts of the cycles. The inter-dependencies are discussed later in the document.
The rise of Data Science has also coincided with the rise of Machine Learning and Artificial Intelligence (West & Allen, 2018), and typically it is expected that Data Scientists have an understanding of, and can use, concepts from these fields (Tang & Sae-Lim, 2016). Machine Learning is a field of study concerned with creating tools that learn analytical models from data (Alpaydin, 2020) and is a subset of Artificial Intelligence. Artificial Intelligence is a field of study concerned with creating machines which mimic the intelligence of humans, typically defined as creating an agent that can perceive its environment, and perform actions to maximize some utility or achieve some goal(s) (Russell & Norvig, 1995).
Many Data Science researchers/practitioners are also Artificial Intelligence and/or Machine Learning practitioners/researchers. As such, from here on I will refer to Data Science researchers/practitioners even if I am talking about Artificial Intelligence and/or Machine Learning. Many Data Science researchers or practitioners are comfortable with the above models of understanding data and the subsequent analysis. For this to be successful, society and organizations have an over growing need to understand what actually happens during developing and deploying a system or model in the real world. Governance, in more ways than one, comes into play. The data collection needs considerations of humans and the human dynamic (Bender & Friedman, 2018; Gebru et al., 2018; Jo & Gebru, 2020). The choice of modelling requires consideration of people and their needs (Mitchell et al., 2019), the deployment further requires the consideration of the human dimension in all its guises (Raji, Gebru, et al., 2020; Raji, Smart, et al., 2020). As such Data Governance can be a useful tool for the Data Scientist to be aware of these human factors and the challenges when humans and data [collection, modelling or products] interact (Buolamwini & Gebru, 2018; Hooker, 2021; Ledford, 2019; Mehrabi et al., 2021; Sujan et al., 2019).
7.2.3 Why Do We Need Data Governance?
From the perspective of governments, as part of economic development and growth, they want to embrace “value creation in the information age” (Nyamwena & Mondliwa, 2020). To do so, the collection, use and flow of data has to be governed in order to be able to have oversight over this value creation. In short, Data Governance has to touch every part of the Data Science life cycle as discussed earlier. Data Governance also rises to prominence as a result of historical pushes for digitization of countries especially that of African countries. Governments are concerned that if they do not capitalize the data opportunity, they will be left behind on another economic development. The challenge arises when we look at ways Data Governance has to be shaped for different countries. Without adequate Data Governance in countries, the opportunities for both public and private sectors are at risk of not realizing the full potential of the information economy. This is a big risk as products that may fall short of the values of the countries citizens may be deployed and ultimately cause harm. Such examples of falling short are inadequate privacy protections (Osakwe & Adeniran, 2021), limitations on what data can be used for, regulation of data-driven products that could be harmful (Metcalf & Crawford, 2016), guidelines on data sovereignty (Hummel et al., 2021), and how specific sets of data should be treated as public goods to be shared within or outside a country (Borgesius et al., 2015). Good Data Governance is not only about the data creation stage, but about how governance permeates the full Data Science cycle (Metcalf & Crawford, 2016). Furthermore, good Data Governance requires the contextual knowledge of and from decision makers (in both public and private sectors) to understand the Data Science cycle (data, modelling, algorithms, etc.) (Kearns & Roth, n.d.). It is harder for the gatekeepers to regulate industry if they themselves do not have a foundational understanding of what typically happens within the Data Science cycle. This is an important point to highlight because industries such as finance, for example, have well defined regulators in most countries. These financial regulators regulate the industry to mitigate corruption and harm. Regulatory boards are made up of experts in the field who then work to set best practice, limitations and also penalties for breaches of the regulations. The challenges with many of the data-driven products we see nowadays is that many of the decision makers in the process of deploying these tools have little experience with the field itself and see most of what is going on as a black box that takes in data, and “magically” produces answers. This highlights the needs for basic foundational regulation that asks the right questions when developing data-driven products but also sets the path for a joint understanding of the field which should be understood by all people (not just experts). In the proceeding section I look at important parts of the Data Science cycle and highlight the human factors and questions that should be asked by Data Scientists and also be understood by decision makers.
7.3 Human Factors and the Data Science Cycle
In order to champion the joint understanding of Data Science and Data Governance, in this section I discuss the human factors in the Data Acquisition, Modelling and Presentation phases of the Data Science cycle.
7.3.1 Data Acquisition
One of the steps that is fraught with tension in the Data Science process is the data acquisition process. This can be a blind spot (Mitchell et al., 2018; Zhang et al., 2018) that can make or break many projects. Imagine using a dataset collected in the 1950s on financial lending by banks. Now building a predictive tool to assist in lending decisions with such a dataset will be full of gender and racial biases in many countries (Bond & Tait, 1997; Rice, 1996). Put simply, the model would learn to discriminate. This is still a challenge today (Runshan et al., 2021). Even if the data is taken as representative of the population being studied, it may encode societal bias and discrimination. Most times when talking and interacting with decision makers or clients, those without much experience tend to overlook the challenges in the acquisition of data. These challenges are connected with governance issues (Veale & Binns, 2017).
7.3.2 Processes and Procedures
In acquiring data, as part of the Data Science process, one connects the problem being approached with the data that will be needed to solve the problem. At some point, there may be data before the questions are clear, while at other times there is a question to be answered but the data has not been mapped out. In all instances, data has to move from where it rests and staged for processing by the Data Science team. This requires identification of the relevant data source, identification of which subset of the information is important and how the transmission will occur. In doing these identification steps we have to look at the human factors.
7.3.3 Human Factors
For each of the proceeding steps of the Data Science process, I focus on these three human factors. For the Data acquisition I focus on: Where does the data come from? Why is/was it being collected? Who is the data about? There are many more factors, but for conciseness and to communicate our message, the message will remain with three factors per step of the Data Science cycle. Where does the data come from? When identifying the source of data, it quickly becomes clear that one has to understand the structures of the organizations internally or externally that control access and use of the data. In an ideal case, there is a clear Data Governance structure that also provides information on how a data scientist can request data, how the data should be handled and any sensitive and salient information that the scientist should be aware of (Abraham et al., 2019). There would be questions that are related to the sensitiveness of the data. Was the data collected in an ethical manner? Is the data part of an open data repository? What licensing is the data under and expectations of use? Is the data from a governmental entity, what are the national expectations on Open Government data? For example, in a municipality, one may expect that aggregated water use data by municipal ward should be open and available (especially as many areas in some countries face water shortages), but there may be some resistance by some officials in making this data available.
It may be that there is not enough human resource to create and keep the data available, the data may normally be available for a fee that adds to revenue, there may be issues of transparency etc. Why is/was it being collected? This is an important factor as it establishes prior expectations on what the data that was collected or is being collected was used for. If we imagine that we have data about the transaction habits of bus riders in a city, the original use of the data and expectation was to manage the transportation system. If now the data will be used to understand behaviour to deliver advertising to bus riders, this new use may not be covered by original terms of reference. More importantly, bus riders may not agree with the change of the use of their data and there is a responsibility the organization has with them to treat their information with care and thought.
Who is the data about? In carrying through the process to build up the data one has to think if it is representative of the population it is serving. Again, focusing on when the data is about people, we need to understand who the data represents and if this distribution is equitable, fair (Mitchell et al., 2018; Zhang et al., 2018)? Further does this distribution of people actually match those we expect to make decisions about in the end data-driven product? If not, this may be a problem that introduces biased decision making. For example, in the recent decade, much has been highlighted about the bias in facial recognition systems (Raji, Gebru, et al., 2020). Some of this bias comes from the original data that was used to train them (Mitchell et al., 2018; Zhang et al., 2018). Some of this bias comes from the designs of the systems and also how success is measured. I will discuss more on this later in the modelling and the presentation subsections). One can see just from looking at the above, that there are important human factors that cannot just be left to the Data Scientist or organization to make decisions about. There needs to be foundational expectations on data handling, data storage, security, ethics and regulatory tests on what the data would be used for.
7.4 Data Analysis and Modelling
In the Data Analysis and Modelling step, the Data Scientist focuses their energy on using the correct approaches to extract meaningful information from the data. These choices will influence the final result as well as be the foundation on which many will choose to believe the results or not. Even though these may be established computational, statistical or mathematical approaches, we still need to understand how choices impact the end product and people.
7.4.1 Processes and Procedures
The Data Scientist takes the data that has been acquired in the prior step. They then work to clean it, transforming it into a form that can be used by downstream modelling tasks and then loading it into their modelling systems. The Data Scientist will make choices on metrics to be measured or optimized. Ultimately, these metrics are used to decide on success and then are used to know if new data should be sourced, the question should be re-framed or can one move to the next step of the Data Science cycle.
7.4.2 Human Factors
For the data analysis and modelling stages I focus on these factors: How are the modelling choices made? Who has the skills to model? What are the models for the use-case being used? How are the modelling choices made? For a period, there was a popular retort that people are biased and machines are unbiased. When it has highlighted that machines cannot be unbiased because the data that they use to learn may be biased, the needle moved to that algorithms cannot be biased, only the data (Birhane & Cummins, 2019). But, this still ignores many factors that modelling choices also impact the results of the final models (Jiang et al., 2020). In Machine Learning, we pride ourselves in working to build better and better generalizable, accurate and efficient algorithms, but this does not absolve us about thinking about our modelling choices (Birhane et al., 2021). Work by Hooker et al. (2020) highlighted the biases in compressed models. Further, more and more ML models use transfer learning (building on prior models or datasets), this then carries forward biases. This is one of the reasons Data Scientists should work to document their modelling choices (Mitchell et al., 2019). Modelling may seem insignificant at the time of decision making, but may lead to big consequences later. A recent example (Birhane et al., 2021) is how models influences the collection of massive (in order to fight against bias) dataset that, when looked at under a microscope, to not be as representative as the dataset authors claimed. This highlights the lack of participation and inclusive design choices that also call in to question, who has the modelling skills?
Who has the skills to model? ML/AI/Data Science is a field that typically is skewed in terms of demographics and who ends up building the underlying technologies. One may argue that this does not apply on the African Continent when it comes to racial makeup. But that is not a true reflection of the field. For a long period, in major technology companies on the continent, the senior technical roles were skewed Male and White (mirroring the challenges that have been criticized about Silicon Valley). Further making this worse is the lack of Data Science skills on the continent. Without these skills, we further have less connection between decision makers and those who design models. How many of the decision makers have a data/computational background? Another factor is that the major tech companies that do drive most of the internet economy tend to only have business offices on the continent (Birhane, 2020). Their aim, to sell their services (Birhane, 2020), extract data (Coleman, 2018) and handle regulatory issues (if there is regulation (Birhane, 2020; Coleman, 2018)). The offices do not build or shape the core technologies at these companies. As such, if we connect this question to the prior one, we see how modelling choices can become a life changing decision for those on the downstream tasks. Imagine how in organizations, automated hiring systems, were deployed to assist in the hiring process by using AI to screen or monitor candidates. These systems have been shown to be discriminatory (Sánchez-Monedero et al., 2020), but what are the odds that the decision makers and internal Data Science teams had the skills to be able to evaluate their facial recognition systems or text screening services against bias?
What are the models for the use-case being used? Recent work in the ML/AI field has brought about focus explainable models in the fight against harm and pursuit for better fairness. These choices of such models are in every use-case. Let’s take, for example, the increase in surveillance systems and facial recognition systems internationally [ref]. How the models are chosen for such use-cases and evaluated impact the ultimate impact these systems will have on society. Much work has highlighted how biased facial recognition systems (Raji, Smart, et al., 2020) can lead to discriminatory behaviour by law enforcement. This may end up being a life of death situation for someone at the end of these automated systems. A Data Scientist and decision maker needs to ask themselves, what is the cost of an error of our model? These should then impact how the deployment is done. Further, depending on the societal expectations, there may be regulatory restrictions in making one choice or another.
7.5 Presentation and Deployment of Data-Driven Products
The final step in many Data Science projects is presenting results to decision makers and/or the deployment of the data driven products.
7.5.1 Processes and Procedures
In this step, the Data Scientist would work to present a report on findings of the modelling in order to answer the original questions. From here, decisions may be made on these reports. Reports may be visualization, simulations or data-driven products with metrics that show their efficacy. Decisions on what to show and who the data-driven products will be aimed at will be made. These have human factors.
7.5.2 Human Factors
For the Presentation and Deployment of data-driven products stages, I focus on these factors: What decisions are being made with the models? What choices are being made in what to be shown? How will the models be kept updated? What decisions are being made with the models? The ultimate test for the usefulness of a model for the decision maker is when it is deployed for used or presented for decision making. This is a spot in the Data Science life cycle that requires careful understanding of the prior parts of the cycle or wrong decisions could be made. When looking at the data product or predictions of a model, the user must understand how the model works, how it was built and what limitations it has. The sub-question here could be, how do people interpret the results/predictions from the data product? This requires more than just displaying a result but also working with human computer interaction practitioners to design in such a way that is fair, transparent and mitigates bias or discrimination (Holstein et al., 2019; Lee & Singh, 2021).
What choices are being made in what to be shown? As in the statistical domain, we can also lie with data-driven products. The COVID 19 pandemic had many examples where decision makers worked to distort data, distort model predictions and even censor data researchers and practitioners in order to fit with a view that the decision maker held (A hostile environment, 2021; Vigjilenca, 2020; Zhang & Barr, 2021). This may be taken as an extreme public example, but this does happen in many ways. One may be testing for harm at run-time. How will the models be kept updated? When deploying data-driven products, the internal models have to be kept updated. The world did not stop changing when the model was trained and deployed. As such, the models will start exhibiting drift. This drift may also come from the how users respond to what the model does itself. Does the organization of Data Science team have procedures on the maintenance of the models in the data-driven product and how to test for drift before the system has high error in its results (predictive, prescriptive, diagnostic etc.?).
In this section I have discussed how Data Science and Data Governance intersect. In the latter part of the section, I chose three sections of the Data Science cycles to be able to analyse for human factors. Through identifying these human factors, we can better understand how Data Governance is an integral part of the full cycle as decisions being made by the scientist will impact users and humans in general. In the next section I then discuss Data Governance on the African continent.
7.6 Data Governance and the African Continent
With calls for African countries to jump on to the current advances of data driven economies, there has been some movements towards strategies and governance policies by governments that cover data. The African Union released the “The Digital Transformation Strategy for Africa 2020–2030” (African Union, 2020). This strategy should be understood in the context of the wider and more localized Data Governance and digitization challenges in different African countries. When it comes to privacy, the European general data protection regulation (GDPR) (European Commission, n.d.) has had wide ranging effect and impact on the internet economy as many companies who processed European citizen data had to abide by the rules set out by the EU. Around the African continent, as shown by the research in (Davis, 2021), there are efforts to strengthen data protection policies, even with only about 52% of African countries having such legislation.
The African Union Convention on Cyber Security and Personal Data Protection (known as the Malabo Convention) (African Union, 2014) was adopted by AU member states in 2014. It sets out to provide protections for cyber infrastructure, protection of personal information, cyber security and the necessary foundations to enable an information economy across the African continent. Even though ratified in 2014, only eight countries had ratified the convention by 18/06/2020.Footnote 1 The convention touches on many aspects that can form a unified foundation for African countries to benefit from the information economy. Without ratification, we have the reality that organizations and practitioners do not have a unified view on how to deploy data tools and for some countries the reality is much worse with very lax or non-existent protections (Davis, 2021).
In South Africa the Protection of Personal Information Act (POPIA) (Government of South Africa, n.d.), which has taken many years to get enacted, has also begun a discussion in the public on data acquisition, protection of personal information and the use of the data for downstream tasks (especially when it is not for the original purpose of data collection). Even so, Data Governance is not only the protection of personal information, but there are also many more human and organizational factors that data interacts with. I hope the preceding section has made it clear that Data Governance should cover more than just the data being used. But, as earlier discussed, there are many human factors that should be taken into consideration in all the stages of the Data Science cycle. To effectively govern the full process, countries have to have a clear understanding of the stages as well as the responsibilities of governments towards the Data Scientists and the responsibilities of the Data Scientists towards the public.
The African continent has made big strides in the ICT sector and building local skills and also championing local companies. Even so, there is still a dominance of the Big Tech Giants (Microsoft, IBM, Google, Facebook etc.) on the continent physically or with services that cross borders. Even though we do not have an agreed definition of the data skill gap, the work by (Sey & Mudongo, 2021) highlights how there is lack of understanding of the need of AI skills and that we need to have efforts to build these skills on the continent and this must connect public and private sectors. These insights are important as they place in context how few of the Big Tech firms have few or any research and development that is done on the continent. AI governance skills are recommended as part of the development of AI skills on the continent (Sey & Mudongo, 2021), echoing the message in this paper on the broader Data Science and Data Governance nexus.
The continent risks being just a source of data (Birhane, 2020) to build services that then are used by citizens without any local development of these services. This has been recently brought to bear with how Facebook only has 13% of its abuse team (which fights abuse on their online platforms) working on non US content, even though 90% of Facebook users are outside the US (Purnell et al., 2021). This is important as misinformation on Facebook outside the US has effect on many countries, but cannot be battled by Facebook itself. Further on, governments have to be able to govern the digital space and insure that the citizens get to benefit from the digital public goods (Gillwald & van der Spuy, 2019). Another challenge is the use of some of the data-driven products for surveillance by both governments and private sector on the continent (Mudongo, 2021). As already highlighted, the systems are less likely to be developed locally and may encode.
7.7 Case Study: Learning from Our Recent Past, Enter ICT4D
Data Science and Artificial Intelligence hailed as a silver bullet to many problems, data itself referred to as the new oil to be exploited by nations and organizations (Hirsch, 2013). But a challenge that organizations and nations should be able to spot rears its head again. With the rise of ICT and digitization efforts, many problems were pointed to where ICT could be the solution (Curtis, 2019). Throw in development practices, ICT4D has been a force for the last two or more decades (Walsham, 2017).
I argue that we now have had enough time that some of the shortcomings of seeing many problems as requiring ICT as the solution, especially from practitioners who would come from outside, drop in, deploy and then leave is very much akin to what is happening in the Data Science world currently and needs change (Shilton et al., 2021). There may be differences, chief among them, familiarity with what ICT is and less familiar with what Data Science, Artificial Intelligence or Machine Learning are (Osoba & Welser, 2017). Basically, Data Science researchers and practitioners are just seen as magicians you throw a problem and data at, and a solution arrives on the other side. We see this with the advent of touting of 4IR strategies for African nations that are driven by public institutions that do not have the skills or knowledge to really engage with the subject they are touting as a solution to many of the problems they face (McBride et al., 2018; Moorosi et al., 2017). In ICT4D, a historical debate was on the efficacy of having researchers and practitioners who were not locals come in with “solutions” using ICT to many development issues (Andrade & Urquhart, 2012; Toyama, 2015). Over time this has become an area of study within the field itself. It became very apparent on how the development and design of systems should be participatory (Andrade & Urquhart, 2012; Tongia & Subrahmanian, 2006; Toyama, 2015) and take into account more than just the technical challenge. This tough challenge took time and many failures. In contrast within Data Science and Artificial Intelligence field, a lot of work has been put into understanding fairness, ethics and the longer term effects of the technical interventions. This is a welcome change to the ICT4D history, but we still are lagging in the understanding of the need of participatory design as well as governance that guides the field (Singh & Flyverbom, 2016). We have large international bodies like the International Telecommunications Union, that many states belong to, that has shaped ICT policies across regions.
In Artificial Intelligence, one can say the debate on fairness and harm has been very much open due to the threats of wide scale impact on people. But, this does not mean that debates solve the problems. In most of the debates and discussions, it is mostly researchers, and not decision and policy makers who are doing work to document harm and make recommendations to mitigate it (Whittaker et al., 2018). Policy makers need to come to the table to also shape the debate by providing input from government. We need to draw from lessons of other fields while at the same time understanding the uniqueness of the take up of data-driven products before we even had the time to think about their impact.
7.8 Conclusion
In this paper, I used a survey of literature around Data Science and Data Governance to bring to the fore the connections within this nexus. Leaving decisions of design to only the Data Scientist ignores the many human factors that data-driven products have. As such, Data Governance is key to being able to create and deploy products that do add to the developing economies on the continent while mitigating harm. This requires that African countries have an appreciation of the needs of governance and skills to enable effective policy. The case study presented on ICT4D allows us to learn from a related discipline that has been active for two decades and has had similar challenges in deploying interventions in the Global South.
Recommendations:
-
There is a need for African governments to work together to practically implement Data Governance policy. The glaring reality that only 8 countries (as of this writing) have ratified the African Union Convention on Cyber Security and Personal Data leaves much to be desired.
-
Both public and private industries must engage with data scientists to better get an understanding of the areas of concern highlighted in this paper beyond data privacy. Most policy on the continent focuses on privacy protections and some automated decision making, but there are many other decisions made in the process of developing data tools that impact the final outcome.
-
For the data scientist, it must be a reality that policy and development of data tools go hand in hand. Even if national, regional or continental policies have not caught up, there is growing movement within our practice that works to develop best practice and also highlight challenges in ethics, fairness and mitigating abuse.
Notes
- 1.
https://au.int/en/treaties/african-union-convention-cyber-security-and-personal-data-protection biases and lead to discrimination. This illustrates another governance gap (whether planned or unplanned) as decision makers have to be able to evaluate the risks and harms such systems may pose to the population (Mudongo, 2021).
References
A hostile environment. (2021). Brazilian scientists face rising attacks from Bolsonaro’s regime. ScienceMag.
Abraham, R., Schneider, J., & Vom Brocke, J. (2019). Data governance: A conceptual framework, structured review, and research agenda. International Journal of Information Management, 49(2019), 424–438.
African Union. (2014). African Union convention on cyber security and personal data protection. African Union: Addis Ababa, Ethiopia.
African Union. (2020). The digital transformation strategy for Africa (2020–2030). Addis Ababa.
Alamo, T., Reina, D. G., Mammarella, M., & Abella, A. (2020). Covid-19: Open-data resources for monitoring, modeling, and forecasting the epidemic. Electronics, 9(5), 827.
Alpaydin, E. (2020). Introduction to machine learning. MIT Press.
Andrade, A. D., & Urquhart, C. (2012). Unveiling the modernity bias: A critical examination of the politics of ICT4D. Information Technology for Development, 18(4), 281–292.
Baxter, G., & Sommerville, I. (2011). Socio-technical systems: From design methods to systems engineering. Interacting with Computers, 23(1), 4–17.
Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling betterscience. Transactions of the Association for Computational Linguistics, 6(2018), 587–604.
Best, M. L. (2014). The internet that Facebook built. Communications of the ACM, 57(12), 21–23.
Birhane, A. (2020). Algorithmic colonization of Africa. SCRIPTed, 17, 389.
Birhane, A., & Cummins, F. (2019). Algorithmic injustices: Towards a relational ethics. arXiv preprint arXiv:1912.07376.
Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R., & Bao, M. (2021). The values encoded in machine learning research. arXiv preprint arXiv:2106.15590.
Birhane, A., Uday Prabhu, V., & Kahembwe, E. (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963.
Bond, P., & Tait, A. (1997). The failure of housing policy in post-apartheid South Africa. In Urban forum (Vol. 8, pp. 19–41). Springer.
Borgesius, F. Z., Gray, J., & van Eechoud, M. (2015). Open data, privacy, and fair information principles: Towards a balancing framework. Berkeley Technology Law Journal, 30(3), 2073–2131.
Bradshaw, D., Dorrington, R. E., Laubscher, R., Moultrie, T. A., & Groenewald, P. (2021). Tracking mortality in near to real time provides essential information about the impact of the COVID-19 pandemic in South Africa in 2020. South African Medical Journal, 111(8), 732–740.
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77–91). PMLR.
Coleman, D. (2018). Digital colonialism: The 21st century scramble for Africa through the extraction and control of user data and the limitations of data protection laws. Michigan Journal of Race and Law, 24, 417.
Curtis, S. (2019). Digital transformation—the silver bullet to public service improvement? Public Money & Management, 39(5), 322–324.
Davis, T. (2021). Data protection in Africa: A look at OGP member progress (August 2021). Technical Report. Alt Advisory.
Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64–73.
European Commission. (n.d.). 2018 reform of EU data protection rules. European Commission. https://ec.europa.eu/commission/sites/betapolitical/files/data-protection-factsheet-changes_en.pdf
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé, H. III, & Crawford, K. (2018). Datasheets for datasets. arXiv preprint arXiv:1803.09010.
Gillwald, A., & van der Spuy, A. (2019). The governance of global digital public goods: Not just a crisis for Africa. GigaNet.
Government of South Africa. (n.d.). Protection of personal information Act 4 of 2013. Government of South Africa. https://www.gov.za/documents/protection-personal-information-act
Hirsch, D. D. (2013). The glass house effect: Big Data, the new oil, and the power of analogy. Maine Law Review, 66, 373.
Holstein, K., Vaughan, J. W., Daumé, H. III, Dudik, M., & Wallach, H. (2019). Improving fairness in machine learning systems: What do industry practitioners need?. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1–16).
Hooker, S. (2021). Moving beyond “algorithmic bias is a data problem”. Patterns, 2(4), 100241.
Hooker, S., Moorosi, N., Clark, G., Bengio, S., & Denton, E. (2020). Characterising bias in compressed models. arXiv preprint arXiv:2010.03058.
Hummel, P., Braun, M., Tretter, M., & Dabrock, P. (2021). Data sovereignty: A review. Big Data & Society, 8(1), 2053951720982012.
Jensen, K. (2012). CRISP-DM process diagram. https://commons.wikimedia.org/wiki/File:CRISP-DM_Process_Diagram.png
Jiang, Z., Zhang, C., Talwar, K., & Mozer, M. C. (2020). Characterizing structural regularities of labeled data in overparameterized models. arXiv preprint arXiv:2002.03206.
Jo, E. S., & Gebru, T. (2020). Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency (pp. 306–316).
Kearns, M., & Roth, A. (n.d.). Ethical algorithm design should guide technology regulation. The Brookings Institution. https://www.brookings.edu/research/ethical-algorithm-design-should-guide-technology-regulation/
Ledford, H. (2019). Millions of black people affected by racial bias in health-care algorithms. Nature, 574(7780), 608–610.
Lee, M. S. A., & Singh, J. (2021). Risk identification questionnaire for detecting unintended bias in the machine learning development lifecycle. In Proceedings of the 2021 AAAI/ACM conference on AI, ethics, and society (pp. 704–714).
Mandl, K. D., & Kohane, I. S. (2016). Time for a patient-driven health information economy? New England Journal of Medicine, 374(3), 205–208.
Mbow, M., Lell, B., Jochems, S. P., Cisse, B., Mboup, S., Dewals, B. G., Jaye, A., Dieye, A., & Yazdanbakhsh, M. (2020). COVID-19 in Africa: Dampening the storm? Science, 369(6504), 624–626.
McBride, V., Venugopal, R., Hoosain, M., Chingozha, T., & Govender, K. (2018). The potential of astronomy for socioeconomic development in Africa. Nature Astronomy, 2(7), 511–514.
McCown, F., & Nelson, M. L. (2009). What happens when Facebook is gone?. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries (pp. 251–254).
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1–35.
Metcalf, J., & Crawford, K. (2016). Where are human subjects in big data research? The emerging ethics divide. Big Data & Society, 3(1), 2053951716650211.
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2018). Prediction-based decisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprint arXiv:1811.07867.
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency (pp. 220–229).
Moorosi, N., Thinyane, M., & Marivate, V. (2017). A critical and systemic consideration of data for sustainable development in Africa. In International conference on social implications of computers in developing countries (pp. 232–241). Springer.
Mudongo, O. (2021). Africa’s expansion of AI surveillance-regional gaps and key trends.
Nyamwena, J., & Mondliwa, P. (2020). Policy brief 3: Data governance matter lessons for South Africa. https://www.competition.org.za/ccred-blog-digital-industrial-policy/2020/7/28/data-governance-matters-lessons-for-south-africa
Osakwe, S., & Adeniran, A. P. (2021). Strengthening data governance in Africa.
Osoba, O. A., & Welser, W., IV. (2017). An intelligence in our image: The risks of bias and errors in artificial intelligence. Rand Corporation.
Peng, R. D., & Matsui, E. (2015). The art of data science. A guide for anyone who works with data. Skybrude Consulting, LLC.
Ponelis, S. R., & Holmner, M. A. (2015). ICT in Africa: Building a better life for all.
Purnell, N., Scheck, J., & Horwitz, J. (2021). Facebook employees flag drug cartels and human traffickers. The Company’s Response Is Weak, Documents Show. https://www.wsj.com/articles/facebook-drug-cartels-human-traffickers-response-is-weak-documents-11631812953.
Raji, I. D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., & Denton, E. (2020). Saving face: Investigating the ethical concerns of facial recognition auditing. In Proceedings of the AAAI/ACM conference on AI, ethics, and society (pp. 145–151).
Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 conference on fairness, accountability, and transparency (pp. 33–44).
Ray, E. L., Wattanachit, N., Niemi, J., Kanji, A. H., House, K., Cramer, E. Y., Bracher, J., Zheng, A., Yamana, T. K., & Xiong, X. et al. (2020). Ensemble forecasts of coronavirus disease 2019 (COVID-19) in the US. MedRXiv.
Rice, W. E. (1996). Race, gender, redlining, and the discriminatory access to loans, credit, and insurance: An historical and empirical analysis of consumers who sued lenders and insurers in federal and state courts, 1950–1995. San Diego Law Review, 33, 583.
Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M., Ursprung, S., Aviles-Rivero, A. I., Etmann, C., McCague, C., Beer, L., et al. (2021). Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, 3(3), 199–217.
Runshan, F., Huang, Y., & Singh, P. V. (2021). Crowds, lending, machine, and bias. Information Systems Research, 32(1), 72–92.
Russell, S. J., & Norvig, P. (1995). Artificial intelligence: A modern approach.
Sánchez-Monedero, J., Dencik, L., & Edwards, L. (2020). What does it mean to ‘solve’ the problem of discrimination in hiring? Social, technical and legal perspectives from the UK on automated hiring systems. In Proceedings of the 2020 conference on fairness, accountability, and transparency (pp. 458–468).
Sey, A., & Mudongo, O. (2021). Case studies on AI skills capacity building and AI in workforce development in Africa.
Shilton, K., Finn, M., & DuPont, Q. (2021). Shaping ethical computing cultures. Communications of the ACM, 64(11), 26–29.
Shuja, J., Alanazi, E., Alasmary, W., & Alashaikh, A. (2021). COVID-19 open source data sets: A comprehensive survey. Applied Intelligence, 51(3), 1296–1325.
Singh, J. P., & Flyverbom, M. (2016). Representing participation in ICT4D projects. Telecommunications Policy, 40(7), 692–703.
Skorup, B., & Thierer, A. (2013). Uncreative destruction: The misguided war on vertical integration in the information economy. Federal Communications Law Journal, 65(2), 157.
Sujan, M., Furniss, D., Grundy, K., Grundy, H., Nelson, D., Elliott, M., White, S., Habli, I., & Reynolds, N. (2019). Human factors challenges for the safe use of artificial intelligence in patient care. BMJ Health & Care Informatics, 26, 1.
Tang, R., & Sae-Lim, W. (2016). Data science programs in US higher education: An exploratory content analysis of program description, curriculum structure, and course focus. Education for Information, 32(3), 269–290.
Tongia, R., & Subrahmanian, E. (2006). Information and Communications Technology for Development (ICT4D) – A design challenge?. In 2006 International conference on information and communication technologies and development. IEEE (pp. 243–255).
Toyama, K. (2015). Geek heresy: Rescuing social change from the cult of technology. Public Affairs.
Veale, M., & Binns, R. (2017). Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society, 4(2), 2053951717743530.
Vigjilenca, A. B. A. Z. I. (2020). Truth distancing? Whistleblowing as remedy to censorship during COVID-19. European Journal of Risk Regulation, 11(2), 375–381.
Walsham, G. (2017). ICT4D research: Reflections on history and future agenda. Information Technology for Development, 23(1), 18–41.
West, D., & Allen, J. (2018). How artificial intelligence is transforming the world. Technical Report. Brookings Institute.
Whittaker, M., Crawford, K., Dobbe, R., Fried, G., Kaziunas, E., Mathur, V., West, S. M., Richardson, R., Schultz, J., & Schwartz, O. (2018). AI now report 2018. AI Now Institute at New York University New York.
Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining (Vol. 1). Springer.
Wynants, L., Van Calster, B., Collins, G. S., Riley, R. D., Heinze, G., Schuit, E., Bonten, M. M. J., Dahly, D. L., Damen, J. A., Debray, T. P. A., et al. (2020). Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ, 369, m1328.
Zhang, Y.-C. (2017). The information economy. In Non-equilibrium social science and policy (pp. 149–158). Springer.
Zhang, J., & Barr, M. (2021). Harmoniously denied: COVID-19 and the latent effects of censorship. Surveillance & Society, 19(3), 389–402.
Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM conference on AI, ethics, and society (pp. 335–340).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Marivate, V. (2023). More Than Just a Policy: Day-to-Day Effects of Data Governance on the Data Scientist. In: Ndemo, B., Ndung’u, N., Odhiambo, S., Shimeles, A. (eds) Data Governance and Policy in Africa. Information Technology and Global Governance. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-24498-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-24498-8_7
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-24497-1
Online ISBN: 978-3-031-24498-8
eBook Packages: Political Science and International StudiesPolitical Science and International Studies (R0)