Digital pathology is the investigation of human tissue samples at the cellular level with the aid of scanners that make digitised images of those samples. The scanners take the place of traditional light microscopes, allowing more sophisticated inspections of digitised images than light microscopes would have afforded in the past. At the moment, light microscopy and digital scanners co-exist as technologies used by pathologists for work on, e.g. biopsied material from particular patients. Computational pathology, by contrast, identifies patterns in large data sets—sets of digitised images derived from the tissue samples of large numbers of patients. Not all of these images will be taken from living people, and many different collections of tissue samples may be the sources of images. Computational pathology might identify patterns distinctive of a particular cancer type for diagnostic or prognostic purposes. One product of computational pathology may be an algorithm that can be used to generate a particular patient’s diagnosis, based on that patient’s whole slide images (WSIs) and the annotations of a digital pathologist (e.g. see Lancellotti et al., 2021). But computational pathology makes sense in its own right as a means of pattern discovery for a cancer type in general and not simply the interpretation of one patient’s biopsy. Unlike other uses of digitised images in pathology, computational pathology is a big data exercise. It can also be a multi-disciplinary exercise and an exercise that often involves public-private partnerships.

In what follows, we aim to bring both digital pathology in general and computational pathology in particular within the scope of Helen Nissenbaum’s theory of privacy as contextual integrity (Nissenbaum, 2010). Although Nissenbaum presents it as a theory addressing complaints in ‘the name of privacy’ (Nissenbaum, 2010, pp. 1–4, 72), i.e. as a theory addressing complaints about possible misuses of personal data, we suggest that it is something more general: a theory of appropriate information transfer—whether or not the information is personal.Footnote 1 In short, it is potentially a good way of thinking about morally acceptable information transfer, including the transfer of information in the form of pixels, as in WSIs. Acceptable information transfer does not always involve transfers of data that affect individual personal choices, reputation and so on, so a theory of acceptable information transfer is not necessarily a theory of privacy. But it is a theory that relates information transfers to their purposes, including local institutional purposes, and these are highly relevant to both the non-computational use of digitised images in pathology and computational pathology.

An earlier paper by one of the authors (Sorell et al., 2022) suggested that computational pathology raises a complex set of issues at the intersection of research ethics, medical ethics, general data ethics and business ethics. That paper claimed that these issues could largely be resolved, supporting the conclusion that computational pathology is morally acceptable. The previous paper tackled the issues piecemeal, however, and did not operate with a single general framework that can also apply to non-computational digital pathology. This paper adapts Nissenbaum’s framework to remove those limitations.

The rest of the discussion is organised as follows. In Section 1, the main lines of the theory of contextual integrity are introduced, and reasons are given why it is not properly speaking a theory of privacy but a theory of morally permissible information transfer. Then, the theory is applied to uses of digitised images for (a) patient-by-patient pathological analysis (Section 2) and (b) computational pathology (Sections 3 and 4). Although big data exercises involving personal data, e.g. in computational pathology, are seen by Nissenbaum and colleagues as particular threats to existing data-sharing norms and other social norms (e.g. see Barocas & Nissenbaum, 2014; Nissenbaum, 2010), we claim that patient-by-patient digital pathology is riskier, at least in forms it has taken during the pandemic. At the end, we consider some risks in computational pathology that are due to the interaction between health institutions, particularly in the public sector, and commercial algorithm developers.

1 Contextual Integrity

The theory of contextual integrity starts from the fact that flows of personal information can proceed outside the control of the person the information is about—the data subject. Information flows that are out of the data subject’s control are not necessarily objectionable, according to Nissenbaum (2010), and, in particular, they are not necessarily in violation of relevant norms. Whether a flow of information is objectionable depends on its compliance with reasonable norms of transmission that are generally invoked in the relevant social context. The relevance of a norm is determined by the kind of information in question, the characteristics of the people and institutions who transmit and receive it and the purposes served by the transmission.

Existing conventions in human rights–respecting jurisdictions acknowledge that some kinds of information, if they were in general circulation, could embarrass or provoke harmful or discriminatory action against the data subject from some receivers of the information. Information about someone’s sexual or reproductive history could fall into this category, as could information about one’s religion or race. But not all personal information is sensitive in this way. To illustrate, information about one’s shoe size or one’s favourite flavour of ice cream might occasion no interest whatsoever and no malice whatsoever from anyone who receives it. So even if it were in circulation without one’s consent or knowledge, that might not rise to the threshold for legitimate complaint. Yet according to at least one influential theory, privacy is precisely a matter of being in control of transmission, and circulation of this information without the data subject’s consent or instigation is a privacy violation (Westin, 1967).Footnote 2

The theory of contextual integrity focuses on the fact that new technology can disrupt information flows by adding new actors and purposes into the channel of information distribution. For example, wearable health-monitoring clothing and accessories, such as Fitbits, can transmit large amounts of data on heart rates during exercise and sleep patterns, correlated with times of day and calendar dates (Nissenbaum & Patterson, 2016). In the past, holders of this data might have been doctors only. In our own day, with the development of the Internet of things (IoT), the receivers of this data could include the manufacturers of the clothing and accessories that carry the data, friends, family, social media followers and medical practitioners. Others—athletic competitors and insurance companies—might want to receive the same data, possibly for purposes that go against the data subject’s interests.

In these cases, one might have objectionable information flows not only relative to the data subject’s interest in restricting access to identifying information, but also relative to the data subject’s possible interests in defeating competitors, and the data subject’s interests in not being targeted for marketing purposes by the health monitoring manufacturer. Again, there may be societal interests in controlling the aggregation of behavioural data for commercial purposes. The theory of contextual integrity takes into account the personal and institutional purposes being pursued in the background of information flows, power differentials between the various parties to information flows, as well as the values underlying their wider liberal democratic settings. The idea seems to be that the acceptability of these flows is dependent on the values and purposes relevant to information flows and the weights they should be given, taking into account also the vulnerabilities of personal data subjects.Footnote 3

Michael Zimmer (2018), who subscribes to the theory of contextual integrity, breaks down the steps that need to be taken to assess digital innovation:

  1. 1.

    Describe the new practice in terms of its information flows.

  2. 2.

    Identify the prevailing context in which the practice takes place at a familiar level of generality, which should be suitably broad such that the impacts of any nested contexts might also be considered.

  3. 3.

    Identify the information subjects, senders and recipients.

  4. 4.

    Identify the transmission principles: the conditions under which information ought (or ought not) to be shared between parties. These might be social or regulatory constraints, such as the expectation of reciprocity when friends share news, or the obligation for someone with a duty of to report illegal activity (Zimmer, 2018).Footnote 4

To illustrate these steps by reference to data transmission from health-monitoring wearables:

  1. 1.

    Describe what data is collected by health-monitoring wearables and the ostensible or advertised purposes of these wearables.

  2. 2.

    The context might be a fitness-training, medical or commercial setting, where physiological data generated from a home or gym exercise regimen are possibly received by many social actors at once.

  3. 3.

    The information subjects might be actively exercising adults who have purchased the wearables. The senders in the first instance might be the same actively exercising adults. Receivers might be the senders’ social media friends and the wearable manufacturer. Re-senders might be social media friends and the manufacturer. Recipients of the resenders: unclear, will vary.

  4. 4.

    Transmission principles/norms: data subject chooses individuals with whom to share information. The choices are registered on an information sharing app with options to limit audiences for the data. Transmission to manufacturer may be required by the manufacturer-issued Terms and Conditions for the use of the wearable.

As Zimmer (2018) indicates, the four steps can yield ‘a prima facie judgment to be rendered as to whether the new process significantly violates the entrenched norms of the context’. The violation of entrenched norms is likely to make information transfer morally objectionable. But this is a prima facie judgement only. To arrive at an all-things-considered judgement, two further steps must be taken:

  • Evaluation I: Consider the moral and political factors affected by the new practice. How might there be harms or threats to personal freedom or autonomy? Are there impacts on power structures, fairness, justice or democracy? In some cases, the results might overwhelmingly favour accepting or rejecting the new practice, while in more controversial or difficult cases, further evaluation might be necessary.

  • Evaluation II: How does the new practice directly impinge on values, goals and ends of the particular context? If there are harms or threats to freedom or autonomy, or fairness, justice or democracy, what do these threats mean in relation to this context (Zimmer, 2018)?

It is hard to know how overall assessments of information flows with as many dimensions as these can be provided by anything less than a liberal democratic political philosophy, including a taxonomy and weighting of informational harms. It is true that the question of the effects of an information flow on personal autonomy (Evaluation 1) is relevant to privacy as standardly understood. But the follow-up—‘Are there impacts on power structures, fairness, justice or democracy?’—clearly goes beyond privacy and indeed the personal sphere in general. By the same token, it is hard to see the theory of contextual integrity as a theory specifically of privacy. A theory of contextual integrity could presumably extend to the transmission of military information or differently, commercially valuable information.

2 Digital Scanners and Information Flows

Digital scanners contribute to at least two kinds of information flows in the practice of modern pathology. On the one hand, they are sometimes used to replace light microscopes. Instead of operating with tissue mounted on slides, pathologists use digital scanners to create digital images—whole slide images (WSIs)—of tissue. These are open to very high-resolution inspection, with possibilities of panning and zooming. They can be annotated for diagnostic or prognostic purposes. WSIs can easily be reproduced and shared on bespoke cloud-based or other computer platforms sharable between pathologists seeking second opinions, or for teaching purposes. The results of interpreting WSIs in this way will typically be diagnoses or prognoses for particular patients, which will be communicated to those patients, often by way of a general practitioner. This kind of information flow, whose purpose is the analysis of the tissue of an identifiable patient, is associated with digital pathology.

A second kind of information flow involves the aggregation of pseudonymised WSIs into large data sets for the development of algorithms, e.g. diagnostic and prognostic algorithms. This aggregates the results of digital pathology applied to WSIs collected patient-by-patient and aims at diagnostics and prognostics associated with, e.g. a cancer-type, such as breast, lung or prostate. In this case, patterns picked up in expert human annotations of WSIs are identified and used for machine identifications of, e.g. malignancies associated with different kinds of cancers. Either that, or algorithms are developed through machine learning applied directly to the aggregated WSIs. In both cases, the information flow involves, in addition to all the elements of digital pathology, computer scientists, either from universities or private companies, hospitals or health trusts, cloud platform providers, data regulators and others. We can call digital pathology leading to algorithm development and testing computational pathology.

Computational pathology is a big data exercise conducted after the digitisation of slides, including slides digitised for the analysis of tissue in patient-by-patient digital pathology. Information flows generated by a digital pathology exercise, then, can overlap with those of computational pathology in at least two ways. First, a digital pathologist can annotate a WSI for a clinical purpose, and that annotated WSI can be included in a data set that is used to train or test a diagnostic algorithm. Second, a pre-existing algorithm might be used as an independent check on a diagnosis generated by digital pathologists.

In what follows, we consider a range of information flows that reflect recent uses of digital scanners in pathology. Typically, digital pathology is supposed to speed up work flow in relation to cellular investigations for common disease types. It can also bring to bear scarce specialist knowledge about rare disease types through opportunities for rapid sharing and transmission of images that are not available in the case of glass slides.

According to Bracey (2017), biopsies are taken by an increasing range of health professionals but are passed to a shrinking number of pathologists. The pathologists are located in hospitals which may or may not be specialist centres for an increasing number of pathological subspecialties. Digital pathology reduces the time needed to get opinions from specialists by radically reducing the time taken for transporting slides and waiting for them to be processed. Digitised content can be shared instantaneously. In addition, specialists can confer on online platforms, combining the established advantages of telepathology with those of making WSIs with scanners. What is more, risks of loss of slides or damage to slides are entirely avoided.

Figure 1 summarises the effects of scanners on workflows. Reviews of biopsied material that would otherwise take weeks or months if the material were mounted on a slide can be shortened to minutes or days, since WSIs can be shared instantaneously. Figure 1 also indicates that patient history and test results are easy to send electronically to reviewers as attachments to a specimen file, a great advance on the speed of postal delivery. Figure 1 applies to the work of digital pathologists working in small groups in a local hospital, or in larger, more specialised hubs.

Fig. 1
figure 1

Effects of scanners on workflows. Adapted from Bracey (2017, p. 94)

Not every clinical benefit afforded by digitisation according to Fig. 1 is necessarily an unmitigated benefit from the standpoint of contextual integrity. After all, patient histories and test results are highly sensitive and are protected by both medical professional norms of confidentiality and norms of data protection. So their transmission in, e.g., unencrypted emails would be highly objectionable, first because emails can be sent to the wrong addresses and second because they are easy to resend or duplicate, not to mention hack. Still, these risks could be met by standard encryption techniques.

We have seen that digitisation enables quicker transmission of diagnostic information than is allowed by slides and that it promotes prompt sharing of expertise through virtual meeting technology. For example, Microsoft Teams software enables pathologists to consult one another without the delays and expense entailed by physical travel to in-person meetings.

A case study that illustrates the potential clinical benefits of long-distance networked consultations by pathologists, including in areas where expertise is scarce, is reported by Colling and colleagues (2021). They discuss the review of WSIs of potential testicular malignancies by a digital pathology hub in Oxford. This lab was sent WSIs by two spoke laboratories with scanners and viewers in the English cities of Swindon and Milton Keynes, respectively. WSIs could be considered by supraregional multidisciplinary teams more local to Swindon and Milton Keynes, as well as in the central Oxford hub. In the Colling study, sharing between centres was tried out at the same time as some pathologists in Oxford used digital scanners for the first time. Feedback was gathered about the perceived advantages and disadvantages of working with digital images, and though opinions were mixed in the small sample of pathologists consulted, in general, digital pathology was well received.

Figure 2 (adapted from Colling et al., 2021) shows the equipment held at the spoke and hub sites and how equipment at the three sites—the Great Western Hospital in Swindon, Milton Keynes Hospital and the Oxford University lab (OUFHT)—exchanged data. The arrangement of equipment at Milton Keynes is not shown in detail. Had it been, it would have appeared at the bottom of Fig. 2, identical to the arrangements at Oxford and Great Western shown on the left and right, respectively, of Fig. 2. Although Oxford (OUFHT) is shown on the left in Fig. 2, it is in fact the hub. Milton Keynes and Great Western are the spokes. The central area of Fig. 2 shows how a slide viewer at any of the three sites exchanges data with the central portal server at Oxford.

Fig. 2
figure 2

(Adapted from Colling et al., 2021) Equipment and data exchanges between spokes (Great Western, Milton Keynes) and hub (Oxford (OUHFT)) of a testicular cancer pathology network. The central area shows how a slide viewer at any of the three sites exchanges data with the central portal server at Oxford

In general, data flows within the network shown in Fig. 2 served both diagnostic and training purposes. That is, they served the purpose of training pathologists in the use of digital scanners and in diagnostic work by already trained pathologists. During the period of the study, WSIs were passed to Oxford from Milton Keynes and Great Western for diagnostic second opinions, and WSIs sourced in Oxford were used by Oxford pathologists new to digital pathology for practice in interpretation and annotation of digital images. Annotated WSIs were also available for multidisciplinary team (MDT) meetings in Oxford involving radiologists, oncologists and other specialties. When WSIs were looked at during meetings, there were mixed opinions about what they added to discussions (Colling et al., 2021, p. 13). Some MDT participants thought they were helpful, others that they were not. In some cases, during the period of the Colling study, pathologists reported discrepancies between the ease of identifying diagnostically important regions of tissue digitally, and the ease of doing so on glass slides (Colling et al., 2021, p. 16). The digital process was not always thought to be superior.

The study by Colling et al. took place in part during the Covid pandemic, which altered data flows in two significant ways. (a) To prevent infection in labs, pathologists turned to online working, which, fortunately, was highly compatible with the switch in hub and spoke institutions to digital pathology. Although it is not made explicit in Colling et al., this change of working practice may have moved some of the digital pathology work, and the communication of results, to home-based networks outside the firewalls indicated in Fig. 2.

In addition, Colling et al. report that (b) scanning capabilities were exploited to add free text documents to WSIs:

For example, the scanning of referral letters and paper request forms bearing the clinical details for a case was implemented to ensure these were accessible alongside the digital slides on the IMS for remote review by pathologists working from home (Colling et al., 2021, p. 8).

Although probably excusable as byproducts of practising pathology in a pandemic, (a) and (b) involve departures from established and reasonable norms of exchange of medical data, not to mention norms of data security. If homeworking survives beyond the Covid pandemic, there is a risk of a longer term loosening of these norms.

In general, there is a range of tensions between measures that might be thought to promote contextual integrity, such as de-identification of patient information associated with WSIs in patient-by-patient digital pathology, and, on the other hand, measures that could in principle promote more efficient and accurate pathological analysis. In the next section, we consider whether there are comparable or more serious problems with computational pathology.

3 Data Flows When Human and Algorithmic Diagnosis are Combined

So far, apart from arguable breaches of contextual integrity associated with pathology workflows that were adapted to the demands of the pandemic, data flows associated with digital pathology have proved relatively unproblematic. We come in this section to the contextual integrity of data flows in which a validated algorithm trained on WSIs adds a diagnosis to a humanly generated one.

The data flows involved in this sort of case include all of those required for the separate training and testing of the algorithm. Data flows involving such training and testing are characteristic of large-scale computational pathology, as opposed to the personalised digital pathology we have been considering so far. But the two kinds of pathology connect—through the use by individual pathologists of a machine-generated personalised diagnosis based on a model. Very often, a model will be developed by commercially employed data scientists who will depend on access to large data sets of WSIs obtained somehow from patients. The purposes of data-sharing in such cases will be a mix of (a) patient benefit, including speedier, more accurate diagnosis and quicker decisions to treat or not to treat, and (b) the development of a commercially successful diagnostic tool. In what follows, we first consider computational pathology as exemplified by the Innovate-UK funded PathLAKE project.Footnote 5

A pathology data lake assembles WSIs made from tissue samples of many research centres and hospitals into a single digital repository suitable for the training of algorithms for diagnosis, prognosis and general biomarker discovery. In the PathLAKE project, the digital repository brings together de-identified samples of various cancer types from many UK centres, all affiliated with the UK National Health Service. The original tissue will have been gathered under a variety of consent regimes. This repository is open to commercial algorithm development by commercial partners who belong to the PathLAKE consortium, as well as to others who can be granted access to the data under certain conditions, including payment conditions for commercial applicants.

Figure 3 is a diagram of some of the information or data flows relevant to PathLAKE.

Fig. 3
figure 3

Data flows in PathLAKE

Reading Fig. 3 from left to right, we see a transfer of WSIs from a UK National Health Service (NHS) laboratory to a server that collects the many tens of thousands of WSIs that form the PathLAKE data set. The NHS data area is firewalled, and, in the passage from an NHS laboratory to the firewalled PathLAKE server, WSIs are de-identified, that is, stripped of meta-data that could connect them to individual NHS patients while stored in the PathLAKE server or subject to analytics there. All WSIs come from NHS centres that were partners in the original PathLAKE consortium, or that have joined its successor, PathLAKE Plus.

It will be seen from this much of Fig. 3 that de-identification and privacy in the sense of control of one’s data is embodied in the architecture of PathLAKE, even though it is widely acknowledged that irreversible de-identification is probably not technically realisable. The demand for de-identification is the vestige of a pre-Nissenbaum concept of privacy in the NHS, probably allied to a strong commitment to patient confidentiality. There is a sense in which some of the values of the NHS have not kept pace with big data and the technical challenges to anonymisation. We return to this point below.

The right-hand side of Fig. 3 explains the process by which commercial algorithm developers and other external researchers become eligible to receive some of the data in PathLAKE and to train and test algorithms on the data. Algorithm developers apply to a PathLAKE committee for access to specific numbers of WSIs representing disease types, notably different cancer types. If permission is given, then developers are expected to train and test algorithms in the PathLAKE data zone (i.e. not on their own servers) and collect data on that training and testing without removing WSIs from the data zone. A condition of giving access to commercial developers is some sort of financial transfer negotiated by a PathLAKE committee different from the access committee. Here, the goal is to produce funds for future PathLAKE work based on an assessment of the market value of the algorithm to be developed. As for the access committee, it makes a decision independent of finance for either allowing or not allowing researchers to work on a subset (usually a small subset) of the data. A majority of the members of the PathLAKE access committee are lay patient representatives, and before approving an application, they must be satisfied that developers have a credible claim to produce patient benefit through their algorithm.

So much for information flows in a set-up for computational pathology like PathLAKE and the data subjects, senders and receivers involved in the processes shown in Fig. 3. We can imagine other possible receivers of data: scanner manufacturers, for example. A further audience for the data is pathology researchers, as consumers of the literature that reports not only biomarkers of various diseases but also how WSIs of tissue samples representing different diseases are prepared and annotated. Other receivers of data include major suppliers of data storage infrastructure. We can also assume that prognostic algorithms will be of interest to actuaries and insurance companies.

What about the purposes that organise the data flows? Patient benefit, as we have seen, is one. But, where the algorithm developers are involved, profit-making is another, and this purpose is shared with scanner manufacturers and sellers of space on data platforms. Patient benefit and profit-making can conflict, and, in the case of PathLAKE, government regulation and guidance operate to keep them in balance. In other, less regulated jurisdictions in which commercial purposes are liable to be more weighty than in the UK, this balance may be harder to achieve.

Let us enlarge on the difference between PathLAKE on the one hand, and a set up for computational pathology in major private health care institutional in a non-welfare state setting. An important fact about data flows in PathLAKE with origins in NHS partners is that the transfers are constrained by the purposes and values of the NHS itself. For English NHS institutions, these purposes and values are comprehensively articulated in an NHS Constitution for England (Department of Health & Social Care, 2021a). The Constitution sets out for users of the NHS the range of services they are eligible for, free of charge, at the point of needing those services. Sections of the Constitution lay down general expectations of staff and also of patients. In particular, some of the norms that might be employed to judge the contextual integrity of NHS-involving data flows are readily discoverable in the Constitution, which the NHS asks various publics, including those assessing contextual integrity, to judge it by.

To illustrate, the NHS constitution for England articulates through the medium of a set of principles the different elements of the benefit that is supposed to be delivered. The very first principle requires the NHS not only to make available healthcare free at the point of use to every resident of England who needs it, but also to:

promote equality through the services it provides and to pay particular attention to groups or sections of society where improvements in health and life expectancy are not keeping pace with the rest of the population (Department of Health & Social Care, 2021a).

This principle supports networked diagnostics where regional shortages of pathologists or pathology sub-specialties would otherwise make some groups of patients wait longer than others for cancer diagnosis and treatment. By the same token, the principle might support the development of automated diagnostics through algorithms. To the extent that it contributes to automated diagnostics, data collection and analytics in PathLAKE are also supported by the first principle. Again, the values and purposes in the NHS constitution attest to the integrity of at least some of the data flows in PathLAKE, namely, those that start with WSIs of patient samples and end up with speedy communication of diagnoses to, and treatment of, those same patients.

It may seem that the first principle of the NHS Constitution has less of a bearing on algorithm development, if, as in the PathLAKE process, the requirements of algorithm development are left to developers to articulate in applications to an access committee. But in fact, the first principle directly constrains algorithm development in the following way. It tells against the use of training sets that are not representative of the relevant UK population as a whole, or not representative of the variety, including the ethnic variety, of patients affected by a particular disease type. This means that diseases with a high incidence in an ethnic or minority community must not be diagnosed by algorithms trained wholly on the WSIs of majority communities. Access committees, including PathLAKE’s, are perfectly able to take into account this sort of constraint and to ask applicants in the access process to take it into account as well.

The purposes constraining information flows from NHS institutions do not only come from the NHS Constitution. There is further normative guidance, itself being regularly updated, for data-driven research and AI-assisted research in particular (Department of Health & Social Care, 2021b). This highlights the need for research with patient benefit, processes in which patients are represented in the evaluation of claims of patient benefit, and methods of measuring the extent of benefit and how widely real benefits are enjoyed in the general population. In addition, there are standard data processing norms, norms of interoperability of new products with ones currently used, and explicit standards for capturing various kinds of patient data. Finally, there are guidelines for apportioning a financial return to the NHS for its data assets from companies that are likely to profit from devices or algorithms developed with those assets.

One way of summarising the effect of the NHS Constitution, taken together with norms for data-driven research, is by saying that health data is specially protected in the UK and that some of its norms do not fit computation pathology very exactly. This is because some of the NHS norms are influenced by medical ethics norms which apply to doctor-patient relationships taken patient by patient—to very small-scale data—and not to NHS services in relation to whole populations, and whole-population data, or patient big data which are more familiar from public health rather than doctor-patient relations. Another feature of the NHS approach is that it submits claims of patient benefit from both public sector and commercial researchers to patient representatives, who may or may not have the education to understand AI-involving innovation in pathology or other branches of medicine.

As will emerge, there are also particular tensions deriving from the involvement of commercial algorithm developers in the training and testing of algorithms. The NHS norms for data-driven research are partly aimed at commercial organisations and acknowledge that there is an issue over what a fair price for data access should be, but the NHS constitution and the PathLAKE access committee process assume that patient representatives will not be enemies of business or suspicious that businesses will take advantage of the NHS in the development of products. The Nissenbaum framework permits, in theory, a distinction between big business involvement in computational pathology (e.g. the digital scanner manufacturers, who include such household names as Siemens and Philips) and the involvement of very small spinoffs that are developing algorithms.

That completes our discussion of the contextual integrity of information flows associated with PathLAKE-based computational pathology. When computational pathology in PathLAKE produces an algorithm used in a digital pathology exercise for a particular patient, safeguards on data in the data zone of the PathLAKE platform, and protections against improper commercial exploitation in the PathLAKE access process, combine to guard against morally questionable information transfers. Compared to the potentially leaky processes of individualised digital pathology, at least as exemplified by projects discussed earlier, computational pathology looks highly trustworthy. In the next section, however, we show that differences between the normative environment of the UK and other, more loosely regulated jurisdictions can overturn that verdict. What is more, even in the UK, there are some as yet unsettled questions about what access to a data set like PathLAKE by commercial firms might involve. The fact that these questions are unsettled should make us uncomplacent about even the safeguards afforded by computation pathology on the model of PathLAKE.

4 A Non-NHS Context

As we saw in the last section, the norms associated with the NHS seem to cocoon information flows involving NHS data from normal kinds of misuse. They anticipate conflicts of interest and power differentials between highly vulnerable data subjects—hospitalised patients suffering from serious disease, ordinary citizen-outpatients, outpatients that belong to minorities suffering from inadequate health care—and commercial firms and other institutions, including, at times, NHS bureaucracy and regulatory bodies. They also anticipate threats to privacy and patient confidentiality. Although we will have some reason later to reconsider or qualify this conclusion, let us contrast the information environment we have just been surveying with another, much more laissez-faire, alternative in the USA.

Figure 4 shows data flows involving a large data set of pathology WSIs at the Memorial Sloan Kettering (MSK) Institute in New York (Schüffler et al., 2021). It is not a far cry from Fig. 3.

Fig. 4
figure 4

(Adapted from Schüffler et al., 2021) Digital pathology data flows involving ‘data warehouse’ at Memorial Sloan-Kettering Cancer Centre

On the left-hand side is shown the scanning of samples and the production of WSIs. These are collected together in the HOBBIT ‘data warehouse’, where patient information is redacted and slides are de-identified. Slides are then made available for viewing and annotating via the MSK Viewer. Users of the MSK viewer are indicated on the lower right-hand side of Fig. 4. These include qualified hospital clinicians, technicians and medical students, as well as ‘researchers’. It is unclear from Schüffler et al. whether researchers are employed exclusively by the MSK Cancer Center or whether access to the data warehouse on behalf of external institutions, including for-profit enterprises, is permitted under certain unspecified conditions.

The attention given through HOBBIT to de-identification of WSIs shows that patient privacy protection is one of the constraining purposes of MSK computational pathology. The examples of successful research done within the MSK set-up also suggest that patient benefit is another of the norms constraining information flows.Footnote 6 To this extent, MSK norms match some of those constraining PathLAKE. PathLAKE, too, de-identifies NHS patient WSIs, and patient benefit is central to its core internal projects and is a condition of commercial access to its data set. On the other hand, PathLAKE is independent of external commercial firms who apply to use its data set.

MSK’s approach has been different. In early 2018, three clinicians in MSK helped found a medical diagnostics firm, Paige.AI, which acquired exclusive intellectual property rights in MSK’s stock of 25 million WSIs. MSK itself received shares, but the three clinicians also had substantial personal financial stakes. These arrangements seem to flout norms in the US outlawing conflicts of interest and also rules governing the operation of not-for-profit health care firms. The arrangements also seemed to some local stakeholders in 2018 to misuse even de-identified patient data (Ornstein & Thomas, 2018). Although the MSK environment in New York was not entirely unregulated, given the recognition there that conflict of interest and tax norms were prima facie violated, the sale of intellectual property rights in a huge set of WSIs to a commercial firm would have been, and remains, unthinkable in the setting of the NHS.

It would be a mistake to conclude, however, that even a data set and platform like PathLAKE’s are entirely insulated from problems with commercial firms. The reason is that there is a tension between the PathLAKE requirement that external firms train and test models in its data zone and the rights of commercial firms to keep secret the code for the proprietary algorithms they would expose on the platforms to do that training and testing. Another problem has to do with the capacity of PATHLAKE or comparable repositories to provide staff and computing time to help external firms operate in its data zone. Not every firm will approach the access committee with a ready model, or the expertise to apply one on an unfamiliar platform. Some may want instead to explore the data for dependencies and may need help from custodians of the data lake to do so. This problem would be eliminated if data could be worked on by firms outside the data zone, with the firm controlling on its own servers the use of its model or algorithm and supplying its own computing time and personnel. But in that case, data would pass out of the control of the NHS, and even agreements to delete it after use might not be enforceable.

In the future, this problem may affect UK data sets other than PathLAKE’s because of the growing support in the UK among policymakers for the concept of a Trusted Research Environment (TRE) (Goldacre & Morley, 2022). A Trusted Research Environment (also known in the UK as a Secure (or Safe) Data Environment (SDE)) is in effect a scaling-up of the set-up visualised in Fig. 3. A number of NHS patient data sets, possibly including sets of WSIs, are made available through a single, secure, online portal to vetted researchers who apply successfully for access. Normally, de-identified data sets are used. Analytical tools are provided with the data sets, and these can be employed on the same site. The results of analysis can be safely exported.

The NHS Digital TRE, still under development, provides this sort of set-up for UK researchers (NHS Digital, No date a). The tools do not, as yet, include AI-assisted analytics, and it is unclear whether commercial firms developing algorithms are supposed to have access. Again, at the moment, the available data sets do not appear to include sets of WSIs.Footnote 7 Before the TRE was established, NHS SecureLab existed. Its purpose was to allow access from a remote Safe Room or Safe Hub to very sensitive and detailed data that is not allowed to be downloaded (see UK Data Service, No date a). This could be a model for future access to WSIs by commercial organisations. TREs using NHS data, including the SecureLab, are subject to a normative framework called ‘The Five Safes’ (UK Data Service, No date b). One of the safes is Safe Settings, which is to do with the platform, its cybersecurity, its access conditions and its conditions for export of data. Almost without exception, the Five Safes prohibit the export of data from a TRE. But this norm of safety may be at odds with some kinds of research, including algorithm development.

TREs do not always lend themselves to the introduction of tools for testing algorithms. There may be uncertainty as to how much computing power and how much technical assistance should be provided to external data scientists applying to use data held on the TRE for AI work. Again, commercial algorithm developers may be reluctant to reveal code that would be visible to the custodians of the spaces on a TRE where testing, or training and testing, would take place. In other words, there are both data security issues and Intellectual Property issues raised by the use of TREs by commercial data science companies.

To illustrate, NHS SDE guidelines released as late as September 2022 make clear that the recommended norms for allowing external developers to import and test code require the code to be inspected by the custodians of TREs (Department of Health & Social Care, 2022, Guideline 8). At the same time, the Guidelines invoke the Five Safes. Exports of data from the TRE are strongly discouraged if not prohibited outright (Guideline 1): the whole point of a TRE is to keep NHS data under NHS control, and exported NHS data can be shared, not necessarily for patient benefit.

The obstacles to AI-assisted research on TREs do not stop there. The same NHS policy that we are currently discussing insists (Guideline 6) that:

Owners of secure data environments must make sure that the public are properly informed and meaningfully involved in ongoing decisions about who can access their data and how their data is used. For example, by ensuring that relevant technical information is presented in an accessible way (that is, through publishing privacy notices and data protection impact assessments).

In practice, this means that external developers will have to satisfy an access committee with lay members (typically patient representatives in the UK) that using the SDE or TRE for algorithm-testing will produce patient benefit and not merely profit. Although the communication of patient benefit is not necessarily very difficult or impossible, the communication of how the AI produces this benefit in a given case can be highly challenging, sometimes because the AI in question is not explicable AI.

5 Summary and Conclusion

We have used Helen Nissenbaum’s ‘contextual integrity’ framework (as adapted by Zimmer) to assess two kinds of use of scanners in digital pathology. In one use, scanners take the place of microscopes in the analysis of tissue samples. WSIs generated by the scanners are used for the classification of one patient’s tissue as normal or abnormal. This is the use of digital scanners in digital pathology. The digital images are readily shared, enabling the quick collection of second opinions, or as aids to teaching in pathology. In the second use of digital scanners, large numbers of whole slide images are produced for mounted tissue samples. The WSIs are then aggregated for algorithm training and testing. This process increases the accuracy and speed of diagnosis, prognosis and biomarker discovery in relation to a cancer type. This second use of scanners belongs to computational pathology. Although the results of computational pathology can be applied in digital pathology, and the annotated WSIs of an individual patient can be used in a training or testing set for an algorithm, the two practices can be distinct.

The main question we have been pursuing is whether information flows associated with one kind of pathology are more likely to violate data ethics norms than the other. The information flows associated with digital pathology assessed in this paper definitely permit and may even encourage medical confidentiality and anonymity breaches as well as simple data loss. To some extent, this risk is the other side of the coin of the ready sharability of single patient WSIs and the need for multidisciplinary teams to work jointly on diagnosis and treatment.

Information flows associated with computational pathology, at least in NHS computational pathology, show more contextual integrity than digital pathology. But adhering to existing contextual norms does not mean that no ethical problems at all attend computational pathology.Footnote 8 The business ethics of commercial algorithm development with patient data sets is not straightforward, and some of the ethically motivated safeguards on NHS data set handling may discourage the development of AI-assisted techniques with great patient benefit. TREs are potentially cybersecure and friendly to AI research, including AI research on WSIs, but they may pose special difficulties for commercially conducted data science, including some risks of economically devalued code.Footnote 9