Powerful institutions across the globe have recently joined the ranks of those making substantive commitments to “open science.” For example, the European Commission and the NIH National Cancer Institute are supporting large-scale collaborations, such as the Cancer Genome Collaboratory,1 the European Open Science Cloud,2 and the Genomic Data Commons,3 with the aim of making giant stores of genomic and other data readily available for analysis by researchers.Reference Phillips4 In the field of neuroscience, the Montreal Neurological Institute is midway through a novel five-year project through which it plans to adopt open science across the full spectrum of its research.Reference Gold5 The commitment is “to make publicly available all positive and negative data by the date of first publication, to open its biobank to registered researchers and, perhaps most significantly, to withdraw its support of patenting on any direct research outputs.”Reference Ali-Khan, Harris and Gold6 The resources and influence of these institutions seem to be tipping the scales, transforming open science from a longstanding aspirational ideal into an existing reality.
Although open science lacks any standard, accepted definition, one widely-cited model proposed by the Austria-based advocacy effort openscienceASAP describes it by reference to six principles: open methodology, open source, open data, open access, open peer review, and open educational resources.7 The overarching principle is “the idea that scientific knowledge of all kinds should be openly shared as early as is practical in the discovery process.” This article adopts this principle as a working definition of open science, with a particular emphasis on open sharing of human data.
As noted above, many of the institutions committed to open science use the word “commons” to describe their initiatives, and the two concepts are closely related. “Medical information commons” refers to “a networked environment in which diverse sources of health, medical, and genomic information on large populations become widely shared resources.”Reference Cook-Deegan and McGuire8 Commentators explicitly link the success of information commons and progress in the research and clinical realms to open science-based design principles such as data access and transparent analysis (i.e., sharing of information about methods and other metadata together with medical or health data).9
But what legal, as well as ethical and social, factors will ultimately shape the contours of open science? Should all restrictions be fought, or should some be allowed to persist, and if so, in what form? Given that a commons is not a free-for-all, in that its governing rules shape its outcomes, how might we tailor law and policy to channel open science to fulfill its highest aspirations, such as universalizing practical access to scientific knowledge and its benefits, and avoid potential pitfalls?10 This article primarily concerns research data, although passing reference is also made to the approach to the terms under which academic publications are available, which are subject to similar debates.
We start from the perspective that the ultimate goal of both the open science movement and information commons creation is to increase practical access to scientific knowledge and its benefits across human society, and to ensure that this access is distributed as evenly as possible. The potential pitfalls of open science include exacerbating existing inequalities, by supporting the development of expensive new diagnostics and treatments that are practically available only to the stratum of the population who can afford them, while putting already-disadvantaged individuals and groups at risk of harms, such as discrimination and stigma-tization. Inequities in algorithmic decision making based on big data have indeed become a widespread focus of attention and research.Reference Eubanks11 A related risk is the de facto privatization of personal data, by organizing data in a manner that benefits only those who possess sufficient resources to allow them to usefully analyze them, thus transforming public funding of open science into an indirect subsidy to private industry.Reference Gurstein12
This article primarily concerns research data, although passing reference is also made to the approach to the terms under which academic publications are available, which are subject to similar debates.
Both of these tendencies relate to data protection as it is evolving. Although data protection frameworks have long included automated decision making and profiling within their scope, it is only with the recent surge of interest in machine learning techniques that a corresponding increase in attention is emerging in delineating what protections, if any, should exist in practice.13 A leading understanding of data protection as a field is that it refers to a “set of legal rules that aims to protect the rights, freedoms, and interests of individuals, whose personal data are collected, stored, processed, disseminated, destroyed, etc. The ultimate objective is to ensure ‘fairness in the processing of data and, to some extent, fairness in the outcomes of such processing.’”Reference Tzanou14 In this context, recent interest in applying data protection rules to the contemporary big data and machine learning contexts should come as no surprise.
Data Protection Under the GDPR
The European Union's General Data Protection Regulation (GDPR) is now in full effect across the European Economic Area (EEA), encompassing 31 countries. It grants rights to “data subjects” (identified or identifiable natural persons), including the right to access, rectify, and object to the processing of their personal data. It also imposes duties on “data controllers” (natural or legal persons, public authorities, agencies, or other bodies that determine the purposes and means of processing “personal data,” meaning information related to a data subject), for example, placing the burden on them to justify processing the personal data, to justify its transfer outside of the EEA, and to justify processing “special categories of data” (i.e., sensitive data, including data concerning health and genetic data), not only as having a lawful basis but also as falling within at least one of ten special category conditions (such as scientific research). Whereas the GDPR's precursor, the EU Data Protection Directive, had to be adopted into the national law of European Union member states, the Regulation is now directly legally applicable within the EEA (i.e., without the necessity of nation-level implementing legislation), as well as to many organizations located outside the EEA which process the personal data of individuals residing in the European Union. Despite the GDPR's aspirations to further harmonization, however, it still allows for member states to impose more stringent restrictions in certain areas it specifies, notably in the context of data the GDPR deems to be sensitive, such as data concerning health and genetic data.
Researchers' Evolving Response to the GDPR
The arrival of the European Union's General Data Protection Regulation (GDPR) set off a wave of unease in many data-intensive industries, including health research, sparking fears that it would effectively curtail their activities.Reference Dove, Townend and Knoppers15 Although other areas of the GDPR, such as implementation of the new data subject right to data portability (which includes the right to have their personal data transmitted directly from one controller to another, where technically feasible) and the “right to be forgotten” (have a data controller erase personal data, cease further dissemination, and potentially halt processing by third parties), and restrictions on data transfer to non-EU countries, have raised concerns in the research community, issues around consent have preoccupied them most, especially the specificity of consent. The GDPR's emphasis on specific consent initially alarmed translational researchers after an early draft version was released in 2016, but language favourable to broad consent when processing personal data for research purposes was ultimately incorporated in Recital 33 in the version enacted.16 Whereas specific consent requires research participants to consent to each specific project they are participating in, and to be informed in advance of the concrete ways in which their data will be used, broad consent allows participants to consent to having their data used in multiple research projects based on a description of a broad type (or types) of research and the governance thereof. For example, participants might consent to the use of their data in any future project aimed at developing treatments for a particular disorder, facilitating sharing of data among investigators in particular fields of research and, potentially, contribution to a relevant data repository. They may also agree to future unspecified research within a field subject to proper oversight.
Although alarm in the health research sector continues to some degree, many in the field have now instead come to view the GDPR as sufficiently attuned to the nature of research, its processes and needs, and as “a well-drafted piece of legislation that raises the standards of data protection globally.”Reference Dove17 Indeed, consent is but one of several alternative bases upon which the final version of the GDPR allows personal data to be lawfully processed. In countries such as the UK, for example, regulators and scholars are currently actively discouraging reliance on consent to fulfill data protection duties in most cases when personal data processing is necessary for research or clinical purposes.Reference Taylor, Wallace and Prictor18 They instead suggest that it is generally preferable to rely on another legal basis, specifically that processing is being carried out for research purposes in the public interest (in the case of a public institution), or that such processing is necessary for pursuing legitimate interests (for private sector institutions).19 This approach does not mean that research participants' consent will not be needed: indeed, irrespective of the GDPR, research ethics duties generally require this. The idea is not to rely on the consent, even though it must generally be obtained to satisfy research ethics duties, for the purpose of fulfilling GDPR obligations, and to instead rely on an alternative legal basis with respect to the GDPR. The impression of the cited writers in the UK, at least for now, is that alternative legal bases such as where processing is based on the “public interest” or the “legitimate interests” of the entity controlling the personal data mitigate the possible infringement and impact of the GDPR on open science principles.
Interpreting the GDPR
One of the difficulties in interpreting the GDPR with respect to health research is that its default lens tends to focus on relationships between private sector companies and their customers. For example, the important question of when the GDPR applies to an organization outside of the EU is determined in part based on whether that organization is “offering goods or services, irrespective of whether a payment … is required” to people in the European Union (Article 3(2)(a)). The research context, including the degree of connection that a cohort of research participants would need to have to the European Union to satisfy this condition, is largely ignored by recent guidance on the interpretation of this article of the GDPR published by the European Data Protection Board.20
Because of this overarching lens, guidance from the European Commission on interpreting the GDPR in the health research context are to be welcomed. One recent attempt to raise awareness of ethics and data protection issues in the scientific community, which focuses on the GDPR, however, represents a missed opportunity in that a number of important questions were either ignored altogether or dealt with in too cursory a fashion. The document in question, “Ethics and data protection,” was prepared at the request of and published by the Research and Innovation arm of the Commission, which oversees the Commission's Horizon 2020 research funding program.21 The document's approach to consent is particularly notable:
Whenever you collect personal data directly from research participants, you must seek their informed consent by means of a procedure that meets the minimum standards of the GDPR. This requires consent to be given by a clear affirmative act establishing a freely given, specific, informed and unambiguous indication of the subject's agreement to the processing of their personal data.22
Although this interpretation stays close to the GDPR's definition of consent in its Article 4(11), the approach appears to be at odds with the regulation in multiple other respects. First, as noted earlier, the regulation's Recital 33 makes an exception to the requirement that consent must always be specific insofar as “data subjects should be allowed to give their consent to certain areas of scientific research when in keeping with recognised ethical standards for scientific research.” It is odd for this exception to be entirely ignored here, despite being previously recognized by guidance endorsed by the Commission's European Data Protection Board.23 Second, the GDPR allows personal data to be collected directly from data subjects on a basis other than consent (e.g. this is implicit in Article 13(1)(d)). Third, the distinction between GDPR consent and research ethics consent appears to be blurred. This distinction is important in that even if a basis other than consent is used to justify processing personal data with respect to the GDPR, consent will still generally be sought from research participants in order to comply with research ethics duties that apply independently. But the form of consent required in such circumstances will be defined by existing research ethics rules, not by the dictates of the GDPR.
In sum, although there is uncertainty because these novel elements of the GDPR remain to be tested (perhaps via a challenge by data subjects in real-life proceedings) the text of the GDPR appears to provide a number of additional routes through which research initiatives can satisfy their data protection obligations. The proviso is that they are attentive, as indeed they should be, to the rights and interests of those whose data they hold, and aim to adopt proportionate measures to safeguard them. These measures include ensuring the technical confidentiality of the data and integrity of the security of their systems, ensuring that the data's confidentiality will not be jeopardized in the hands of any third parties to whom it will be disclosed or transferred, and that participants' rights to access and rectify their data are ensured. Rather than continuing to wait for the courts and regulators to weigh in, an alternative to gaining clarity on some of these details would be to develop a data protection code of conduct for the health sector. As of now, BBMRI-ERIC is leading such a proposed initiative.Reference Litton24 Once approved by EU data protection authorities according to a process set out in the GDPR itself, adherence to such a code would provide evidence of compliance with the GDPR, and adherence, when combined with binding and enforceable commitments to apply the appropriate safeguards, by an organization outside the EU wishing to receive personal data from an entity subject to the GDPR satisfies the regulation's restrictions on transfer.
Despite the appearance that the GDPR strikes the proper balance between accommodating scientific research and securing individual rights and dignity, the tension between open science and data protection goes to the very core of the two movements.
Remaining Tensions Between Open Science and Data Protection
Despite the appearance that the GDPR strikes the proper balance between accommodating scientific research and securing individual rights and dignity, the tension between open science and data protection goes to the very core of the two movements.Reference Dove25 When “open source” or “open data” have been given formal definitions, such as in the GNU Public License, these generally appear in an absolutist form and require that the information in question must be provided in its entirety and must be free to use, free to disseminate, and free to adapt with as few restrictions as feasible.
A key way in which the GDPR's restrictions are circumscribed is according to the purpose for which personal data were collected or are otherwise processed, which must be defined. For example, if researchers collect personal data, they must indicate the purpose of collection, such as to conduct a particular study. Data protection regimes have tended to prohibit any use of personal data for any purpose other than the one that was indicated at the time of collection. This contrasts sharply with the driving rationale behind the initiatives that constitute the open science movement, which instead emphasize the benefits that are possible through unforeseeable future uses to which any given information set might be put.Reference Boyle, Hess and Ostrom26 Open source software, for example, is made available to be reused for purposes that may have been unforeseen or even unforeseeable by its initial creator. This apparently fundamental tension is not entirely new: back in the 1970s and 1980s, a wave of new laws enacted around the world aimed to reconcile the protection of privacy with a new public right of access to government documents.27 As the default position established was that government documents were presumed to be made freely available to the public unless some exception to access applied, this gave rise to a risk of violating the privacy of those whose personal information was contained in them. Through practices and frameworks established over time in the context of specific cases, the tension between these two objectives came to be workably smoothed out, often by redacting personal information, where appropriate.
The GDPR still has little or no judicial interpretation, apart from that portion of the jurisprudence of European courts, especially the European Court of Justice, regarding the previous Data Protection Directive that remains relevant to it. Further, experimentation with large-scale open science is only just beginning. As a body of legal interpretations develops, and as experience with open science increases, a similar process of incremental adjustment should be encouraged. To make the transition as smooth and steady as possible, more interplay is needed between the two, currently siloed, fields of study: advocates for open science should ensure that the courts' coming interpretations of the GDPR carefully weigh possible effects on the information commons, while also seeking to ensure that debates around open science incorporate careful consideration of the rights and concerns of data subjects and lead to steps to recognize and guarantee personal data protection.
As open science becomes institutionalized, we are in a key moment in which to establish the rules that will shape it, taking account of legitimate concerns about data protection and data sharing. Additional high-quality social science to shed light on the specific policy calibrations that will maximize open science while also giving data protection, and other extrinsic policy considerations, their due would be invaluable in this respect.
As an example of potential interplay between the two fields, the developers of a data protection code of conduct for health research, as discussed earlier, might explicitly incorporate a provision stating that the principle of open science or the medical information commons are to be viewed as important principles to guide the analysis. The general principle might be made concrete within a code of conduct in the specific context of health research, for example, by setting out best practices when establishing a data access committee, when necessary, and ensuring the approach is aimed at realizing the twin goals of data protection and open science. This approach is indeed in harmony with the underlying goal of data protection to promote the free movement of personal data so long as its processing appropriately protects and realizes the hopes of the people to whom it relates.
Conclusion
As open science becomes institutionalized, we are in a key moment in which to establish the rules that will shape it, taking account of legitimate concerns about data protection and data sharing. Additional high-quality social science to shed light on the specific policy calibrations that will maximize open science while also giving data protection, and other extrinsic policy considerations, their due would be invaluable in this respect. Even though the legislative process of the new EU General Data Protection Legislation is over, its substance will continue to rapidly evolve and take shape through the interpretations of courts and data protection authorities in specific cases and through any ancillary national legislation. The research community should watch for opportunities not only to have its voice heard in those processes, but to draw on past insights from access to public information laws to formulate interpretations that balance the promotion of self-determination for individuals with the promotion of data sharing and creation of an efficient open science information commons to support new discoveries.
Acknowledgements
The authors gratefully acknowledge funding from the Tannenbaum Open Science Institute (TOSI).