In August 2022, the Biden White House’s Office of Science and Technology Policy (OSTP) released a memorandum on Ensuring Free, Immediate, and Equitable Access to Federally Funded Research (2022 OSTP Memo).1 Its major aims include requiring that federally supported research, along with data of sufficient quality to validate and replicate the findings, be made available to the public without embargo. The 2022 OSTP Memo represents the most recent step in federal data sharing efforts over the past 20 years, including those specific to genomic data sharing.2 Since the last OSTP Memo on this topic in 2013, all 20 federal departments and agencies covered within OSTP’s scope have implemented data sharing policies. These policies have enabled access to over 2.4 million federally supported publications and an additional 5.7 million articles in the sciences generally.3
Despite these achievements, many of the problems these federal policies set out to fix remain. Among these challenges are: (1) clarifying who should bear the burden of sharing data; (2) translating shared data into scientific advancements; (3) elucidating how federal policies intersect with private interests (e.g., journals, industry co-funders, or commercially generated data); and (4) balancing the autonomy interests of those who contribute data (including patients, research participants, and commercial consumers) with the public beneficence attendant to advancing science.
Due to the need to combine large amounts of data globally to support comprehensive advances across genomic variance, health behaviors, and health outcomes, the governance of genomic data sharing was largely where these types of policies began — and the field of genetics remains on the cutting edge of the debate regarding ongoing challenges. Therefore, while the U.S. government continues to focus on implementation of the 2022 OSTP Memo, and the National Institutes of Health (NIH) is concurrently updating its most recent 2014 genomic data guidance, it is critical to better understand the goals and challenges of those expected to both benefit from and contribute to these shared data resources. To this end, in the spring and summer of 2020, we conducted semi-structured interviews with U.S. academic genetic researchers. We explored perceived benefits and burdens, industry interests, and autonomy considerations related to data sharing and using shared data resources. In this article, we provide a background of the major U.S. federal government data sharing policies over the past twenty years, present the results of our qualitative study, and discuss areas for continued improvement for federal governance and support of research.
In this article, we provide a background of the major U.S. federal government data sharing policies over the past twenty years, present the results of our qualitative study, and discuss areas for continued improvement for federal governance and support of research.
Background
1997 Bermuda Principles
U.S. science funding agencies began in the 1980s to think comprehensively about data sharing from funded research. The Human Genome Project (HGP), with the goal of generating the first sequence of the human genome, was launched in 1990. United States participants were funded by the U.S. Department of Energy and the NIH Office for Human Genome Research (later named the National Human Genome Research Institute (NHGRI)). Six years later, 50 members of the HGP gathered to adopt the first major set of principles for the HGP regarding the sharing of genomic data, known as the “Bermuda Principles.”Reference Contreras, Contreras, Cuticchia and Kirsch4 These principles mandated that sequencing data should be “freely available and in the public domain” to enable research, development, and the betterment of society.5 NHGRI then expanded the scope of these principles from the HGP to all its funded large-scale researchers, which evolved several times through 2003. Reference Arias, Pham-Kanter and Campbell6
2003 NIH Policy
In 2003, the NIH adopted a federal data sharing policy across all institutes and centers, called the NIH Data Sharing Policy and Implementation Guidance (2003 NIH Guidance). It required that investigators asking for $500,000 or more in direct costs per year have a “plan for sharing final research data for research purposes, or state why data sharing is not possible.” The sharing had to “occur in a timely fashion” (generally defined as “no later than the acceptance for publication of the main findings from the final dataset”) and contain information necessary to “document, support, and validate” research findings. Such data also had to include relevant information about methods, codes, variables, etc. needed to “prevent misuse, misinterpretation, and confusion.”7
The policy also explicitly recognized that “the investigators who collected the data have a legitimate interest in benefiting from their investment of time and effort.” It specifically allowed investigators to benefit from “first and continuing use but not from prolonged exclusive use” of the data they generated. The 2003 NIH Guidance was also particularly concerned about the generation and analysis of data that had been “co-funded” by private industry. It recognized “the need to protect patentable and other proprietary data,” if those limitations were disclosed in the original grant proposal’s data sharing plan. The NIH also recognized the rights of contributors to privacy protections. However, it also recommended that promises to contributors that their data would not be shared as part of the informed consent or disclosure process “should not be made routinely and without adequate justification.”8
2008 NIH GWAS Policy
In 2008, the NIH created its own genomic data sharing policy common across institutes and centers (i.e. not just limited to NHGRI). The Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (2008 NIH GWAS Policy) created a centralized data repository (the database of Genotypes and Phenotypes (dbGaP)), protected data contributors by ensuring that data sharing did not run contrary to the terms of the informed consent, and set standards for publication and intellectual property rights for all NIH-funded research that included GWAS. The policy required the sharing of protocols, instruments, variables, and supporting documentation, and “strongly encourage[d]” the sharing of curated phenotypic and genomic data within dbGaP. It also granted awardees a period of up to 12-months publication exclusivity from the shared dataset (others were allowed to analyze the data, but not submit findings to a journal during that time).9 Under this policy, over 2,200 investigators accessed 304 studies and produced over 900 publications.10
2013 OSTP Memo
In 2013, under the Obama Administration, the White House’s OSTP released its own Increasing Access to the Results of Federally Funded Scientific Research Memorandum (2013 OSTP Memo) to set one of the first federal data sharing standards, again increasing the coverage of data sharing requirements to now include many federal departments and agencies who fund research. The goal of the 2013 OSTP Memo was to “maximize the impact and accountability” of federal investment in research to “accelerate scientific breakthroughs and innovation.” It included the 20 federal departments and agencies with over $100 million in annual research and development expenditures in its scope.11
The 2013 OSTP Memo set the same 12-month post-publication embargo period for making all research papers “directly arising from federal funding” publicly available as the NIH had in 2003. It also required the sharing of data “commonly accepted in the scientific community as necessary” to validate the findings described therein. In addition, the 2013 OSTP Memo recognized the importance of balancing the ambitious goals of data sharing with “associated costs and administrative burden.” The memo specifically emphasized an interest in not adversely affecting opportunities for non-federally funded researchers, although it did not offer clear guidance regarding how to do so.12
Much like the 2003 NIH Policy, the 2013 OSTP Memo recognized proprietary interests to avoid “significant negative impact on intellectual property rights, innovation, and U.S. competitiveness.” This time, with the addition of the article sharing requirement, it also recognized the interests of journals as discrete stakeholders. OSTP argued that “publishers provide valuable services, including the coordination of peer review, that are essential for ensuring the high quality and integrity of many scholarly publications.” It therefore required agency plans to have a strategy for “leveraging existing archives…and fostering public-private partnerships with scientific journals” as well as procedures to help prevent the “unauthorized mass redistribution of scholarly publications.” To maximize the impact of federal funding, it specifically encouraged public-private collaboration to maximize interoperability and creative reuse. In addition, the 2013 Memo noted the need for agencies to ensure that “confidentiality and personal privacy” of contributors were protected throughout.13
2014 NIH GDS Policy
The following year, the NIH replaced its 2007 NIH GWAS Policy with the Genomic Data Sharing Policy (2014 GDS Policy). This current policy applies if federal funding supports the “generation” of genomic data. While it did not alter OSTP’s required 12-month embargo for release of federally funded articles, it offered additional details to ensure “broad and responsible sharing” of large-scale genomic data. The 2014 GDS policy requires funded investigators to share genomic data, including the analytic code or tools necessary to interpret it, in an NIH-designated repository by the time of publication of their first related article.14
Public comments to this proposed policy expressed concerns regarding the financial burden that such a detailed level of data sharing would place on investigators, emphasizing the related infrastructure needed for such data sharing and the reallocation of already limited resources away from primary research. In addition, critics pointed out that the timeline for sharing genomic data could limit researchers’ ability to “perform adequate quality control.” The NIH acknowledged the “significant effort to prepare the data for sharing,” but maintained that this burden was “warranted by the significant discoveries made possible through the secondary use of the data.”15
Notably, the 2014 GDS Policy requires investigators to request informed consent for future use and sharing of genomic data derived from cell lines or clinical specimens collected after the effective date. The federal regulations, which set the requirements for human subjects research, Subpart A of which is called the “Common Rule,” do not cover de-identified biospecimens and therefore do not require informed consent for de-identified specimen sharing.16 But the 2014 GDS Policy tightened this standard, arguing that “it is increasingly clear that participants expect to be asked for their permission to use and share their de-identified specimens for research,”17 even if those specimens are de-identified as defined by the HIPAA Privacy Rule (e.g., lacking name or address).18 This sets up a bifurcated system in which these additional protections do not apply to de-identified data, but do apply to the de-identified specimens from which those data are derived in the first place.
This federal justification for requiring informed consent for research with de-identified specimens and cell lines mirrors that which was used in 2015, when the U.S. Department of Health and Human Services released a Notice of Proposed Rulemaking to update the Common Rule.19 It too proposed that de-identified data remain outside the protections of Common Rule, but that the regulations should be changed to newly cover de-identified specimens; it even cited the same three underlying studies as the 2014 GDS Policy to support this claim.Reference Kaufman, Murphy-Bollinger, Scott and Hudson20
That said, many commentators on the Notice of Proposed Rulemaking argued against the proposal to treat all biospecimens as inherently identifiable, due to concerns regarding making specimen research more expensive, less common, and restricting research productivity overall.Reference Lynch, Bierer and Cohen21 The final revisions to the Common Rule therefore did not adopt this proposal writ large.22 The informed consent requirement for de-identified specimen research remains limited to federally-funded studies that generate genomic data. The only allowable exceptions to the informed consent requirement in the 2014 GDS Policy must be for “compelling scientific reasons.” Funded investigators are to request contributor consent for the “broadest possible sharing” but, if not, investigators are to submit data to controlled-access repositories.23
2020 NIH Policy
A new NIH Policy for Data Management and Sharing (2020 NIH Policy) updates the 2003 NIH Policy in several important ways. These include broadening the scope of covered research, from that which cost $500,000 per year, to “all research, funded or conducted in whole or in part by NIH, that results in the generation of scientific data.” In addition, while it maintains the requirement that data must be shared by the time of the first associated publication, it adds that even data that are not ultimately published must be shared by the end of the award period — whichever comes first. It invokes a standard for sharing “quality” data, which includes both the ability to validate and replicate research findings whether or not those findings are ultimately published.24
It also requires investigators to “maximize” the amount of data that can be shared (e.g., through the informed consent process), but acknowledges the potential “ethical, legal, or technical” factors that might limit such sharing. It encourages investigators to ensure that contributors are informed regarding what will happen with their data to respect their autonomy, and that factors that might impact sharing (e.g., limitations on consent for certain types of research) “travel” with the data to inform future users.25 This policy became effective in January 2023 and includes a commitment to updating the 2014 NIH GDS Policy, as well.Reference Jorgenson, Wolinetz and Collins26
2022 OSTP Memo
The most recent federal data-sharing memorandum, Ensuring Free, Immediate, and Equitable Access to Federally Funded Research Memorandum (2022 OSTP Memo), was released in August 2022. Its goals include enhancing equity and trust in government- supported science, and it broadens the scope of federal departments and agencies that must develop their own data sharing policy from those with over $100M in R&D funding to those with any funding.27
Perhaps most notably, it responds to what it describes as “years of public feedback” that the 12-month embargo period was “inequitable” in that it limited “immediate access [to published articles] to only those able to pay for it or who have privileged access through libraries or other institutions.” The 2022 OSTP Memo therefore requires that all published articles resulting from federal funding (including funding held by co-authors) be made “freely available and publicly accessible” without journal embargo or delay.28 OSTP also included highly stipulated guidance regarding the kinds of repositories in which investigators should deposit their data.29 These recommend that repositories should provide free and easy access,30 curation and quality assurance,31 common formatting,32 clear provenance,33 and fidelity to consent.34
In an attempt to move away from the “financial means and privileged access,” which, OSTP argued, are currently required to access cutting-edge scientific findings, the 2022 OSTP Memo cites values of “equal opportunity” in allowing “all Americans to benefit from the returns on our research and development investments without delay.” It delegates the National Science and Technology Council’s Subcommittee on Open Science to develop measures to additionally reduce inequities for “individuals from underserved backgrounds and those who are early in their careers,” as well as reduce the burden of data sharing on funded researchers generally.35
While the 2022 OSTP Memo also gives the Subcommittee on Open Science the task of coordinating engagement with stakeholders, “including but not limited to publishers…,” it lacks similar deferential language regarding the 2013 OSTP Memo’s concerns about publishers’ value to the research enterprise. It also adds new language regarding transparency surrounding the generation of federally funded scholarship, including “authorship, funding, affiliations, and development status” of the work.36
Present Study
Before the 2022 OSTP Memo was released, we conducted semi-structured interviews with U.S. academic genetic researchers. While these interviews focused on genomic data specifically (i.e., the researchers were sampled via a PubMed publication of an article including genomic data), they discussed both genomic and other related phenotypic data. Previous data sharing policies have focused on data sharing with limited exploration of the related burden on funded researchers, a definition of industry partnership that no longer covers the complex scope of current data sharing partnerships, and somewhat contradictory stances on respecting contributor autonomy (e.g., discouraging participants from opting out of data sharing but also requiring consent for some specimen use). We therefore conducted this study to provide insights into the impact of federal data-sharing policies, with a focus on genomic data, through a qualitative exploration of perceived benefits and burdens of both sharing and using shared data resources, the translation of shared data into improved science, challenges with weighing industry interests, and considerations regarding informed consent under this dynamic governance landscape.
Materials And Methods
Recruitment
We identified prospective interviewees based on a PubMed review of 2017 – 2019 articles with at least one U.S. academic-affiliated corresponding author, which also indicated use of genomic data from at least one of the following types of genomic data stewards (i.e., entities that govern or oversee data resources): (1) A private steward (based on their inclusion in Research and Markets’ rank of direct-to-consumer (DTC) genetic testing companies, i.e., 23andMe, Ambry Genetics, Ancestry.com, Color Genomics, Gene by Gene),37 or (2) An academic, government, or consortia-related steward. We wanted to ensure that half of our sample used private stewards due to our specific interest in querying the under-explored relationship of the impact of private genomic data on research. The other half of our sample used non-private stewards, which ended up representing academic, government, and consortia-controlled data resources. We contacted the authors of approximately half of the identified articles – starting with those published most recently to aid in interviewee recollection and oversampling for female and Latino/Hispanic, African American or Black, or Asian researchers – via an email to the corresponding author (46% response rate). A more detailed description of recruitment is available in a previously published paper from these interviews.Reference Trinidad38
Interviews and Analyses
We generated a semi-structured interview protocol based on a literature review of different attributes of genomic databases and solicited input from qualitative methods experts and genetic researchers to identify confusing or unclear phrasing prior to recruitment. We asked interviewees questions regarding employment, why they chose a specific data steward(s) to answer their research question (if they had the choice to begin with), contributor protections, data usage agreements, funding, data-sharing, and research outcomes (our interview guide is available as an appendix to a previous publication39).
In our previous analysis, we focused on interviewees’ selection of database(s).40 Here, we focus on researcher perspectives regarding sharing their genomic and related phenotypic data, as well as using data shared by others. While interview questions focused on the database(s) identified in the author’s PubMed publication result, we also asked them to compare this with their experiences using other databases.
We carried out each 30 to 60-minute interview via Zoom or telephone between March and July 2020 (KSB, CK, MK). Our male and female-identifying interviewers were non-Hispanic White and none of the interviewers conduct their own research with genomic data. We provided interviewees with a $100 gift card following completion of the interview. We audio recorded and transcribed the interviews, reviewed the resulting transcripts for accuracy, and cleaned and de-identified them (CK). For the thematic analysis, we employed a method of iterative description, using grounded theory.Reference Thorne, Kirkham and O’Flynn-Magee41 We characterized themes common across interviews and captured individual variation. (KSB, KR, MGT, CK).
Our preliminary codebook was developed based on the structure of the interview guide, and then was iteratively edited after initial review and analysis of transcripts. All analysts concurred that thematic saturation was reached after 23 interviews. We then double-coded all transcripts (KR, MGT, CK) and met as a team to reconcile any discrepancies (KR, MGT, CK, KSB). We read through coded excerpts to identify relevant themes, which were then discussed with the entire team and consolidated into the final thematic analysis. This study was approved by the University of Michigan Institutional Review Board (HUM00175088), and each participant provided informed consent.
Results
Out of the 23 U.S. academic genetic researchers we interviewed, eleven used a private database in their reference article, and 12 used an academic, government, or consortia database. The majority of interviewees were female (n=13), non-Hispanic White researchers (n=14), with an average of 8.5 years at their current institution (see our previous publication for demographic tables42). Nearly all compared different types of databases beyond the one for which they were sampled, leading to a discussion of 70 distinct databases (30 academic data stewards, 13 government, 11 private, 8 NGOs, and 8 via collaborations).
Theme 1: Sharing Data was Seen as A Burden Without Reward
A major challenge discussed in all the federal data sharing policies is who should carry the burden of data sharing, and how to limit its weight. Our interviewees described cleaning, preparing, and depositing data into authorized government repositories as laborious for investigators and their teams. One interviewee believed that this problem was particularly compounded at primary data collection sites:
…those investigators are sort of like ‘we hate actually being one of the funded sites because we make all the phenotypes and all genotypes available immediately, and we’re so busy collecting all the data that we don’t even have time to analyze it.’ … So, [mandated data sharing is] sort of a double-edged sword…
Data sharing requires either the investigators take on the task themselves, “which is a huge undertaking,” or pay others to do it. But another interviewee described the problems of data sharing and cleaning even if the government provided funding for assistance (as it currently does). Data sharing is complex and requires a baseline of expertise — but lacks attendant academic prestige. Thus, even when paid, the task was considered undesirable:
Keeping our labs motivated, keeping our post docs motivated, keeping them productive is hard enough and then having [to make] them go through some really cumbersome process to make their data available, which involves both bureaucratic work and work organizing and curating the data, which people don’t often see benefit from? So, yeah, I think it’s a lot of things that make [data sharing] challenging.
Not only was data sharing described as lacking academic prestige, but several interviewees also complained about the potential loss of academic opportunities in so doing. For example, one interviewee, discussing the current requirement of sharing project data (in effect since the 2003 NIH Memo), described the general hesitation that, if investigators share their data while still in the process of analysis for subsequent publications or grants, there could be another researcher who would “beat you, quote unquote, to the punch to find that new discovery within your own data.” Among researchers, this phenomenon is commonly referred to as being “scooped.”
In addition to receiving credit for a new discovery, this interviewee was particularly worried about securing additional grant funding if supporting preliminary data were already published by others: “I can be a good citizen, but how do you get a return on investment, right?” Although data sharing delays are supposed to be limited by the current federal requirements, interviewees also described how the lack of enforcement of those requirements compounded these issues as well as researcher uncertainty about the cost-benefit calculation of sharing data. One interviewee noted that current enforcement is “pretty bad in a lot of cases,” potentially unfairly compounding the burden of compliant researchers by enabling free riders who do not adhere to data sharing requirements.
As one interviewee summarized, “everyone needs to make this process easier” to enable investigators to share their data in the first place. The NIH puts “so much back on the researcher to make [data sharing] happen that I think it needs to be a little bit more centralized. Be sure it happens.”
Theme 2: Shared Data Often Lack the Quantity or Quality Necessary to Improve Science
As discussed above, the overarching goal of the federal data sharing policies is to improve science. But a second theme of our interviews was that shared data sometimes lacks the quality to validate (required since 2003) and replicate (since 2020) research findings. The recent National Science and Technology Council’s report on Desirable characteristics of data repositories for federally funded research, which came out two years after these interviews, includes the need for repositories in the future to provide “curation and quality assurance” to improve “accuracy and integrity”43 as well as “clear providence.”44 Demonstrating how far shared data resources will have to go to meet these standards, our interviewees described a landscape of shared data that sometimes lacks the quantity and/or quality necessary to meaningfully translate those data into advanced knowledge.
For example, while one interviewee admitted that complaining about the lack of necessary shared (in this case, phenotypic) data “would not make me popular…amongst my peers,” several interviewees reported such challenges. They described a lack of related clinical, supporting, or methodological data necessary to validate or replicate published results. As one interviewee stated, shared data are:
…basically provided in such small scale without the necessary information that’s needed to really do the robust research needed…Like the [National Cancer Institute] has mandated data sharing for clinical trials, but [other investigators] upload publicly available just a fraction of the data you would need [to conduct a meaningful secondary analysis].
Although the requirement to share methods information has been in place since at least 2003, another interviewee discussed the lack of access to the methods by which datasets were generated. This forced them to go back and read related papers to try to understand how shared data were generated “and whether you think that was valid or not.” Conversely, a different interviewee described their challenges in sharing such methodological information — especially when working in a large consortium where each site had different IRB and consent requirements and, therefore, different methodological scope.
Others voiced concerns about the quality of data that were shared. One interviewee stated that they would be worried about using a publicly available dataset where curation “was not rigorously performed and reputable…” As they pointed out, “It’s very easy to put a whole bunch of crap out there…”
Further, demonstrating the circular nature of challenges with sharing data and using data that have been shared, one interviewee observed that concerns about being “scooped” made them “feel like researchers sit on that data for a really long time because they want to get as much as they can from their labs before they share it,” which in turn led to data being dated “by the time it gets released to everyone else.”
Theme 3: Private Interests Can Limit the Amount of Data Funded Investigators Share
“Free and easy access” is another component of the new NSTC’s desirable repository characteristics.45 Our previous analysis indicated that the concept of “easy access” was most closely aligned with the use of private data stewards.46 Thus, while the 2003 NIH Policy focused mostly on private “co-funding,” and more recent policies have discussed industry interests in terms of publishers, our interviews focused on another component of public-private collaboration: funded researchers using privately held genomic data for their work.
About half of our interviewees described their experiences working with private data stewards. We have previously described the benefits interviewees perceived in working with private stewards,47 but interviewees also discussed why private stewards wanted to work with them, as academic researchers, as well. Perceived advantages for private data stewards to collaborate included co-authorship, learning new methods, publicity, and the ability to attract new customers for direct-to-consumer genetic testing products. Altruism was also described as a driving motivation. Data stewards, including private ones, were seen as “happy to see that their data is used for interesting scientific questions.”
One interviewee discussed the potential value for private data stewards of dataset validation when analyses relying on their data are published in reputable journals. For example, one interviewee said that when they published, the private data steward they worked with linked to their article on their website: “So that at least it looks like they are working with name brand institutions when other researchers are looking at the website to see whether they should work with them.” Some private datasets rely on self-reported (as opposed to clinician or researcher-captured) phenotypic information, which has been criticized as potentially lacking validity.Reference Wyatt, Harris, Adams and Kelly48 But, one interviewee argued, publishing with this kind of data comes with it de facto validation of the underlying dataset itself:
…it legitimizes [the private data steward] as a company, makes them look better in their research…they get their genetic insights followed up on and they prove that the way they collect data is valuable, mainly by the self-report, and maybe that helps them build a case for then selling the data to various drug development companies.
In addition, while the 2013 OSTP Memo specifically encouraged public-private collaboration to “maximize interoperability and creative reuse as well as the impact of federal funding,”49 our work told a different story. Despite perceived benefits to private stewards of sharing data with academic researchers, several interviewees also spoke of challenges with intellectual property rights in private data which limited their sharing and utility. They specifically described their experiences with private data stewards who would not let them share proprietary data with either journals or government databases:
…this became a really important roadblock for us in terms of publishing the paper, because basically, the journal said, ‘Your paper’s interesting. We would love to see a revision.
But you need to make the data available.’
And [the private data steward] said, ‘Well, we can’t do that.’
This interviewee found this experience particularly frustrating because they believed that this steward had not been candid regarding what the data sharing restrictions would be in advance, and the journal ultimately rejected the paper because they could not deposit the data “in dbGaP or something like that.”
Another interviewee said that it was “totally public” that this same data steward restricts external investigators to only publishing up to 10,000 single nucleotide polymorphism-level results per paper. But they took issue with its reported justification for this policy as protecting the privacy of participants:
We have other ways to protect against re-identifiability of participants that, I think, make those concerns irrelevant. For example, we round the summary statistics that we make publicly available to five decimal places. We don’t give the actual real frequencies in our data, we instead posted 1000 Genomes [Project] allele frequencies. I think those precautions eliminate any concern about re-identifiability…
This interviewee agreed with the previous one that such intellectual property stipulations ultimately limited the usefulness of sharing the results.
Last, an interviewee discussed the use of other federal funding linked to privately held genomic data: that of the Centers for Medicare & Medicaid Services. When patients receive clinical genomic testing, often the generation and analysis of those data are sent to private testing companies. This can defray clinical costs for hospitals in that it centralizes and externalizes the expensive process of genomic analysis. But those resultant data are then generally also considered the property of the private company that generated them — even if the patient used federally funded insurance to pay for it:
…our government is effectively paying for these tests to be done, but yet they have no obligation to deposit that data into publicly available resources for us to use…. I mean, there’s literally hundreds upon hundreds of thousands — if not millions — of patients who might have gotten some form of genomic testing, and that data is completely unavailable to us, even though the government paid, basically, for it.
Theme 4: Tensions Exist Between Broad Data Sharing and Contributor Consent
Our last theme surrounds the tension between data sharing and transparency with, and informed consent from, contributors. Federal data sharing policies encourage investigators to “maximize” data usage through the informed consent process. If there are exceptions to this maximal sharing, annotations for appropriate use are supposed to adhere and travel with data to limit future uses. But two interviewees discussed not actually knowing the institutional review board (IRB) rules for using secondary data to begin with: “…we just kind of had to make up everything as we went along.” Another acknowledged that they “don’t know what current guidelines or practices are actually for research use” but that they “always did wonder in the back of my head” whether patients knew that other researchers had access to identified information. A different interviewee pointed out the need for ongoing education because, even when they had informed consent discussions with their own contributors, “people sometimes would ask me: am I going to clone them? Like sit around cloning random people?!”
Others talked about concerns that the appropriate kinds of informed consent were not secured for banked data or specimens, and/or whether contributors understood what it meant in the first place. One interviewee stated that they “suspect that the people never really consented to giving the data” that were collected decades ago. In fact, only three interviewees knew the informed consent status of the contributors in the genomic data used for their article — and all three only know in retrospect because the journal required them to disclose it.
Taken as a whole, these IRB and informed consent concerns could critically impact the ability of investigators to effectively use shared data resources, particularly when trying to combine different datasets — and the limitations of contributor comprehension of information even when full informed consent is offered. One interviewee therefore described informed consent status as:
While the federal government continues to iteratively design and implement data sharing policies for funded research, many challenges remain. The goal of improving accessibility and impact is laudable, but our study demonstrates that sharing data is seen as a laborious burden without academic reward, shared data often lack the quantity or quality necessary for translation into improved science, private interests can limit data sharing and usefulness, and tensions remain between contributor autonomy and the advancement of science.
…actually one of the big barriers to accessing data…there are a lot of datasets that we might have used, but the consent was actually more narrow, or precluded us actually considering or using that dataset.
Concerningly, this interviewee even speculated that some limitations on consent were intentionally drawn to avoid some of the burdens of data sharing described above:
There’s a balance between protecting individual study participants and data sharing. I think some scientists may act in bad faith and may tailor consents in ways that their data ends up not being able to be available, even though they can publish papers in journal and publish findings.
They went on to point out that the same might be true for IRB approval because future data sharing is also “all driven locally, right by your local IRB — but it’s also driven a bit by what you asked for as an investigator.”
Another interviewee specifically brought together concerns regarding IRB review and consent with private data stewards. The interviewee described the system of “contract IRBs,” which private companies can hire to review their research proposals, as problematic because contract review might not have the same quality of oversight. As opposed to academic IRBs, contract IRBs face pressure to be “in favor of the company’s wishes.” In terms of informed consent, they were also worried whether contributors to private databases realize “that their data could be sold to drug companies who are developing certain medications and then making money off those medications.”
Discussion
While the federal government continues to iteratively design and implement data sharing policies for funded research, many challenges remain. The goal of improving accessibility and impact is laudable, but our study demonstrates that sharing data is seen as a laborious burden without academic reward, shared data often lack the quantity or quality necessary for translation into improved science, private interests can limit data sharing and usefulness, and tensions remain between contributor autonomy and the advancement of science.
Importantly, the challenges our interviewees described are, by and large, neither novel nor due to rapidly changing technologies. They are the same challenges that the federal government has been grappling with for more than two decades. Our findings, rooted in the context of iterative data sharing policies, underscore important considerations for federal departments and agencies that are crafting data sharing policies in response to the 2022 OSTP Memo.
First, many interviewees bemoaned the time-consuming process of preparing and depositing data into authorized government repositories. Despite federal reassurance to researchers that related costs can be included in grant budgets, our interviewees highlighted the remaining tension that data sharing tasks require a high level of technical aptitude. These tasks are often assigned to post-doctoral fellows and other junior researchers who have the technical ability to do the work but lack the attendant academic prestige or production of academic deliverables that will further their careers. In addition, fellows and junior researchers often transition institutions in short order, so incoming trainees often must learn the process from scratch — taking even more time away from publication-producing research. Critically, our interviewees emphasized that financial cost (which is reimbursable) is not as valuable to them as time (which is not). Moreover, data sharing even presents the possibility of academic vulnerability via the risk of getting “scooped.”
These findings contrast with how the same interviewees, discussed in our previous paper, described using shared data resources to avoid the time-consuming and expensive process of generating their own data.50 The tension is cyclical: researchers can save time and money using previously generated data, but perceive time and money as being wasted when asked to share it themselves. The underlying concern seems to be that of the “free-rider,” researchers who access the benefit of common resources without contributing themselves. If researchers share their data, they want to be reassured that others’ data will also be there for them to use — a problem that several ultimately blamed on a lack of sufficient enforcement of data sharing quality and standards by the government.
Beyond the findings of our study, it is worth noting the additional burden introduced by the 2022 OSTP Memo includes the immediate release of articles resulting from federal funding (including that held by co-authors). While journals had generally accepted the previous 2013 policy of a 12-month embargo for federally funded research without additional publication charges, the 2022 Memo lacked the deferential language to publishers of its 2013 counterpart. The immediate release of the article and data upon publication will affect journals’ business models more substantially and may lead to expanded publication fees, even if the submitted article was not supported by federal funding.51
While the 2022 OSTP Memo states that funded researchers may include open access fees in their budget proposals,52 questions remain regarding whether this will limit the flexibility of scientific discovery. For example, needing to precisely estimate number of publications and related study costs years in advance, before even starting the work, will be challenging for prospective budget requests. In addition, substantial publication fees for research with a limited budget can disincentivize researchers from publishing all their findings and negative findings in particular — both of which are critical to informing the field and avoiding publication bias. High open-access fees could also affect collaboration among researchers. One could envision a non-federally funded research team declining the contributions of an author who receives federal funding so as not to put them in the position of having to pay for immediate release. Or researchers might publish the minimum number of articles they feel is necessary with a citation to their federal funding, and others without the funding citation to avoid fees. It will be interesting to see whether this new rule will result in a net gain of the amount of work that cites federal funding.
A second, and related, theme of our interviews is that shared data sometimes lacks the quality to validate and replicate research findings. While much federal time and money has been devoted to encouraging and requiring mass data sharing, there is a dearth of empirical validation of the ability of those data to be translated into advanced science.Reference Morain53 As the federal government continues to invest federal time and resources — as well as the time and resources of federally funded researchers — empirical validation necessary to support an actual cost/benefit analysis of resources is critical.
Our interviewees also described a high motivation of private data stewards to collaborate with academic researchers to validate and publicize their product. We have previously found that the number of academic publications using private genomic data has increased over time. In addition, almost half of publications from 2011-17 using sampled private genomic databases also cited at least some NIH funding for the research.Reference Spector-Bagdady54 Private data stewards can then profit from selling access to these validated databanks to other industry players, as illustrated by the recent $300M agreement between GlaxoSmithKline and 23andMe.55
But the academic-private interplay is not quite so clear-cut. Interviewees described not being able to share privately generated genomic data due to intellectual property concerns and contractual limitations. While this is certainly understandable from a business perspective — genomic data are an asset — the role of federal funding in building this asset remains under explored.Reference Spector-Bagdady56 The 2014 GDS Policy importantly limits the scope of its applicability to funding used in the “generation” of genomic data, but it is unclear how the broadened scope of the 2022 OSTP Memo will change that standard.
This leads to a complex balance between the impact of federal funds on data sharing. As one interviewee pointed out, genomic data that are privately held may have been generated via Centers for Medicare & Medicaid Services clinical funding in the first place. And, as Alexis Walker found in her recent qualitative exploration of employees of private sector genomics, the vast amount of industry IP is actually “developed in academic labs …funded by the taxpayer.” One of her interviewees therefore “found it a bit egregious” that industry is then allowed to take that intellectual property, market it, and sell it back to patients at “massive margins.”Reference Walker57 If federal funding can be used to analyze genomic data that cannot ultimately be shared, and in so doing add value to the data as a business asset, there are potentially large gaps preventing the government from maximizing on its investment in such public-private partnerships — a specifically stated goal.
Our interviews also highlighted a tension between the federal push for data sharing and protections for transparency and contributor consent. Our interviewees struggled to convey what kind of informed consent, if any, was provided for the information they used in established databanks. Only interviewees who were required to report it to their journal knew. This finding is consistent with our previous research which found that the type of contributor consent is not disclosed in academic papers using privately held genomic data almost half the time.58 One interviewee voiced the concern that investigators that share data might even weaponize consent requirements to intentionally disallow themselves from sharing data in the future. This would both avoid the perceived burden of data sharing as well as the risk of being scooped. A lack of information regarding type of consent for shared data generally limits researcher ability to adhere to such standards, as well as government enforcement of their requirements. While the 2022 OSTP Memo lacks discussion of the type of consent necessary in its new language requiring transparency, implementing departments and agencies should consider this key component of disclosure and enforcement.
In addition, the current 2014 GDS Policy, by applying protections to de-identified biospecimens but not de-identified data, de facto assumes that contributors are more concerned about protections for the research use of their specimens versus data.59 This was an argument also made by the 2015 Notice of Proposed Rulemaking for the Common Rule (but was not included in the final 2018 revision due to public response).60 Both pieces even cite the same three articles to support this claim.61 None of the articles, however, actually do so. The first, Kaufman et al., elicits participants’ willingness to participate in a biobank — but did not actually compare participants’ attitudes regarding specimens against those regarding data.62 Vermeulen et al. again only queried (Dutch) patients about consent preferences for specimens,63 and Trinidad et al. was normative and made no such comparative argument.64 In fact, when we recently surveyed a national sample of over 2,000 patients, as opposed to finding that respondents were more likely to want notice regarding the future use of their specimens, we found that respondents were more likely to want notice regarding use of their health information.Reference Spector-Bagdady65 Given the burden this biospecimen exceptionalism additionally poses for researchers, any new GDS policy potentially should re-consider this bifurcated requirement.
It is important to note that the findings reported in this research represent only a snapshot of the experiences of a relatively small sample of genetic researchers sharing and using shared data resources. Further research is necessary to generalize their experiences, such as surveys across a wider population, and an assessment of the relationship (if any) between researcher demographics, professional status, type of work, and experience. These interviews are a valuable step in this process.
Conclusion
As the federal government continues to expand upon and improve its data sharing policies over the past 20 years, complex challenges remain. Our findings demonstrate that the burden, translation, industry limitations, and consent structure of data sharing remain an issue. Thus, while the U.S. government continues to focus on this important work, it is critical that implementing departments and agencies better understand the goals and challenges of genetic researchers expected to benefit from, and contribute to, these broadened shared data resources.
Acknowledgements
The authors would like to thank the 2022 Health Law, Policy, Bioethics, and Biotechnology Workshop and Prof. I. Glenn Cohen at Harvard Law School, and Prof. Brian J. Zikmund-Fisher for their feedback on a previous draft of this paper. This work was funded in part by the National Human Genome Research Institute (K01HG010496 and T32-HG010030), the National Center for Advancing Translational Sciences (UL1TR002240, R01TR004244), the National Institute of Mental Health (R01MH126937), and the National Cancer Institute (R01CA237118).
Data Availability
De-identified qualitative quotes, organized by theme rather than in full transcript presentation to further protect the identity of the interviewee, is available upon request.
Note
This study was approved by the University of Michigan Institutional Review Board (HUM00175088). The study data were de-identified. This study was performed in accordance with relevant guidelines and regulations, including those set forth in the Declaration of Helsinki. Informed consent was obtained from all subjects, and de-identified data were used for analysis and reporting. The authors declare no conflict of interest.