UK Biobank Data Breach: Confidential Health Records Exposed Online
Confidential health data from the UK Biobank project has been exposed online on numerous occasions, a Guardian investigation has uncovered. This revelation raises serious questions about the safeguarding of patient records by one of the United Kingdom's flagship medical research initiatives.
UK Biobank, which holds the medical records of 500,000 British volunteers, is recognised as one of the world's most comprehensive repositories of health information. It has been instrumental in driving significant breakthroughs in research areas such as cancer, dementia, and diabetes. However, scientists authorised to access Biobank's sensitive data appear to have occasionally been negligent regarding its security protocols.
Inadvertent Data Exposure and Privacy Risks
The exposed files, which seem to have been inadvertently posted online by researchers utilising the data, do not include direct identifiers like names or addresses. Despite this, they may still pose substantial privacy concerns. One dataset discovered by the Guardian contained millions of hospital diagnoses and associated dates for over 400,000 participants.
With the consent of a Biobank volunteer, the Guardian was able to pinpoint what appeared to be extensive hospital diagnosis records for that individual. This was achieved using only their month and year of birth, along with details of a major surgery they had undergone. A data expert described the scale and persistence of this issue as "shocking", particularly in an era where artificial intelligence and social media facilitate easier cross-referencing of information online.
UK Biobank has dismissed these concerns, asserting that no identifying data, such as names and addresses, were provided to researchers. In a statement, Prof Sir Rory Collins, the chief executive of UK Biobank, stated: "We have never seen any evidence of any UK Biobank participant being re-identified by others."
Background and Data Management Practices
Established in 2003 by the Department of Health and medical research charities, UK Biobank holds genome sequences, scans, blood samples, and lifestyle information from 500,000 volunteers. Recently, the government extended Biobank's access to include volunteers' GP records. Scientists from universities and private companies worldwide apply for access, and until late 2024, they were permitted to download data directly onto their own computer systems.
The problem emerged because journals and funders increasingly require researchers to publish the code used to analyse large datasets. When intending to upload code, some researchers have accidentally published partial or entire Biobank datasets to GitHub, a popular online code-sharing platform. UK Biobank prohibits researchers from sharing data outside their systems and has introduced further training for all researchers to address this.
Legal Actions and Ongoing Concerns
In the past year, data leaks have become a more urgent concern for UK Biobank. Between July and December 2025, it issued 80 legal notices to GitHub, which complied with requests to remove data from the internet. However, much of the data remains accessible. Some files contain only patient IDs or test results for small numbers, while others are more extensive. One dataset found online in January included hospital diagnoses and associated dates for about 413,000 participants, along with their sex and month and year of birth.
A data expert who reviewed the file commented: "It sent shivers down my spine to even open. I deleted the file immediately. It was very detailed and felt like a gross invasion of privacy even to glance at."
Re-identification Tests and Volunteer Perspectives
To assess the risk of re-identification, the Guardian approached several Biobank volunteers. One volunteer, a woman in her 70s, shared her month and year of birth and the month and year she had a hysterectomy. Only one person in the dataset matched these details, and the apparent match was corroborated by five other diagnoses from the records that the volunteer had not initially disclosed.
The volunteer expressed surprise, stating: "Effectively you were rehearsing the main parts of my medical history to me without me having given you any information at all. I didn't expect that." She added that while she was not overly concerned about her own data being exposed and intended to remain a participant, she was worried about whether Biobank had broken its agreement with people. "They said they would hold our data securely ... I just feel as though that has to come into the equation," she said.
Expert Criticisms and Ethical Dilemmas
Privacy experts argue that UK Biobank's approach is at odds with the reality that many people reasonably share some health information online, and in the age of AI, this can be easily identified and cross-referenced. Prof Felix Ritchie, an economist at the University of the West of England, questioned: "Are these people aware that the internet exists? The idea that they can rely on their volunteers never putting any other information out there about themselves is an entirely unreasonable thing to expect."
Dr Luc Rocher, associate professor at the Oxford Internet Institute, noted that removing identifiers often does not guarantee anonymity. Simply knowing a person's birthday and a specific medical event date might be sufficient to pinpoint their record with high confidence. "Once identified, that record could reveal sensitive information such as a psychiatric diagnosis, an HIV test result, or a history of drug abuse," they explained.
Prof Niels Peek, professor of data science and healthcare improvement at the University of Cambridge, reiterated that the scale of the problem is "shocking". He acknowledged that Biobank has taken the issue seriously and "done everything that one can reasonably expect", but added: "The scale and persistence with which this has happened demonstrates that there are huge tensions between the ambition to drive health research with data at scale and the legal and ethical imperative to protect people's privacy."
Experts have also raised doubts about whether Biobank can fully regain control of the data released online. Despite takedown efforts, many files remained available on a code archive website until shortly before publication, highlighting ongoing vulnerabilities in data security protocols.



