Navigation and service

Use of cookies

By clicking on "Allow" you consent to the anonymous recording of your stay on the site. The evaluations do not contain any personal data and are used exclusively for the analysis, maintenance and improvement of our website. For further information on data privacy, please click on the following link: Data Privacy Policy

OK

Generating synthetic data to preserve patient privacy in cancer registry

Short description of the project

Population based cancer registries in Germany has been established since the 20th century, and now cover cancer data for more than 13 million cases centralized at the Zentrum für Krebsdaten (ZfKD) at Robert Koch Institute. The value for legitimate use cases such as scientific research, health policy evaluation or general public information is immense. But centralizing such data also introduces a risk of reidentification for more than 11M patients as of today. In 2023 the first delivery of data collected across cancer registries will include even more granular events especially regarding therapy. This new data model increases both, value for legitimate use cases and risks of reidentification. Moreover, the Federal Cancer Registry Data Act imposes to broaden the access to this data. Thus, the necessity of technical measures to protect against patient reidentification, while conserving the value of data for legitimate use cases, is increased. The balance between sharing access and preserving privacy is a common goal shared in the health sector. A relevant way to achieve it may be to generate synthetic datasets for external access.

This project aims to develop methods to produce synthetic data intended to conserve statistical properties useful for selected use cases, while protecting against reidentification. Such a goal entails developing generative models and evaluation methods, not only for the reidentification risk but also for the data quality in terms of medical domain. Intended quality aspect for the generated datasets may range from giving a loyal broad representation of the database in the case of a dataset made available to the general public, to more statistically precise data regarding a cancer type for a dataset made available to a research team working on a limited perimeter.

The project intends to research the use of neural networks as generative models. Such models have been successfully used for data such as images or texts with recent advances in the state of the art. Applications of such models to tabular or relational data has been way less tackled in the machine learning literature. Therefore, exploration of machine learning approaches fitted for such data is another expected contribution of the project.

The project addresses the following “Essential Public Health Functions” (EPHS):

  • #2: Investigate, diagnose, and address health problems and hazards affecting the population.
    Reidentification risk exists in every large health database. The goal is to mitigate risk on privacy while increasing reuse of the data.
  • #3: Communicate effectively to inform and educate people about health, factors that influence it, and how to improve it.
    A privacy-preserving synthetic dataset is intended to be made publicly available.
  • #4: Strengthen, support, and mobilize communities and partnerships to improve health. Increase reuse of the cancer registry data by partners for research, health policy design and evaluation, innovation.
  • #5: Create, champion, and implement policies, plans, and laws that impact health.
    Federal Cancer Registry Data Act imposes to broaden the access to the cancer registry data and protect privacy.

Date: 05.09.2023