Define your data sharing scope

Before sharing any sequence data, determine whether the dataset contains human genomic variants or pathogen sequences. This distinction dictates the legal framework: GDPR for human data, or the Nagoya Convention and GISAID for biological resources. Confusing these categories leads to immediate compliance failures.

Human genomic variants

If your data involves human subjects, it is classified as personal data under the General Data Protection Regulation (GDPR) in the EU and similar laws globally. Genomic data is uniquely identifying and often cannot be fully anonymized. You must ensure any sharing mechanism respects data subject rights and has a lawful basis, such as explicit consent or public interest.

Pathogen sequences

Pathogen sequence data, such as viral genomes, often falls under the Convention on Biological Diversity (CBD) and specific databases like GISAID. GISAID operates on a unique sharing mechanism where access is free for those who agree to the Database Access Agreement and provide proper attribution. Unlike traditional open data, GISAID requires users to register and adhere to specific terms regarding data usage and credit, particularly for clinical or epidemiological insights.

Start by cataloging the origin and type of your sequences. If human, consult your Data Protection Officer to assess GDPR obligations. If pathogenic, review the GISAID EULA or relevant national biosafety regulations. Clear scoping prevents costly rework later in the sharing process.

Anonymize human genomic variants

Sharing human genomic data requires stripping identifiers that could link a sequence back to a specific person. This process, known as de-identification, is essential for complying with genetic privacy laws while maintaining scientific utility. Without proper anonymization, shared sequence data remains vulnerable to re-identification attacks, even when names are removed.

The following steps outline the technical workflow for preparing genomic variants for public sharing. These steps align with standards from major repositories like GISAID and the International Nucleotide Sequence Database Collaboration (INSDC), which mandate strict data governance for open access.

shared sequence data
1
Remove direct personal identifiers

Strip all direct identifiers from the metadata associated with the sequence. This includes names, dates of birth, precise geographic coordinates (down to the city level if necessary), and hospital IDs. Direct identifiers are the most obvious vector for re-identification and must be removed before any technical processing.

2
Generalize geographic data

Replace precise locations with broader regional categories. Instead of sharing a specific address or hospital ward, use state, province, or country-level data. This reduces the risk of triangulating a patient’s identity through location-based inference, which is particularly important in small or isolated communities.

3
Apply k-anonymity to demographic fields

Ensure that any remaining demographic data (such as age or sex) is grouped into bins large enough to prevent unique identification. For example, instead of listing an age of 34, use a range like 30–39. This technique, known as k-anonymity, ensures that each record is indistinguishable from at least k-1 other records in the dataset.

4
Audit for indirect identifiers

Review the dataset for quasi-identifiers—combinations of data points that, when linked with external sources, can re-identify individuals. Common quasi-identifiers include rare genetic variants combined with demographic data. Use statistical tools to measure the risk of re-identification and adjust the data binning or suppression rates accordingly.

5
Validate against repository standards

Before uploading, verify that the anonymized dataset meets the specific requirements of the target repository. GISAID and INSDC have distinct policies regarding data sharing and attribution. Ensure your metadata fields align with their controlled vocabularies to avoid rejection or compliance issues.

By following this sequence, you can share valuable genomic data while protecting participant privacy. Always consult the latest guidelines from the target data repository, as privacy standards evolve rapidly in response to new re-identification techniques.

Use GISAID for Pathogen Sequences

GISAID (Global Initiative on Sharing All Influenza Data) is the standard repository for pathogen sequences, particularly for viruses like SARS-CoV-2. It operates on a shared governance model where data access is governed by the Database Access Agreement (DAA). Unlike open databases that may have fewer attribution requirements, GISAID ensures that contributors receive proper credit and that users adhere to specific data use terms.

To share pathogen sequence data ethically and legally through GISAID, follow this workflow:

1
Register and Agree to the DAA

Create an account on GISAID. You must review and digitally sign the Database Access Agreement (DAA). This agreement binds you to the sharing mechanisms and attribution rules that govern all data on the platform. Open access is provided free-of-charge to individuals who agree to these terms.

2
Upload Sequences with Metadata

Submit your pathogen sequences along with essential metadata, such as sample collection dates and geographic locations. Accurate metadata is critical for downstream analyses and ensures the data is useful to the broader scientific community. Ensure all information is accurate before submission, as corrections can be difficult.

3
Manage Data Release Timing

GISAID allows you to control when your data becomes publicly available. You can choose to restrict access to your sequences for a period (e.g., 60 days) to allow for your own analysis or publication before the data is released to the general public. This helps protect your work while contributing to global surveillance.

4
Verify Attribution Settings

Before finalizing your upload, confirm that your institution and author details are correctly linked to the sequences. GISAID’s system automatically tracks contributions to ensure proper citation in any publications or reports that use your data. This step is vital for maintaining your academic credit and compliance with funding agency requirements.

shared sequence data

Using GISAID’s structured approach helps balance the need for rapid global data sharing with the rights of data contributors. By adhering to the DAA, you help maintain a trusted ecosystem where scientists can collaborate without legal ambiguity.

Draft a data use agreement

A Data Use Agreement (DUA) is the legal boundary that prevents your shared sequence data from being misused. Without it, recipients may publish findings before you, use the data for commercial gain, or share it with unauthorized third parties. This document turns a simple data transfer into a controlled, accountable exchange.

You need to define exactly who can access the data, for what purpose, and for how long. The agreement should explicitly forbid re-identification attempts and mandate secure storage. By setting these rules upfront, you protect the privacy of the individuals behind the sequences and maintain your institution’s compliance with regulations like HIPAA or GDPR.

Key clauses to include

Start by defining the Scope of Use. Specify whether the data is for academic research, public health monitoring, or commercial development. Any use outside this scope constitutes a breach. Be specific about permitted analyses to avoid ambiguity later.

Next, outline Data Security Requirements. Recipients must store the data on encrypted, access-controlled servers. They should not share the raw sequences with anyone not listed in the agreement. Include a clause requiring immediate notification if a security breach occurs, so you can mitigate potential privacy risks.

Finally, establish Publication and Attribution Rules. Many disputes arise when recipients publish results without citing the data provider or delaying release to allow the original team to publish first. A clear timeline for review and acknowledgment protects your intellectual contribution while ensuring fair credit.

Official templates and resources

Do not start from scratch. Use existing frameworks that are already tested in the scientific community. The Global Initiative on Sharing All Influenza Data (GISAID) provides a robust Database Access Agreement that sets a high standard for sequence data sharing. Their model ensures that users identify themselves and adhere to strict ethical guidelines.

Similarly, the International Nucleotide Sequence Database Collaboration (INSDC) offers guidelines that align with major repositories like GenBank. These resources provide a baseline for legal language that is widely recognized by institutions and journals. Adapting these templates saves time and reduces the risk of legal oversights.

Verify compliance before upload

Before you submit sequence data to a public repository or share it with a partner, you must confirm that all legal and ethical obligations are satisfied. This final check prevents accidental breaches of patient privacy or institutional agreements.

Review your consent forms to ensure they explicitly cover the intended use of the data. If the data includes genomic sequences, verify that no protected health information (PHI) remains attached to the metadata. Check that any required data use agreements are signed and on file.

Confirm that the receiving platform’s terms align with your sharing goals. For example, GISAID requires users to agree to its Database Access Agreement to ensure open access while maintaining attribution standards. Similarly, the INSDC guarantees free access but relies on strict submission guidelines to maintain data integrity.

  • Consent forms cover public release
  • PHI stripped from metadata
  • Data use agreements signed
  • Repository terms reviewed

Only proceed with the upload once every item in this checklist is marked complete. This step protects your institution and the subjects whose data you are sharing.

Common questions about sequence data

Users often confuse technical sequence generators with global biological data sharing. Below are specific answers regarding Informatica IICS and GISAID access.