Planning for Data Management and Sharing

This article describes a data management and sharing plan and its components while also highlighting relevant policies, resources, and tools for planning for data management sharing activities.

What is a Data Management and Sharing Plan?

A data management and sharing plan, sometimes referred to as a DMSP or DMP, is a document that describes how research data will be collected, managed, securely stored, and made shareable during and at the end of a research project. It is typically two pages in length and required by a growing number of funding agencies as part of the grant application process.

Getting Started

Before you begin writing your data management and sharing plan, we recommend reviewing your funding agency’s data management and sharing policy. The policy will fully describe the funding agency’s expectations for managing, preserving, and sharing research data.

Most research at UNC goes through one of these five funding agencies:

You can find other funding agency public access plans via Science.gov: https://science.gov/Public-Access-Plans-Guidance.html

Once you’ve reviewed the funding agency policy, we recommend gathering all materials from your current grant proposal, even if they are merely drafts. These documents will more than likely contain information relevant to your data management and sharing plan.

Writing the Plan

A data management and sharing plan should be about two pages. Typically, a plan requires the following information:

  • Data types, formats, and estimated size

  • Documentation and metadata standards

  • Roles and responsibilities

  • Security and storage

  • Sharing and preservation

  • Access restrictions, limitations, licensing

This is not an exhaustive list. Some policies may require other information, so be certain to note those requirements.

A useful tool for writing your plan is the DMPTool. It provides templates that include all required components of a funding agency’s data management and sharing policy.

Note: UNC researchers requesting a DMSP Review must draft their DMSP in the DMPTool.

Learn more about the DMPTool

Learn more about RDMC DMSP Review Service

Data Types, Formats, and Estimated Size

Describe the expected types of data that will be generated during your research project. Include a description of how the data will be generated/collected as well. For instance, will your research produce sequencing, imaging, or experimental data? How will that data be generated?

Provide an estimate of how much data you anticipate will be collected as well. How many participants or experiments will be conducted? What amount of data are you expecting to collect (provide estimated file sizes)?

Along with the types and amount of data being generated during your project, it is important to describe the expected data formats your research will be stored and shared in. Are you using a community standard file format? Are the file formats stable enough for long-term preservation?

We offer guidance on recommended file formats for a variety of data types such as Text and Qualitative Data. The Data Curation Network also has data primers available with file formats that lend themselves to preservation and sharing.

When describing your data formats, we also recommend listing any software required to access those data. Some things to consider are whether your data can be read into open-source software that is readily available and used by your research community. If not, provide information on the specific proprietary software being used in your research.

Example Language

This project will produce pre- and post-intervention health training assessment survey data and post-intervention training focus group data. Data will be collected from 120 participants, generating 24 assessment survey datasets and 12 focus group audio files and transcripts totaling approximately 100MB in size. The following data files will be used or produced during the project:

  1. REDCap survey data will be exported to csv file and converted to Stata files for analysis.

  1. Focus groups will be recorded as mp3 files and transcribed and coded using NVivo. The transcripts will be saved as .pdf files and the NVivo database will be exported as .qdpx file.

Survey data will be made available in Stata (requires at least Stata SE 16) and tabular (.csv) format that can be loaded into Excel, R (v3.6 or later), or other commonly used statistics programs to be accessed or manipulated. Focus group data will be made available in .pdf for long-term preservation and can be transferred to commonly-used qualitative data analysis (QDA) software programs.

Metadata Standards and Documentation

A metadata standard is adopted by a research community as a means of describing research data to facilitate discovery, re-use, and understanding. Users should be able to understand what they can and cannot do with your data, how the data were collected, who collected the data, and the purpose of the study. Please note that some communities have metadata standards for describing data while others may not have an adopted standard.

The Research Data Alliance has put together a comprehensive metadata standard catalog which can be browsed by scheme and subject or searched across various fields like funder and data type.

Additionally, when identifying a data repository for sharing your data, you should look at what metadata standards they support and whether that is appropriate for your data needs.

Documentation further describing your research data, methodological approach, compute environment and statistical software, and any manipulation of your data should also be addressed within your data management plan. These documents should be provided to help users further understand the context and contents of your research outputs. Documentation can include, but is not limited to, codebooks, data dictionaries, READMEs, study protocols, survey instruments, and methodology reports. If your funding agency asks for information regarding documentation, describe the types of documentation you will be sharing along with your research data.

Learn more about Metadata

Learn more about Documentation - READMEs

Learn more about Documentation - Codebooks

Example Language

To facilitate interpretation of the data, study protocols, survey instruments and codebooks, and qualitative schedule of questions, coding schema, and code reports of frequency and density will be shared and associated with the relevant datasets.

To facilitate their efficient use, all our data and materials will be structured and described using the following standards:

Formal standards for pre- and post-intervention training assessment data have not yet been widely adopted. However, our data and other materials will be structured and described according to best practices.

Data will be stored in commonly used and open formats, such as .csv and Stata .dta for survey data and .pdf and .qdpx for focus group data, complying with the REFI-QDA Standard for qualitative data transfer and the FAIR guiding principles. Information needed to make use of these data [e.g., the meaning of variable names, codes, information about missing data, other metadata, etc.] will be recorded in codebooks and coding schema that will be accessible to the research team and subsequently shared alongside final datasets.

Information about our research process, including the details of our analysis pipeline will be maintained contemporaneously, using study protocols and qualitative coding scheme. This information will be accessible to all members of the research team and will be shared alongside our data.

Roles and Responsibilities

A key component of a data management plan is identifying the person(s) responsible for managing and sharing your data. These roles can belong to one person or across team members. In some cases, your research project may warrant hiring a data archivist or data steward to complete the curation and data repository deposits at the end of your project.

Make sure to clearly describe who will perform which actions on the expected research data from your project. If you need to hire staff to handle these tasks, be sure to include that within your budget.

Note: many funding agencies are aware of the costs necessary to ensure data management activities are completed; therefore, it is expected that these costs be included in the proposed budget.

Learn more about Roles & Responsibilities

Example Language

The following individuals will be responsible for data collection, management, storage, retention, and dissemination of project data, including updating and revising the Data Management and Sharing Plan when necessary.

  • PI Name, Researcher, Institution/Department, ORCID, email

    • Role: oversight of data management activities to ensure compliance with funding agency requirements

  • Project Manager Name, Researcher, Institution/Department, ORCID, email

    • Role: data management tracking and oversight of task completion by deadlines; share data with research data archivist for data archiving

  • Data Manager Name, Researcher, Institution/Department, ORCID, email

    • Role: active data management responsibilities including data storage, data quality, data access and data restrictions

  • Analyst Name, Researcher, Institution/Department, ORCID, email

    • Role: de-identify data as needed for data sharing and archiving; analyze data

  • Research Data Archivist, Data Archive/Institution, ORCID, email 

    • Role: prepare data and documentation for data sharing; perform curation actions on data as needed such as file normalization, documentation review, metadata creation; archive data and make available according to IRB and federal funding agency requirements

Data Security and Storage

Information on where your data will be stored, who will have access, and how potentially identifying data will be kept secure should be included throughout your data management and sharing plan. This information will tie into data access restrictions and limitations to sharing.

A few things to consider when drafting this information:

  1. Will the data generated during this project include personally identifiable information (PII) or protected health information (PHI)? If so, how will the data be protected during and after collection and analysis?

  1. Should you consult with your department IT or ITS Research Computing for a secure storage solution? How long will you need to keep these data secure? Who will have access to these data? Can they be requested with a data use agreement and IRB approval?

We recommend consulting with external collaborators, expertise, and/or university services during the planning phase to ensure you account for costs and time within your proposal and DMSP.

Example Language

Researchers will be required to comply with IRB protocols and to ensure the data are stored on a secure, off-network system with access limited to only approved project members.

Data Sharing and Preservation

Your data management plan should include information on how and where you will be sharing and preserving the generated research data from your project. What repository(ies) will you be using? Will the data be publicly available for download, or will users need to request access? How long will your data be made available within the data repository?

The first step to answering these questions is to identify an appropriate, trustworthy data repository. Review your funding agency’s data sharing guidance to see if they require data to be shared in a designated repository. If not, you will need to locate the most appropriate repository for your data.

Domain-specific Data Repositories

Domain-specific data repositories are built to support and preserve specific data formats from their designated research community. These repositories usually offer features and metadata standards relevant to their domain and have staff and expertise available to answer questions related to preserving commonly used data types. There are many domain-specific data repositories available; however, they do not exist for all data types and/or disciplines.

The NIH Repositories for Sharing Scientific Data lets researchers search by keyword or institute/center. They also have guidance for Selecting a Data Repository which describes the desirable characteristics a data repository should have to meet the requirements of the NIH data management and sharing policy.

The Registry of Research Data Repositories (RE3) is another useful database for searching for domain-specific data repositories. RE3 allows users to search by keyword or browse by content type, subject, or country. Results will display whether the repository offers licensing, open access to data, and uses persistent identifiers.  

Generalist Data Repositories

If a domain-specific data repository doesn’t exist in your field, then a generalist data repository is most likely an appropriate fit for sharing and preserving your research data. A generalist data repository will share and preserve any data regardless of file format, type, or discipline. There are a handful of popular generalist data repositories available to researchers:

UNC Dataverse (for UNC researchers)

Dryad

Figshare

Mendeley Data

Zenodo

Once you have identified an appropriate data repository, include information about when and how the data will be shared within that platform. You will also want to describe how the data will be made findable within the repository and for how long. Does it use persistent identifiers? What is the repository’s commitment to maintaining access to their holdings?

Example Language

Tabular dataset(s) will be deposited in the UNC Dataverse, a generalist data repository managed by the Research Data Management Core (RDMC) at the University of North Carolina at Chapel Hill. UNC Dataverse provides persistent identifiers, robust standardized metadata, and is committed to long-term preservation and access of research data. Data are published under a CC0 license by default with customizable terms of use as needed. Additionally, UNC Dataverse is routinely backed up and preserved on multiple geographically distributed servers and is a member of Data-PASS, a community committed to the sustainability and access of research data.

Qualitative data will be shared in Syracuse University’s Qualitative Data Repository (QDR), through the institutional membership managed by the RDMC. QDR provides the UNC community with dedicated expertise and a customized data repository for preserving and sharing qualitative data sets to ensure long-term access and use.

QDR assigns persistent digital object identifiers (DOIs), annotations, data citations, and multiple export formats for bibliographic citations. Data citations enable QDR to track the re-use of datasets. QDR staff conducts multiple backups and routine file integrity checks, as well as monitors for file-format obsolescence. The published data in QDR will be linked with the survey data published in UNC Dataverse to facilitate discovery.

Data will be made available once analysis is completed or at the time of associated publication, whichever comes first. Data will be stored in UNC Dataverse and the Qualitative Data Repository for at least 10 years after the project performance period, as indicated in their respective preservation policies.

Access Restrictions, Limitations, and Licensing

Funding agencies are aware that not all data generated and used to report the results of a funded project can be fully shared due to ethical, technical, or legal limitations. If you will be collecting data that may be too sensitive, may have legal consequences, or may be too large to share entirely, describe those factors within the DMSP and provide information on how you will make as much data available as possible and under what conditions.

Learn more about Data Access Restrictions

A few questions to ask yourself:

  1. Can I de-identify and clean the data sufficiently to ensure participant privacy is not compromised while also maintaining the utility of these data? If not, can the data be stored securely and requested through a secure transfer protocol and data use agreement?

  1. For big data, are there subsets of the data that I can share that will provide users with enough information for re-use? If not, can a data access process be created that permits users to analyze the data via the institution’s equipment or is there a way to transfer data through a secure protocol?

Funders ask that researchers make a best effort to share as much data as feasibly possible. If there are questions about your expected limitations for sharing, they will ask for clarification.

Example Language

All participants will consent to the sharing of aggregate and de-identified survey and focus group data. Any potentially identifying variables or focus group comments will be stripped from the public-use data in compliance with IRB protocols and human subjects protections.

Participants will have the option to consent to sharing their identifiable survey data for future research and scholarly use as part of a donation agreement. These identifiable survey data will be made available as a separate, identifiable dataset in UNC Dataverse.

Due to the small sample size and potential for re-identification, the raw data will not be made available for public use. Interested researchers wishing to build upon the raw data and transcripts may submit a data use agreement to request access to these data. Researchers will be required to comply with IRB protocols and to ensure the data are stored on a secure, off-network system with access limited to only project members approved in the data use agreement. Any breach of this agreement is subject to the terms of use stipulated in the data use agreement.

To ensure participant consent for data sharing, IRB paperwork and informed consent documents will include language describing plans for data management and sharing data, describing the motivation for sharing, and explaining that personal identifying information will be removed from public-use data. A donation agreement will allow participants to decide whether to share their identifiable survey data for future research and scholarly use.

To protect participant privacy and confidentiality, public-use data will be de-identified using the safe harbor method as detailed by the US Department of Health and Human Services. This method will remove any variables or values within the data that could be used to re-identify a participant.

Example Data Management Plans

There are many examples of data management plans available for review. We recommend reading through a few DMSPs to see what content is included and how information is described.

Example DMS Plans Directory

DMPTool Public Plans

NIH Sample Plans

References

Bohman, L., Hertz, M., & Orlowska, D. (2022). Example DMS plans. Working Group on NIH DMSP Guidance. https://doi.org/10.17605/OSF.IO/UADXR

Horsburgh, J., Koskela, R., Lubas, R., Staples, T., & DataONE. (2011, August 30). Best Practice: Identify and use relevant metadata standards. Data Management Skillbuilding Hub. https://dataoneorg.github.io/Education/bestpractices/identify-and-use

National Institutes of Health. (n.d.). Repositories for Sharing Scientific Data | Data Sharing. Retrieved July 12, 2023, from https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/repositories-for-sharing-scientific-data

National Institutes of Health. (n.d.). Selecting a Data Repository | Data Sharing. Retrieved July 12, 2023, from https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/selecting-a-data-repository

Research Data Alliance Metadata Standards Catalog Working Group. (n.d.). Metadata Standards Catalog. Retrieved July 12, 2023, from https://rdamsc.bath.ac.uk/

 

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

 

RDM Guidance formatting was influenced by The Writing Center, University of North Carolina at Chapel Hill Tips & Tools handouts.