Packaging Your Data for Sharing
This article helps you organize your research data and associated materials into a publishable package that supports transparency, reusability, and compliance with data sharing policies. These data packages can be deposited and published in a data repository for broader dissemination.
Introduction
A growing number of funding agencies, publishers, and science stakeholders are implementing data sharing policies. The average researcher is not aware of techniques and best practices to share research data. Most doctoral training programs do not include curricular topics of data management, information organization, and archival theories.
As we have seen from the early efforts of data sharing, the publication of a lone data set is not enough to support research transparency and re-usability. Future users and replicators need more information about the research design, data structure, data edits, and sometimes computational approaches.
Building a package of materials around your data is a way to ensure the reusability by others and transparency of your research. Applying a ‘package’ framework to sharing data can help you assemble your materials and produce high-quality packages that achieve integrity, reusability, and compliance with data sharing policies.
What is a Data Package?
We introduce the term ‘data package’ as a framework for assembling, organizing, and documenting your research data and materials into a collection. This collection makes evident the data, process, and materials that you used in your research. In practice, a data package can be as simple as a folder containing your data and any relevant files that someone would need to understand and interpret your data.
Making a FAIR Data Package
FAIR is a set of guiding principles and practices for facilitating data discovery, access, and sensible re-use (Wilkinson, M. D., et al., 2016). FAIR is a popular acronym where it describes a set of 4 desirable, high-level characteristics that data should exhibit:
Findable
Accessible
Interoperable
Re-usable
As you assemble your data package, FAIR can guide your decision-making process on what to include in your package, how to format your materials, or where to deposit your package. At decision points, you can consider: how to make these data more Findable, Accessible, Interoperable, and/or Re-usable? What impact will this decision have on the FAIR-ness of the data?
For more information, see the guidance for FAIR Principles.
Collecting Your Files
As mentioned above, a data package is a final research product meant to disseminate the data underlying your results. At a minimum, a data package contains:
Evidence or data underlying your results (e.g., survey data, database, GIS shapefiles, transcripts, social media tweets).
Documentation of your research data process, including steps for data collection, preparation, and any transformations (e.g., methodology brief, instruments, data audit trail, scripts for data edits, qualitative coding/annotation process).
Explanation of your data structure and file (e.g., data dictionary, database entity and relationship (ER) diagrams, qualitative code/annotation definitions).
Description of the files in the data package, explaining file relationships, and capturing other relevant information. This file is usually called a README.
The documentation of your research process and data structure can be the most challenging for you to decide what to include/exclude; however, these materials are important in helping a future user understand if your data are appropriate for a study and to correctly interpret your data. A guiding question for you at this stage is:
What would someone outside my research team need to know to use my data correctly?
Below are exemplar data packages for survey research and qualitative interview projects.
Survey Data Package | Qualitative Data Package |
---|---|
Numeric data (.csv) | De-identified transcripts and memos (.txt, Atlas.ti) |
Data dictionary with variable info (.txt) | Qualitative codebook with definitions, examples (.txt) |
Survey instruments (.pdf/a) | Schedule of questions (.pdf/a) |
Informed consent form (.pdf/a) | Informed consent form (.pdf/a) |
Survey methodology brief (.pdf/a) | Data collection and coding process brief (.pdf/a) |
README (.txt) | README (.txt) |
Data license with terms of use | Data Use Agreement for full data with PPI/PHI (.pdf/a) |
We have compiled several guidance documents that discuss relevant topics such as Documentation -README, Documentation - Codebook, File Naming Conventions, Recommended Preservation File Formats for Text, Recommended Preservation File Formats for Qualitative Data, Code Preparation, among others. For additional guidance on specific data types, see the Data Curation Network (2023) primers that offer recommendations and considerations for multiple data types.
Documenting for Transparency and Reusability
Once you have assembled your materials, it is time to assess transparency and reusability. Some questions to consider are:
Are the steps in your data process, including data collection, preparation, and transformations for analysis, clear and complete?
Did you include information on the data file structure such as file set up, variable information, and if needed instructions for loading the data into a program?
Does your data require special software or applications (e.g., SAS, MaxQDA)?
For categorical or binary variables, is it clear how you set up the variables including any values/labels in the data (i.e., 1=Strongly Disagree, 2=Somewhat Disagree, …)?
Did you describe any missing or null values (e.g., 999=Not asked, NA=Not applicable) in your data?
If you used data produced by others (i.e., secondary data analysis), did you include a data citation for each data source?
Did you explain any relationships between the files in your data package? For instance, the data is provided as 2 files, one at the individual level and the other at the country level.
We have further guidance on Documentation - README and Documentation - Codebook.
Assessing for Sharing
A final step is to consider the legal, privacy, and ethical issues for sharing this data package with the public. Funders, employers, and science stakeholders do not want you to share data that raise ethical concerns. It is your obligation as the researcher to determine if it is appropriate to share these data and in which ways are appropriate to share these data.
As you assess your data for sharing, a few considerations include:
Does my data contain personally identifying information (PPI) and/or personal health information (PHI)?
How likely is someone to be identified in my data? If you are looking across the variables and at combinations of variables about a participant, how likely could someone guess who this participant is?
Does my data fall under copyright or a data license from the original data producers?
Does my informed consent process and forms describe data sharing?
Are there any restrictions on data use (i.e., only for academic research, big data cannot be moved) that a future user needs to be aware of?
The sharing of research data is not binary (open, closed). Data sharing is a gradient with many ways to make data available (i.e., open, restricted access, dark archive, closed). If you want more information on ways to share data, we encourage you to review the guidance on Data Access Restrictions, Data Use Agreements, Sensitive Data, or Terms of Use and Licensing.
If you have any ethical concerns about complying with a data sharing policy, you should consult the program officer, editor, or representative about options that will satisfy their policy.
Conclusion
This guidance has taught you how to prepare a data package that will be transparent and reusable. We encourage you to consider what might be in your data package throughout your research project and store documents that capture your research and/or data approaches, providing fodder for your data package and hopefully saving you time and effort. As you prepare your data package, please consult the RDMC guidance and, if you still have questions, let us know through our Help Center.
References
Data Curation Network. (2023). Data Curation Primers. Data Curation Network GitHub Repository.
Wilkinson, M. D., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3(160018). doi:10.1038/sdata.2016.18
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
RDM Guidance formatting was influenced by The Writing Center, University of North Carolina at Chapel Hill Tips & Tools handouts.