Tallinn University of Technology

Data Management Plan

When planning research, it is important to consider carefully and document the ways of collecting and processing data during the research project, to specify who has access to these data and who is responsible for them, what will happen to the data after the closure of the project, etc. To do all this, it is necessary to create the data management plan and to follow it throughout the project. It is a good idea to use a tool to create a data management plan DMPONLINE Digital Curation UK
 

I WHAT TYPE OF DATA TO COLLECT AND HOW TO DESCRIBE THEM?

- I'll collect it myself
- (re)use my previously collected data
- I use public open data (Estonian Open Government Data Portal)
- (re)using data collected by others, (re3data)
- I buy the data

  • Keep in mind
    - which version of data you reuse or purchase
    - what if the author of the data uploads a new version
    - store the version used and the vendor documentation on your server
    - check copyrights, licenses, restrictions (access, reuse)
    - check machine readability and interoperability with the planned information system

  • How will the data be collected or created
    - name the existing standard procedures and methods
    - are there any data standards available
    - how to ensure data quality (availability, integrity, confidentiality)
    - how do you handle errors (input errors, problematic values)

  • Data description
    - data types (experiment, observation data, survey data, video files, etc.)
    - how new data integrates with existing data
    - which data deserve long-term preservation
    - if some datasets are subject to copyright or intellectual property rights, show that you have permission to use the data

II HOW TO STORE AND SECURE YOUR DATA

  • Data formats
    - point out and explain the data formats you have chosen
    - use open formats
    - use standard formats
    - use machine-readable formats
    - find out if the format allows automatic metadata insertion
    - check if the repositories support the selected formats

    Recommended data formats: 
    File Formats. Open Data Handbook 
    File Formats. Data Archiving and Networked Services 

  • Estimate the data volume at the end of the project. It implicates several aspects:
    - preservation
    - access
    - backup
    - data exchange
    - hardware and software
    - technical support
    - expenses

  • Organization of data
    - be systematic and consistent
    - naming files: simple, logical, without abbreviations or with standard abbreviations (countries, languages, units of measurement, methods)
    - abbreviations in one language throughout
    - file organization (options: project name, time, place, collector, material type, format, version)
    - folder structure should be hierarchical, simple, logical, short
    - copying files to multiple locations is not a good practice; store in one location, create shortcuts
    - version control system git
    - cloud-based code repository GitHub
    - metadata (who is responsible for adding metadata)

    Article: Data Organization in Spreadsheets

  • Data documentation
    - Use this guide for data documentation:
    Siiri Fuchs, & Mari Elisa Kuusniemi. (2018, December 4). Making a research project understandable - Guide for data documentation (Version 1.2). Zenodo. DOI: http://doi.org/10.5281/zenodo.1914401

    - a README text file is included with the data files and should contain as much information as possible about the data files to allow others to understand the data. Create one README.txt file for each database - always name it as README.txt or README.md (Markdown), not readme, ABOUT, etc.

    The README.txt file should contain the following information:
    - title of the dataset
    - dataset overview (abstract)
    - file structure and relationships between files
    - methods of data collection
    - software and versions used
    - standards 
    - specific information about data (units of measurement, explanations of abbreviations and codes, etc.)
    - possibilities and limitations of data reuse
    - contact information for the uploader of the dataset

    Guidelines for creating a README file

  • Metadata
    - administrative metadata, project details (ID, funder, rights and licences)
    - technical metadata (hardware and software, instruments, tools, access rights)
    - descriptive metadata (author, title, abstract, subject terms)
    - DataCite Metadata Framework (mandatory, recommended, optional metadata) on DataCite Estonia Consortium webpage
    - metadata standards indicate which fields should be filled: Directory of Metadata standards
    - free online efix reviewer: all hidden metadata info of document, audio, video, e-book, spreadsheet and image files)
    - controlled metadata dictionaries and classifications tell you what to write in these fields, using standard terminology. BARTOC (Basel Register of Thesauri, Ontologies & Classifications)
    Examples: 
    Estonian Subject Thesaurus 
    Agrovoc thesaurus  
    Mammal Species of the World  
    JACS education subject classifications 
    GeoName

III ARE YOU PERMITTED TO GIVE ACCESS TO YOUR DATA AFTER THE END OF THE PROJECT.? WHO ACCESSES THEM, UNDER WHICH CONDITIONS AND FOR HOW LONG?

  • Secure storage, backup, transfer and recovery

    The goal is to maintain data quality:
    - availability and accessibility
    - integrity (correctness, completeness and timeliness)
    - confidentiality (only available to authorized persons or systems, key management, storage of log files)

    Storage:
    - cloud environments
    - central servers
    - sensitive data servers
    - hard disk drive
    - external hard drive
    - mobile devices

    Backup: creating a copy of the current status of data and/or programs that, after an security incident, allows you to restore it to its known current state
    - maintaining and backing up the master file
    - rule 3-2-1 (store your data in 3 copies on 2 different memory devices from which 1 is afar)
    - who is responsible, especially for mobile devices

    Carry out a risk analysis: what if ....
    - IT systems are down
    - power outages, water and fire accidents
    - the device is lost or stolen
    - malware is discovered in devices
    - a team member leaves or dies, etc.

    Risk weighing (probability and losses)

    Risk assessment: threats and their likelihood, weaknesses, measures

    Information security standard ISO / IEC 27001

  • Access to data, information security
    - management of access rights (same for all, contractual rights, temporary labor rights)
    - storing log files
    - pseudonymization, encryption, key management
    - data exchange, personal data, third countries
    - organizational and physical security: training of a new employee, possible problems with the outgoing workers, internal rules of procedure, fire safety, locking the doors.
    - who is responsible for information security?

  • Long-term preservation

    FAIR Data
    - what data has long-term value? Preserving and sharing it for reuse
    - preparing data for sharing, FAIR data
    - repository selection

How to make data findable (F)
- the data have a permanent identifier DOI. See DataCite Estonia
- metadata is in the DataCite registry
- standard metadata like Dublin Core ore use other standards
- machine-readable metadata
- data and relevant metadata are in separate files but linked
- keywords and subject terms
- version management 

How to make the data accessible (A)
- choose the repository where the data is stored
- which data is open access e. open data
- which data will remain closed and for what reason
- metadata must be open even when the data is not open (exceptions like rare species location)
- technical metadata: required software (version), instrument specifications, software tools

How to make data interoperable with other computer systems (I)
- mainly the task of the repository
- what data and metadata standards, controlled vocabularies and taxonomies are used
- description of data types: if not standard, how interoperability is ensured
- linking to other data, metadata, and specifications
- data exchange standards

How to ensure data reusability. Partially repository task (R)
- partly a task of the repository
- is it raw, cleaned or processed data
- embargo period, grounds
- licenses 
- citing:  DataCite citation formater  
- standard metadata, which (domain) standards are used
- provenance of the data (who, where, what, where, published)
- which software version is used
- how long is the data available for re-use
- data quality assurance (availability, integrity, confidentiality)
- suggestions who might need this data (in README.txt)

  • Data sharing
    - is the data shared in a repository, or as a supplementary data of an article, or as a separate data article in a data journal
    - in which repository is the data stored
    - who might find this data useful
    - how do you share your data (open data, or you have to ask for data)
    - when do you share (at once, after publication of the article, after embargo period)
    - is the data linked to a publication
    - link to your ORCID account

  • Access restrictions
    - which data is open access, open data
    - which data will remain closed and for what reason
    - any encrypted data
    - authentication, who gives access rights
    - whether you need to create a user account under certain terms

  • Who will be responsible for data management
    - by positions
    - principal investigator (PI): Data Management Policy, DMP, contracts, costs, training
    - researchers: follow and improve DMP, data management, problem solving
    - data manager: training, consulting, information security, backup, hardware and software
    - laboratory assistant, support staff: according to their tasks
    - by workflow
    - who is responsible for data collection, documentation, metadata, data security, etc.

    Look also TU Delft RD Policy 

  • Planned costs
    - costs are mainly related to manpower, hardware and software
    - guides, training, lawyer and/or DPO consultation, translation service APC
    - data collection: purchase of data, transcription of recorded interviews
    - digitization and OCR: hardware and software, manpower
    - software development or software purchase, user licenses
    - hardware: computers, servers, instruments, field work equipment
    - data analysis: hardware and software, outsourced services
    - data storage and backup: predictable data volume, rule 3-2-1
    - long-term storage of data: preparation for sharing (formatting), anonymisation
    - data storage in a repository
    - partner meetings, conferences
    - project data manager
    - consideration: 5% of the project budget

Data Management Plans examples and instructions:

DMP Tuuli Public DMP templates  

Digital Curation UK Example DMPs and guidance    

Digital Curation UK: Data Management Plans

Public Data Management Plans created with the DMPTool - RIO  

Public DMPs: Royal Danish Library / Technical University of Denmark

Public DMPs: DMPTool

Research Data Nederlands The what, why and how of data management planning 
 

Source: Data Management Plan (DMP) University of TartuLibrary
 

ASK ABOUT RESEARCH DATA MANAGEMENT, STORAGE, DATA MANAGEMENT PLAN AND REPOSITORY SELECTION

Katrin Bobrov
katrin.bobrov@taltech.ee
620 3551

PUT Data Management Plan


Questions to consider:

  • what is the nature of your research project?

  • what research questions are you addressing?

  • for what purpose are the data being collected or created?

Guidance:
Briefly summarise the type of study (or studies) to help others understand the purposes for which the data are being collected or created.


Questions to consider:

  • are there any existing procedures that you will base your approach on?

  • does your department/group have data management guidelines?

  • does your institution have data protection or security policy that you will follow?

  • does your institution have a Research Data Management (RDM) policy?

  • does your funder have a research data management policy? – are there any formal standards that you will adopt?

List any other relevant funder, institutional, departmental or group policies on data management, data sharing and data security. Some of the information you give in the remainder of the DMP will be determined by the content of other policies. If so, point/link to them here.


What data will you collect or create? - questions, guidance


Questions to consider:

  • What type, format and volume of data?

  • Do your chosen formats and software enable sharing and long-term access to the data?

  • Are there any existing data that you can reuse?

Guidance:
Give a brief description of the data, including any existing data or third-party sources that will be used, in each case noting its content, type and coverage. Outline and justify your choice of format and consider the implications of data format and data volumes in terms of storage, backup and access.

Data Volume:

  • Note what volume of data you will create in MB/GB/TB. Indicate the proportions of raw data, processed data, and other secondary outputs (e.g., reports).

  • Consider the implications of data volumes in terms of storage, access and preservation. Do you need to include additional costs?

  • Consider whether the data scale will pose challenges when sharing or transferring data between sites; if so, how will you address these challenges?

Data format:

  • Clearly note in what format(s) your data will be in, e.g., plain text (.txt), comma-separated values (.csv), geo-referenced TIFF (.tif, .tfw).

  • Explain why you have chosen specific formats. Decisions may be based on staff expertise, a preference for open formats, the standards accepted by data centres or widespread usage within a given community.

  • Using standardized, interchangeable or open formats ensures the long-term usability of data; these are recommended for sharing and archiving.

  • Clearly outline and justify your choice of format and consider the implications of data format and data volumes in terms of storage, backup and access.

See UK Data Service guidance on recommended formats or DataONE Best Practices for file formats.

See more:
https://dans.knaw.nl/en/about/services/easy/information-about-depositing-data/before-depositing/file-formats
http://opendatahandbook.org/guide/en/appendices/file-formats/


Data descriptions: 

  • Give a summary of the data you will collect or create, noting the content, coverage and information type, e.g. tabular data, survey data, experimental measurements, models, software, audiovisual data, physical samples, etc.

  • Consider how your data could complement and integrate with existing data, or whether there are any current data or methods that you could reuse.

  • Indicate which data are of long-term value and should be shared and/or preserved.

  • If purchasing or reusing existing data, explain how issues such as copyright and IPR have been addressed. You should aim to minimize any restrictions on the reuse (and subsequent sharing) of third-party data.

How will the data be collected or created? - questions, guidance, sample

Questions to consider:

  • What standards or methodologies will you use?

  • How will you structure and name your folders and files?

  • How will you handle versioning?

  • What quality assurance processes will you adopt?

Guidance:
Outline how the data will be collected/created and which community data standards (if any) will be used. Consider how the data will be organised during the project, mentioning for example naming conventions, version control and folder structures. Explain how the consistency and quality of data collection will be controlled and documented. This may include processes such as calibration, repeat samples or measurements, standardised data capture or recording, data entry validation, peer review of data or representation with controlled vocabularies.

SAMPLE 1:
Class observation data, faculty interview data and student survey data will be collected. The data will be collected during the research period (Jan 2022 – Dec 2022). Most of the data will be in text format (notes, paper survey).

Each file will be named with a short description/acronym to reflect its content, followed by the date of creation. To record different versions, we will add a version number in the file name. For example, file name GSC_20200608_v01.xls represents the data acquired on June 8, 2020, the 1st version.

We will create a document to detail file naming conventions and provide a list of explanations of the short descriptions/acronyms used in file names.

SAMPLE 2:
Experimental lab data will be collected using microscope. The data generated will be time- and location- stamped image files of natural resources in Some Place. The images will be served as a record of the occurrence of creatures, natural artefacts, and conditions at specific places and times during the period 2021 through 2036. For many of the photos, taxonomic information and metadata will also be available. The occurrence data will be observational and qualitative. Metadata files shall be retained to facilitate reuse.

SAMPLE 3:
The primarily public data from 2000 to 2015 from the XXX Bureau will be acquired. Some preliminary (non-public) Census data, and some other sources, e.g. the XXX State Statistics, and XXX State Dept of Health will also be purchased and gathered.

SAMPLE 4:
Primary data of audio files including Estonian and English language will be collected. Text files are generated after the files are transcribed.  Encrypted digital voice recorders (DVRs) will be used to collect both interviews and transcripts. Interviews and focus group digital audio files will not be stored on the DVRs, only collected and then securely transferred to the project's cloud based virtual research environment space via a secure FTP (File Transfer Protocol)

SAMPLE 5:
We estimate that we will be collecting approximately 800 surveys, 20 interviews (approximately 30 min in length each), and 2 focus groups (approximately 90 min in length each). Total magnitude of data, including accounting for versions (raw, master, analytic) is estimated to be under 30GB.

SAMPLE 6:
Our file formats will exist both in non-proprietary and proprietary formats. The non- proprietary formats will ensure that these data are able to be used by anyone wishing to do so once they are deposited and made openly available.

Surveys will exist in .csv (non-proprietary), MS Excel, & SPSS (both proprietary) formats. For more information regarding SPSS see: SPSS Wikipedia 

Interviews & focus groups data will exist in .mp3 (non-proprietary), MS Word & NVivo (both proprietary) formats. For more information regarding NVivo see: NVivo 

Any survey data deposited for sharing and long-term access will be in .csv format so that anyone can use them without requiring proprietary software. The final de-identified versions of the interviews and focus groups transcripts will be exported into a basic non-proprietary text format for deposit, long-term preservation and access.

SAMPLE 7:
Sensor data, images and possibly 3rd party data (weather and road conditions) will be collected. Data is saved as excel spreadsheets and in SQL database.

SAMPLE 8:
Quantitative data will be collected using motion capture system. The processed data types will include Matlab files, MS Excel files, codebook texts, and graphical files

SAMPLE 9:
The data, samples, and materials expected to be produced will consist of laboratory notebooks, raw data files from experiments, experimental analysis data files, simulation data, microscopy images, optical images …, each of these data is described below:

A. Laboratory notebooks: The graduate student and PI will record by hand any observations, procedures, and ideas generated during the course of the research.
B. Experimental raw data files: These files will consist of ASCII text that represents data directly collected from the various electrical instruments used to measure the thermoelectric properties of the superlattice nanowire thermoelectric devices.
C. Experimental analysis data files: These files will consist of spreadsheets and plots of the raw data mentioned in Part A. The data in these files will have been manipulated to yield meaningful and quantitative values for the device efficiency. The analysis will be performed using best practice and acceptable methods for calculating device efficiency.
D. Simulation data: These data will represent the results from commercially available simulation and modeling software to model the quantum confinement.
E. Microscopy images: Images of the proposed silicon nanostructures will be generated by scanning electron microscopy (SEM), transmission electron microscopy (TEM) at high resolution to quantify wire diameter and roughness, and atomic force microscopy (AFM).
F. Optical images: Images of the nanostructured devices will be collected using an optical microscope at various magnification settings.
G. Superlattice nanowire samples: The nanostructured samples will consist of silicon quantum dot superlattice nanowires. The experimenter will use these samples to measure device efficiencies.

SAMPLE 10:
After raw data is recorded, it is then copied to the central storage unit, where it is categorized by equipment, year, and stored in a folder named using the following naming convention: year-month-day-type_of_experiment. After that, in our central database, the experimenter either uploads or links experiments (depending on the size of files) with a sample id given to him/her earlier. This sample id, in turn, links experiments with the origin of the sample.

The database stores not only experimental data but also calibration measurements and linking them with experiments. This allows us to speed up the analysis process and avoids typos, as do spreadsheet-type solutions.

SAMPLE 11:
In our research project, we have diverse sets of data. Data is collected form experiments with cardiomyocytes and describes functional and structural aspects of myocytes. This involves measuring parameters as a function of time, space, or frequency. For example, light microscopy images of cardiomyocyte structure and function, oxygen concentration measurements describing cellular energetics and electrophysiologial recordings for myocyte function.

For light microscopy, we have two types of data: 1. our own standard from microscopes that we have built, where raw data is stored in HDF5 format together with relevant metadata. HDF5 is a format commonly used to store and organize large amounts of data; 2. Zeiss lsm and czi formats from the commercial microscopes at our disposal.

Oxygen measurements the data are recorded in csv format, and electrophysiological measurements are also stored in HDF5 format with relevant metadata. Due to the diversity of the data sets, we have previously developed a software platform for storage, analysis, and sharing purposes with the application of FAIR (findability, accessibility, interoperability, and reusability) principles described in https://doi.org/10.1371/journal.pcbi.1008475.

For raw data storage purposes, we have a central storage unit where all experiments are stored. This unit currently has 30TB free space. We have estimated that we require approx 2.5 TB of storage capacity per year.


What documentation and metadata will accompany the data? - questions, guidance, sample


Questions to consider:

  • What information is needed for the data to be read and interpreted in the future?

  • How will you capture/create this documentation and metadata?

  • What metadata standards will you use and why?

  • What metadata will be provided to help others identify and discover the data?

Guidance:
Describe the types of documentation that will accompany the data to help secondary users to understand and reuse it. This should at least include basic details that will help people to find the data, including who created or contributed to the data, its title, date of creation and under what conditions it can be accessed. Documentation may also include details on the methodology used, analytical and procedural information, definitions of variables, vocabularies, units of measurement, any assumptions made, and the format and file type of the data. Consider how you will capture this information and where it will be recorded. Wherever possible you should identify and use existing community standards.
 

When select the "Metadata standards" option:

SAMPLE 1:
The clinical data collected from this project will be documented using CDASH v1.1 standards. The standard is available at CDISC website.

SAMPLE 2:
Using an electronic lab notebook, we would be generating metadata along with each notebook and postings. The metadata would include Sections, Categories and Keys which would be assigned by collaborators for reuse so as to maintain consistency in the use of terminology. We would also be using the Properties Ontology (ChemAxiomProp) when describing the chemical and materials properties.

SAMPLE 3:
We will be using some core elements from the TEI metadata standards to describe our data. We will also be adding some customised elements in the metadata to provide more details on the rights management.

SAMPLE 4:
We have our own developed software platform for data documentation and metadata storage that is published recently in PLOS Computational Biology 16(12): e1008475. The platform has full documentation of its parts. It includes details about the methodology used, analytical and procedural information, definitions of variables, units of measurement, any assumptions made, and the format and the file type of the data. In the future, when a new analysis method is developed, we immediately add corresponding documentation for future usage.
 

When select the "No metadata standards will be used" option:

SAMPLE 1:
I will not be using any metadata or international standard for the data collected and generated for this project. However, I will ensure each document that I have created using the Microsoft Word, Microsoft Excel and Microsoft PowerPoint has sufficient basic information such as Author’s name, Title, Subject, Keywords and etc. in the document properties. In addition, a separate readme file will be prepared to describe the details of each data. I will be applying the recommendations provided by Cornell University in the creation of readme file(s). Key elements could include: introductory information about the data, methodological, date-specific and sharing/access related information.

SAMPLE 2:
Metadata about timing and exposure of individual images will be automatically generated by the camera. GPS locations will subsequently be added by post-processing GPS track data based on shared time stamps. Metadata for the image dataset as a whole will be generated by the image management software (iMatch) and will include time ranges, locations, and a taxon list. Those metadata will be translated into Ecological Metadata Language (EML), created using the Morpho software tool, and will include location and taxonomic summaries.

The dataset will be accompanied by a README file which will describe the directory hierarchy and filenaming convention.

Each directory will contain an INFO.txt file describing the experimental protocol used in that experiment. It will also record any deviations from the protocol and other useful contextual information. Microscope images capture and store a range of metadata (field size, magnification, lens phase, zoom, gain, pinhole diameter etc) with each image. This should allow the data to be understood by other members of our research group and add contextual value to the dataset should it be reused in the future


How will you manage any ethical issues? - questions, guidance, sample


Questions to consider:

  • Have you gained consent for data preservation and sharing?

  • How will you protect the identity of participants if required? e.g. via anonymization

  • How will sensitive data be handled to ensure it is stored and transferred securely?

Guidance:
Ethical issues affect how you store data, who can see/use it and how long it is kept. Managing ethical concerns may include: anonymisation of data; referral to departmental or institutional ethics committees; and formal consent agreements. You should show that you are aware of any issues and have planned accordingly. If you are carrying out research involving human participants, you must also ensure that consent is requested to allow data to be shared and reused.

See UK Data Service guidance on consent for data sharing.

 See more: https://www.etag.ee/en/research-integrity/

SAMPLE 1:
Research will include sensitive data as it will contain human subject identifiable data.

The research will include data from subjects being screened for STDs. The final dataset will include self-reported demographic and behavioural data from interviews and laboratory data from urine specimens. Because the STDs being studied are reportable diseases, we will be collecting identifying information. Even though the final dataset will be stripped of identifiers, there remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and documentation available only under a data-sharing agreement that provides for: (1) a commitment to using the data only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate technology; and (3) a commitment to destroying or returning the data after analyses are completed.

SAMPLE 2:
I have sensitive data as it will contain human subject identifiable data.

Access to research records will be limited to primary research team members. Recorded data will have any identifying information removed and will be relabelled with study code numbers. A database which relates study code numbers to consent forms and identifying information will be stored separately on password-protected computers in a secured, locked office. To maintain the privacy of the participants, any report of individual data will only consist of performance measures without any demographic or identifying information.

SAMPLE 3:
In our work, we use laboratory mice for experiments. All animal procedures are already approved by the Estonian National Committee for Ethics in Animal Experimentation. As we don’t handle patient-related data, there is no need for anonymization. The data is accessible only for researchers how are involved with the project. However, the data will be available from the project’s PI on reasonable request.
 

How will you manage copyright and Intellectual Property Rights (IPR) issues? - questions, guidance, sample

Questions to consider:

  • Who owns the data?

  • How will the data be licensed for reuse?

  • Are there any restrictions on the reuse of third-party data?

  • Will data sharing be postponed/restricted, e.g. to publish or seek patents?

Guidance:
State who will own the copyright and IPR of any data that you will collect or create, along with the licence(s) for its use and reuse. For multi-partner projects, IPR ownership may be worth covering in a consortium agreement. Consider any relevant funder, institutional, departmental or group policies on copyright or IPR. Also consider permissions to reuse third-party data and any restrictions needed on data sharing.

SAMPLE:
Our principle has been open source policy, however, the Department of Cybernetics will own the IPR. The software will be developed under GPLv3.

Data will be made available through scientific publications and corresponding data supplements, either as supporting materials of the publication or separate entity in a publicly accessible database.

We do not expect to seek patents as a part of this project. Data will be made open with the reuse licenses applied in accordance with ETAg and TalTech guidelines, journal policies, and agreement with the collaboration partners.

For each dataset, the terms will be set specifically taking all the above requirements into account.

See the DCC guide on How to license research data, and EUDAT's data and software licensing wizard


How will the data be stored and backed up during the research? - questions, guidance, sample


Questions to consider:

  • Do you have sufficient storage, or will you need to include charges for additional services?

  • How will the data be backed up?

  • Who will be responsible for backup and recovery?

  • How will the data be recovered in the event of an incident?

Guidance:
State how often the data will be backed up and to which locations. How many copies are being made? Storing data on laptops, computer hard drives or external storage devices alone is very risky. The use of robust, managed storage provided by university IT teams is preferable. Similarly, it is normally better to use automatic backup services provided by IT Services than rely on manual processes. If you choose to use a third-party service, you.

See UK Data Service Guidance on data storage.

SAMPLE 1:
I will be using a networked storage drive XXX, which is a storage for active data for all research staff and students. It is fully backed-up, secure, resilient, and has multi-site storage. It is accessible via VPN (Virtual Private Network) from outside the University.  

SAMPLE 2:
The data will be stored locally on a secure password-protected data server. One set of hard drives and one set of tapes will be stored in XXX building. A second set of hard drives and a second set of tapes will be stored at a XXX building. 

SAMPLE 3:
The data (on staff computers and the web server) will be managed according to the standard practices of the college’s IT department and will be password protected. Any restricted, non-public data will be stored on XXX ( Restricted Access Data Center). 
 

Backup & Versioning Control

SAMPLE 1:
A complete copy of materials will be generated and stored independently on primary and backup sources for both the PI and Co-PI (as data are generated) and with all members of the Expert Panel every 6 months. The project team will be adopting the Version Control guidelines provided by National Institute of Dental and Craniofacial Research to organise and ensure different versions of the data are identifiable and properly controlled and use.

SAMPLE 2:
We will adopt and use the version control standards recommended by University of Leicester for the transcripts of the interviews and coding in terms of changes the research team has made to the files.

SAMPLE 3:
We will be using Mercurial, a free, distributed source control management tool to manage the data, so that the data would easily be identifiable and properly controlled and used.

SAMPLE 4:
All data will be backed up manually on monthly basis by researcher xxx on a computer hard drive kept at the research team office. The computer will be password protected and only team members will be given the password and right to access the computer. Incremental back-ups will be performed nightly and full back-ups will be performed monthly. Versions of the file that have been revised due to errors/updates will be retained in an archive system. A revision history document will describe the revisions made.

SAMPLE 5:
In our data storage and backup, we follow the 3-2-1 rule. In our laboratory, we have two copies of data: one on an instrument computer and one on a central storage server. Data on the central storage server is backed up automatically daily to a backup server located at a completely different building. Furthermore, all work computers are automatically backed up on a daily basis to the backup server.

For data backup and recovery, the senior research staff is responsible. Each member of a senior research staff member has his/her area of responsibility concerning experimental equipment that includes data management.
 

How will you manage access and security? - questions, guidance, sample

Questions to consider:

  • What are the risks to data security, and how will these be managed?

  • How will you control access to keep the data secure?

  • How will you ensure that collaborators can access your data securely?

  • If creating or collecting data in the field, how will you ensure its safe transfer into your main secured systems?

Guidance:
If your data is confidential (e.g. personal data not already in the public domain, confidential information or trade secrets), you should outline any appropriate security measures and note any formal standards that you will comply with e.g. ISO 27001.

SAMPLE:
Our data is collected in the lab and copied to the data storage via a secure, isolated network. Access to data is protected by passwords (databases) or through secure login procedures (application server). The access to the backup is restricted and available only to PI. Collaborators can access the data through a secure data sharing server, if required by the particular project.


Which data are of long-term value and should be retained, shared, and/or preserved? - questions, guidance, sample


Questions to consider:

  • What data must be retained/destroyed for contractual, legal, or regulatory purposes?

  • How will you decide what other data to keep?

  • What is the foreseeable research uses for the data?

  • How long will the data be retained and preserved?

Guidance:
Consider how the data may be reused e.g. to validate your research findings, conduct new studies, or for teaching. Decide which data to keep and for how long. This could be based on any obligations to retain certain data, the potential reuse value, what is economically viable to keep, and any additional effort required to prepare the data for data sharing and preservation. Remember to consider any additional effort required to prepare the data for sharing and preservation, such as changing file formats.

SAMPLE:
We consider that all our collected data has long-term value as it is collected from animals and would be useful in the future for other research projects. Possibly, the data can be used as an input of different mathematical models and as part of large-scale AI-assisted discoveries. The data is shared through scientific publications that are one of the project outcomes and directly using the standardized output of our database for researchers interested in it.

The data has no expiration date, so that it will be retained at least ten years from the end of the project. This does not require any major effort because the data collection is set up in a way that it is ready for long-term preservation after linking it with the database.

See the DCC guide: How to appraise and select research data for curation.


What is the long-term preservation plan for the dataset? - questions, guidance

Questions to consider:

  • Where, e.g. in which repository or archive will the data be held?

  • What costs, if any, will your selected data repository or archive charge?

  • Have you taken into account time and effort to prepare the data for sharing/preservation?

Guidance:
Consider how datasets that have long-term value will be preserved and curated beyond the lifetime of the grant. Also outline the plans for preparing and documenting data for sharing and archiving. If you do not propose to use an established repository, the data management plan should demonstrate that resources and systems will be in place to enable the data to be curated effectively beyond the lifetime of the grant.

An international list of data repositories is available via re3data, and some universities or publishers provide lists of recommendations, e.g., PLOS ONE recommended repositories.


How will you share the data? - questions, guidance, sample


Questions to consider:

  • How will potential users find out about your data?

  • With whom will you share the data, and under what conditions?

  • Will you share data via a repository, handle requests directly or use another mechanism?

  • When will you make the data available?

  • Will you pursue getting a persistent identifier for your data?

Guidance:
Consider where, how, and to whom data with acknowledged long-term value should be made available. The methods used to share data will be dependent on a number of factors such as the type, size, complexity and sensitivity of data. If possible, mention earlier examples to show a track record of effective data sharing. Consider how people might acknowledge the reuse of your data.

SAMPLE 1:
Our collected data is mainly disseminated through presentations at scientific conferences and meetings, scientific publications, and our exciting collaborations. For example, our laboratory is a member of COST action MitoEAGLE that consists of research teams from 32 countries.

The data will be available in parts right after we have published research articles related to the data

SAMPLE 2:
Datasets will be published together with the corresponding publications. For dataset publishing, we plan to use internationally established repositories, such as TalTechData.

Data sharing in the form of original datasets and metadata as a separate database is already possible and has a minimal cost. However, conversion of such datasets as ours to RDF enriched with the standard ontologies is a separately financed project that is in progress in the laboratory. This will allow us to publish the datasets in a form suitable for machine learning and mining.
 

Are any restrictions on data sharing required? - questions, guidance, sample

Questions to consider:

  • What action will you take to overcome or minimize restrictions?

  • For how long do you need exclusive use of the data and why?

  • Will a data-sharing agreement (or equivalent) be required?

Guidance:
Outline any expected difficulties in sharing data with acknowledged long-term value, along with causes and possible measures to overcome these. Restrictions may be due to confidentiality, lack of consent agreements or IPR, for example. Consider whether a non-disclosure agreement would give sufficient protection for confidential data.

SAMPLE:
Datasets from this work which underpin a publication will be deposited in XXX:  institutional data repository, and made public at the time of publication. Data in the repository will be stored in accordance with funder and University data policies. Files deposited in repository XXX data will be given a Digital Object Identifier (DOI) and the associated metadata. The DOI issued to datasets in the repository can be included as part of a data citation in publications, allowing the datasets underpinning a publication to be identified and accessed. Metadata about datasets held in the XXX will be publicly searchable and discoverable and will indicate how and on what terms the dataset can be accessed.


Who will be responsible for data management? - questions, guidance, sample


Questions to consider:

  • Who is responsible for implementing the DMP and ensuring it is reviewed and revised?

  • Who will be responsible for each data management activity?

  • How will responsibilities be split across partner sites in collaborative research projects?

  • Will data ownership and responsibilities for RDM be part of any consortium agreement or contract agreed between partners?

Guidance:
Outline the roles and responsibilities for all activities e.g. data capture, metadata production, data quality, storage and backup, data archiving & data sharing. Consider who will be responsible for ensuring relevant policies will be respected. Individuals should be named where possible.

SAMPLE:
Project PI is responsible for enforcing the implementation of the DMP. However, the overall data collection, including metadata entry, has been shared among senior research staff. Each of them has their own area of responsibility related to the type of experiments they are in charge of and will oversee whether DMP is implemented correctly.

Furthermore, our research group will be entirely responsible for enforcing DMP with a possible collaborative research project.
 

What resources will you require to deliver your plan? - questions, guidance

Questions to consider:

  • Is additional specialist expertise (or training for existing staff) required?

  • Do you require hardware or software which is additional or exceptional to existing institutional provision?

  • Will data repositories apply charges?

Guidance:
Carefully consider any resources needed to deliver the plan, e.g. software, hardware, technical expertise, etc. Where dedicated resources are needed, these should be outlined and justified.

Horizon Europe Data Management Plan


1. Data summary - questions, guidance, sample

What is the purpose of the data collection/generation and its relation to the objectives of the project?

What types and formats of data will the project generate/collect?

Will you re-use any existing data and how?

What is the origin of the data?

What is the expected size of the data?

To whom might it be useful ('data utility')?

Guidance 
The type[s] of data that will be used in the project is[are] [insert the types of data that will be used such as experimental, observational, images, text]. The estimated size of the data is [insert data size]. The project will [collect/re-use existing/collect and re-use existing] data. The origins of the data will be [insert where data will be collected from and/or the origins of the re-used dataset].

SAMPLE:

Origin of data:   

  • Image files will be recorded from a confocal microscope.

  • RNA sequencing data will be generated from normal and tumor tissues from patients.

  • Patient data will be acquired from the XXX Register.

  • Survey responses will be acquired using the REDCap survey software.

  • Measurements of markers of liver and renal function will be collected in the SMART‐TRIAL system. · Respondent data will be acquired in clinical interviews.

  • Existing bioinformatics data will be used for new analyses.    

Data format:   

  • Biomarker Data will be saved in a .csv format.

  • PCR data will be saved in .csv format

  • Questionnaire data will be saved in SAS format.  

  • Data on prescribing practices before and after pilot trial will be managed in SAS (file   format: .sas7bdat) and analyzed in STATA (file format: .dta).

  • Interview responses will be saved in Nvivo .nvp format.

  • Survey responses will be exported from REDCap to .csv format.

  • Register data will be received in spreadsheet format and will be converted to .tsv format before analysis.

  • Sequencing data will be in .fastq format.

  • Flow cytometry data will be saved in .fcs format.

  • Confocal images will be saved in .jpeg format.

  • Proteome raw data will be saved in .raw files

  • Raw methylation data will be in .idat format.

  • Raw genetic variation data will be in .vcf format.


2. FAIR DATA

2.1 Making data findable, including provisions for metadata - questions, guidance, sample

Questions:

Will data be identified by a persistent identifier?

Will rich metadata be provided to allow discovery? What metadata will be created? What disciplinary or general standards will be followed? In case metadata standards do not exist in your discipline, please outline what type of metadata will be created and how.

Will search keywords be provided in the metadata to optimize the possibility for discovery and then potential re-use?

Will metadata be offered in such a way that it can be harvested and indexed?


Guidance
This section of the DMP should present the measures to ensure the data’s:

Findability – Including any identifiers, keywords, metadata standards and other practices that will optimize the potential of finding and re-using the data.
Accessibility – First, details on the repository in which the data will be deposited should be given. Second, the access to the data itself, including open access, access protocols and restrictions aspects. Third, issues relating to metadata accessibility and availability should be described. In the case of certain data or metadata that will not be shared – proper justification should be provided.
Interoperability – The vocabularies, standards, formats or methodologies that will be used to enable data exchange, re-use and interoperability.
Reusability – This sub-section should provide information on the expected documentation (e.g., explaining methodology, codebooks, variables),


SAMPLE:

  • Data will be described by rich metadata using standard or specified terminologies:

    • Documentation will include a standardized folder structure, codebooks (metadata about the data), logbooks (metadata about data processing), analysis plans, input and output files from databases and statistical software

    • All files will be named according to the date of acquisition and experimental condition and put into folders. A „read me“ file will be generated, explaining the experimental conditions, tissue and cell types.

    • Survey responses will be curated into the Psych‐DS format.

    • Working files will be clearly labelled with a version suffix, e.g. v2.

    • The following metadata will be provided (as Excel file) for each experiment: Experiment number, Condition, Date, Creator, Description, Format

    • Metabolomics data will be documented in accordance with community standards defined by the Metabolomics Standards Initiative

  • We plan to make our datasets findable by uploading rich metadata to a searchable resource (a data repository) and having a persistent identifier assigned to the data by the repository. Data will be deposited at a repository/database (please provide name) immediately and without embargo.

  • Data will be made available upon publication as a supplement to the publication.

  • Metadata will be deposited at TalTechData and be freely searchable. There will be links to the underlying data.

2.2 Making data openly accessible  - questions, sample

Repository:
Will the data be deposited in a trusted repository?

Have you explored appropriate arrangements with the identified repository where your data will be deposited?

Does the repository ensure that the data is assigned an identifier? Will the repository resolve the identifier to a digital object?

Data:
Will all data be made openly available? If certain datasets cannot be shared (or need to be shared under restricted access conditions), explain why, clearly separating legal and contractual reasons from intentional restrictions. Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if opening their data goes against their legitimate interests or other constraints as per the Grant Agreement.

If an embargo is applied to give time to publish or seek protection of the intellectual property (e.g. patents), specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.

Will the data be accessible through a free and standardized access protocol?

If there are restrictions on use, how will access be provided to the data, both during and after the end of the project?

How will the identity of the person accessing the data be ascertained?

Is there a need for a data access committee (e.g. to evaluate/approve access requests to personal/sensitive data)?

Metadata:
Will metadata be made openly available and licenced under a public domain dedication CC0, as per the Grant Agreement? If not, please clarify why. Will metadata contain information to enable the user to access the data?

How long will the data remain available and findable? Will metadata be guaranteed to remain available after data is no longer available?

Will documentation or reference about any software be needed to access or read the data be included? Will it be possible to include the relevant software (e.g. in open source code)?

SAMPLE:
Making data accessible 

Data and metadata will be retrievable by their unique and persistent identifier assigned by the TalTechData repository.

Datasets that do not contain personal information will be:

  • made available upon publication as a supplement to the publication.

  • deposited at a repository/database (please provide name) immediately and without embargo.

Datasets containing personal information will be:

  • made available upon request after ensuring compliance with relevant legislation and  guidelines. Metadata will be published open in a data repository.

Analysis scripts and other developed code will be uploaded to TalTechData.
 

2.4 Making data interoperable - questions, sample

Are the data produced in the project interoperable, that is allowing data exchange and re-use between researchers, institutions, organisations, countries, etc. (i.e. adhering to standards for formats, as much as possible compliant with available (open) software applications, and in particular facilitating re-combinations with different datasets from different origins)?

What data and metadata vocabularies, standards or methodologies will you follow to make your data interoperable?

Will you be using standard vocabularies for all data types present in your data set, to allow inter-disciplinary interoperability?

In case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies?

Will your data include qualified references to other data (e.g. other data from your project, or datasets from previous research)?

SAMPLE:
Making data interoperable 

We plan to make our datasets interoperable by using controlled vocabularies, keywords or ontologies where possible and by using file formats that are as open and widely used as possible.
 

Increase data re-use (through clarifying licences) - questions, sample

How will you provide documentation needed to validate data analysis and facilitate data re-use (e.g. readme files with information on methodology, codebooks, data cleaning, analyses, variable definitions, units of measurement, etc.)?

Will your data be made freely available in the public domain to permit the widest re-use possible? Will your data be licensed using standard reuse licenses, in line with the obligations set out in the Grant Agreement?

Will the data produced in the project be useable by third parties, in particular after the end of the project?

Will the provenance of the data be thoroughly documented using the appropriate standards?

Describe all relevant data quality assurance processes.

SAMPLE:
Increase data re‐use 

We plan to make our datasets reusable by assuring high data quality, by providing all documentation needed to support data interpretation and reuse and by clearly licensing the data via the repository so that others know what kinds of reuse are permitted.

Tools needed: 

  • The data can be read by any software compatible with .jpeg files

  • The data can be read by any software compatible with .csv files

  • A software licence for SPSS will be required to read the data file which has been analysed.

  • Code necessary to process and interpret the data will be deposited on TalTechData.

  • Data Transfer/Processing agreements will be signed prior to any data sharing.

  • Data will be deposited at a repository/database (please provide name) immediately and without embargo, using a license (please specify license type, e.g CC‐BY).

Data quality: 

  • Data will be quality‐checked at collection/generation by validation against controls or publicly available databases.

  • RNA seq data will be quality controlled in terms of sequence quality, sequencing depth, reads duplication rates (clonal reads), alignment quality, nucleotide composition bias, PCR bias, GC bias, rRNA and mitochondria contamination, coverage uniformity. Only high‐quality data will be included in the subsequent analysis.

  • The register holder assures data quality in terms of completeness and correctness of registration.

  • The transcribed interview material will be coded independently by two researchers.

  • Images will be inspected for artifacts and the results will be recorded in a spreadsheet file.

  • Mass spectrometry results will be quality‐checked for contamination and mass accuracy.

  • Register data will be quality controlled according to a procedure established in our group (REF).

  • Data will be checked at the point of entry in REDCap or SMART‐TRIAL for double entries, completeness, missing data and unreasonable values.  

  • To assure data quality, the study will be conducted according to the COREQ guidelines for


3. Other research outputs 

In addition to the management of data, beneficiaries should also consider and plan for the management of other research outputs that may be generated or re-used throughout their projects. Such outputs can be either digital (e.g. software, workflows, protocols, models, etc.) or physical (e.g. new materials, antibodies, reagents, samples, etc.).

Beneficiaries should consider which of the questions pertaining to FAIR data above, can apply to the management of other research outputs, and should strive to provide sufficient detail on how their research outputs will be managed and shared, or made available for re-use, in line with the FAIR principles.

Further to the FAIR principles, DMPs should also address research outputs other than data, and should carefully consider aspects related to the allocation of resources, data security and ethical aspects.

Guidance
The management of other research outputs that are generated/re-used in the project (e.g., software, models, new materials) should be discussed and, when relevant, their compliance to the FAIR principles should be detailed.


4. Allocation of resources - questions, guidance, sample
What are the costs for making data FAIR in your project?

How will these be covered?

Note that costs related to open access to research data are eligible as part of the grant (if compliant with the Grant Agreement conditions). Who will be responsible for data management in your project?

Are the resources for long term preservation discussed (costs and potential value, who decides and how what data will be kept and for how long)?

Guidance
This section should include a discussion on the resources such as costs associated with compliance to the FAIR principles or who will be responsible for data management.

SAMPLE:
Allocation of resources 

  • Data management is performed by the PI / a research assistant / a postdoc / a dedicated data manager.

  • Salary of X EUR for a data manager in the group is required.

  • Access to the departmental server is required. It is expected to cost X EUR


5. Data security - questions, guidance, sample

What provisions are in place for data security (including data recovery as well as secure storage and transfer of sensitive data)?

Is the data safely stored in certified repositories for long term preservation and curation?

Guidance
Aspects that should be referred to in this section include provisions ensuring data security, including its storage and recovery.


SAMPLE:

Access to the documentation stored in XXX servers is restricted to group members.

  • Data saved in XXX servers is backed up.  

  • Access to data saved in XXX servers requires user authentication with password.

  • Access to  servers is permitted only when on TalTech premises or by VPN.

  • In OneDrive, it is possible to recover changed/deleted datasets.

  • We only work with pseudonymized data, with the key stored in a safety cabinet located at XXX (please specify location) and to which only XXX have access to (please specify the people that have access to it).

  • It has been judged that controlled access is not required for these data since the data do not contain personal information


6. Ethical aspects - questions, guidance, sample

Are there any ethical or legal issues that can have an impact on data sharing?

These can also be discussed in the context of the ethics review. If relevant, include references to ethics deliverables and ethics chapter in the Description of the Action (DoA). Is informed consent for data sharing and long-term preservation included in questionnaires dealing with personal data?

Guidance
Any ethical or legal issues that can have an impact on data sharing should be presented. Additionally, when the research uses personal data, aspects such as informed consent or long-term preservation should be referred to.


SAMPLE:

There are no personal data, nor any other grounds for confidentiality.

  • Sensitive personal data will be handled according to GDPR. 

  • IP rights will be managed in accordance with the contract drawn up with our industrial partner organization (specify).

  • Survey and clinical data will be anonymized, i.e. all possibility to trace the data back to the study participant has been removed. The data is anonymized when the code key is destroyed and it is no longer possible to connect a person to the data.

  • Data will be pseudonymized and a key will be kept separately from the data.

  • Patient data is pseudonymized by the clinical collaborator and the code is not accessible to researchers in our research group. The material will arrive to research group coded, and the original code will be saved by the collaborators.

  • Ethical approvals/amendments and informed consent forms for the project are registered in the diary.

  • Consent has been acquired from human participants to process/share data.

  • Data Transfer/Processing agreements will be signed prior to any data sharing.

  • Results will only be presented on aggregated level without any possibility of backward identification.


7. Other  - questions, guidance, sample

Do you, or will you, make use of other national/funder/sectorial/departmental procedures for data management? If yes, which ones (please list and briefly describe them)?

List any other relevant funder, institutional, departmental or group policies on data management, data sharing and data security. Some of the information you give in the remainder of the DMP will be determined by the content of other policies. If so, point/link to them here.

Guidance
If other procedures or practices of data management are relevant to the project they should be presented in this section.


SAMPLE:
List any other relevant funder, institutional, departmental or group policies on data management, data sharing and data security: European Research Council guidelines for Open Access; European Commission Data Guidelines; Horizon 2020/Europe Guidelines on FAIR Data Management; Directive (EU) 2019/1024 on Open Data and the re-use of public sector information (2019); European Commission proposal for a regulation on European data governance (Data Governance Act) (25.11.2020); EC’s Digital Strategy (2020); Regulation (EU) 2016/679 EU General Data Protection Regulation (the GDPR) (2018); Copyright Directive (2019); Open Science Expert Group of the Estonian Research Council. Open Science in Estonia: Open Science Expert Group of the Estonian Research Council Principles and Recommendations for Developing National Policy, p 6 (2016); FORCE-11 The FAIR Data Principles; Taltech FAIR Data guidelines for making data findable; Data documentation, organisation and metadata recommendations of Taltech