LibGuides: Research Data Management: Data Organization

Data Curation Specialist

Email Me

Data Organization

File names and a simple hierarchy will make files easier to locate. Set up conventions for your project, document them for all team members and be consistent!

Keep file names short, descriptive, and agree on and follow consistent conventions with your team
Try to keep file hierarchies shallow, and no more than 4 levels deep
Limit the number of files to around 10 files per folder
Keep track of versions through either date and time or a numbering system (v01, v01-01, v02-01, v03-01, etc.)

Recommendations:

Use standard dates in YYYY-MM-DD format (2022-07-23)
Use a short identifier (e.g Project Name or Grant #)
Include a summary of content (e.g Questionnaire or GrantProposal) as file name
Use_as delimiters. Avoid special characters such as: &,*%#*()!@${}[]?<>-
Keep track of document versions either sequentially or within a unique date and time
Make folder hierarchies as simple as possible

Example: Files with a naming convention

20230601_NSFProject_DesignDocument_Sandra_v2-01.docx
20230609_NSFProject_MasterData_Monica_v1-00.xlsx
20230705_NSFProject_LabTest1_Data_Lee_v3-03.xlsx
20230821_NSFProject_LabTest1_Documentation_Lee_v3-03.xlsx
20230912_NSFProject_LabTest2_Data_Lee_v1-01.xlsx
20240120_NSFProject_ProjectMeetingNotes_Ninfa_v1-00.docx

Creative Commons CC-BY: Adapted from Dalhousie University Libraries and the University of British Columbia's "Organize"

Without description, data is hard to understand and use. Make your data FAIR (findable, accessible, interoperable, reusable) by describing it with metadata (data about data). Metadata is the data that you use to describe and document the research data that you have collected. It contains descriptive elements, of which examples are listed below. Metadata will make your data sets searchable in an archive or repository, easily located from a citation, and easily understood by people who might want to use your data. Use Metadata to record details about a study such as

its context
the dates of data collections
data collection methods, etc.

Below are some ISO suggested minimal metadata elements to use when you are documenting your data:

Title
Creator (Principal Investigators)
Date Created (also versions)
Format (and software required)
Subject
Unique Identifier
Description of the specific data resource
Coverage of the data (spatial or temporal)
Publishing Organization
Type of Resource
Rights
Funding or Grant

Discipline Specific Metadata

Sciences

Darwin Core: This metadata schema is for describing biological specimens, including their occurrence in nature as documented by observations, samples, and related information. Based on Dublin Core, this schema is used in natural history specimen collections and species observation databases.
Ecological Metadata Language (EML): This metadata schema is for ecological data. EML is implemented as a series of XML document types that can by used in a modular and extensible manner to document ecological data.
NASA's Standards: NASA has a variety of data format and metadata standards, as well as "heritage" standards that were in use by NASA Earth Science Data Systems (ESDS) prior to the start of the legacy ESDS Standards Process Group (SPG).

Geospatial

FGDC (Federal Geographic Data Committee): This schema is for geospatial data.
FGDC Metadata Tools: FGDC metadata creation and editing tools.

Social Sciences

DDI (Data Documentation Initiative) Alliance: This international metadata schema is for social, economic, and behavioral sciences. Expressed in XML, this metadata schema supports the entire research data life cycle.
OLAC (Open Language Archives Community) Metadata: This metadata set was developed by the Open Language Archives Community for the Open Archives initiative. It is based on Dublin Core.

Humanities

TEI (Text Encoding Initiative): This is a standard for the representation of texts in digital form.
VRA Core (Visual Resources Association): The VRA Core is a data standard for the description of works of visual culture as well as the images that document them.

Controlled Vocabulary

In addition to selecting a metadata standard or schema, whenever possible you should use a controlled vocabulary. A controlled vocabulary provides a consistent way to describe data. Examples of controlled vocabularies include subject headings, thesauri, ontologies, and taxonomies. Using a controlled vocabulary will improve your data's findability and will make your data more shareable with researchers in the same discipline.

General Purpose

Arts and Humanities

Health Sciences and Medicine

Sciences

Social Sciences

Metadata is sometimes captured through deposit in data repositories, but you can also prepare data dictionaries, codebooks and README files to further describe and contextualize your work. README files are plain text documents that sit at the top level of project folders and describe the purpose of the project, contact details, and organization of files. Including a README with your work helps ensure that future users will understand the data, any terms, and more.

README files should include:

Title
Principle Investigator(s)
Dates/Locations of data collection
Keywords
Language
Funding
Descriptions of every folder, file, format, data collection method, instruments, etc.
Definitions
People involved
Recommended citation

ReadME File Template

Guide to Writing "ReadME" Style Metadata: Cornell University comprehensive guide and template.

Sensitive data is defined as information that is protected against unwarranted disclosure. Access to sensitive data should be safeguarded. Protection of sensitive data may be required for legal or ethical reasons, for issues pertaining to personal privacy, or for proprietary considerations. Examples of sensitive information may include, but are not limited to, some types of research data, such as research data that is

Personally identifiable or proprietary
Public safety information
Financial donor information
Information concerning select agents
System access passwords
Information security records
Information file encryption keys

Techniques for Managing and Sharing Data

De-identification

This is the process of removing direct and indirect identifiers from a dataset, while maintaining enough information for the data to be useable to future researchers. In de-identification a key is geneated that explains the steps taken to de-identify the data and which could be used to reverse the process and reassociate the data with individuals.

Anonymizing

The process of anonymization is similar to deidentification in the types of information masked in the original data set. However, this process is irreversible, meaning no key is generated and there is no way in the future to reconnect the individual subject with the data they supplied for the project.

Licensing Agreements

Licensing allows access to data with little or no redaction other than removal of direct identifiers (names and addresses). Researchers seeking access sign an agreement agreeing to abide by the rules ensuring continued subject confidentiality. This approach relies on the researcher to abide by the agreement, which can be its weakness. (NCBI, “Protecting Privacy…”, section 3)

Remote Execution Systems

Confidential data are stored on a computer maintained by the data disseminator (who may or may not be the principal researcher), and any queries from secondary researchers are submitted to the system. If the query results are not confidential, they are provided for the secondary researcher without individual data. Types of data analysis are limited in this model to help maintain confidentiality. The resulting restrictions and return of only aggregate data can make the data difficult to use for secondary research. (NCBI, “Protecting Privacy…”, section 3)

Data Quality Check Up

Data Quality Management Checking Guide
Guide from Purdue University Libraries with a checklist to guide researchers in for data credibility, timeliness, information completeness, data accuracy, data consistency, and data deduplication.