Guides thématiques: Données de la recherche: Organizing, naming, documenting, no 404 errors: easy to Find data

IN ENGLISH!

Make your data easy to Find

A unique, persistent identifier;

Clearly named, organized and documented files and folders.

Easy to find data = persistent and durable identifiers

Risk: 404 errors. Other researchers may get a 404 error if you don't apply a DOI to your data!

Remedy: unique, durable identifiers like the DOI (Digital Object Identifier), a kind of ISBN or ISSN for the online resource. The DOI makes it easier for search engines to find your data. It assigns a permanent URL to it, enabling you to retrieve a precise and complete citation of the resource without having to re-enter everything.

How do you apply a DOI to your data? Simply upload your data to the Sciences Po data repository, data.sciencespo. We'll be happy to help you, train you, do it for you... Citing the data you use is just as important as citing your publications.
Among the metadata associated with the DOI, there are :

- Mandatory metadata: creator, title, publisher/repository, publication date. These elements are requested by journals when you cite data;
- Recommended metadata: resource type, subject, language, format, version, resource description, distribution/reuse license, creation date, last consulted date, etc.

DOIs are provided on a subscription basis by specialized agencies: CrossRef, DataCite (consortium of information science and library services founded in London in 2009, managed in France by Inist-CNRS).

A DOI is constituted according to the ISO 26324 standard: DOI agency prefix (here, Inist-CNRS) + institution suffix (Sciences Po, in our case). A declaring organization has as many DOI prefixes as it has contracts with agencies providing DOIs.

Example: ttps://doi.org/10.21410/7E4/EATFBW

Your data is unique!

Some videos to learn more about persistent identifiers

[Video] Data Management: Data Citation, University of Wisconsin Data Services
[Video] Persistent identifiers and data citation explained, Research Data Netherlands

Easy to find data = clearly named files

Risk: getting bogged down in files with nebulous names such as "JD-1" "JD-2" initials of a colleague or interviewee.

Remedy: file naming rules, so you know what's inside without spending three hours reading the content. How to differentiate between interviews with different people and organizations...

Sciences Po recommendation: name your file in 3 mandatory parts separated by an underscore: prefix [data name]_root [project acronym]_suffix [indicating date: ISO 8601 standard and version - or date of last update].

Ex: interviewsBamako_ProjetX_20200202_Vdef

The use of underscores between terms makes it easier to read files on operating systems other than the producer's, for example.

Clarification of the version reference at the end of the file name: VP.Vdef or V1, V2... The version reference may be useful if the researcher intends to re-interview people or if the interviewees' discourse evolves over time.

► More generally, some general rules for naming files > "human and machine readable"

Use naming rules that are shared and understood by everyone in your team.

How to name?

Be brief (25 characters max), reflect the content, no overly generic terms ("draft", "test"...)
No empty words, use commonly understood abbreviations
ideally, don't repeat terms between file names and folder names (although in reality, it's not that simple...)
no spaces; no special characters (accented, symbols, etc.). the file name must be machine-readable)
Capital letters
Number always at least 2 digits long (depending on the number of files involved: 01, 002, 0003 etc.)
Most important element first to facilitate document search

Examples:

1. Short but meaningful

The file name should contain enough information to be understood outside the storage space. Avoid redundant and unnecessary information.

DMP.pdf | data management plan.pdf

RapportActivitePolitis2018.pdf | rapport d’activité du projet politis 2018.pdf

2. Never use spaces

Distinguish between the different elements of a file name by using capital letters and/or underscores (underscore) "_".

PolitisDMP.pdf | DMP du projet Politis.pdf

Politis_budget_2019.xls | Politis budget 19.xls

3. No special characters

Never use special characters or letters with accents : à é `ù % , { } ! @ $ € & * ().

Politis_Budget_Prev.pdf | Politis(Budget_Prévisionnel).pdf

Politis_DMP_ethique.pdf | Politis DMP & éthique.pdf

4. Ordering information

For documents of the same subject, always use the same order of information.

Politis_Budget_2019.pdf et Politis_Rapport_2019.pdf | Politis_Budget_2019.pdf et Rapport_2019_Politis.pdf

EntretienSyndicat.wav et EntretienONG.wav | EntretienSyndicat.wav et ONGEntretien.wav

5. Numbers: balanced strucure

Always use the same number of characters. Model : 1-9, 01-99, 001-999, etc.

Politis_entretien01.wav [...] Politis_entretien26.wav | Politis_entretien1.wav [...] Politis_entretien26.wav

Budget2015.wav [...] Budget2019.wav | Budget2015.wav [...] Budget19.wav

6. Date format

Use only numbers. Models : AAAA, AAAAMM, AAAAMMJJ.

Politis_CR_20180910 | CR 10 septembre 18

Politis_budget20182019 | budget 2018-19

7. Version

Use the letter "V" followed by a number to indicate the version. If it is a draft, indicate by «Vdraft».

rapportV01 | rapport version 1

Use naming rules that are shared and understood by everyone in your team.

Keep control of your data - don't let software automatically name your files! Systematic quality control!

► Tools

AntRenamer
Free program that lets you easily rename large numbers of files and folders at once, according to criteria you define. It lets you rename by modifying and/or adding character strings, by enumeration or from within a file. It also extracts certain metadata from the files concerned. It supports Unicode names.
Bulk Rename Utility
Tool for batch renaming of files and folders, including images and sound files. For Windows.
Renamer
The same, for Mac.

Easy to find data = organized files

Data easy to find = clearly organized files.

Risk: not understanding files left behind by a colleague, e.g. a foreign post-doc who has left for other climes.

Remedy: folder tree rules - so that others can understand your work.

The dataset must be understandable to someone who didn't create it (researcher, citizen if need be). The dataset must also be usable if the data producer has left.

In concrete terms: is a file tree structure planned (by source, by theme, by research objective, by organization, by function, use of pseudonyms, etc.)?
A clear tree structure is generally established when files are deposited in a data repository. For certain large-scale projects (e.g. Elipss), the file tree structure can evolve and be set up before the project begins

[Video] Research Data Management : Organise, Massey University

► Template 1

► Template 2

A simple structural template for organizing all project data, publications and administrative documents.

► Tools

Zotero
Bibliographic management software for organizing and sorting your files using collections, markers and saved searches. Wonderful training courses organized by the library.
Archifiltre
Audit, sort, detect duplicates and visualize your server trees. It can be used to clean up obsolete files and limit storage costs and information loss. Useful when reorganizing trees, cleaning days or archiving actions.

Easy to find data = documented files

Your first step is often to find data, to reuse other people's data.

But to be findable, data must be documented and referenced by search engines. Documenting all data processing (harvesting, cleaning, merging, coding, etc.) enables you to trace the various stages of your work, making it easier for search engines to understand your data.

Some search engines specialize in research data, such as BASE = academic search engine managed by the library of the University of Bielefeld (Germany), which indexes over 8.5 million datasets from more than 8800 sources. Multilingual search. Some of the indexed sites are supplied by the researchers themselves. Repositories may be incompletely or imperfectly referenced. Single source embryo for finding datasets.

► ElasticSearch is a search engine that integrates, harmonizes and interconnects datasets and publications from different sources.

Questions to ask yourself

How were the data compiled? Is it from archives you've scoured, interviews from a survey you've conducted, a financial database you've purchased, or data from the web?
How have you processed your data? How did you go from raw data to refined data?
What variables are used? How are they structured in your files? Are they dates, numbers or text?
Who was involved in the various stages of data entry, processing, coordination, etc.?
What was the data collection protocol? What measures have been taken to reduce risks, detect errors and ensure the scientific validity of the data? Has a quality assurance procedure (before the start of the project) and quality control during and after the project been planned?

A codebook is a technical description of the data collected for a specific purpose to feed one or more datasets. It describes the organization of the data (folders/files) and the meaning of the variables. They often include a description of the study (who, why, how), the sampling method (universe, criteria, response rate), information on the files (number of observations, record length, number of records per observation), the structure of the data within the file (hierarchical, multiple cards...), the meaning of the variables, the format, instructions for using and interpreting the data; in appendices, the text of questions and answers.

[Video] Tips on Documentation, John MacInne, Professor of Sociology, University of Edinburgh (MANTRA)

Cleaned data

Open Refine cleans, prepares and enriches your csv or other data. Analyze data columns, correct errors in a set in one go: date formats, multiple representation of the same data, duplicate records, redundant data, unnecessary spaces...

► If you use your own data cleansing software, don't forget to document it. Attach scripts to the data repository to facilitate replication and verification of your data.

Annotated data

Dicto is an application for annotating, analyzing and publishing video and audio media: interviews, media analyses, oral communications (conferences, seminars, speeches).

Guides thématiques

Données de la recherche

IN ENGLISH!

Make your data easy to Find

Easy to find data = persistent and durable identifiers

Easy to find data = clearly named files

Politis_DMP_ethique.pdf | Politis DMP & éthique.pdf

For documents of the same subject, always use the same order of information.

5. Numbers: balanced strucure

Always use the same number of characters​. Model : 1-9, 01-99, 001-999, etc.​

6. Date format

7. Version

►​ Tools

Easy to find data = organized files

►​ Template 1

► Template 2 A simple structural template for organizing all project data, publications and administrative documents.

►​ Tools

Easy to find data = documented files

Cleaned data

Annotated data

Always use the same number of characters. Model : 1-9, 01-99, 001-999, etc.

► Tools

► Template 1

► Template 2

A simple structural template for organizing all project data, publications and administrative documents.

► Tools