" "
Sciences Po | Library - New window

Guides thématiques

Données de la recherche

Speak English?

IN ENGLISH!

Manage data

Data is much broader than the numbers in an Excel file. Research data management is the result of several movements that support research in France and abroad, including: Open Access or free access to scientific and technical information, the administration of scientific evidence to validate results through replication, the reuse of data sets, their enhancement and more broadly the safeguarding of the scientific heritage. Policies (e.g. Open Science National Plan, July 4, 2018), initiatives (e.g.: Go FAIR), network (e.g.: Research Data Alliance) are actions that set the beat for research. Impacts can be immediate, such as depositing open-access datasets for the submission of an article (American Journal of Political Science) or the drafting of a data management plan for Horizon Europe and ANR projects or longer term projects. Developing management strategies to document, preserve, enhance and safeguard data will allow the community to better understand your work and its results, but above all will save your time and increase your visibility. 

  • How has the data been constituted? Archives, interviews, surveys, financial databases, web-based data?
  • How have you processed your data? How did you go from raw to refined data?
  • What variables did you use? How are they structured in your files? By date, number, text?
  • Who participated in the various stages of work on the data: entry, processing, coordination...? 

Recording your answers to these questions from the outset of your research will enable you to reuse and share your data easily.

Some first insights regarding research data management:



Differences between big, open and 
research data

What is big data, anyway?

Big data covers everything from Amazon's recommendation systems to the study of social networks. 

These very voluminous data sets are difficult to apprehend with traditional database management tools. New tools for data management, processing, analysis, visualization and storage are being used:  

  • Cloud computing: Google drive, for example
  • Distributed computing: in computing, the processing of information or a program distributed over several microprocessors and, more generally, several CPUs.
  • High-speed supercomputers
  • Examples of projects concerned
  • High-throughput genome sequencing
  • Virtual astronomy observatories
  • Physics and energy simulation
  • Medical imaging
  • Biotic or abiotic environmental data
  • Economic data: e-commerce (Amazon's recommendation system, for example); decision-making systems (databases as decision support and assistance)
  • Social data: social networks, digital libraries, etc.


What is open data, anyway?

Open data aims to share public data in order to provide new services that are useful to all, and to ensure that the State is accountable to its citizens.

A movement and philosophy in favor of free access to information, publications and data. Its aim is to encourage the re-use of public data, collected or produced by a public service as part of its mission using public funds (1978 law, 1994 circular). Public information is a common good, financed by the taxpayer and therefore collective, the dissemination of which is in the public and general interest. The stakes involved in disseminating it are therefore manifold: 

  • Democratic issues: making public policies more transparent, disseminating essential public data (major legal texts, administrative information for the public), offering new services to improve citizens' daily lives e.g. in the context of transport: real-time information on peak times.
  • Political issues: enhancing the quality of administrative work, promoting a more efficient State, demonstrating accountability (the obligation of public authorities to report on their performance). Cf. Obama administration's Memorandum on Transparency and Openness; Open Government Directive
  • Economic issues: 
    • economic viability: non-exclusive use, positive externalities (an actor is favored by the action of third parties without having to act), conditions of pure and perfect competition (free and immediate information from all economic agents on all other agents and exchanged goods), use value…
    • economic growth: promote job creation; start-ups that use public data to offer new services to citizens. Ex: transport.

► How usable is my data? A little e-learning module just for you!

► Tim O'Reilly suggests the creation of an open innovation platform that enables every citizen to contribute to solving collective problems by bringing up information and expertise disseminated within society.


What is research data, anyway?

 

 

 

 

 

 

 

 

 

Romain Couturier/Cyril Heude

OECD (Organisation for Economic Co-operation and Development) definition:

  • Texts, sounds, images, figures collected and produced by the researcher with a view to writing an article or a book
  • Main research sources needed to validate results.

E.g.: Photographs, satellite images, diagrams, drawings, meteorological records, sound recordings, computer code, data hidden in the code (or in a separate layer)... But laboratory notebooks, preliminary analyses or samples do not fall into this category.

 

AAF (Association des archivistes français) definition:

  • All materials received and produced by research teams
  • Whether or not linked to publications
  • Raw data, organized and transformed so that it can be interpreted by someone unfamiliar with the project
  • Include a larger group: research archives (laboratory administrative archives, for example)
  • So it's essential to work hand in hand with archivists!

For SHS, research data includes quantitative data that define trends, can be quantified, verified and made intelligible by statistical tools. Research data also includes qualitative data that characterize, but do not measure, the properties of a fact or phenomenon.

Different types of research data

  









 

 

Romain Couturier/Cyril Heude

 

 

This includes observation data (field recordings), which are often unique and irreplaceable, and therefore worth keeping/sharing for future research. Examples of useful tools for creating questionnaires: Qualtrics, Survalyzer, ModaLisa, LimeSurvey... 

Experimental data and compiled data are often reproducible, but at a dissuasive cost. Simulation models (economic, etc.) are often more useful than the simulation data they generate. Canonical data are organized, validated and widely used, as at the INSEE data.

Details on data types:

  • Simulation: economic models, diagrams based on an economic reality, for example. The idea is to derive profitability from a particular share configuration and apply the model to a specific market. A model that works in one market is tested in another market. The data is less valuable than the model. Challenge: see how a model that applies to one reality can work in another.
  • Experimentation: in the laboratory, these data are obtained using specific equipment; they are reproducible, but at a dissuasive cost, and the process is time-consuming. Tip: make these data available to everyone, so that the colleagues don't have to pay again to obtain the same result, so that others can save time.
  • Derived or compiled: Compilation of raw data, results of text mining and data mining. The aim: to extract the essence of a subject. This can involve the creation of new data sets by combining data from multiple or immense sources, such as the analysis of a million Twitter accounts. The human alone would spend a lifetime on this, and never finish. The trends identified by a data mining algorithm remain a prerequisite for the indispensable human analysis. The process is expensive and time-consuming.
  • Reference or canonical: annotated, peer-reviewed datasets made available as reference data: statistical collections, INSEE data, genomics data.

Data management issues

[Vidéo] Data Sharing and ManagementNYU Health Sciences Library Licence Creative Commons


Scientific issues

  • "Cumulative science: avoid the duplication of effort (redoing what has already been done), save time, increase the lasting impact of your research, facilitate cooperation and information sharing between partners in a collaborative project through robust and adapted storage devices;
  • Reproducibility: enable a different team to reproduce research results thanks to well-structured data;
  • Scientific proof: combat fraud and demonstrate the anteriority of your research;
  • Discovery: making data available enables the exploration of themes not previously considered by original researchers. "Others than us will be able to make marvellous creations from it", says Tim Berners Lee on the subject of sharing public data for citizen education (open data).


Financial stakes 

  • Research funders have an obligation to improve the return on investment of funded research. The widest possible dissemination is required. Eventually, data management activities will be taken into account in the appointment and promotion of researchers.

The adage " as open as possible, as closed as necessary"

 

 

 

 

 

 

 


Cyril Heude/Romain Couturier

 

Opening your data is not always a must; there are cases where it is legally obligatory to close it: 

  • industrial secrecy (the new revolutionary flavor): secrecy of processes, secrecy of economic and financial information, secrecy of commercial or industrial strategies: all sensitive elements that have a particular impact on the competitive environment of the establishment and its partners. They may not be disclosed to anyone other than the parties concerned, unless the information covered by this secrecy has been removed.
  • defense secrecy (interviewees risk their lives to help your research): incompatible with national security issues: interviewing al-Qaeda or the Bureau des Légendes at the DGSE.
  • the project does not generate data: yes, yes, it does occur!
  • Disseminating data could compromise the project's objective
  • protection of personal data under the RGPD. 
    • What is personal data? Any information relating to a natural person who can be identified, directly or indirectly, is personal data, regardless of whether this information is confidential or public. For such data to no longer be considered personal, it must be rendered anonymous in such a way as to make it impossible to identify the concerned person.
      • if it is possible to identify a person by cross-referencing several pieces of information (age, gender, town, degree, etc.) or by using various technical means, the data is still considered personal. 
      • What is sensitive data? The CNIL offers a definition to help you determine whether your project involves sensitive personal data.

Please note: these cases explain why access to data in repositories may be subject to restrictions or embargoes, depending on the nature of the produced data. These exceptions do not exempt you from drawing up a data management plan.

The technique for dealing with these risks without overwhelming the research: a consent form, data anonymization, a declaration of processing recorded in the DPO (Data Protection Officer) register, or even an impact analysis for the most sensitive data may be required. 

The content of interviews in sociology or anthropology, for example, is subject to the interviewees' authorization for any dissemination. One important document: the free and informed consent form signed by the participants. They have the possibility of withdrawing from the study as soon as they wish, without justification; the panel has been given time to ask questions; the interviewer has left his contact details to give the interviewees the possibility of going back on what they said; only the data necessary for the project have been collected and processed (principle of data minimization). Model in the interviewee's language. Clearly state that participation is voluntary; participants have the right not to answer certain questions. Study issues and conditions for managing, sharing and archiving project data have been understood and accepted.

Anonymising: understanding and practising

 

 

Anonymization must be sufficient to protect the confidentiality of personal data while ensuring the dissemination of information for research purposes.

Anonymize consists in removing direct personal identifiers (name, address, social security number...) or indirect (profession, ethnicity...). Service dedicated to Sciences Po.

Examples of software to modify or delete personal and sensitive data: Gimp (images), Metadata anonymisation toolkit, ExifTool (all formats).

Cyril Heude/Romain Couturier

 

 

[Video] Anonymisation: theory and practice (part 1 of 3), Mark Elliot (NCRMLicence Creative Commons

Pseudonymising: understanding and practising

Pseudonymization is less reliable: we can find out who we are talking about by cross-checking. Pseudonymised data is always personal data.

Advice: encrypt confidential data, encrypt content, store the encryption key in a different location from the data. Software: 7-zip.

For all these documents: models, tools and people exist at Sciences Po that can help you. They are listed in the Sciences Po Data Guide where you will find all the information.

Advice: distribute at least the data that supports the research articles published within the framework of the project in order to allow your readers to deepen their understanding of your analyses.

Different levels of processing

Metaphor 1: the pyramid

From the base to the top, we can observe different levels of data processing: raw data, collections of reference data (statistics), processed, selected, documented data, supporting data for publications. However, it can also be said that data is never truly raw: it always has a format, an author, a context and a signifying force induced by its own publication.

Metaphor 2: the funnel

The funnel metaphor is used to illustrate the discrepancy in quantity between the data produced, processed and retained: the data retained and quoted in the article are only a fraction of the processed data, which are themselves only a fraction of the data produced.

Focus on your field photographs

If you're a sociologist or historian who takes photographs in the field, here are a few recommendations for taking images in the field using a camera or telephone. Objective: better manage your images.


 Before shooting

Settings to be made once in the parameters of your camera or phone:

  • Format: prefer JPEG format without compression or with the lowest compression. 
  • Resolution: Choose the highest resolution. For example: 10M(millions of pixels) or 4096 x 2304
  • Date and time: Check that this information is up to date, especially if there is a time zone change and your device does not update this information automatically.
  • Author: Enter your initials or preferably your name in the author field, if possible.


► After shooting

Organize
Copy files to your computer or institutional Google drive. Organize files in a tree structure (e.g.: location, subject, date, etc.).

Rename files
See "Easy-to-find data" page.

Document

Would you like to assign keywords, captions and other attributes to your files? Edit the metadata: it's best to use image management software to edit metadata. This will enable you to make batch modifications, such as associating a keyword with a series of images. Without software, you can also edit your metadata directly from the settings. You can also use a table-type tracking file to note shooting location, context (street rendezvous, regional archives, etc.) or even the names of people or contacts.


Tools

Dernière mise à jour: Apr 29, 2025 3:05 PM