" "
Sciences Po | Library - New window

Guides thématiques

Données de la recherche

Speak English?

IN ENGLISH!

Make your data Accessible

  Stored on 2 reliable mediums, 1 of which is remote during the project;

  In a trustworthy dissemination repository ;

  Archived in part at the end of the project;

  Focus on data mining and visualisation.

Store and share: where, when, how

Accessible data = stored on 2 reliale medium, 1 of which is remote, during the project

 

Risk: not being able to access information because the physical medium has been destroyed; losing your belongings (at random, a USB stick containing 10 years' worth of work) in a move, fire, burglary, data theft, etc.

The Dropbox case: beware of the data backup procedure: it's a commercial tool. What happens if the company that produces it goes bankrupt? Do not store personal or sensitive data in Dropbox: there is a risk that the owners of the storage solution will recover the data.

Remedy: 3 2 1: the golden rule of data storage!

The 3 2 1 rule consists in 3 identical copies on 2 different media (usb, external hard disk, institutional servers - suitable for sensitive data, clouds), 1 of which in a different location (out of the office). The CASD server - Centre d'Accès Sécurisé aux Données - give payable access to massive confidential public data. Identification is in two stages, using fingerprints and biometrics, and storage is on a computer not connected to the Internet.

Retention period: unregulated.

Advice: IT Departments

Keep in mind the question of where data is hosted, as cloud computing leads to the internationalization of information carriers, which can lead to conflicts between legal systems (personal data protection, copyright, liability). Remember to consult the terms and conditions of the tools you use. If you're working with American researchers, keep in mind that their personal data legislation differs greatly from the RGPD. Data collected for a French laboratory should not be stored on American servers.

Choose strong passwords with a high level of protection (long string of characters, capital letters, numbers, special characters). Never use them twice; don't use your birthday, initials, street name, hobbies or passions. Choose an up-to-date operating system and software, and encrypted spaces and data.

Favor the use of robust storage systems (with stable performance), with automatic backup, like those used by Sciences Po. Take care to maintain your hardware (updates, etc.).

Be careful to choose a good anti-virus, as PC breakdowns make it easier to download infected programs. 

How to store and secure your data during your research

And some good practices to have in mind: strong password choices, up-to-date operating system and software, encrypted spaces and data.

Storage solution Advantages Risks / Precautions Recommandation

Local storage (computer, laptop)

Easy to manage and to prevent from unauthorised access

Not sufficient if data are stored on only one device (=> backup is needed). Laptops can be stolen Hard-drive encryption is mandatory

Backup your computer

External hard-drive

Useful to exchange data without transmitting them over the Internet Easily lost, stolen and damaged Use preferably for temporary storage

Network shared drive on research center server / Network Attached Storage

Automated backups. Centrally stored. High storage capacity

Can not be accessed by external people For long-term storage

Google Drive - Cloud storage provided by Sciences Po

Can be accessed by external people (if they have a Google email address). Automatic Backup

Storage in E.U. not guaranteed => conflict with GPDR. Not suitable for all research projects. Control access when sharing

Encrypt personal data before uploading them to the cloud (compliance to GPDR).
Other cloud storage managed by a university or CNRS Secure in case of E.U. storage Size may be limited May be secure and appropriate
Cloud storage without any agreement (e.g. Dropbox)

Widespread use. Not depending on an email provider

Free services by commercial providers may claim rights to use content you manage and share them for their own purposes

Risky. Not recommended for sensitive data

► Google Drive for Education

Your Google Sciences Po account is not your private Google account.  As part of the Google Apps for Education business agreement, you benefit from special terms and conditions of use. Google DRIVE Sciences Po is better than any other private solution: Private Google Drive, Drop Box, Orange Box, Facebook, etc.). More information: Google Apps [accès réservé].

 

More and more establishments are offering collaborative sharing and storage services such as SaaS (Software as a Service), cloud computing or desktop virtualization.  Examples: Renater, open science framework, cumulus.

 

► Huma-Num

 For more information, visit the dedicated page.  


 

► My CoRe - CNRS

Member of a research unit? Use MyCoRe services, a space for "individual storage and backup, mobility, and secure sharing." Store up to 100MB, synchronize your computers, and share your files. Several types of accounts are offered: individual, service (for teams, projects), and guest. A guest (not listed in Janus) can only upload files < 10MB to the CNRS Cloud. This file size is very small for audio files. If MyCore is not sufficient, CNRS allows the use of Cryptomator software, which enables the creation of a vault on the cloud (thus on Google Drive). The software is free and can be installed on Windows and Mac.

Tip for audio files: Create a shared folder on Google Drive, accessible to all project members. This folder can be encrypted at creation and accessed only by password using Cryptomator.

 
 ODS - CNRS

All the info about ODS.


► Other storage tools with automatic back up
 : SyncBack, Cobian, Macrium Reflect, Open Science Framework.

► FileSender

  • Authentication via the Education-Research identity federation
  • Fast file submission to one or more correspondents
  • Consulting the files uploaded
  • Invitation of correspondents to upload files to their personal file repository space.
► 7-zip

Do you use or share confidential or personal data? Use 7-zip! This tool, recommended by Sciences Po,  allows you to encrypt a document and meets the need for secure transfer to third parties. The procedure [restricted acces]. To encrypt your data you must use passwords that should not be forgotten years later. For any question, ask sos@sciencespo.fr.

► Other tools

Accessible data = in a trustworthy dissemination repository

 

Risk: not being able to access the supporting data for an article because you're not part of the right research community, having your data reallocated without even crediting you, having your analyses questioned...

Data.sciencespo is an institutional repository based on Harvard's dataverse solution. You can choose between access open to all, on request from research teams, or restricted. Perennial DOI/URL

Re3Data is a data repository directory. You are not obliged to use the Sciences Po repository. 

Other repositories:

  • Zenodo: this multidisciplinary repository is funded by the European Commission. Problem: largely invested by predatory publishers, as there is no moderation. You can choose an existing disciplinary community or create your own.
  • Nakala: CNRS repository
  • Specialized, disciplinary repositories: data sharing is an integral part of research practices in certain disciplines (e.g. astronomy, genetics, environment, Pangea in geology). According to the CNRS, they contained 2% of data in 2014.
  • Private repositories belonging to the Research and Development sections of companies: these correspond to personal initiatives to make data available to the scientific community, for example in the event of a pandemic.

Some journals require the following data to be deposited in a repository:

  1. the data on which the article is based
  2. the data on which an article's conclusions are based. 

Whatever the repository, a good one:

  1. Has a clear business model, visible to all, widely shared: the "dataverse" solution developed by Harvard is used for data.sciencespo, also adopted by Lorraine, Paris 8…
  2. Offers guarantees in terms of indexing and reporting (so that you can find the dataset you want) and content preservation.

That's what Datasciencespo is here for, and we're here to help!

 

How to deposit your data after your project in data.sciencespo, our repository


Why deposit? You ensure that they are preserved, reused, and shared, thanks to their documentation and associated metadata.

► Depositing = sharing? The choice is yours between open or restricted access. You can choose the persons with whom you wish to share your data (collaborators, reviewers, etc.).  

► Which data? All the data you have produced, a selection of the data you have collected.

► How to download and deposit and your data? Contact us ! And some guidance.

Be careful of the formats! Which file formats are best for tabulated data, text, sound, images, video, etc.?
This question should be posed from the very outset of your research in order to be able to reuse, share, and preserve your data.
Follow the recommendations of UK Data Services:
 Recommended formats

What about codes and softwares?

Accessible data = archived in part at the end of the project

Risk: 5 years of programmed obsolescence if you do nothing.

Don't confuse storage during the project with long-term archiving solutions for 10, 20 or 30 years after the project. 

Copying media, migrating formats across multiple generations of media and technologies, storing multiple copies using the most widely used, free and open technologies, diversifying media, storing media in different machine rooms. These operations require data selection, which means that we need to think about which data sets will be archived because they are unique, and which will be destroyed at the end of the project because they are easily reproducible. Data must be of recognized scientific value to the scientific community from which they originate.

Selection criteria: potential for scientific re-use, evidential value, historical value, etc.

Retention periods are regulated. Find out more in the CNIL practice guide.

OAIS Template (Open Archival Information System): in english.

More info:
Digital Curation Centre (DCC): Whyte, A. & Wilson, A. (2010). 
How to appraise and select research data for curation. DCC How-to Guides. Edinburgh: Digital Curation Centre.

Accessible data = TDM : Text and Data Mining

Data mining means being a cyber-archaeologist, extracting information relevant to a subject from the wealth of information available, and making connections and choices that make this wealth of information intelligible to humans.

What is TDM? It's an automated analysis of digital information, involving the extraction of knowledge through a learning or statistical algorithm based on criteria of novelty, occurrences and similarity.

Why TDM? The growing volume of scientific literature and research data is leading to a massification of information. A number of digital tools make it easier to consult, exploit and cross-reference information than would be possible manually, and thus facilitate the acquisition of new knowledge, the discovery of new trends and cross-disciplinary research.

TDM: the legal framework? Article 38 of the Law for a Digital Republic aims to make up for the absence of a clear legal framework. It introduces an exception to copyright and database producer's rights, limited to scientific texts and data (unlike American Fair Use). This right to TDM has been present since 2014 in the guidelines of the H2020 research funding program. It is reflected in the existence of French (Istex, 21 million resources structured, enriched and divided into major disciplines) and European (OpenMinTed) TDM infrastructures. Some research projects use Istex corpora and mining tools as study sources, for example the Terre-Istex project, combining geographic information systems (GIS) and geology, or Unitex-Castys, in linguistics.

Among the software used, the aptly named Grobid (Generation of Bibliographic Data) extracts and analyzes content such as bibliographic information, and offers a statistical analysis of recurring terms.

A few vocabulary points. The Data Lake stores data without pre-processing and without any preconceived ideas as to its nature or subsequent use. It corresponds to a set of unstructured data from multiple sources (hence the image of the lake), so massive that it is impossible to process or analyze by the human mind or conventional information tools. For information, structured data are included in a relational database, with tables and columns; semi-structured data use formats such as .csv, .xml or .json; documents, .pdf files and e-mails are unstructured data. The data lake can be the starting point for dematerialized collaborative approaches between researchers and promote open science. The data swamp corresponds to a less organized and less clean state of data than the data lake, often for data that is inaccessible or of little value.

EXAMPLE: 1 000 000 is the number of Twitter accounts analyzed as part of the Participate project (Sciences Po/Harvard), which aims to understand why low-income individuals finance election campaigns in the USA, Great Britain and France!Le TDM : c'est quoi ? C'est une analyse automatisée d'information numérique qui implique l'extraction de connaissances à travers un algorithme d'apprentissage ou de statistiques sur des critères de nouveauté, d'occurrences et de similarité.

Accessible data = visualisation

 

Data visualization means being a cyber-painter, telling a story, making colors dance so that the essence of information mining becomes clearer at a glance. But beware of magnificent shows that make no sense at all.

What is data visualization? This graphic representation of information and data, maps and infographics, is intended to be a powerful narrative with a precise goal, for example:

  • To help you make decisions;
  • To highlight possible relationships between different data and derive statistical information from them
  • To show what makes data homogeneous. Ideally, it enables you to focus on the relevant data alone.

The most common types of graphics: diagrams (circular, concrete), tables, dashboards, flow charts, infographics, mind maps: these forms can be used to analyze the characteristics of populations of different sizes on the same indicator.

More specific graphics: area charts, polar charts, bar charts, bullet charts, whisker charts, bubble clouds, cartograms, circular views, point distribution charts, Gant charts, line charts, histograms, matrices, networks, radial trees, scatter plots, word clouds.

All combinable!

Graphs are used, for example, to pinpoint the results of opinion surveys.

Examples:

  • GarganText: a tool for terminological visualization of a textual corpus, it produces interactive maps that evolve as you work on them. It can be used to build a thematic map of words to feed a "state of the art" type article, without missing a key theme in a given issue. The larger the dot, the more central the term is in the network of relationships between terms. The tool has recently been used to create an interactive map of coronavirus research and its links with other diseases, to analyze thousands of articles, to highlight the themes addressed and their organization, and to identify the most representative terms in the corpus, in this case "effective vaccines".
  • According to its creators, the tool could also be used to map the political programs of election candidates.

To find out more, click here.

Tip: beware of "flashy" visualizations that miss the point and don't meet the need. Visualization can influence the effectiveness and credibility of the message.

Tools from the medialab

The médialab helps social science and humanities researchers make the most of the mass of data made available by digital technology. It has three main missions that are highly integrated: methodology, analysis, theory. The team develops a large number of software programs that make it possible to organize, automate and visualize research on natively digital or digitized data. Here they are: medialab.sciencespo.fr/tools/​

Dernière mise à jour: Apr 29, 2025 3:05 PM