" "
Speak English?
Stored on 2 reliable mediums, 1 of which is remote during the project;
In a trustworthy dissemination repository ;
Archived in part at the end of the project;
Focus on data mining and visualisation.
Risk: not being able to access information because the physical medium has been destroyed; losing your belongings (at random, a USB stick containing 10 years' worth of work) in a move, fire, burglary, data theft, etc.
The Dropbox case: beware of the data backup procedure: it's a commercial tool. What happens if the company that produces it goes bankrupt? Do not store personal or sensitive data in Dropbox: there is a risk that the owners of the storage solution will recover the data.
Remedy: 3 2 1: the golden rule of data storage!
The 3 2 1 rule consists in 3 identical copies on 2 different media (usb, external hard disk, institutional servers - suitable for sensitive data, clouds), 1 of which in a different location (out of the office). The CASD server - Centre d'Accès Sécurisé aux Données - give payable access to massive confidential public data. Identification is in two stages, using fingerprints and biometrics, and storage is on a computer not connected to the Internet.
Retention period: unregulated.
Advice: IT Departments
Keep in mind the question of where data is hosted, as cloud computing leads to the internationalization of information carriers, which can lead to conflicts between legal systems (personal data protection, copyright, liability). Remember to consult the terms and conditions of the tools you use. If you're working with American researchers, keep in mind that their personal data legislation differs greatly from the RGPD. Data collected for a French laboratory should not be stored on American servers.
Choose strong passwords with a high level of protection (long string of characters, capital letters, numbers, special characters). Never use them twice; don't use your birthday, initials, street name, hobbies or passions. Choose an up-to-date operating system and software, and encrypted spaces and data.
Favor the use of robust storage systems (with stable performance), with automatic backup, like those used by Sciences Po. Take care to maintain your hardware (updates, etc.).
Be careful to choose a good anti-virus, as PC breakdowns make it easier to download infected programs.
And some good practices to have in mind: strong password choices, up-to-date operating system and software, encrypted spaces and data.
Storage solution | Advantages | Risks / Precautions | Recommandation |
---|---|---|---|
Local storage (computer, laptop) |
Easy to manage and to prevent from unauthorised access |
Not sufficient if data are stored on only one device (=> backup is needed). Laptops can be stolen Hard-drive encryption is mandatory |
Backup your computer |
External hard-drive |
Useful to exchange data without transmitting them over the Internet | Easily lost, stolen and damaged | Use preferably for temporary storage |
Network shared drive on research center server / Network Attached Storage |
Automated backups. Centrally stored. High storage capacity |
Can not be accessed by external people | For long-term storage |
Google Drive - Cloud storage provided by Sciences Po |
Can be accessed by external people (if they have a Google email address). Automatic Backup |
Storage in E.U. not guaranteed => conflict with GPDR. Not suitable for all research projects. Control access when sharing |
Encrypt personal data before uploading them to the cloud (compliance to GPDR). |
Other cloud storage managed by a university or CNRS | Secure in case of E.U. storage | Size may be limited | May be secure and appropriate |
Cloud storage without any agreement (e.g. Dropbox) |
Widespread use. Not depending on an email provider |
Free services by commercial providers may claim rights to use content you manage and share them for their own purposes |
Risky. Not recommended for sensitive data |
► Google Drive for Education
Your Google Sciences Po account is not your private Google account. As part of the Google Apps for Education business agreement, you benefit from special terms and conditions of use. Google DRIVE Sciences Po is better than any other private solution: Private Google Drive, Drop Box, Orange Box, Facebook, etc.). More information: Google Apps [accès réservé].
More and more establishments are offering collaborative sharing and storage services such as SaaS (Software as a Service), cloud computing or desktop virtualization. Examples: Renater, open science framework, cumulus.
► Huma-Num
For more information, visit the dedicated page.
Member of a research unit? Use MyCoRe services, a space for "individual storage and backup, mobility, and secure sharing." Store up to 100MB, synchronize your computers, and share your files. Several types of accounts are offered: individual, service (for teams, projects), and guest. A guest (not listed in Janus) can only upload files < 10MB to the CNRS Cloud. This file size is very small for audio files. If MyCore is not sufficient, CNRS allows the use of Cryptomator software, which enables the creation of a vault on the cloud (thus on Google Drive). The software is free and can be installed on Windows and Mac.
Tip for audio files: Create a shared folder on Google Drive, accessible to all project members. This folder can be encrypted at creation and accessed only by password using Cryptomator.
All the info about ODS.
► Other storage tools with automatic back up : SyncBack, Cobian, Macrium Reflect, Open Science Framework.
► FileSender
Do you use or share confidential or personal data? Use 7-zip! This tool, recommended by Sciences Po, allows you to encrypt a document and meets the need for secure transfer to third parties. The procedure [restricted acces]. To encrypt your data you must use passwords that should not be forgotten years later. For any question, ask sos@sciencespo.fr.
Risk: not being able to access the supporting data for an article because you're not part of the right research community, having your data reallocated without even crediting you, having your analyses questioned...
Data.sciencespo is an institutional repository based on Harvard's dataverse solution. You can choose between access open to all, on request from research teams, or restricted. Perennial DOI/URL
Re3Data is a data repository directory. You are not obliged to use the Sciences Po repository.
Other repositories:
Some journals require the following data to be deposited in a repository:
Whatever the repository, a good one:
That's what Datasciencespo is here for, and we're here to help!
► Why deposit? You ensure that they are preserved, reused, and shared, thanks to their documentation and associated metadata.
► Depositing = sharing? The choice is yours between open or restricted access. You can choose the persons with whom you wish to share your data (collaborators, reviewers, etc.).
► Which data? All the data you have produced, a selection of the data you have collected.
► How to download and deposit and your data? Contact us ! And some guidance.
Be careful of the formats! Which file formats are best for tabulated data, text, sound, images, video, etc.?
This question should be posed from the very outset of your research in order to be able to reuse, share, and preserve your data.
Follow the recommendations of UK Data Services: Recommended formats
Risk: 5 years of programmed obsolescence if you do nothing.
Don't confuse storage during the project with long-term archiving solutions for 10, 20 or 30 years after the project.
Copying media, migrating formats across multiple generations of media and technologies, storing multiple copies using the most widely used, free and open technologies, diversifying media, storing media in different machine rooms. These operations require data selection, which means that we need to think about which data sets will be archived because they are unique, and which will be destroyed at the end of the project because they are easily reproducible. Data must be of recognized scientific value to the scientific community from which they originate.
Selection criteria: potential for scientific re-use, evidential value, historical value, etc.
Retention periods are regulated. Find out more in the CNIL practice guide.
Data mining means being a cyber-archaeologist, extracting information relevant to a subject from the wealth of information available, and making connections and choices that make this wealth of information intelligible to humans.
What is TDM? It's an automated analysis of digital information, involving the extraction of knowledge through a learning or statistical algorithm based on criteria of novelty, occurrences and similarity.
Why TDM? The growing volume of scientific literature and research data is leading to a massification of information. A number of digital tools make it easier to consult, exploit and cross-reference information than would be possible manually, and thus facilitate the acquisition of new knowledge, the discovery of new trends and cross-disciplinary research.
TDM: the legal framework? Article 38 of the Law for a Digital Republic aims to make up for the absence of a clear legal framework. It introduces an exception to copyright and database producer's rights, limited to scientific texts and data (unlike American Fair Use). This right to TDM has been present since 2014 in the guidelines of the H2020 research funding program. It is reflected in the existence of French (Istex, 21 million resources structured, enriched and divided into major disciplines) and European (OpenMinTed) TDM infrastructures. Some research projects use Istex corpora and mining tools as study sources, for example the Terre-Istex project, combining geographic information systems (GIS) and geology, or Unitex-Castys, in linguistics.
Among the software used, the aptly named Grobid (Generation of Bibliographic Data) extracts and analyzes content such as bibliographic information, and offers a statistical analysis of recurring terms.
A few vocabulary points. The Data Lake stores data without pre-processing and without any preconceived ideas as to its nature or subsequent use. It corresponds to a set of unstructured data from multiple sources (hence the image of the lake), so massive that it is impossible to process or analyze by the human mind or conventional information tools. For information, structured data are included in a relational database, with tables and columns; semi-structured data use formats such as .csv, .xml or .json; documents, .pdf files and e-mails are unstructured data. The data lake can be the starting point for dematerialized collaborative approaches between researchers and promote open science. The data swamp corresponds to a less organized and less clean state of data than the data lake, often for data that is inaccessible or of little value.
EXAMPLE: 1 000 000 is the number of Twitter accounts analyzed as part of the Participate project (Sciences Po/Harvard), which aims to understand why low-income individuals finance election campaigns in the USA, Great Britain and France!Le TDM : c'est quoi ? C'est une analyse automatisée d'information numérique qui implique l'extraction de connaissances à travers un algorithme d'apprentissage ou de statistiques sur des critères de nouveauté, d'occurrences et de similarité.
Data visualization means being a cyber-painter, telling a story, making colors dance so that the essence of information mining becomes clearer at a glance. But beware of magnificent shows that make no sense at all.
What is data visualization? This graphic representation of information and data, maps and infographics, is intended to be a powerful narrative with a precise goal, for example:
The most common types of graphics: diagrams (circular, concrete), tables, dashboards, flow charts, infographics, mind maps: these forms can be used to analyze the characteristics of populations of different sizes on the same indicator.
More specific graphics: area charts, polar charts, bar charts, bullet charts, whisker charts, bubble clouds, cartograms, circular views, point distribution charts, Gant charts, line charts, histograms, matrices, networks, radial trees, scatter plots, word clouds.
All combinable!
Graphs are used, for example, to pinpoint the results of opinion surveys.
Examples:
To find out more, click here.
Tip: beware of "flashy" visualizations that miss the point and don't meet the need. Visualization can influence the effectiveness and credibility of the message.
The médialab helps social science and humanities researchers make the most of the mass of data made available by digital technology. It has three main missions that are highly integrated: methodology, analysis, theory. The team develops a large number of software programs that make it possible to organize, automate and visualize research on natively digital or digitized data. Here they are: medialab.sciencespo.fr/tools/