Research data: a strategic subject

News from committee

The issue of data is a strategic subject and was the subject of a report commissioned by the French Prime Minister. ‘Pour une politique publique de la donnée’ (For a public policy on data) was written by the MP Eric Bothorel and published in December 2020. It strongly emphasises the importance of the issues linked to scientific data’s role as vectors of knowledge. It notably states that “if [the] culture of sharing [data] between research teams had been better established, the management and handling of Covid19 would certainly have been more effective and reactive during the crisis.”

The European Commission report Cost-benefit analysis for FAIR research data – Cost of not having FAIR research data published in 2019 estimates that the mismanagement of research data has cost France €3 billion due to wasted time, non-optimised storage costs, licensing fees, problems with research being duplicated or a lack of cross-fertilisation.

A national research data platform is therefore going to be set up to respond in part to this strategic challenge. This is a commitment set out in the first National Plan for Open Science and was reaffirmed by Frédérique Vidal in December 2020 with the announcement of the creation of “a data repository to store orphan data, known as the long tail, whose weight in bytes is low but whose scientific weight can turn out to be major. When we think of research into rare diseases or the work of paleoanthropologists, clearly there is no such thing as small or negligible data.

Research data management – a critical situation

The current research data management situation is relatively critical. Consultations with researchers during preparatory studies for this project revealed the difficulties they are facing. These include storage practices on an individual medium (hard disk, USB key, etc.), losing data when a researcher leaves a laboratory, the lack of a sustainable solution for storing and opening data collected during a research project and the impossibility of reproducing scientific research results because the data and code are either inaccessible, poorly documented or cannot be reused. An INSERM research director speaks of a “chasm of unused data“. The proposal presented today is the result of a political will and also aims to respond to the daily difficulties faced by researchers.

Research data: an issue of major economic importance

The economic stakes linked to research data are very high. 24 million data items have a DOI assigned by Datacite and the major actors among their producers (those with more than 1 million DOIs assigned) are very large research organisations like the CERN or ETH Zurich and scientific publishing giants like Figshare, which belongs to the Springer Nature group, or Sage Publishing. On the other hand, France has only registered 225,000 DOIs. Amazon has created an ‘Open data sponsorship programme‘ which provides free access to a number of services when open datasets are deposited. Digital giants like Google data search or Mendeley Data (which belongs to the Elsevier group) have launched data search engines. The Elsevier group no longer presents itself as a publisher but as a “data analysis” company which is now present throughout the full research cycle.

A proliferation of data repositories in France and around the world but still no solution for many scientific communities

In the framework of the preparatory work for the new roadmap for research infrastructures, the questionnaire sent to the infrastructures concerned showed that only 33% of these said they had a data policy (which did not always cover the entire life cycle) and only 28% deposited data in a repository. However, paradoxically there is a proliferation of such repositories with over 3600 throughout the world and at least 110 in France including half a dozen existing institutional repositories and several others being set up. This proliferation greatly reduces the visibility and discoverability of data for researchers and acts as an obstacle to the cross-fertilisation of data between disciplines. The technological cost of setting up a repository for an institution is estimated at 1 million euros over a 4-year period whereas such costs can very often in fact be shared.

Open data citation advantage – The impact of adding data to articles in terms of citations of the articles concerned

Contrary to some researchers’ long-standing fear that their data may be “plundered”, those who share the data associated with their articles actually significantly benefit in terms of visibility as measured by citations. Half a dozen studies show a strong increase in citations when data is associated with a paper. This is therefore a strong opportunity for French research to strengthen its international visibility particularly as France’s ranking in the production of articles is currently regressing.

The impact of adding associated data to articles in terms of citations of the articles concerned Colavizza, Giovanni, Iain Hrynaszkiewicz, Isla Staden, Kirstie Whitaker and Barbara McGillivray. 2020. ‘The Citation Advantage of Linking Publications to Research Data’. PLOS ONE 15 (4): e0230416.

Researchers are invited to deposit their research data in data repositories. Certain thematic or disciplinary communities have developed best practices for structuring data in accordance with the FAIR [1]FAIR: Findable, Accessible, Interoperable, Reusable principles to preserve, open or share it with restricted access if the form of the data or the framework it was obtained in calls for this.

Researchers should opt in priority for national and international thematic and disciplinary data repositories which comply with best practices for the dissemination of their data. However, unfortunately too many scientific fields do not have a suitable solution for depositing data yet.

The obligation to deposit the data associated with articles has led to communities that do not benefit from trusted repositories depositing their data in private publishers’ repositories or in un-moderated generic repositories.

A generic national data service: a commitment set out in the National Plan for Open Science

As early as 2018, the Minister of Higher Education, Research and Innovation announced the development of a generic data hosting and dissemination service among the measures set out in the National Plan for Open Science.

To determine the national generic data repository service which would best respond to researchers’ requirements, a group of specialists from different disciplinary backgrounds carried out 3 studies in the framework of the Committee for Open Science:

The replies of the research communities interviewed for these studies reveal that their most important requirement is for support in the preparation and description of data.

Get to know Recherche Data Gouv which is being developed