공부의 즐거움

CERN Announces New Open Data Policy in Support of Open Science 본문

플랫폼 서비스 동향

CERN Announces New Open Data Policy in Support of Open Science

yoosue 2020. 12. 13. 17:00

CERN Announces New Open Data Policy in Support of Open Science (2020.12.11)

www.hpcwire.com/off-the-wire/cern-announces-new-open-data-policy-in-support-of-open-science/

 

CERN Announces New Open Data Policy in Support of Open Science

GENEVA, Dec. 11, 2020 -- The four main LHC collaborations (ALICE, ATLAS, CMS and LHCb) have unanimously endorsed a new open data policy for scientific

www.hpcwire.com

 

ENEVA, Dec. 11, 2020 — The four main LHC collaborations (ALICE, ATLAS, CMS and LHCb) have unanimously endorsed a new open data policy for scientific experiments at the Large Hadron Collider (LHC), which was presented to the CERN Council today. The policy commits to publicly releasing so-called level 3 scientific data, the type required to make scientific studies, collected by the LHC experiments. Data will start to be released approximately five years after collection, and the aim is for the full dataset to be publicly available by the close of the experiment concerned. The policy addresses the growing movement of open science, which aims to make scientific research more reproducible, accessible, and collaborative.

The level 3 data released can contribute to scientific research in particle physics, as well as research in the field of scientific computing, for example to improve reconstruction or analysis methods based on machine learning techniques, an approach that requires rich data sets for training and validation.

“The open data policy reflects CERN’s commitment to open science, which was already asserted in the CERN Convention over 60 years ago,” said Eckhard Elsen, CERN Director for Research and Computing. “The policy sets out the concrete steps towards its implementation at CERN, which will make data available to the extended scientific community as well as the general public.”

Scientific data are considered to have different levels of complexity. Level 3 data are of the type used as input to most physics studies and will be released alongside the software and documentation needed to use the data. Its release will allow high-quality analysis by diverse groups: non-CERN scientists, scientists in other fields, educational and outreach initiatives, and the general public.

The policy also covers the release of level 1 and level 2 datasets, of which samples are already available. Level 1 corresponds to the supporting information of results published in scientific articles, and level 2 corresponds to dedicated scientific datasets designed for educational and outreach purposes.

Data storage solutions at the CERN data centre (Image: CERN)

In practice, scientific datasets will be released through the CERN Open Data Portal, which already hosts a comprehensive set of data related to the LHC and other experiments. Data will be available using FAIR standards, a set of data guidelines that ensure the data are findable, accessible, interoperable, and re-usable.

“The policy provides a progressive framework for the openness and preservation of experimental data,” said Jamie Boyd, convener of the working group that formulated the policy. This strategy complements CERN’s existing Open Access policy, which mandates that all CERN research results are published in open access. It is also aligned with the recent European Strategy for Particle Physics Update announced in June 2020. The new policy could be used as a blueprint for other experiments at CERN and in other scientific organisations.

CERN previously pioneered Open Access to scientific literature with the SCOAP3 consortium, a global partnership of libraries, funding agencies and research institutions from 46 countries and intergovernmental organisations, which is now the largest open access initiative in the world. In addition, CERN collaborates with many organisations, such as the European Commission and UNESCO, on its efforts to promote open science practices beyond particle physics.

Further information:


CERN Open Data Policy for the LHC Experiments November, 2020

 

The CERN Open Data Policy reflects values that have been enshrined in the CERN Convention for more than sixty years that were reaffirmed in the European Strategy for Particle Physics (2020) , and aims to empower the LHC experiments to adopt a consistent approach towards the openness and preservation of experimental data. Making data available responsibly (applying FAIR standards), at different levels of abstraction and at different points in time, allowsthe maximum realisation of their scientific potential and the fulfillment of the collective moral and fiduciary responsibility to member states and the broader global scientific community. CERN understands that in order to optimise reuse opportunities, immediate and continued resources are needed. The level of support that CERN and the experiments will be able to provide to external users will depend on available resources. This policy relates to the data collected by the LHC experiments, for the main physics programme of the LHC — high-energy proton–proton and heavy-ion collision data. The foreseen use cases of the Open Data include reinterpretation and reanalysis of physics results, education and outreach, data analysis for technical and algorithmic developments and physics research. The Open Data will be released through the CERN Open Data Portal which will be supported by CERN for the lifetime of the data. The data will be tailored to the different uses, and will be made available in formats defined by each experiment that afford a range of opportunities for long-term use, reuse and preservation. In general, four levels of complexity of HEP data have been identified by the Data Preservation and Long Term Analysis in High Energy Physics (DPHEP) Study Group , which serve varying audiences and imply a diversity of openness solutions and practices.

 

Published Results (Level 1) Policy: Peer-reviewed publications represent the primary scientific output from the experiments. In compliance with the CERN Open Access Policy, all such publications are available with Open Access, and so are available to the public. To maximise the scientific value of their publications, the experiments will make public additional information and data at the time of publication, stored in collaboration with portals such as HEPData, with selection routines stored in specialised tools. The data made available may include simplified or full binned likelihoods, as well as unbinned likelihoods based on datasets of event-level observables extracted by the analyses. Reinterpretation of published results is also made possible through analysis preservation and direct collaboration with external researchers.

 

Outreach and Education (Level 2) Policy: For the purposes of education and outreach, dedicated subsets of data are used, selected and formatted to provide rich samples to maximise their educational impact, and to facilitate the easy use of the data. These data are released with a schedule and scope determined by each experiment. The data are provided in simplified, portable and self-contained formats suitable for educational and public understanding purposes; but are not intended nor adequate for the publication of scientific results. Lightweight environments to allow the easy exploration of these data may also be provided. CERN experiments will make data of such high level of abstraction available, accessible through the CERN Open Data Portal.

 

Reconstructed Data (Level 3) Policy: The LHC experiments will release calibrated reconstructed data with the level of detail useful for algorithmic, performance and physics studies. The release of these data will be accompanied by provenance metadata, and by a concurrent release of appropriate simulated data samples, software, reproducible example analysis workflows, and documentation. Virtual computing environments that are compatible with the data and software will be made available. The information provided will be sufficient to allow high-quality analysis of the data including, where practical, application of the main correction factors and corresponding systematic uncertainties related to calibrations, detector reconstruction and identification. A limited level of support for users of the Level 3 Open Data will be provided on a best-effort basis by the collaborations. Public data releases will occur periodically, following an appropriate latency period to allow thorough understanding of the data, the reconstruction and calibrations, as well as to allow time for the scientific exploitation of the data by the collaboration. The size of the released datasets will be commensurate with the total amount of data collected of similar type, with the aim to commence data releases within five years of the conclusion of the run period. Data may be withheld by an experiment if there are active analyses ongoing. Full datasets will be made available at the close of the collaboration. The data will be released from the CERN Open Data Portal under the Creative Commons CC0 waiver, and will be identified with persistent data identifiers, and the data must be cited through these identifiers. Similarly, appropriate acknowledgements of the experiment(s) should be included in publications released using such data, and the publications made clearly distinguishable from those released by the collaboration. Any scientific claims in such publications are the responsibility of their authors and not of the experiments. It is expected that scientific results released using Open Data follow best scientific practices. The experiments may impose rules related to the use of the data by members of their respective collaborations. External authors should be aware that they will not have access to the vast amount of tacit knowledge built up within the LHC collaborations over the decades of design, construction and operation of the experimental apparatus. To allow external scientists to fully benefit from all the data, knowledge and tools, the collaborations may offer appropriate association programmes.

 

Raw Data (Level 4) Policy: It is not practically possible to make the full raw data-set from the LHC experiments usable in a meaningful way outside the collaborations. This is due to the complexity of the data, metadata and software, the required knowledge of the detector itself and the methods of reconstruction, the extensive computing resources necessary and the access issues for the enormous volume of data stored in archival media. It should be noted that, for these reasons, general direct access to the raw data is not even available to individuals within the collaboration, and that instead the production of reconstructed data (i.e. Level-3 data) is performed centrally. Access to representative subsets of raw data—useful for example for studies in the machine learning domain and beyond—can be released together with Level-3 formats, at the discretion of each experiment.

 

(각주 있으니 원문 살펴볼 것)

cds.cern.ch/record/2745133/files/CERN-OPEN-2020-013.pdf

 

 

(-> open data에 윤리나 프라이버시 문제는 없을까? policy에 해당 내용이 보이지 않는 듯)