The fast development of large-scale scientific facilities (Synchrotron, Neutron, Laser, X-ray Free Electron Laser [XFEL] facilities etc.) has massively increased the speed with which experiments can be performed, while new methods and techniques have greatly increased the amount of raw data collected during each experiment [1]. While this has created enormous new opportunities, it has also created tremendous challenges for the national facilities and the users. Typically, users collect data during their assigned and limited beamtime and then spend many months analysing them. With the huge increase in data volume, this is no longer possible. Therefore, only a small fraction of this multidisciplinary and scientifically complex Big Data are fully analysed and, ultimately, used in scientific publications. Therefore, in few years, as the Large Scientific Facilities Big Data do definitely overcome any conventional data analysis approach purely based on human resources, users could be then incapacitated producing meaningful science from their large-scale experiments. This problem is even more evident in the case of XFELs, where tens of Petabytes are produced and must be analysed yearly. This is unfortunate because large scientific facility beamtime is an expensive resource with respect to money as well as time. Furthermore, a lack of appropriate data analysis approach limits the realisation of experiments that generate a large amount of data in a very short period of time. Moreover, the current lack of automatized data analysis pipelines prevents the fine-tuning of the experiments during a beamtime, thus further reducing the efficiency of the beamtime potential usage. This effect, commonly known as the “Big Data deluge”, affects the large scientific facilities worldwide in several different ways, including fast data collection and available local storage, curation of the data, as well as data movement and deposition in a database.
Nowadays we are witnessing the dawn of Artificial Intelligence (AI), Machine Learning (ML) and Robotic Automation within the field of large scientific facilities, generating deep changes in how petabytes of interdisciplinary datasets are intelligently processed, managed, analysed and visualised. Therefore, the evolution of large scientific facilities into Superfacilities enables multimodal user science confronting the Big Data challenges, crucial for the entire scientific community. This seminar will thus introduce the Big Data Science Center (BDSC) at the Shanghai Synchrotron Radiation Facility (SSRF), the first scientific Superfacility in China, and one of the first worldwide [2], further detailing on its most recent developments. The BDSC aims at dramatically accelerating and automatising the multidisciplinary researches of all the users at the Large National Scientific Facilities, effectively increasing the rate of their scientific discoveries and the resulting technological advancements, with a clear societal impact. Therefore, this Big Data Science Platform targets the researches that several national and international universities, academies, research institutes and industries are pursuing at SSRF, where a massive support in terms of Scientific Computation is required to enable the most complete knowledge transfer from scientific research to industrial developments, while elastically interfacing them with the top Chinese National Supercomputers nationwide.
References
[1] C. Wang, U. Steiner and A. SepeCorresponding author, Synchrotron Big Data Science. Small 14, 1802291 (2018)
[2] C. Wang, F. Yu, Y. Liu, X. Li, J. Chen, J. Thiyagalingam and A. SepeCorresponding author, Deploying the Big Data Science Center at the Shanghai Synchrotron Radiation Facility: the first superfacility platform in China. IOP Publishing, Machine Learning: Science and Technology 2, 035003 (2021)
Dr. Jitae Park
Dr. Dominic Hayward