학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

BioExcel-2 Deliverable 2.4 – Development of a framework for the combination of HPC and HPDA operations

Resource Type
Authors: Hospital, Adam; Bayarri, Genis; Gelpi, Josep Lluis; Andrio, Pau; Lezzi, Daniele; Badia, Rosa M; Ejarque, Jorge; Alvarez, Javier; Sola, Salvi; Barissi, Sandro; Battistini, Federica; Gallego, Diego; Sala, Alba; Westermaier, Yvonne
Source
Subject: Machine Learning
HPC
Data Analytics
HPDA
Workflows
Language

Online Access

Open Access (OpenAIRE)

초록

This deliverable presents the different approaches of the BioExcel CoE regarding the design and development of a framework for the combination of High Performance Computing (HPC) and High Performance Data Analytics (HPDA) operations, and the initial workflows that will benefit from such a framework. The BioExcel HPC-HPDA framework combines two state-of-the-art techniques: Machine Learning (ML) algorithms to automatically learn from generated data, andBig Dataapproaches to store, combine, analyse, and retrieve large amounts of information. Focusing on the ML field, a new category for the BioExcel Building Blocks (BioBB) library has been implemented: biobb_ml. This module contains a set of building blocks wrapping popular ML libraries such as Scikit-learn and TensorFlow. The collection of building blocks contained in the biobb_ml module is divided in six categories that are presented in section 2 of this document: Classification, Regression, Clustering, Neural Networks, Dimensionality Reduction and Utils. The module offers an easy way of training, testing and prediction with common ML methods like Random Forest, K-Means or Neural Networks. Including these building blocks in the BioBB library expands the collection of tools wrapped and allows generating complex reproducible workflows including ML models. In parallel with the biobb_ml module, a new ML library, offering large-scale data analytics on HPC infrastructures, has been developed: dislib. Based on distributed arrays (ds-arrays), dislib divides the input data in blocks that are processed in parallel, eventually performing a reduction to yield the results (map/reduce), reducing the execution time while at the same time supporting larger input data sets. Thanks to the BioExcel contribution, the library has been extended with a couple of widely used methods in the computational biomolecular simulation field: Principal Component Analysis and Daura’sclustering. The library is now able to read different MD trajectory formats as input. Preliminary tests presented in section 3 are showing encouraging results, with the possibility to apply these methods to extremely large trajectory files. dislib functionalities will be integrated in the biobb_ml module in new releases. Workflows using ML methods for large-scale scientific projects are presented to showcase how these technologies are applied in our field. Two very different endeavors, one focused on predicting Protein-DNA binding affinities in transcription factors, and one predicting RNA interaction energies reproducing Quantum Mechanics (QM) results, are briefly presented in section 4. Finally, an HPDABig Dataapproach, with a NoSQL database storing trajectories of COVID-19 related simulations is described in section 5. The database and its connected server provide an interactive and graphical interface to quickly browse key structural and flexibility features of the most important proteins involved in SARS-CoV-2 host invasion. The server is an example of a large collaborative effort, and showcases the importance ofBig Dataand data analytics in the biomolecular simulation field.

공지

DAU Library

학술논문

요약정보

BioExcel-2 Deliverable 2.4 – Development of a framework for the combination of HPC and HPDA operations

Online Access

초록