Skip to content

Input Data Preparation

The preparation script serves as a comprehensive tool for preparing training and validation datasets. Users can customize the data preparation process through the configuration file, specifying paths to data files, training modes (classification or regression), and other relevant parameters. The script then extracts particle datasets, applies normalization to ensure consistent weights, and prepares TensorFlow datasets for training. It offers options for reshuffling and batching, as well as the ability to artificially increase statistics by duplicating the dataset. The resulting datasets are saved in TensorFlow format, and the script optionally generates validation plots for feature engineering studies. With a user-friendly command-line interface, this script streamlines the intricate process of the data preparation for the training step.

Example of input feature distributions:

Example of input feature distribution (pT) Example of input feature distribution (eta)

To optimize the utilization of Final State Transformer, input data should be provided in the HDF5 file format with a specific compound dataset structure. Input data are organized into structured formats for a particle_dataset and an event_dataset, for which each attribute features outlined in the used configuration file should be included. The HDF5 File Creation can be achieved by using tools or libraries such as h5py (Python) or HDF5 libraries in languages like C or Java. Find below an example of particle dataset structure.

Example of particle dataset structure:

Example of Particle Dataset