Validation of the efficiency model¶

Tested on environment LHCb Analysis Facility from landerlini/lhcbaf:v0p8¶

This notebook is part of the pipeline to model the acceptance of charged particles in LHCb. In particular, it requires:

  • the preprocessing step, defined in the Preprocessing.ipynb notebook
  • the training step, defined in the Efficiency.ipynb notebook.

Libraries and environment¶

As for the validation of the acceptance model, this notebooks relies on GPUs to speed up selection, binning and histogramming operations to perform the validation. As for the acceptance validation, we rely on cupy for the numerical computations on GPU, on dask to stream data from disk to CPU RAM, and on cudf to stream data from CPU RAM to GPU RAM. As there are conflicting dependencies in cudf and tensorflow for GPU, in this notebook we will perform the evaluation of the trained model on CPU, without any significant loss in performance.

Once again, with the current amount of data, using the GPU is not strictly needed, but we plan to extend the training dataset in the next future and then we may have some benefit (and if not, it would be very interesting to understand why).

Please refer to the discussion in the Acceptance-validation notebook for further details.

/usr/local/miniconda3/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Load data¶

The data needed for the measurement of the performance of the trained algorithm is stored in the preprocessing step and is never loaded in the training notebook to avoid data leaks. Hence, it can be considered as completely independent of the training dataset.

As in other notebooks, we are using our custom implementation of FeatherReader able to load chunks of data from disk and stream them in the form of either a TensorFlow dataset or, in this case, as a dask dataframe.

Load the preprocessing step¶

The preprocessing step was defined in the preprocessing notebook and was stored in the same folder as the trained model. The serialization relies on pickle.

Load the model¶

The model was stored in the training notebook with the standard keras APIs. We print a summary of the model to standard output to document the details of the model we are validating.

Loading model from /workarea/local/private/cache/models/efficiency
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, 12)]         0           []                               
                                                                                                  
 dense (Dense)                  (None, 128)          1664        ['input_1[0][0]']                
                                                                                                  
 dense_1 (Dense)                (None, 128)          16512       ['dense[0][0]']                  
                                                                                                  
 add (Add)                      (None, 128)          0           ['dense[0][0]',                  
                                                                  'dense_1[0][0]']                
                                                                                                  
 dense_2 (Dense)                (None, 128)          16512       ['add[0][0]']                    
                                                                                                  
 add_1 (Add)                    (None, 128)          0           ['add[0][0]',                    
                                                                  'dense_2[0][0]']                
                                                                                                  
 dense_3 (Dense)                (None, 128)          16512       ['add_1[0][0]']                  
                                                                                                  
 add_2 (Add)                    (None, 128)          0           ['add_1[0][0]',                  
                                                                  'dense_3[0][0]']                
                                                                                                  
 dense_4 (Dense)                (None, 128)          16512       ['add_2[0][0]']                  
                                                                                                  
 add_3 (Add)                    (None, 128)          0           ['add_2[0][0]',                  
                                                                  'dense_4[0][0]']                
                                                                                                  
 dense_5 (Dense)                (None, 128)          16512       ['add_3[0][0]']                  
                                                                                                  
 add_4 (Add)                    (None, 128)          0           ['add_3[0][0]',                  
                                                                  'dense_5[0][0]']                
                                                                                                  
 dense_6 (Dense)                (None, 4)            516         ['add_4[0][0]']                  
                                                                                                  
 softmax (Softmax)              (None, 4)            0           ['dense_6[0][0]']                
                                                                                                  
==================================================================================================
Total params: 84,740
Trainable params: 84,740
Non-trainable params: 0
__________________________________________________________________________________________________

Transform data and evaluate the model in a pipeline¶

The following code block defines a pipeline for:

  • reading a chunk of data from disk using FeatherReader;
  • process the chunk of data on CPU applying the inverse preprocessing transform and evaluating the trained neural network model;
  • load the chunk of data to GPU memory for further processing
INFO:tensorflow:Assets written to: ram://538f1ff5-4642-472f-accc-afa6672a2621/assets
Dask DataFrame Structure:
mc_x mc_y mc_z mc_log10_p mc_tx mc_ty mc_eta mc_phi mc_is_e mc_is_mu mc_is_h mc_charge not_recoed recoed_as_long recoed_as_upstream recoed_as_downstream predicted_not_recoed predicted_recoed_as_long predicted_recoed_as_upstream predicted_recoed_as_downstream
npartitions=3
float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float32 float32 float32 float32
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: from_pandas, 3 graph layers

Validation of the averaged efficiency¶

As a first step in the validation, we compare the overall reconstruction efficiency for each combination of particle type and track class.

As a reminder, the particles we are considering are:

  • Hadrons,
  • Muons,
  • Electrons.

While the track classes are:

  • Long tracks,
  • Upstream tracks,
  • Downstream tracks.

Please refer to the Preprocessing notebook for additional discussion on this choice.

For each of the nine combinations of particle and track class, we compare the overall efficiency averaged on the whole sample, on the sample obtained from detailed simulation and from the model. If the training was successful, we should see excellent agreement between the two.

Validation in kinematic bins¶

Then we split the sample in kinematic bins to evaluate the quality of the agreement between the model and the test data sample. To perform this analysis we may need some additional variables, computed combining the features used for training. In the following code block we complete the dataset adding these variables to the dataframe.

Validation plots¶

The following stacked histograms represent the probability for a generated particle in acceptance to be reconstructed as long, upstream or downstream tracks. The different track types are stacked up to the histrogram representing the whole, unselected sample.

As usual, the other histograms are obtained either:

  • selecting particles based on the reconstruction track class as obtained from detailed simulation; or
  • weighing the input particles according to the predictions of the trained model.

Different particle types are represented in different rows, while different momentum bins are represented in different rows.

Histograms binning in pseudorapity¶

Histograms binning in the $z$ coordinate¶

Conclusion¶

This notebook describes the validation of the efficiency model, comparing the predicted probability for a particle to be reconstructed as either a long, downstream or upstream track.

The comparison is performed using an independent dataset never seen in the training phase, but statistically equivalent to those used for optimization.

The preprocessing step is inverted and applied to the dataset in order to access the variables as they were before the transformation, recovering their physical meaning.

Comparison of the efficiencies is performed both for the averaged sample and in kinematic bins, showing a decent level of agreement between the model and the detailed simulation.