LHCb Analysis Facility
from landerlini/lhcbaf:v0p8¶This notebook is part of the pipeline to model the acceptance of charged particles in LHCb. In particular, it requires:
As for the validation of the acceptance model, this notebooks relies on GPUs to speed up selection, binning and histogramming operations to perform the validation.
As for the acceptance validation, we rely on cupy
for the numerical computations on GPU, on dask
to stream data from disk to CPU RAM, and on cudf
to stream data from CPU RAM to GPU RAM. As there are conflicting dependencies in cudf
and tensorflow
for GPU, in this notebook we will perform the evaluation of the trained model on CPU, without any significant loss in performance.
Once again, with the current amount of data, using the GPU is not strictly needed, but we plan to extend the training dataset in the next future and then we may have some benefit (and if not, it would be very interesting to understand why).
Please refer to the discussion in the Acceptance-validation notebook for further details.
The data needed for the measurement of the performance of the trained algorithm is stored in the preprocessing step and is never loaded in the training notebook to avoid data leaks. Hence, it can be considered as completely independent of the training dataset.
As in other notebooks, we are using our custom implementation of FeatherReader able to load chunks of data from disk and stream them in the form of either a TensorFlow dataset or, in this case, as a dask dataframe.
The preprocessing step was defined in the preprocessing notebook and was stored in the same folder as the trained model.
The serialization relies on pickle
.
The model was stored in the training notebook with the standard keras APIs. We print a summary of the model to standard output to document the details of the model we are validating.
Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, 12)] 0 [] dense (Dense) (None, 128) 1664 ['input_1[0][0]'] dense_1 (Dense) (None, 128) 16512 ['dense[0][0]'] add (Add) (None, 128) 0 ['dense[0][0]', 'dense_1[0][0]'] dense_2 (Dense) (None, 128) 16512 ['add[0][0]'] add_1 (Add) (None, 128) 0 ['add[0][0]', 'dense_2[0][0]'] dense_3 (Dense) (None, 128) 16512 ['add_1[0][0]'] add_2 (Add) (None, 128) 0 ['add_1[0][0]', 'dense_3[0][0]'] dense_4 (Dense) (None, 128) 16512 ['add_2[0][0]'] add_3 (Add) (None, 128) 0 ['add_2[0][0]', 'dense_4[0][0]'] dense_5 (Dense) (None, 128) 16512 ['add_3[0][0]'] add_4 (Add) (None, 128) 0 ['add_3[0][0]', 'dense_5[0][0]'] dense_6 (Dense) (None, 4) 516 ['add_4[0][0]'] softmax (Softmax) (None, 4) 0 ['dense_6[0][0]'] ================================================================================================== Total params: 84,740 Trainable params: 84,740 Non-trainable params: 0 __________________________________________________________________________________________________
The following code block defines a pipeline for:
FeatherReader
;INFO:tensorflow:Assets written to: ram://538f1ff5-4642-472f-accc-afa6672a2621/assets
mc_x | mc_y | mc_z | mc_log10_p | mc_tx | mc_ty | mc_eta | mc_phi | mc_is_e | mc_is_mu | mc_is_h | mc_charge | not_recoed | recoed_as_long | recoed_as_upstream | recoed_as_downstream | predicted_not_recoed | predicted_recoed_as_long | predicted_recoed_as_upstream | predicted_recoed_as_downstream | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
npartitions=3 | ||||||||||||||||||||
float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float32 | float32 | float32 | float32 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
As a first step in the validation, we compare the overall reconstruction efficiency for each combination of particle type and track class.
As a reminder, the particles we are considering are:
While the track classes are:
Please refer to the Preprocessing notebook for additional discussion on this choice.
For each of the nine combinations of particle and track class, we compare the overall efficiency averaged on the whole sample, on the sample obtained from detailed simulation and from the model. If the training was successful, we should see excellent agreement between the two.
Then we split the sample in kinematic bins to evaluate the quality of the agreement between the model and the test data sample. To perform this analysis we may need some additional variables, computed combining the features used for training. In the following code block we complete the dataset adding these variables to the dataframe.
The following stacked histograms represent the probability for a generated particle in acceptance to be reconstructed as long, upstream or downstream tracks. The different track types are stacked up to the histrogram representing the whole, unselected sample.
As usual, the other histograms are obtained either:
Different particle types are represented in different rows, while different momentum bins are represented in different rows.
This notebook describes the validation of the efficiency model, comparing the predicted probability for a particle to be reconstructed as either a long, downstream or upstream track.
The comparison is performed using an independent dataset never seen in the training phase, but statistically equivalent to those used for optimization.
The preprocessing step is inverted and applied to the dataset in order to access the variables as they were before the transformation, recovering their physical meaning.
Comparison of the efficiencies is performed both for the averaged sample and in kinematic bins, showing a decent level of agreement between the model and the detailed simulation.