LHCb Analysis Facility
from landerlini/lhcbaf:v0p8¶This notebook is part of the pipeline to model the acceptance of charged particles in LHCb. In particular, it requires:
In this notebook (and other validation notebooks) we will use GPUs to process the data and build the histograms.
In most cases, the time is dominated by the graphics functions in matplotlib
to the benefit from using GPUs is marginal, but this may change in the future when processing larger datasets.
To process data with the GPU we are using cupy
implementing a numpy-compatible library of numerical functions accelerated by a GPU, and cudf which is interfaced with dask, enabling streaming data to the GPU while processing them with cupy
. cudf
is strictly needed only when the GPU memory is insufficient to cope with the whole dataset to be processed as it automates the process of splitting the dataset in batches, loading each batch to GPU, apply some processing with cupy
retrieving and store the output to free the GPU memory, and then continue with another batch.
With the current volume of data, this is not necessary, but again, we plan to increase the training datasets soon.
Unfortunately,
cudf
andtensorflow
have inconsistent dependencies that, at the time of writing, make it impossible to have both of the libraries running on the GPU in the same environment. In the validation notebooks, where the most of the computing power is needed to split, sort and organize data in histograms, we will evaluate the model on CPU.
We are now loading data in Apache feather format using our custom FeatherReader
. In this case we will read the Test dataset as a
dask
dataframe.
Note that the Test dataset was obtained from the overall dataset in the preprocessing step and was never loaded in the training notebook, so it can be considered completely independnet of both the training and validation datasets used to define weights and architecture of the neural network model.
To produce plots of physically-meaningful variables, we need to invert and apply the preprocessing step to the data stored on disk.
The preprocessing step was defined in the Preprocessing.ipynb
notebook and stored in the same folder as the neural network model using pickle
. We simply reload it from there.
The model is loaded using the keras APIs, and summarized below for completeness.
Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, 12)] 0 [] dense (Dense) (None, 128) 1664 ['input_1[0][0]'] dense_1 (Dense) (None, 128) 16512 ['dense[0][0]'] add (Add) (None, 128) 0 ['dense[0][0]', 'dense_1[0][0]'] dense_2 (Dense) (None, 128) 16512 ['add[0][0]'] add_1 (Add) (None, 128) 0 ['add[0][0]', 'dense_2[0][0]'] dense_3 (Dense) (None, 128) 16512 ['add_1[0][0]'] add_2 (Add) (None, 128) 0 ['add_1[0][0]', 'dense_3[0][0]'] dense_4 (Dense) (None, 128) 16512 ['add_2[0][0]'] add_3 (Add) (None, 128) 0 ['add_2[0][0]', 'dense_4[0][0]'] dense_5 (Dense) (None, 128) 16512 ['add_3[0][0]'] add_4 (Add) (None, 128) 0 ['add_3[0][0]', 'dense_5[0][0]'] dense_6 (Dense) (None, 128) 16512 ['add_4[0][0]'] add_5 (Add) (None, 128) 0 ['add_4[0][0]', 'dense_6[0][0]'] dense_7 (Dense) (None, 128) 16512 ['add_5[0][0]'] add_6 (Add) (None, 128) 0 ['add_5[0][0]', 'dense_7[0][0]'] dense_8 (Dense) (None, 128) 16512 ['add_6[0][0]'] add_7 (Add) (None, 128) 0 ['add_6[0][0]', 'dense_8[0][0]'] dense_9 (Dense) (None, 128) 16512 ['add_7[0][0]'] add_8 (Add) (None, 128) 0 ['add_7[0][0]', 'dense_9[0][0]'] dense_10 (Dense) (None, 128) 16512 ['add_8[0][0]'] add_9 (Add) (None, 128) 0 ['add_8[0][0]', 'dense_10[0][0]'] dense_11 (Dense) (None, 1) 129 ['add_9[0][0]'] ================================================================================================== Total params: 166,913 Trainable params: 166,913 Non-trainable params: 0 __________________________________________________________________________________________________
In the next block we are defining the pipeline for loading data in chunks, apply to each chunk the inverted preprocessing step, obtain the neural network response for the entries of that data chunk, and finally upload the datachunk to GPU memory.
Loading data in chunks is a task performed by the FeatherReader
object.
Then we apply a custom function, my_processed_batch
, to each chunk. Such a function is obtained as a specialization of a more general function named process_batch
, by specifying the list of feature and label names and the preprocessing step.
INFO:tensorflow:Assets written to: ram://742fca5c-117e-4d95-8d24-af96ead46ada/assets
mc_x | mc_y | mc_z | mc_log10_p | mc_tx | mc_ty | mc_eta | mc_phi | mc_is_e | mc_is_mu | mc_is_h | mc_charge | acceptance | predicted_acceptance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
npartitions=3 | ||||||||||||||
float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float32 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
To ensure everything was correct in the loading and evaluation of the model, we repeat the comparison on the distribution of labels and predictions. The distributions are qualitatively comparable.
To build the histograms in kinematic bins we need may need to add some variable.
Using dask
and cudf
, the computation of these new variables is lazy: it is delayed to the time when the result is read.
At that point, the chunks are iteratively loaded from disk and the whole sequence of operations is performed on each chunk.
Here we report the comparison of the distribution of events:
Here we report the comparison of the distribution of events:
In this notebook we produced some plots showing the agreement between the acceptance obtained from detailed simulation and the modeled acceptance probability.
Additional comparisons and plots might be added in the future.