Preprocessing steps for Acceptance and Efficiency models¶

Tested on environment LHCb Analysis Facility of Docker image landerlini/lhcbaf:v0p8¶

This is the first notebook of the processing chain. Here, we obtain the data from the remote storage and convert them from ROOT format to pandas DataFrames. Then we apply a QuantileTransformer from scikit-learn to present to the Artificial Neural Networks Gaussian distributed data. Finally we store to remote storage the preprocessing steps.

This is the notebook where the input features and the labels are defined.

Technologies and libraries¶

On top of the standard Python echosystem we are using:

  • uproot to convert data from ROOT TTrees to pandas DataFrames
  • dask DataFrame to enable processing datasets larger than the available RAM. Dask takes care of flushing from disk to RAM the data, converting from ROOT to pandas data format on demand.
  • Arrow Feather data format to cache in local storage the training dataset
    • Note that custom wrappers to Dask and TensorFlow, as defined in feather_io.py are needed
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

/usr/local/miniconda3/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Loading data¶

Input data are obtained from the environment variable $INPUT_FILES. For debugging purpose a default value is provided.

Found 10 files from /workarea/cloud-storage/anderlinil/LamarrBenderTrain/j100/*.root

Conversion from ROOT TTree to Pandas DataFrame¶

Training data is obtained from a Bender job running on simulated data. The ROOT files obtained from Bender are structured in multiple TDirectories for the training of different models. Here we are considering only the Tracking part, corresponding to the TDirectory named "TrackingTuple".

Two TTrees are defined:

  • sim with the generator-level particles, possibly matched to reconstructed tracks
  • reco reconstructed tracks with additional information on the reconstructed track parameters and their uncertainties.

sim: Generator-level particles¶

The complte list of variables defined for all generator-level particles is provided below.

  • mc_key
  • evtNumber
  • runNumber
  • acceptance
  • reconstructible
  • recobleCat
  • mcID
  • mcIDmother
  • mc_charge
  • fromSignal
  • reconstructed
  • type
  • mc_vertexType
  • mc_primaryVertexType
  • mc_motherVertexType
  • mc_x
  • mc_y
  • mc_z
  • mc_r
  • mc_tx
  • mc_ty
  • mc_px
  • mc_py
  • mc_pz
  • mc_p
  • mc_pt
  • mc_eta
  • mc_phi
  • mc_mass
  • x_ClosestToBeam
  • y_ClosestToBeam
  • z_ClosestToBeam
  • tx_ClosestToBeam
  • ty_ClosestToBeam
  • px_ClosestToBeam
  • py_ClosestToBeam
  • pz_ClosestToBeam
  • p_ClosestToBeam
  • pt_ClosestToBeam
  • eta_ClosestToBeam
  • phi_ClosestToBeam
  • mass_ClosestToBeam

Acceptance model¶

Definition of the variables¶

Some useful feature is missing, in particular we are defining the

  • $\log_{10} p$ of the particle which ease the job of the Quantile Transformer
  • $t_x$ and $t_y$ which are the slopes of the track on the origin vertex
  • is_e, is_mu and is_h providing boolean values used to define whether a particle is an electron, a muon or a hadron (pion, kaon, or proton).

The complete list of variables is provided below

Features

  • mc_x
  • mc_y
  • mc_z
  • mc_log10_p
  • mc_tx
  • mc_ty
  • mc_eta
  • mc_phi
  • mc_is_e
  • mc_is_mu
  • mc_is_h
  • mc_charge

Labels

  • acceptance

Preliminary plots and harmful outliers¶

Before preprocessing the data we ensure that the distributions are well behaved. We should pay particular attention to outlayers falling in regions completely dijoint from the core of the distribution.

Tails are tractable with the QuantileTransformer, while error values used by the reconstruction may lead to very inconsistent training of the neural networks and should be treated explicitely.

For the acceptance this does not seem to be an issue.

Labels¶

A loose acceptance cut is applied already at the time of generating the nTuple, still the fraction of particles in acceptance is small.

Preprocessing: definintion and training of the ColumnTransformer¶

We define a preprocessing step applying different transformations to different variables. Indeed, we plan to use the QuantileTransformer on continuous features while we should simply ignore the boolean flags such as is_mu or is_h.

To acheive such a goal we are using ColumnTransformer defined in scikit-learn.

We are applying the QuantileTransformer to the continuos features while we simply passthrough the boolean flags.

Finally we store the trained preprocessing step in the same destination as the model that we are going to train in the next notebook.

2023-01-17 08:16:42.431736: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Preprocessing step stored in:
/workarea/local/private/cache/models/acceptance/tX.pkl

Train, test and validation¶

We split the sample in three subset:

  • train (50% of the total), used for training
  • test (40% of the total), used to assess the quality of the trained model by comparing the distributions obtained from the model with those of the real data
  • validation (10% of the toal), used to monitor the training procedure using an independent dataset to assess overtraining

Cache the preprocessed datasets to local storage¶

Finally, we are ready to process the entire dataset applying the preprocessing step and storing in a local storage the relevant features and labels.

Processing acceptance-train
Processing acceptance-test
Processing acceptance-validation
0
Train 2645683
Test 2118122
Validation 530089

Reconstruction efficiency¶

The same steps followed for modelling the acceptance are now to be followed for the model of the reconstruction efficiency of charged particles. The model of the efficiency has a significant difference over the acceptance: while a particle is either in acceptance or not, a particle in acceptance can be reconstructed in different ways, or track classes:

  • long tracks (type: 3) are tracks obtained combining stubs in the vertex locator, in the upstream tracker and in the downstream tracking stations
  • upstream tracks (type: 4) are tracks obtained combining stubs in the vertex locator and in the upstream tracker, but without a matching in the downstream tracking section
  • downstream tracks (type: 5) are tracks obtained combining stubs in the upstream and downstream trackers, but without a matching stub in the vertex locator.

Upstream tracks are ususally connected to low momentum particles that were ejected by the magnetic field before reaching the downstream tracker. The resolution on the momentum is rather poor, but the residual magnetic field between the vertex locator and the upstream tracker is sufficient to perform a decent momentum measurement.

Downstream tracks are mainly due to particles decaying outside the Vertex Locator. Typical examples are the decay products from

  • $K^0_S \to \pi^+ \pi^-$
  • $\Lambda^0 \to p \pi^-$

Downstream tracks are crucial for a number of studies involving hyperons.

We define the set of labels accordingly, while for simplicity, the same features as for the acceptance the are used.

Features

  • mc_x
  • mc_y
  • mc_z
  • mc_log10_p
  • mc_tx
  • mc_ty
  • mc_eta
  • mc_phi
  • mc_is_e
  • mc_is_mu
  • mc_is_h
  • mc_charge

Labels

  • not_recoed
  • recoed_as_long
  • recoed_as_upstream
  • recoed_as_downstream

Cache preprocessed data to local storage¶

As done for the the preprocessed dataset used to train the acceptance model, we will store in a local folder the datasets ready to be processed with tensorflow.

Processing efficiency-train
Processing efficiency-test
Processing efficiency-validation
0
Train 2644496
Test 2119795
Validation 529603

Finally we need to publish the preprocessing step to the same target directory that will host the trained model.

Preprocessing step stored in:
/workarea/local/private/cache/models/efficiency/tX.pkl