Preprocessing steps for Resolution and Covariance models¶

Tested on environment LHCb Analysis Facility of Docker image landerli/lhcbaf:v0p8¶

This notebook loads, prepares and caches the data for the training of the resolution and covariance models. Both resolution and covariance rely on GANs and benefit from larger statistics if compared to acceptance and efficiency models. Nonetheless, the preprocessing procedure is similar: data are loaded from the remote storage and coverted from ROOT to pandas DataFrames, then a QuantileTransformer is applied.

In addition, we will introduce a custom DecorrTransformer that decorrelate features by covariance matrix diagionalization. This step is particularly important for modeling the covariance because of the strong correlation arising between different element of the covariance matrix. While strong correlation is usually manageable by Neural Networks, here the correlated features are part of the reference dataset that the generator should learn reproducing by minimizing some measure of the distance between distributions. In the original forumulation of GANs, that we are using here with minor tweaks, the measure of the distance between the generated and reference distribution saturates when the two distributions are disjoint. Strong correlation between variables makes it more difficult to avoid disjoint distributions and causes less predictable training outcome.

Technologies and libraries¶

On top of the standard Python echosystem we are using:

  • uproot to convert data from ROOT TTrees to pandas DataFrames
  • dask DataFrame to enable processing datasets larger than the available RAM (Out-of-memory computation).
  • Arrow Featether data format to cache in local storage the training dataset
    • refer to feather_io.py for a custom wrapper to Feather
/usr/local/miniconda3/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Loading data¶

Input data are obtained from the environment variable $INPUT_FILES. For debugging purpose a default value is provided.

Found 10 files from /workarea/local/private/cache/data/j100/*.root

Conversion from ROOT TTree to Pandas DataFrame¶

Training data is obtained from a Bender job running on simulated data. The ROOT files obtained from Bender are structured in multiple TDirectories for the training of different models. Here we are considering only the Tracking part, corresponding to the TDirectory named "TrackingTuple".

Two TTrees are defined:

  • sim with the generator-level particles, possibly matched to reconstructed tracks
  • reco reconstructed tracks with additional information on the reconstructed track parameters and their uncertainties.

reco: reconstructed particles¶

The complete list of variables defined for particles passing the reconstruction step is shown below.

mc_key                   true_mass_ClosestToBeam  cov_BegRich1_0_1         cov_AtT_1_3              cov_EndRich2_3_4        
evtNumber                x_ClosestToBeam          cov_BegRich1_1_1         cov_AtT_2_3              cov_EndRich2_4_4        
runNumber                y_ClosestToBeam          cov_BegRich1_0_2         cov_AtT_3_3             
PID                      z_ClosestToBeam          cov_BegRich1_1_2         cov_AtT_0_4             
clone                    qop_ClosestToBeam        cov_BegRich1_2_2         cov_AtT_1_4             
purity                   cov_ClosestToBeam_0_0    cov_BegRich1_0_3         cov_AtT_2_4             
type                     cov_ClosestToBeam_0_1    cov_BegRich1_1_3         cov_AtT_3_4             
history                  cov_ClosestToBeam_1_1    cov_BegRich1_2_3         cov_AtT_4_4             
fitHistory               cov_ClosestToBeam_0_2    cov_BegRich1_3_3         px_BegRich2             
ghostProb                cov_ClosestToBeam_1_2    cov_BegRich1_0_4         py_BegRich2             
likelihood               cov_ClosestToBeam_2_2    cov_BegRich1_1_4         pz_BegRich2             
fitStatus                cov_ClosestToBeam_0_3    cov_BegRich1_2_4         x_BegRich2              
chi2PerDoF               cov_ClosestToBeam_1_3    cov_BegRich1_3_4         y_BegRich2              
nDoF                     cov_ClosestToBeam_2_3    cov_BegRich1_4_4         z_BegRich2              
hasT                     cov_ClosestToBeam_3_3    px_EndRich1              tx_BegRich2             
hasVelo                  cov_ClosestToBeam_0_4    py_EndRich1              ty_BegRich2             
hasTT                    cov_ClosestToBeam_1_4    pz_EndRich1              qop_BegRich2            
mc_tx                    cov_ClosestToBeam_2_4    x_EndRich1               phi_BegRich2            
mc_ty                    cov_ClosestToBeam_3_4    y_EndRich1               eta_BegRich2            
mc_px                    cov_ClosestToBeam_4_4    z_EndRich1               cov_BegRich2_0_0        
mc_py                    px_EndVelo               tx_EndRich1              cov_BegRich2_0_1        
mc_pz                    py_EndVelo               ty_EndRich1              cov_BegRich2_1_1        
mc_p                     pz_EndVelo               qop_EndRich1             cov_BegRich2_0_2        
mc_pt                    x_EndVelo                phi_EndRich1             cov_BegRich2_1_2        
mc_eta                   y_EndVelo                eta_EndRich1             cov_BegRich2_2_2        
mc_phi                   z_EndVelo                cov_EndRich1_0_0         cov_BegRich2_0_3        
mc_mass                  tx_EndVelo               cov_EndRich1_0_1         cov_BegRich2_1_3        
reco_tx                  ty_EndVelo               cov_EndRich1_1_1         cov_BegRich2_2_3        
reco_ty                  qop_EndVelo              cov_EndRich1_0_2         cov_BegRich2_3_3        
reco_px                  phi_EndVelo              cov_EndRich1_1_2         cov_BegRich2_0_4        
reco_py                  eta_EndVelo              cov_EndRich1_2_2         cov_BegRich2_1_4        
reco_pz                  cov_EndVelo_0_0          cov_EndRich1_0_3         cov_BegRich2_2_4        
reco_p                   cov_EndVelo_0_1          cov_EndRich1_1_3         cov_BegRich2_3_4        
reco_pt                  cov_EndVelo_1_1          cov_EndRich1_2_3         cov_BegRich2_4_4        
reco_eta                 cov_EndVelo_0_2          cov_EndRich1_3_3         px_EndRich2             
reco_phi                 cov_EndVelo_1_2          cov_EndRich1_0_4         py_EndRich2             
true_x_ClosestToBeam     cov_EndVelo_2_2          cov_EndRich1_1_4         pz_EndRich2             
true_y_ClosestToBeam     cov_EndVelo_0_3          cov_EndRich1_2_4         x_EndRich2              
true_z_ClosestToBeam     cov_EndVelo_1_3          cov_EndRich1_3_4         y_EndRich2              
tx_ClosestToBeam         cov_EndVelo_2_3          cov_EndRich1_4_4         z_EndRich2              
ty_ClosestToBeam         cov_EndVelo_3_3          px_AtT                   tx_EndRich2             
px_ClosestToBeam         cov_EndVelo_0_4          py_AtT                   ty_EndRich2             
py_ClosestToBeam         cov_EndVelo_1_4          pz_AtT                   qop_EndRich2            
pz_ClosestToBeam         cov_EndVelo_2_4          x_AtT                    phi_EndRich2            
p_ClosestToBeam          cov_EndVelo_3_4          y_AtT                    eta_EndRich2            
pt_ClosestToBeam         cov_EndVelo_4_4          z_AtT                    cov_EndRich2_0_0        
eta_ClosestToBeam        px_BegRich1              tx_AtT                   cov_EndRich2_0_1        
phi_ClosestToBeam        py_BegRich1              ty_AtT                   cov_EndRich2_1_1        
mass_ClosestToBeam       pz_BegRich1              qop_AtT                  cov_EndRich2_0_2        
true_tx_ClosestToBeam    x_BegRich1               phi_AtT                  cov_EndRich2_1_2        
true_ty_ClosestToBeam    y_BegRich1               eta_AtT                  cov_EndRich2_2_2        
true_px_ClosestToBeam    z_BegRich1               cov_AtT_0_0              cov_EndRich2_0_3        
true_py_ClosestToBeam    tx_BegRich1              cov_AtT_0_1              cov_EndRich2_1_3        
true_pz_ClosestToBeam    ty_BegRich1              cov_AtT_1_1              cov_EndRich2_2_3        
true_p_ClosestToBeam     qop_BegRich1             cov_AtT_0_2              cov_EndRich2_3_3        
true_pt_ClosestToBeam    phi_BegRich1             cov_AtT_1_2              cov_EndRich2_0_4        
true_eta_ClosestToBeam   eta_BegRich1             cov_AtT_2_2              cov_EndRich2_1_4        
true_phi_ClosestToBeam   cov_BegRich1_0_0         cov_AtT_0_3              cov_EndRich2_2_4        

Resolution model¶

For consistency with the acceptance and efficiency models, we are defining:

  • mc_log10_p as $\log_{10}(p)$, which ease the job to the Quantile transformer;
  • is_e, is_mu and is_h providing boolean values used to define whether a particle is an electron, a muon or a hadron (pion, kaon or proton);
  • nDoF_f obtained by adding a random number, uniformly distributed in the range $[0, 1]$ to the number of degrees of freedom of the track fit.

In addition we are defining:

  • the boolean flags long, upstream and downstream which encode the particle type in one-hot-encoded classes
  • dx, dy, dz, dtx, dty and dp representing the difference between the reconstructed and generator-level quantity of the position ($x$, $y$, and $z$), the slope ($t_x$ and $t_y$) and the momentum ($p$), as evaluated in the ClosestToBeam track state.

Features¶

The resolution model takes as inputs the generator-level position, slope and momentum of the particle in where its trajectory minimizes the distance from the $z$ axis (ClosestToBeam), three boolean flags representing the nature of the particle (electron, muon or hadron) and three boolean flags representing the track category as defined by the efficiency model.

The resolution model should provide as an output the difference between the reconstructed values and the input feature, plus track-quality metrics such as the number of degrees of freedom and the $\chi^2$ of the track fit, and the ghost probability.

Clearly, in the training sample these quantities are obtained with reconstruction algorithms performing the track fit. In Lamarr, these parameters will be generated randomly trying to reproduce the distributions observed in the training dataset.

Input features (real)

  • true_x_ClosestToBeam
  • true_y_ClosestToBeam
  • true_z_ClosestToBeam
  • true_tx_ClosestToBeam
  • true_ty_ClosestToBeam
  • mc_log10_p

Input features (boolean)

  • mc_is_e
  • mc_is_mu
  • mc_is_h
  • long
  • upstream
  • downstream

Output features (real)

  • dx
  • dy
  • dz
  • dtx
  • dty
  • dp
  • chi2PerDoF
  • nDoF_f
  • ghostProb

Some statistics¶

We report below some basic statistics on the dataset defined to train the resolution model. These plots might highlight pathological contributions to the distributions, such as error values resulting in large spikes or outliers inconsistent with the core of the distribution.

Some rows¶

Only a fraction of the whole dataset is used for this study and then for defining the preprocessing step.

The number of rows used in this notebook is 300000
true_x_ClosestToBeam true_y_ClosestToBeam true_z_ClosestToBeam true_tx_ClosestToBeam true_ty_ClosestToBeam mc_log10_p mc_is_e mc_is_mu mc_is_h long ... downstream dx dy dz dtx dty dp chi2PerDoF nDoF_f ghostProb
0 0.573766 -0.504192 5.343706 0.136734 0.155601 3.794627 0.0 0.0 1.0 1.0 ... 0.0 -0.002566 0.021092 -0.008306 -0.000227 -0.000460 -5.286621 1.015763 30.303075 0.006457
2 0.727226 0.200873 10.719674 -0.032421 0.117373 3.357954 0.0 0.0 1.0 1.0 ... 0.0 0.085674 -0.099873 -0.268574 -0.001063 0.001247 -9.628418 0.866766 38.164848 0.004073
3 0.321377 -0.521502 -11.688612 0.027271 0.016806 4.028106 0.0 0.0 1.0 1.0 ... 0.0 0.069323 -0.065798 1.786812 -0.000060 0.000454 50.350586 0.910281 33.065956 0.003543
4 0.218335 0.280382 16.488998 -0.067375 0.052465 3.671158 0.0 0.0 1.0 1.0 ... 0.0 0.010765 -0.115082 -0.718598 0.000362 0.001003 3.974121 0.745428 38.067185 0.002595
5 0.111747 -0.403851 21.429649 -0.051381 -0.014217 4.089846 0.0 0.0 1.0 1.0 ... 0.0 -0.042447 0.019251 1.217850 -0.000184 -0.000240 17.000000 1.012854 40.430504 0.023526
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
44509 0.741029 0.163753 -51.839146 0.012476 -0.056459 3.917769 0.0 0.0 1.0 1.0 ... 0.0 1.085071 0.362747 1.880047 -0.002045 -0.001622 -171.697754 1.225178 29.162469 0.020100
44510 0.830669 -0.094383 -42.360165 0.003479 0.030621 3.838578 0.0 0.0 1.0 1.0 ... 0.0 0.172731 0.014883 5.235565 -0.000398 0.000706 0.297852 0.843934 31.736690 0.004302
44511 0.742185 -0.369345 -16.887154 0.033525 0.067367 3.943779 0.0 0.0 1.0 0.0 ... 0.0 -0.006785 0.003745 0.253954 0.000102 0.000122 3192.511719 1.319449 13.165907 0.059842
44515 0.519034 -0.492174 -6.532609 -0.049847 -0.052567 3.996593 0.0 0.0 1.0 1.0 ... 0.0 0.043666 0.015674 1.521009 -0.000918 -0.000590 -9.393555 1.156296 31.286268 0.004501
44516 0.295246 -0.479987 0.392626 -0.041027 -0.025236 3.948789 0.0 0.0 1.0 1.0 ... 0.0 -0.194846 0.070787 1.313074 0.000878 -0.000803 51.608398 1.182083 39.769771 0.011902

300000 rows × 21 columns

Track class representation per particle type¶

Resolution function per particle type¶

Track-quality metrics per particle type¶

Resolution function per particle type¶

Preprocessing (resolution model)¶

We are using the ColumnTransformer class from scikit-learn to apply different preprocessing to the real and boolean features. Boolean features are passed through as they are, while a QuantumTransformer (once again from scikit-learn) is used to transform the real quantities in gaussian-distributed features.

As a difference with acceptance and efficiency models, for GAN-models the preprocessing of the output features is also critical. As all of the output features are real features, we transform everything with a QuantileTransformer.

The preprocessing steps must be stored in the same folder as the trained model for further validation and deployment.

2023-01-17 08:16:50.493483: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Preprocessing step stored in:
/workarea/local/private/cache/models/resolution/tX.pkl
Preprocessing step stored in:
/workarea/local/private/cache/models/resolution/tY.pkl

Train, test and validation¶

The dataset is split in:

  • train (50%) used for training the model
  • test (40%) used for measuring the performance of the model
  • validation (10%) used in combination with the train dataset to check for overtraining effects

Split data is stored on disk in chunks of 100 MB (before compression).

Processing resolution-train
Processing resolution-test
Processing resolution-validation
0
Train 352218
Test 282879
Validation 69997

Covariance model¶

The covariance model will take as an input the same generator-level features as the resolution model, plus the track-quality features (possibly predicted by the resolution GAN). In this way, the correlation between the track quality and the uncertainty on the track parameters is preserved.

We aim at obtaining from the model the description of the covariance matrix. However, to ease the training task we are introducing the following (reversible) simplifications:

  • as the covariance matrix is symmetric, we are learning only the diagonal and lower-triangular parts;
  • the elements of the diagional are positive, we will learn they logarithm to avoid the GAN producing negative predictions;
  • to reduce the correlation between the diagonal elements and the off-diagonal elements on the same row/column, we model the correlation instead of the covariance.

Note. We have not validated that the last tweak, originated from intuition, is actually needed. In principle, the decorrelation step might be sufficient to remove these correlations. However, this transform is not linear and applying it is harmless: it is lossless and the overhead for inversion is negligible. Still, we might review this step in the future.

The complete list of input and output features follow:

Input features (generator-level)

  • true_x_ClosestToBeam
  • true_y_ClosestToBeam
  • true_z_ClosestToBeam
  • true_tx_ClosestToBeam
  • true_ty_ClosestToBeam
  • mc_log10_p

Input features (from resolution GAN)

  • chi2PerDoF
  • nDoF_f
  • ghostProb

Input features (boolean flags)

  • mc_is_e
  • mc_is_mu
  • mc_is_h
  • long
  • upstream
  • downstream

Output features (diagonal)

  • log_cov_ClosestToBeam_0_0
  • log_cov_ClosestToBeam_1_1
  • log_cov_ClosestToBeam_2_2
  • log_cov_ClosestToBeam_3_3
  • log_cov_ClosestToBeam_4_4

Output features (correlations)

  • corr_ClosestToBeam_0_1
  • corr_ClosestToBeam_0_2
  • corr_ClosestToBeam_1_2
  • corr_ClosestToBeam_0_3
  • corr_ClosestToBeam_1_3
  • corr_ClosestToBeam_2_3
  • corr_ClosestToBeam_0_4
  • corr_ClosestToBeam_1_4
  • corr_ClosestToBeam_2_4
  • corr_ClosestToBeam_3_4

Some distributions of the features¶

As for the resolution, we plot some distributions of the input and output features before applying the scikit-learn preprocessing steps. As mentioned above, the purpose of these plots is to identify error values resulting in large spikes, or unphysical outlayers in the distributions.

We will use a fraction of the complete dataset with 1M rows to fill these histograms and to train the preprocessing steps.

Preprocessing (covariance model)¶

The preprocessing of the input features is very similar to the one adopted for the resolution model. In particular, we are using a scikit-learn ColumnTransformer combining the application of a QuantileTransformer to the real features and ignoring (passthrough) the boolean flags.

For the output features the preprocessing step is actually the sequential combination (pipeline) of three algorithms:

  • A StandardScaler to correct for the different order of magnitude of the different rows (and columns) of the covariance matrix, due to the unfortunate choice of units in LHCb
  • A custom DecorrTransformer implemented as part of the preprocessing_utils module, with:
    • a "fitting" step composed of the following steps:
      • computing the covariance matrix of the inputs (in this case, it will the covariance matrix of the covariance matrix elements which may sound confusing, but the first covariance matrix describes the correlation between different instances of the trak covariance matrix obtained fitting the hits into track objects)
      • computing the eigen vectors of the covariance matrix
    • a "transformation" step obtained multiplying the matrix defined by the eigen vectors with the inputs.

The documentation of the of the DecorrTransformer is copy-pasted below for reference.

Help on class DecorrTransformer in module preprocessing_utils:

class DecorrTransformer(builtins.object)
 |  A minmalistic decorrelation transform based on the eigen vectors of covariance matrix.
 |  
 |  This simple transformers removes the linear correlation from the columns of a 
 |  dataset by computing the eigen vectors of their covariance matrix.
 |  A matrix is built by stacking the eigen vectors.
 |  Applying the matrix to the input features, a linear application rotating 
 |  and normalizing the inputs is obtained. 
 |  
 |  The matrix of the eigen vector is orthogonal by construction, hence the 
 |  inverse of the transform is simply obtained by multiplying the trasposed matrix
 |  to the transformed input.
 |  
 |  Methods defined here:
 |  
 |  fit(self, X, y=None)
 |      Computes the covariance matrix of the inputs and its eigen vectors.
 |  
 |  inverse_transform(self, dX)
 |      Applies the inverted tranformation
 |  
 |  transform(self, X)
 |      Applies the direct transformation
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

Distributions of the transformed variables¶

The plot below represents the outcome of the preprocessing, showing the distributions of the preprocessed quantities and indicating the boundaries corresponding to the 99.98% of the dataset.

Storing the preprocessing steps¶

The preprocessing steps are stored with pickle in the same folder where we will store the trained models.

Preprocessing step stored in:
/workarea/local/private/cache/models/covariance/tX.pkl
Preprocessing step stored in:
/workarea/local/private/cache/models/covariance/tY.pkl

Consistently with resolution model, the preprocessed dataset is split in:

  • train (50%) used for training the model
  • test (40%) used for measuring the performance of the model
  • validation (10%) used in combination with the train dataset to check for overtraining effects

Split data is stored on disk in chunks of 100 MB (before compression).

Processing covariance-train
Processing covariance-test
Processing covariance-validation
0
Train 352610
Test 282281
Validation 70203

Summary and conclusion¶

In this notebook, we described the preprocessing steps for the training of the resolution and the covariance model. Both models will rely on Generative Adversarial Networks for which we need to preprocess both the input and the output features.

In summary,

  • Resolution model
    • Preprocessing of the input:
      • Passthrough for the boolean variables
      • QuantileTransformer for real features
    • Preprocessing of the output:
      • QuantileTransformer for real features
  • Covariance model
    • Preprocessing of the input:
      • Passthrough for the boolean variables
      • QuantileTransformer for real features
    • Preprocessing of the output:
      • Pipeline of the following transformations:
        • MinMaxScaler
        • Custom DecorrTransformer
        • QuantileTransformer

In addition, please note that the output of the covariance model is not the covariance matrix itself, but:

  • the $\log$-values of the diagonal elements of the covariance matrix
  • the values of the off-diagonal elements of the correlation matrix

All the QuantileTransformer employed in this notebook use a normal distribution as output.