LHCb Analysis Facility
of Docker image landerli/lhcbaf:v0p8
¶This notebook loads, prepares and caches the data for the training of the resolution and covariance models.
Both resolution and covariance rely on GANs and benefit from larger statistics if compared to acceptance and efficiency models.
Nonetheless, the preprocessing procedure is similar: data are loaded from the remote storage and coverted from ROOT to pandas DataFrames, then a QuantileTransformer
is applied.
In addition, we will introduce a custom DecorrTransformer
that decorrelate features by covariance matrix diagionalization. This step is particularly important for modeling the covariance because of the strong correlation arising between different element of the covariance matrix. While strong correlation is usually manageable by Neural Networks, here the correlated features are part of the reference dataset that the generator should learn reproducing by minimizing some measure of the distance between distributions.
In the original forumulation of GANs, that we are using here with minor tweaks, the measure of the distance between the generated and reference distribution saturates when the two distributions are disjoint. Strong correlation between variables makes it more difficult to avoid disjoint distributions and causes less predictable training outcome.
On top of the standard Python echosystem we are using:
uproot
to convert data from ROOT TTrees to pandas DataFrames
Arrow Featether
data format to cache in local storage the training datasetfeather_io.py
for a custom wrapper to FeatherInput data are obtained from the environment variable $INPUT_FILES
.
For debugging purpose a default value is provided.
Found 10 files from /workarea/local/private/cache/data/j100/*.root
ROOT TTree
to Pandas DataFrame
¶Training data is obtained from a Bender job running on simulated data.
The ROOT files obtained from Bender are structured in multiple TDirector
ies for the training of different models.
Here we are considering only the Tracking part, corresponding to the TDirectory
named "TrackingTuple"
.
Two TTree
s are defined:
sim
with the generator-level particles, possibly matched to reconstructed tracksreco
reconstructed tracks with additional information on the reconstructed track parameters and their uncertainties.reco
: reconstructed particles¶The complete list of variables defined for particles passing the reconstruction step is shown below.
mc_key true_mass_ClosestToBeam cov_BegRich1_0_1 cov_AtT_1_3 cov_EndRich2_3_4 evtNumber x_ClosestToBeam cov_BegRich1_1_1 cov_AtT_2_3 cov_EndRich2_4_4 runNumber y_ClosestToBeam cov_BegRich1_0_2 cov_AtT_3_3 PID z_ClosestToBeam cov_BegRich1_1_2 cov_AtT_0_4 clone qop_ClosestToBeam cov_BegRich1_2_2 cov_AtT_1_4 purity cov_ClosestToBeam_0_0 cov_BegRich1_0_3 cov_AtT_2_4 type cov_ClosestToBeam_0_1 cov_BegRich1_1_3 cov_AtT_3_4 history cov_ClosestToBeam_1_1 cov_BegRich1_2_3 cov_AtT_4_4 fitHistory cov_ClosestToBeam_0_2 cov_BegRich1_3_3 px_BegRich2 ghostProb cov_ClosestToBeam_1_2 cov_BegRich1_0_4 py_BegRich2 likelihood cov_ClosestToBeam_2_2 cov_BegRich1_1_4 pz_BegRich2 fitStatus cov_ClosestToBeam_0_3 cov_BegRich1_2_4 x_BegRich2 chi2PerDoF cov_ClosestToBeam_1_3 cov_BegRich1_3_4 y_BegRich2 nDoF cov_ClosestToBeam_2_3 cov_BegRich1_4_4 z_BegRich2 hasT cov_ClosestToBeam_3_3 px_EndRich1 tx_BegRich2 hasVelo cov_ClosestToBeam_0_4 py_EndRich1 ty_BegRich2 hasTT cov_ClosestToBeam_1_4 pz_EndRich1 qop_BegRich2 mc_tx cov_ClosestToBeam_2_4 x_EndRich1 phi_BegRich2 mc_ty cov_ClosestToBeam_3_4 y_EndRich1 eta_BegRich2 mc_px cov_ClosestToBeam_4_4 z_EndRich1 cov_BegRich2_0_0 mc_py px_EndVelo tx_EndRich1 cov_BegRich2_0_1 mc_pz py_EndVelo ty_EndRich1 cov_BegRich2_1_1 mc_p pz_EndVelo qop_EndRich1 cov_BegRich2_0_2 mc_pt x_EndVelo phi_EndRich1 cov_BegRich2_1_2 mc_eta y_EndVelo eta_EndRich1 cov_BegRich2_2_2 mc_phi z_EndVelo cov_EndRich1_0_0 cov_BegRich2_0_3 mc_mass tx_EndVelo cov_EndRich1_0_1 cov_BegRich2_1_3 reco_tx ty_EndVelo cov_EndRich1_1_1 cov_BegRich2_2_3 reco_ty qop_EndVelo cov_EndRich1_0_2 cov_BegRich2_3_3 reco_px phi_EndVelo cov_EndRich1_1_2 cov_BegRich2_0_4 reco_py eta_EndVelo cov_EndRich1_2_2 cov_BegRich2_1_4 reco_pz cov_EndVelo_0_0 cov_EndRich1_0_3 cov_BegRich2_2_4 reco_p cov_EndVelo_0_1 cov_EndRich1_1_3 cov_BegRich2_3_4 reco_pt cov_EndVelo_1_1 cov_EndRich1_2_3 cov_BegRich2_4_4 reco_eta cov_EndVelo_0_2 cov_EndRich1_3_3 px_EndRich2 reco_phi cov_EndVelo_1_2 cov_EndRich1_0_4 py_EndRich2 true_x_ClosestToBeam cov_EndVelo_2_2 cov_EndRich1_1_4 pz_EndRich2 true_y_ClosestToBeam cov_EndVelo_0_3 cov_EndRich1_2_4 x_EndRich2 true_z_ClosestToBeam cov_EndVelo_1_3 cov_EndRich1_3_4 y_EndRich2 tx_ClosestToBeam cov_EndVelo_2_3 cov_EndRich1_4_4 z_EndRich2 ty_ClosestToBeam cov_EndVelo_3_3 px_AtT tx_EndRich2 px_ClosestToBeam cov_EndVelo_0_4 py_AtT ty_EndRich2 py_ClosestToBeam cov_EndVelo_1_4 pz_AtT qop_EndRich2 pz_ClosestToBeam cov_EndVelo_2_4 x_AtT phi_EndRich2 p_ClosestToBeam cov_EndVelo_3_4 y_AtT eta_EndRich2 pt_ClosestToBeam cov_EndVelo_4_4 z_AtT cov_EndRich2_0_0 eta_ClosestToBeam px_BegRich1 tx_AtT cov_EndRich2_0_1 phi_ClosestToBeam py_BegRich1 ty_AtT cov_EndRich2_1_1 mass_ClosestToBeam pz_BegRich1 qop_AtT cov_EndRich2_0_2 true_tx_ClosestToBeam x_BegRich1 phi_AtT cov_EndRich2_1_2 true_ty_ClosestToBeam y_BegRich1 eta_AtT cov_EndRich2_2_2 true_px_ClosestToBeam z_BegRich1 cov_AtT_0_0 cov_EndRich2_0_3 true_py_ClosestToBeam tx_BegRich1 cov_AtT_0_1 cov_EndRich2_1_3 true_pz_ClosestToBeam ty_BegRich1 cov_AtT_1_1 cov_EndRich2_2_3 true_p_ClosestToBeam qop_BegRich1 cov_AtT_0_2 cov_EndRich2_3_3 true_pt_ClosestToBeam phi_BegRich1 cov_AtT_1_2 cov_EndRich2_0_4 true_eta_ClosestToBeam eta_BegRich1 cov_AtT_2_2 cov_EndRich2_1_4 true_phi_ClosestToBeam cov_BegRich1_0_0 cov_AtT_0_3 cov_EndRich2_2_4
For consistency with the acceptance and efficiency models, we are defining:
mc_log10_p
as $\log_{10}(p)$, which ease the job to the Quantile transformer;is_e
, is_mu
and is_h
providing boolean values used to define whether a particle is an electron, a muon or a hadron (pion, kaon or proton);nDoF_f
obtained by adding a random number, uniformly distributed in the range $[0, 1]$ to the number of degrees of freedom of the track fit.In addition we are defining:
long
, upstream
and downstream
which encode the particle type in one-hot-encoded classesdx
, dy
, dz
, dtx
, dty
and dp
representing the difference between the reconstructed and generator-level quantity of the position ($x$, $y$, and $z$), the slope ($t_x$ and $t_y$) and the momentum ($p$), as evaluated in the ClosestToBeam
track state.The resolution model takes as inputs the generator-level position, slope and momentum of the particle in where its trajectory minimizes the distance from the $z$ axis (ClosestToBeam
), three boolean flags representing the nature of the particle (electron, muon or hadron) and three boolean flags representing the track category as defined by the efficiency model.
The resolution model should provide as an output the difference between the reconstructed values and the input feature, plus track-quality metrics such as the number of degrees of freedom and the $\chi^2$ of the track fit, and the ghost probability.
Clearly, in the training sample these quantities are obtained with reconstruction algorithms performing the track fit. In Lamarr, these parameters will be generated randomly trying to reproduce the distributions observed in the training dataset.
Input features (real)
true_x_ClosestToBeam
true_y_ClosestToBeam
true_z_ClosestToBeam
true_tx_ClosestToBeam
true_ty_ClosestToBeam
mc_log10_p
Input features (boolean)
mc_is_e
mc_is_mu
mc_is_h
long
upstream
downstream
Output features (real)
dx
dy
dz
dtx
dty
dp
chi2PerDoF
nDoF_f
ghostProb
We report below some basic statistics on the dataset defined to train the resolution model. These plots might highlight pathological contributions to the distributions, such as error values resulting in large spikes or outliers inconsistent with the core of the distribution.
Only a fraction of the whole dataset is used for this study and then for defining the preprocessing step.
true_x_ClosestToBeam | true_y_ClosestToBeam | true_z_ClosestToBeam | true_tx_ClosestToBeam | true_ty_ClosestToBeam | mc_log10_p | mc_is_e | mc_is_mu | mc_is_h | long | ... | downstream | dx | dy | dz | dtx | dty | dp | chi2PerDoF | nDoF_f | ghostProb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.573766 | -0.504192 | 5.343706 | 0.136734 | 0.155601 | 3.794627 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | -0.002566 | 0.021092 | -0.008306 | -0.000227 | -0.000460 | -5.286621 | 1.015763 | 30.303075 | 0.006457 |
2 | 0.727226 | 0.200873 | 10.719674 | -0.032421 | 0.117373 | 3.357954 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.085674 | -0.099873 | -0.268574 | -0.001063 | 0.001247 | -9.628418 | 0.866766 | 38.164848 | 0.004073 |
3 | 0.321377 | -0.521502 | -11.688612 | 0.027271 | 0.016806 | 4.028106 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.069323 | -0.065798 | 1.786812 | -0.000060 | 0.000454 | 50.350586 | 0.910281 | 33.065956 | 0.003543 |
4 | 0.218335 | 0.280382 | 16.488998 | -0.067375 | 0.052465 | 3.671158 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.010765 | -0.115082 | -0.718598 | 0.000362 | 0.001003 | 3.974121 | 0.745428 | 38.067185 | 0.002595 |
5 | 0.111747 | -0.403851 | 21.429649 | -0.051381 | -0.014217 | 4.089846 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | -0.042447 | 0.019251 | 1.217850 | -0.000184 | -0.000240 | 17.000000 | 1.012854 | 40.430504 | 0.023526 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
44509 | 0.741029 | 0.163753 | -51.839146 | 0.012476 | -0.056459 | 3.917769 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 1.085071 | 0.362747 | 1.880047 | -0.002045 | -0.001622 | -171.697754 | 1.225178 | 29.162469 | 0.020100 |
44510 | 0.830669 | -0.094383 | -42.360165 | 0.003479 | 0.030621 | 3.838578 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.172731 | 0.014883 | 5.235565 | -0.000398 | 0.000706 | 0.297852 | 0.843934 | 31.736690 | 0.004302 |
44511 | 0.742185 | -0.369345 | -16.887154 | 0.033525 | 0.067367 | 3.943779 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | -0.006785 | 0.003745 | 0.253954 | 0.000102 | 0.000122 | 3192.511719 | 1.319449 | 13.165907 | 0.059842 |
44515 | 0.519034 | -0.492174 | -6.532609 | -0.049847 | -0.052567 | 3.996593 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.043666 | 0.015674 | 1.521009 | -0.000918 | -0.000590 | -9.393555 | 1.156296 | 31.286268 | 0.004501 |
44516 | 0.295246 | -0.479987 | 0.392626 | -0.041027 | -0.025236 | 3.948789 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | -0.194846 | 0.070787 | 1.313074 | 0.000878 | -0.000803 | 51.608398 | 1.182083 | 39.769771 | 0.011902 |
300000 rows × 21 columns
We are using the ColumnTransformer
class from scikit-learn to apply different preprocessing to the real and boolean features.
Boolean features are passed through as they are, while a QuantumTransformer
(once again from scikit-learn) is used to transform the real quantities in gaussian-distributed features.
As a difference with acceptance and efficiency models, for GAN-models the preprocessing of the output features is also critical.
As all of the output features are real features, we transform everything with a QuantileTransformer
.
The preprocessing steps must be stored in the same folder as the trained model for further validation and deployment.
/workarea/local/private/cache/models/resolution/tX.pkl
/workarea/local/private/cache/models/resolution/tY.pkl
The dataset is split in:
Split data is stored on disk in chunks of 100 MB (before compression).
Processing resolution-train Processing resolution-test Processing resolution-validation
0 | |
---|---|
Train | 352218 |
Test | 282879 |
Validation | 69997 |
The covariance model will take as an input the same generator-level features as the resolution model, plus the track-quality features (possibly predicted by the resolution GAN). In this way, the correlation between the track quality and the uncertainty on the track parameters is preserved.
We aim at obtaining from the model the description of the covariance matrix. However, to ease the training task we are introducing the following (reversible) simplifications:
Note. We have not validated that the last tweak, originated from intuition, is actually needed. In principle, the decorrelation step might be sufficient to remove these correlations. However, this transform is not linear and applying it is harmless: it is lossless and the overhead for inversion is negligible. Still, we might review this step in the future.
The complete list of input and output features follow:
Input features (generator-level)
true_x_ClosestToBeam
true_y_ClosestToBeam
true_z_ClosestToBeam
true_tx_ClosestToBeam
true_ty_ClosestToBeam
mc_log10_p
Input features (from resolution GAN)
chi2PerDoF
nDoF_f
ghostProb
Input features (boolean flags)
mc_is_e
mc_is_mu
mc_is_h
long
upstream
downstream
Output features (diagonal)
log_cov_ClosestToBeam_0_0
log_cov_ClosestToBeam_1_1
log_cov_ClosestToBeam_2_2
log_cov_ClosestToBeam_3_3
log_cov_ClosestToBeam_4_4
Output features (correlations)
corr_ClosestToBeam_0_1
corr_ClosestToBeam_0_2
corr_ClosestToBeam_1_2
corr_ClosestToBeam_0_3
corr_ClosestToBeam_1_3
corr_ClosestToBeam_2_3
corr_ClosestToBeam_0_4
corr_ClosestToBeam_1_4
corr_ClosestToBeam_2_4
corr_ClosestToBeam_3_4
As for the resolution, we plot some distributions of the input and output features before applying the scikit-learn preprocessing steps. As mentioned above, the purpose of these plots is to identify error values resulting in large spikes, or unphysical outlayers in the distributions.
We will use a fraction of the complete dataset with 1M rows to fill these histograms and to train the preprocessing steps.
The preprocessing of the input features is very similar to the one adopted for the resolution model.
In particular, we are using a scikit-learn ColumnTransformer
combining the application of a QuantileTransformer
to the real features and ignoring (passthrough) the boolean flags.
For the output features the preprocessing step is actually the sequential combination (pipeline) of three algorithms:
StandardScaler
to correct for the different order of magnitude of the different rows (and columns) of the covariance matrix, due to the unfortunate choice of units in LHCbDecorrTransformer
implemented as part of the preprocessing_utils
module, with:The documentation of the of the DecorrTransformer
is copy-pasted below for reference.
Help on class DecorrTransformer in module preprocessing_utils: class DecorrTransformer(builtins.object) | A minmalistic decorrelation transform based on the eigen vectors of covariance matrix. | | This simple transformers removes the linear correlation from the columns of a | dataset by computing the eigen vectors of their covariance matrix. | A matrix is built by stacking the eigen vectors. | Applying the matrix to the input features, a linear application rotating | and normalizing the inputs is obtained. | | The matrix of the eigen vector is orthogonal by construction, hence the | inverse of the transform is simply obtained by multiplying the trasposed matrix | to the transformed input. | | Methods defined here: | | fit(self, X, y=None) | Computes the covariance matrix of the inputs and its eigen vectors. | | inverse_transform(self, dX) | Applies the inverted tranformation | | transform(self, X) | Applies the direct transformation | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)
The plot below represents the outcome of the preprocessing, showing the distributions of the preprocessed quantities and indicating the boundaries corresponding to the 99.98% of the dataset.
The preprocessing steps are stored with pickle in the same folder where we will store the trained models.
/workarea/local/private/cache/models/covariance/tX.pkl
/workarea/local/private/cache/models/covariance/tY.pkl
Consistently with resolution model, the preprocessed dataset is split in:
Split data is stored on disk in chunks of 100 MB (before compression).
Processing covariance-train Processing covariance-test Processing covariance-validation
0 | |
---|---|
Train | 352610 |
Test | 282281 |
Validation | 70203 |
In this notebook, we described the preprocessing steps for the training of the resolution and the covariance model. Both models will rely on Generative Adversarial Networks for which we need to preprocess both the input and the output features.
In summary,
In addition, please note that the output of the covariance model is not the covariance matrix itself, but:
All the QuantileTransformer employed in this notebook use a normal distribution as output.