LHCb Analysis Facility
of Docker image landerlini/lhcbaf:v0p8
¶This is the first notebook of the processing chain. Here, we obtain the data from the remote storage and convert them from ROOT format to pandas DataFrames. Then we apply a QuantileTransformer
from scikit-learn
to present to the Artificial Neural Networks Gaussian distributed data. Finally we store to remote storage the preprocessing steps.
This is the notebook where the input features and the labels are defined.
On top of the standard Python echosystem we are using:
uproot
to convert data from ROOT TTrees
to pandas DataFrames
dask DataFrame
to enable processing datasets larger than the available RAM. Dask takes care of flushing from disk to RAM the data, converting from ROOT to pandas data format on demand.Arrow Feather
data format to cache in local storage the training datasetfeather_io.py
are neededWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Input data are obtained from the environment variable $INPUT_FILES
.
For debugging purpose a default value is provided.
Found 10 files from /workarea/cloud-storage/anderlinil/LamarrBenderTrain/j100/*.root
ROOT TTree
to Pandas DataFrame
¶Training data is obtained from a Bender job running on simulated data.
The ROOT files obtained from Bender are structured in multiple TDirector
ies for the training of different models.
Here we are considering only the Tracking part, corresponding to the TDirectory
named "TrackingTuple"
.
Two TTree
s are defined:
sim
with the generator-level particles, possibly matched to reconstructed tracksreco
reconstructed tracks with additional information on the reconstructed track parameters and their uncertainties.sim
: Generator-level particles¶The complte list of variables defined for all generator-level particles is provided below.
mc_key
evtNumber
runNumber
acceptance
reconstructible
recobleCat
mcID
mcIDmother
mc_charge
fromSignal
reconstructed
type
mc_vertexType
mc_primaryVertexType
mc_motherVertexType
mc_x
mc_y
mc_z
mc_r
mc_tx
mc_ty
mc_px
mc_py
mc_pz
mc_p
mc_pt
mc_eta
mc_phi
mc_mass
x_ClosestToBeam
y_ClosestToBeam
z_ClosestToBeam
tx_ClosestToBeam
ty_ClosestToBeam
px_ClosestToBeam
py_ClosestToBeam
pz_ClosestToBeam
p_ClosestToBeam
pt_ClosestToBeam
eta_ClosestToBeam
phi_ClosestToBeam
mass_ClosestToBeam
Some useful feature is missing, in particular we are defining the
is_e
, is_mu
and is_h
providing boolean values used to define whether a particle is an electron, a muon or a hadron (pion, kaon, or proton).The complete list of variables is provided below
mc_x
mc_y
mc_z
mc_log10_p
mc_tx
mc_ty
mc_eta
mc_phi
mc_is_e
mc_is_mu
mc_is_h
mc_charge
acceptance
Before preprocessing the data we ensure that the distributions are well behaved. We should pay particular attention to outlayers falling in regions completely dijoint from the core of the distribution.
Tails are tractable with the QuantileTransformer
, while error values used by the reconstruction may lead to very inconsistent training of the neural networks and should be treated explicitely.
For the acceptance this does not seem to be an issue.
A loose acceptance cut is applied already at the time of generating the nTuple, still the fraction of particles in acceptance is small.
ColumnTransformer
¶We define a preprocessing step applying different transformations to different variables.
Indeed, we plan to use the QuantileTransformer
on continuous features while we should simply ignore the boolean flags such as is_mu
or is_h
.
To acheive such a goal we are using ColumnTransformer
defined in scikit-learn.
We are applying the QuantileTransformer
to the continuos features while we simply passthrough
the boolean flags.
Finally we store the trained preprocessing step in the same destination as the model that we are going to train in the next notebook.
/workarea/local/private/cache/models/acceptance/tX.pkl
We split the sample in three subset:
Finally, we are ready to process the entire dataset applying the preprocessing step and storing in a local storage the relevant features and labels.
Processing acceptance-train Processing acceptance-test Processing acceptance-validation
0 | |
---|---|
Train | 2645683 |
Test | 2118122 |
Validation | 530089 |
The same steps followed for modelling the acceptance are now to be followed for the model of the reconstruction efficiency of charged particles. The model of the efficiency has a significant difference over the acceptance: while a particle is either in acceptance or not, a particle in acceptance can be reconstructed in different ways, or track classes:
Upstream tracks are ususally connected to low momentum particles that were ejected by the magnetic field before reaching the downstream tracker. The resolution on the momentum is rather poor, but the residual magnetic field between the vertex locator and the upstream tracker is sufficient to perform a decent momentum measurement.
Downstream tracks are mainly due to particles decaying outside the Vertex Locator. Typical examples are the decay products from
Downstream tracks are crucial for a number of studies involving hyperons.
We define the set of labels accordingly, while for simplicity, the same features as for the acceptance the are used.
mc_x
mc_y
mc_z
mc_log10_p
mc_tx
mc_ty
mc_eta
mc_phi
mc_is_e
mc_is_mu
mc_is_h
mc_charge
not_recoed
recoed_as_long
recoed_as_upstream
recoed_as_downstream
As done for the the preprocessed dataset used to train the acceptance model, we will store in a local folder the datasets ready to be processed with tensorflow.
Processing efficiency-train Processing efficiency-test Processing efficiency-validation
0 | |
---|---|
Train | 2644496 |
Test | 2119795 |
Validation | 529603 |
Finally we need to publish the preprocessing step to the same target directory that will host the trained model.
/workarea/local/private/cache/models/efficiency/tX.pkl