[Deployment] 3.5 Building a Reproductible Machine Learning Pipeline
in Development on API
How to Build a Reproductible Machine Learning Pipeline.
Building a Reproductible Machine Learning Pipeline
What is reproducibility in Machine Learning
: Reproducibly is the ability to duplicate a machine learning model exactly, such that given the same raw data as input, both models return the same output
Machine Learning Pipleline: Overview
: Gathering Data Sources -> Data Analysis -> Data Pre-processing -> Variable selection -> Machine Learning Model Building -> Model Deployment
Reproducibility during data gathering
1) Data can be the most difficult challange to ensure reproducibility
Problems occur if the training dataset can’t be reproduced at a later stage
For example, the databases are constanly updated and overwritten, therefore values present at a certain time point differ from values later on.
Order of data during data loading is random, for example when retrieving the rows with SQL
2) How to ensure reproducibility?
Save a sanpshot of traning data(either the actual data, or a reference to a storage location such as AWS S3)
Design data sources with accurate timestamps, so that a view of the data at any point in time can be retrieved
Reproducibility during feature creation
1) Lack of reproducibility may arise from
Replacing missing data with random extracted values
Removing labels based on percentages of observations
Calculating statistical values like the mean to use for missing value replacement
More complex equations to extract featurn, e.g., aggregating over time
2) How to ensure reproducibility?
Code on how a feature is generated should be tracked under version control and published with auto-incremented or timestamp hashed versions
Many of the parameters extracted for feature engineering depend on the data used for training -> ensure data is reproducible
If replacing by extracting random sample always set a seed
Reproducibility during model building
1) How to ensure reproducibility
Record the order of the features
Record applied feature transformation, e.g., standatdisation
Record hyperparameters
For models that require an element of randomness to be trained (decision trees, neural networks, gradient descents), always set a seed
If the final model is a stack of models, record the structure of the ensemble.
Reproducibility during model deploment - Software environment and implementation
1) How to ensure reproducibility
For full reproducibility, the software versions should match exactly - applications should list all third party library dependencies and their versions
Use a container and track its sepcifications, sush as image version (which will include important information such as operating system version)
Research, develop and deploy utilising the same langauge, e.g., python
Prior to building the model, understand how the model will be integrated in production - how the model will be consumed-, so you can make sure the way it was designed can be fully integrated
- Examples of partial deployment include, some data not being available at the time consuming the mode live
- Filters in place do not allow a certain cohort of data to be seen by the model