[Deployment] 3.5 Building a Reproductible Machine Learning Pipeline


How to Build a Reproductible Machine Learning Pipeline.


Building a Reproductible Machine Learning Pipeline

What is reproducibility in Machine Learning

: Reproducibly is the ability to duplicate a machine learning model exactly, such that given the same raw data as input, both models return the same output

Machine Learning Pipleline: Overview

: Gathering Data Sources -> Data Analysis -> Data Pre-processing -> Variable selection -> Machine Learning Model Building -> Model Deployment

Reproducibility during data gathering

1) Data can be the most difficult challange to ensure reproducibility

  • Problems occur if the training dataset can’t be reproduced at a later stage

  • For example, the databases are constanly updated and overwritten, therefore values present at a certain time point differ from values later on.

  • Order of data during data loading is random, for example when retrieving the rows with SQL

2) How to ensure reproducibility?

  • Save a sanpshot of traning data(either the actual data, or a reference to a storage location such as AWS S3)

  • Design data sources with accurate timestamps, so that a view of the data at any point in time can be retrieved

Reproducibility during feature creation

1) Lack of reproducibility may arise from

  • Replacing missing data with random extracted values

  • Removing labels based on percentages of observations

  • Calculating statistical values like the mean to use for missing value replacement

  • More complex equations to extract featurn, e.g., aggregating over time

2) How to ensure reproducibility?

  • Code on how a feature is generated should be tracked under version control and published with auto-incremented or timestamp hashed versions

  • Many of the parameters extracted for feature engineering depend on the data used for training -> ensure data is reproducible

  • If replacing by extracting random sample always set a seed

Reproducibility during model building

1) How to ensure reproducibility

  • Record the order of the features

  • Record applied feature transformation, e.g., standatdisation

  • Record hyperparameters

  • For models that require an element of randomness to be trained (decision trees, neural networks, gradient descents), always set a seed

  • If the final model is a stack of models, record the structure of the ensemble.

Reproducibility during model deploment - Software environment and implementation

1) How to ensure reproducibility

  • For full reproducibility, the software versions should match exactly - applications should list all third party library dependencies and their versions

  • Use a container and track its sepcifications, sush as image version (which will include important information such as operating system version)

  • Research, develop and deploy utilising the same langauge, e.g., python

  • Prior to building the model, understand how the model will be integrated in production - how the model will be consumed-, so you can make sure the way it was designed can be fully integrated

    • Examples of partial deployment include, some data not being available at the time consuming the mode live
    • Filters in place do not allow a certain cohort of data to be seen by the model





© 2018. by statssy

Powered by statssy