A typical ML process for training a model is as follows
- Start with data prepossessing
- Train the model
- Evaluate the model
- Iterate the above steps
These steps are often executed manually by running scripts, notebooks and applying human judgment to validate the model as successfully trained. After model training for production, the process of training an ML model is more likely to be automated with an ML model training pipeline. This way, the model training can be reviewed, reproduced, tuned, and debugged at any point. When triggered, the components in the above pipeline will run in sequential order, and only if they are all successful, a micro-service that serves the trained model will be created or updated.
Most of the components in the above pipeline can be reused and have open-source implementations. All types of data pre-processing, model training, and validation components are hosted on the AI Hub. You can go very far into production ML training and serving without the need for writing custom components.
Let’s go through the basics before using any components. All of the components are applications in Docker containers. For example, to train an ML model, we need the Docker registry URI that points to the container and passes arguments as training data, learning rate, and output location. When we run that trainer image, the application will read the runtime arguments and produce a trained model. There are many solutions for running docker containers on the Cloud. AI Platform training is a specialized service for running ML training containers.
AI Platform training works better than other solutions because of the easy distributed training setup and the provision of TPUs.To serve inference from your model, you need to create a serving REST endpoint. That is easy if you let AI Platform serving host your model. AI Platform serves a REST API that returns predictions and scales automatically according to demand. This eliminates the need for the setup and maintenance of complex infrastructure. AI Platform also simplifies the process of updating prediction models by providing a model name and version. Note that the process of training and deployment will be the same for all TensorFlow components as well. The orchestration is the final stage of any pipeline development, and you should be able to execute every component separately before you put them in a pipeline.
So how do we chain our Docker ML containers?
Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. To create a pipeline, we need to describe the DAG of components with the Kubeflow Pipelines Python SDK. KFP pipelines are composed of components, and components are Docker containers with some metadata as inputs, outputs, and Docker registry. That metadata is typically stored in a YAML file.
ML Container + component.yaml = Component
ML Containers are released with such a metadata file and published under the “Kubeflow pipeline” category. Some ML containers don’t have a component.YAML file. They are just Docker containers that parse command line arguments.