Machine Learning Development Lifecycle
The lifecycle of the machine learning development process often follows these steps:
1. Data Collection
In this step, we fetch data from various sources. Common examples include a data lake, a data catalog, or streaming data (like Kafka, Kinesis). It's important to note that data governance plays a crucial role here. A data catalog that handles access management and governance is vital for a ML lifecycle that works with sensitive PII (personally identifiable information) data.
2. Data Integration and Preparation
This step involves several sub-steps:
- Filtering out bad data: Not all data is useful or accurate. We need to filter out irrelevant or incorrect data.
- Data labeling: Raw data (images, text files, videos, etc.) can be categorized appropriately, allowing a ML model to learn from it. Tools like AWS SageMaker Ground Truth can accomplish this.
- Data Lineage: It becomes increasingly important at this step so that we can track the origin of the data that has been transformed by services such as AWS Glue (ETL). Data Lineage can be tracked with SageMaker ML Lineage Tracking, or the AWS Glue Data Catalog.
3. Data Visualization and Analysis
This step encompasses a few key steps: Feature engineering, model training, and model validation.
3.1. Feature Engineering
A "feature" refers to the "input" that the ML models use during training and inference to make predictions. For example, a ML application that recommends a music playlist can have song ratings, songs listened to previously, or song listening time as ML model features. On AWS, you can use ETL services like AWS Glue or EMR to do feature extraction and transformation. Features can then be stored, updated, fetched, or shared in the SageMaker Feature Store.
3.2. Model Training
During the training step:
- ML algorithms are selected.
- Training data/input features are used to train an ML model.
- Model parameters and hyperparameters are also used to optimize the training.
3.3. Model Evaluation
The evaluation step comprises of validation and tuning. This is where the performance and accuracy of the ML model is measured to see if the business goals will be achieved. Typically, data scientists will:
- Have multiple models and different versions of models to evaluate the effectiveness of.
- Fine-tune the models by repeating earlier steps (e.g., data preparation, feature engineering, training).
- Keep track of the experiments and different training iterations and evaluation results via SageMaker.
- Perform data bias analysis so that the real-world impact wouldn't be poor.
ML in practice with Cloud Services
In this example, data scientists could use AWS to run their ML experiments using the myriad of AWS managed services available for ML.
First, they can obtain data from data sources using ETL services like EMR or Glue. They can create/train their models using Jupyter Notebooks via SageMaker Studio. Model artifacts can then be stored into an object store, such as Amazon S3. When a model is ready for production use, data scientists can set up an Amazon SageMaker endpoint to host the model to generate predictions. This can then be connected to an AWS Lambda -> Amazon API Gateway, so that users will be able to generate predictions over the wire.
Automation in Machine Learning
The machine learning workflow above is done manually, which won't scale very well in the long run. This is because the time and effort required to run these experiments manually are very costly and prone to errors.
In more detail:
- Manual Intervention: Any changes to the ML model require manual intervention by the data scientist, usually in the form of re-running cells in the SageMaker Jupyter Notebook.
- Versioning and Automation: SageMaker Jupyter Notebooks are hard to version and difficult to automate.
- Auto-scaling: SageMaker Endpoints might not have auto-scaling enabled according to the number of requests coming in.
- Feedback Loop: There is no feedback loop on the performance of the model if there is no automation. If the quality of the model deteriorates, it will only be found through user complaints.
To address this, we will start introducing MLOps. MLOps is the set of operational practices to automate and standardise ML model building, training, deployment, monitoring, management, and governance.
Enhancements in MLOps
Containerization
Running Jupyter Notebooks on SageMaker has several pain points. One of them is that your ML environment can easily fluctuate over time. For example, some experiments might just run better with a newer version of numpy
or tensorflow
, or new APIs might be required for training your model. Thus, a good enhancement here will be to containerize our environments by using Docker to create images. These images can be stored in a managed service like Amazon ECR (Elastic Container Registry).
Remote Code Repositories
Code can and should be stored in a Git repo, rather than residing strictly in the Jupyter Notebook. You could use something like GitHub or AWS CodeCommit for pulling and pushing code.
Workflow Pipelining
Ideally, we want to run our Jupyter Notebooks in sequence from Data Fetching -> Data Preparation -> Training -> Validation. We will also want to store the artifacts of our model somewhere, like Amazon S3. Doing all of these steps manually every time becomes tedious, fast. These steps can all be automated into a workflow. You can achieve this with AWS Step Functions, Apache Airflow (or MWAA for the managed AWS equivalent), or SageMaker Pipelines.
In addition, the pipeline can be triggered based on numerous conditions. For example, a code push to a repository, a cron job trigger using EventBridge, or just manual triggering.
Model Registry
ML models can be stored in S3, however it wouldn't help manage them. For example, versioning for ML models are very important to have. A model registry service such as SageMaker Model Registry can help to manage this for you and track other types of metadata such as hyperparameters.
Automated Deployment of Models
SageMaker endpoints can be updated with new models. To have automatic deployments for new models:
- You could create an AWS Lambda that triggers from a ML model approval (i.e. A model approval from SageMaker Model Registry).
- Fetch the new model in question within the Lambda
- Update the SageMaker endpoints with the new model
To address TPS demands, SageMaker endpoints can also be autoscaled based on traffic demands. These endpoints will finally be hooked to an AWS Lambda triggered by API Gateway requests.
To have safe, partial deployments, one could set up a Canary based deployment by using CloudWatch Alarms that track errors with the new model and gradually increase deployment % to new endpoints until all of them are deployed.
Deployment to Different Stages
Models that are approved for public usage can be deployed to a separate staging account and production account, that handles the auto-scaling of the new ML model via SageMaker endpoint, all the way to the external users through APIs called from API Gateway.
This helps to reduce the overall responsibility of each service account, thereby allowing things such as metrics and costs to be more distinguishable. Each service account can also be adjusted specifically for the services that it maintains. On AWS for example, you might want to raise specific service quota limits on one account vs. another. Finally, this enforces good security by permission gating service accounts based on IAM, so that a data scientist doesn't modify production artifacts accidentally.
ML Inference CI/CD
The CI/CD for deploying the ML model to different accounts can be handled through a framework like Jenkins or a managed pipeline service such as AWS CodePipeline. In the case of AWS CodePipeline, the entire flow from SageMaker endpoints to API Gateway can also be converted to a CloudFormation stack, which helps to group AWS resources together for deployment on multiple accounts. Finally, the CI/CD can also re-use the same build environments as the Docker image. With AWS, this can be achieved by using AWS CodeBuild to pull images from Amazon ECR. Then these environments can be pushed to each deployment stage via AWS CodePipeline.
Monitoring ML Models in Production
Once new models are deployed in production, the models can also be replicated into a separate object store (e.g. S3) for baselining purposes.
New models that are created during the training process can now be compared to the baseline models. A model monitor service like SageMaker Model Monitor can compare the two datasets and generate a report. If the report detects a concerning drift, it could emit an event through EventBridge, which could then trigger the re-training pipeline again.
ML Feature Store
For data scientists, feature engineering is typically done before training, which has a major impact on the model's performance. For that reason, having access to the input features within the cloud makes sense when running Jupyter Notebook in development.