Skip to content

User Guide

This is a comprehensive guide to deploying ML projects to k8s using Bodywork. It assumes that you understand the key concepts that Bodywork is built upon and that you have worked-through the Quickstart Tutorials.

Deployment Project Structure

Bodywork-compatible ML projects need to be structured in a specific way. All the files necessary for defining a stage must be contained within a directory dedicated to that stage. The directory name defines the name of the stage. This enables the Bodywork workflow-controller to identify the stages and run them in the desired order. Consider the following example directory structure,

root/
 |-- prepare-data/
     |-- prepare_data.py
     |-- requirements.txt
     |-- config.ini
 |-- train-svm/
     |-- train_svm.py
     |-- requirements.txt
     |-- config.ini
 |-- train-random-forest/
     |-- train_random_forest.py
     |-- requirements.txt
     |-- config.ini
 |-- choose-model/
     |-- choose_model.py
     |-- requirements.txt
     |-- config.ini
 |-- model-scoring-service/
     |-- model_scoring_app.py
     |-- requirements.txt
     |-- config.ini
 |-- bodywork.ini

Here we have five directories given names that relate to the ML tasks contained within them. There is also a single workflow configuration file, bodywork.ini. Each directory must contain the following files:

*.py
An executable Python module that contains all the code required for the stage. For example, prepare_data.py should be capable of performing all data preparation steps when executed from the command line using python prepare_data.py.
requirements.txt
For listing 3rd party Python packages required by the executable Python module. This must follow the format required by Pip.
config.ini
Containing stage configuration that will be discussed in more detail below.

Running Tasks in Remote Python Environments

bodywork_diagram

Bodywork projects must be packaged as a Git repositories (e.g. on GitHub), that will be cloned by Bodywork when executing workflows. When the Bodywork workflow-controller executes a stage, it starts a new Python-enabled container in your k8s cluster and instructs it to pull the required directory from your project's Git repository. Then, it installs any 3rd party Python package requirements, before running the executable Python module.

Configuring Workflows

All configuration for a workflow is contained within the bodywork.ini file, that must exist in the root directory of your project's Git repository. An example bodywork.ini file for the project structure in the example above could be,

[default]
PROJECT_NAME="my-classification-project"
DOCKER_IMAGE="bodyworkml/bodywork-core:latest"

[workflow]
DAG=prepare-data >> train-svm, train-random-forest >> choose-model >> model-scoring-service

[logging]
LOG_LEVEL="INFO"

Each configuration parameter is used as follows:

PROJECT_NAME
This will be used to identify all k8s resources deployed for this project.
DOCKER_IMAGE
The container image to use for remote execution of Bodywork workflows and stages. This should be set to bodyworkml/bodywork-core:latest, which will be pulled from DockerHub.
DAG
A description of the workflow structure - the stages to include in each step of the workflow - this will be discussed in more detail below. - LOG_LEVEL: must be one of: DEBUG, INFO, WARNING, ERROR or CRITICAL. Manages the types of log message to stream to the workflow-controller's standard output stream (stdout).

Defining Workflow DAGs

The DAG string is used to control the execution of stages by assigning them to different steps of the workflow. Steps are separated using the >> operator and commas are used to delimit multiple stages within a single step (if this is required). Steps are executed from left to right. In the example above,

DAG=prepare-data >> train-svm, train-random-forest >> choose-model >> model-scoring-service

The workflow will be interpreted as follows:

  • step 1: run prepare-data; then,
  • step 2: run train-svm and train-random-forest in separate containers, in parallel; then,
  • step 3: run choose-model; and finally,
  • step 4: run model-scoring-service.

Configuring Stages

The behavior of each stage is controlled by the configuration parameters in the config.ini file. For the model-scoring-service stage in our example project this could be,

[default]
STAGE_TYPE="service"
EXECUTABLE_SCRIPT="model_scoring_app.py"
CPU_REQUEST=0.25
MEMORY_REQUEST_MB=100

[service]
MAX_STARTUP_TIME_SECONDS=30
REPLICAS=1
PORT=5000

[secrets]
USERNAME="my-classification-product-cloud-storage-credentials"
PASSWORD="my-classification-product-cloud-storage-credentials"

The [default] section is common to all types of stage and the [secrets] section is optional. The remaining section must be one of [batch] or [service].

Each [default] configuration parameter is to be used as follows:

STAGE_TYPE
One of batch or service. If batch is selected, then the executable script will be run as a discrete job (with a start and an end), and will be managed as a k8s job. If service is selected, then the executable script will be run as part of a k8s deployment and will expose a k8s cluster-ip service to enable access over HTTP, within the cluster.
EXECUTABLE_SCRIPT
The name of the executable Python module to run, which must exist within the stage's directory. Executable means that executing python model_scoring_app.py from the CLI would cause the module (or script) to run.
CPU_REQUEST / MEMORY_REQUEST
The compute resources to request from the cluster in order to run the stage. For more information on the units used in these parameters refer here.

Batch Stages

An example [batch] configuration for the prepare-data stage could be as follows,

[batch]
MAX_COMPLETION_TIME_SECONDS=30
RETRIES=2

Where:

MAX_COMPLETION_TIME_SECONDS
Time to wait for the given task to run, before retrying or raising a workflow execution error.
RETRIES
Number of times to retry executing a failed stage, before raising a workflow execution error.

Service Deployment Stages

An example [service] configuration for the model-scoring-service stage could be as follows,

[service]
MAX_STARTUP_TIME_SECONDS=30
REPLICAS=1
PORT=5000

Where:

MAX_STARTUP_TIME_SECONDS
Time to wait for the service to be 'ready' without any errors having occurred. When the service reaches the time limit without raising errors, then it will be marked as 'successful'. If a service deployment stage fails to be successful, then the deployment will be automatically rolled-back to the previous version.
REPLICAS
Number of independent containers running the service started by the stage's Python executable module - model_scoring_app.py. The service endpoint will automatically route requests to each replica at random.
PORT
The port to expose on the container - e.g. Flask-based services usually send and receive HTTP requests on port 5000.

Injecting Secrets

Credentials will be required whenever you wish to pull data or persist models to cloud storage, access private APIs, etc. We provide a secure mechanism for dynamically injecting credentials as environment variables within the container running a stage.

The first step in this process is to store your project's secret credentials, securely within its namespace - see Managing Credentials and Other Secrets below for instructions on how to achieve this using Bodywork.

The second step is to configure the use of this secret with the [secrets] section of the stages's config.ini file. For example,

[secrets]
USERNAME="my-classification-product-cloud-storage-credentials"
PASSWORD="my-classification-product-cloud-storage-credentials"

Will instruct Bodywork to look for values assigned to the keys USERNAME and PASSWORD within the k8s secret named my-classification-product-cloud-storage-credentials. Bodywork will then assign these secrets to environment variables within the container, called USERNAME and PASSWORD, respectively. These can then be accessed from within the stage's executable Python module - for example,

import os


if __name__ == '__main__':
    username = os.environ['USERNAME']
    password = os.environ['PASSWORD']

Configuring Namespaces

Each Bodywork project should operate within its own namespace in your k8s cluster. To setup a Bodywork compatible namespace, issue the following command from the CLI,

$ bodywork setup-namespace my-classification-product

Which will yield the following output,

creating namespace=my-classification-product
creating service-account=bodywork-workflow-controller in namespace=my-classification-product
creating cluster-role-binding=bodywork-workflow-controller--my-classification-product
creating service-account=bodywork-jobs-and-deployments in namespace=my-classification-product

We can see that in addition to creating the namespace, two service-accounts will also be created. This will grant containers in my-classification-product the appropriate authorisation to run workflows, batch jobs and deployments within the newly created namespace. Additionally, a binding to a cluster-role is also created. This will enable containers in the new namespace to list all available namespaces on the cluster. The cluster-role will be created if it does not yet exist.

Managing Secrets

Credentials will be required whenever you wish to pull data or persist models to cloud storage, or access private APIs from within a stage. We provide a secure mechanism for dynamically injecting secret credentials as environment variables into the container running a stage. Before a stage can be configured to inject a secret into its host container, the secret has to be placed within the k8s namespace that the workflow will be deployed to. This can be achieved from the command line - for example,

$ bodywork secret create \
    --namespace=my-classification-product \
    --name=my-classification-product-cloud-storage-credentials \
    --data USERNAME=bodywork PASSWORD=bodywork123!

Will store USERNAME and PASSWORD within a k8s secret resource called my-classification-product-cloud-storage-credentials in the my-classification-product namespace. To inject USERNAME and PASSWORD as environment variables within a stage, see Injecting Secrets into Stage Containers below.

Working with Private Git Repositories using SSH

When working with remote Git repositories that are private, Bodywork will attempt to access them via SSH. For example, to setup SSH access for use with GitHub, see this article. This process will result in the creation of a private and public key-pair to use for authenticating with GitHub. The private key must be stored as a k8s secret in the project's namespace, using the following naming convention for the secret name and secret data key,

$ bodywork secret create \
    --namespace=my-classification-product \
    --name=ssh-github-private-key \
    --data BODYWORK_GITHUB_SSH_PRIVATE_KEY=paste_your_private_key_here

When executing a workflow defined in a private Git repository, make sure to use the SSH protocol when specifying the git-repo-url - e.g. use,

git@github.com:my-github-username/my-classification-product.git

As opposed to,

https://github.com/my-github-username/my-classification-product

Testing Workflows Locally

Workflows can be triggered locally from the command line, with the workflow-controller logs streamed to your terminal. In this mode of operation, the workflow controller is operating on your local machine, but it is still orchestrating containers on k8s remotely. It will still clone your project from the specified branch of the Bodywork project's Git repository, and delete it when finished.

For the example project used throughout this user guide, the CLI command for triggering the workflow locally using the master branch of the remote Git repository, would be as follows,

$ bodywork workflow \
    --namespace=my-classification-product \
    https://github.com/my-github-username/my-classification-product \
    master

It is also possible to specify a branch from a local Git repository. A local version of the above example - this time using the dev branch - could be as follows,

$ bodywork workflow \
    --namespace=my-classification-product \
    file:///absolute/path/to/my-classification-product \
    dev

Testing Service Deployments

Service deployments are accessible via HTTP from within the cluster - they are not exposed to the public internet. To test a service from your local machine, you will need to start a local proxy server to enable access to your cluster. This can be achieved by issuing the following command,

$ kubectl proxy

Then in a new shell, you can use the curl tool to test the service. For example, issuing,

$ curl http://localhost:8001/api/v1/namespaces/my-classification-product/services/my-classification-product--model-scoring-service/proxy \
    --request POST \
    --header "Content-Type: application/json" \
    --data '{"x": 5.1, "y": 3.5}'

Should return the payload according to how you've defined your service in the executable Python module - e.g. in the model_scoring_app.py file found within the model-scoring-service stage's directory.

We have explicitly excluded from Bodywork's scope, the task of enabling access to services from requests originating outside the cluster. There exist multiple patterns that can achieve this - e.g. via load balancers or ingress controllers - and the choice will depend on your project's specific requirements. Please refer to the official Kubernetes documentation to learn more.

Deleting Service Deployments

Once you have finished testing, you may want to delete any service deployments that have been created. To list all active service deployments within a namespace, issue the command,

$ bodywork service display \
    --namespace=my-classification-project

Which should yield output similar to,

SERVICE_URL                                                       EXPOSED   AVAILABLE_REPLICAS       UNAVAILABLE_REPLICAS
http://my-classification-product--model-scoring-service:5000      true      2                        0

To delete the service deployment use,

$ bodywork service delete
    --namespace=my-classification-project
    --name=my-classification-product--model-scoring-service

Workflow-Controller Logs

All logs should start in the same way,

2020-11-24 20:04:12,648 - INFO - workflow.run_workflow - attempting to run workflow for project=https://github.com/my-github-username/my-classification-product on branch=master in kubernetes namespace=my-classification-product
git version 2.24.3 (Apple Git-128)
Cloning into 'bodywork_project'...
remote: Enumerating objects: 92, done.
remote: Counting objects: 100% (92/92), done.
remote: Compressing objects: 100% (64/64), done.
remote: Total 92 (delta 49), reused 70 (delta 27), pack-reused 0
Receiving objects: 100% (92/92), 20.51 KiB | 1.58 MiB/s, done.
Resolving deltas: 100% (49/49), done.
2020-11-24 20:04:15,579 - INFO - workflow.run_workflow - attempting to execute DAG step=['prepare-data']
2020-11-24 20:04:15,580 - INFO - workflow.run_workflow - creating job=my-classification-product--prepare-data in namespace=my-classification-product
...

After a stage completes, you will notice that the logs from within the container are streamed into the workflow-controller logs. For example,

----------------------------------------------------------------------------------------------------
---- pod logs for my-classification-product--prepare-data
----------------------------------------------------------------------------------------------------
2020-11-24 20:04:18,917 - INFO - stage.run_stage - attempting to run stage=prepare-data from master branch of repo at https://github.com/my-github-username/my-classification-product
git version 2.20.1
Cloning into 'bodywork_project'...
Collecting boto3==1.16.15
  Downloading boto3-1.16.15-py2.py3-none-any.whl (129 kB)
...

The aim of this log structure is to provide a useful way of debugging workflows out-of-the-box, without forcing you to integrate a complete logging solution. This is not a replacement for a complete logging solution - e.g. one based on Elasticsearch. It is intended as a temporary solution to get your ML projects operational, as quickly as possible.

Scheduling Workflows

If your workflows are executing successfully, then you can schedule the workflow-controller to operate remotely on the cluster as a k8s cronjob. For example, issuing the following command from the CLI,

$ bodywork cronjob create \
    --namespace=my-classification-product \
    --name=my-classification-product \
    --schedule="0,15,30,45 * * * *" \
    --git-repo-url=https://github.com/my-github-username/my-classification-product \
    --git-repo-branch=master \
    --retries=2

Would schedule our example project to run every 15 minutes. The cronjob's execution history can be retrieved from the cluster using,

$ bodywork cronjob history \
    --namespace=my-classification-product \
    --name=my-classification-product

Which will yield output along the lines of,

JOB_NAME                                START_TIME                    COMPLETION_TIME               ACTIVE      SUCCEEDED       FAILED
my-classification-product-1605214260    2020-11-12 20:51:04+00:00     2020-11-12 20:52:34+00:00     0           1               0

Accessing Historic Logs

The logs for each job executed by the cronjob are contained within the remote workflow-controller. The logs for a single workflow execution attempt can be retrieved by issuing the bodywork cronjob logs command on the CLI - for example,

$ bodywork cronjob logs \
    --namespace=my-classification-product-1605214260 \
    --name=my-classification-product-1605214260

Would stream logs directly to your terminal, from the workflow execution attempt labelled my-classification-product-1605214260. This output stream could also be redirected to a local file by using a shell redirection command such as,

$ bodywork cronjob logs ... > log.txt

To overwrite the existing contents of log.txt, or,

$ bodywork cronjob logs ... >> log.txt

To append to the existing contents of log.txt.