diff --git a/website/docs/advanced/_category_.json b/website/docs/advanced/_category_.json new file mode 100644 index 000000000..99c08a85c --- /dev/null +++ b/website/docs/advanced/_category_.json @@ -0,0 +1,7 @@ +{ + "label": "Advanced Guides", + "position": 8, + "link": { + "type": "generated-index" + } +} \ No newline at end of file diff --git a/website/docs/advanced/docker/_category_.json b/website/docs/advanced/docker/_category_.json new file mode 100644 index 000000000..e88fd355f --- /dev/null +++ b/website/docs/advanced/docker/_category_.json @@ -0,0 +1,7 @@ +{ + "label": "Docker Images", + "position": 7, + "link": { + "type": "generated-index" + } +} diff --git a/website/docs/advanced/docker/deploy/automated.md b/website/docs/advanced/docker/deploy/automated.md new file mode 100644 index 000000000..07ba1f3f9 --- /dev/null +++ b/website/docs/advanced/docker/deploy/automated.md @@ -0,0 +1,204 @@ +--- +title: Automated Deployment +description: Build and Publish Images +sidebar_position: 2 +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +In the GATK-SV pipeline, the Docker images undergo automated +processes for building, testing, and publishing as part of +the CI/CD workflow. These automated procedures guarantee +that all images are consistently and reproducibly built +within a standardized Linux VM environment +(specifically, GitHub Actions). +This ensures uniformity across all GATK-SV Docker images +and keeps them synchronized with the latest code-base. + + +The automated CI/CD pipeline also includes continuous +testing and regression identification during pull requests. +This proactive approach allows for the detection and +resolution of any issues related to image changes or content +before merging the pull request. +Consequently, it ensures the reliability and consistency +of the Docker images, simplifies the review process, +and maintains the high quality of the pipeline. + + +Additionally, the automated CI/CD workflow ensures that +the Docker images are correctly mirrored on multiple +container registries, specifically Azure Container Registry (ACR) +and Google Cloud Container Registry (GCR). +This redundancy guarantees availability and accessibility +of the images across different platforms. + + +Latest Docker images are listed in the files, +with detailed automated deployment descriptions in the following sections. + + + + + ```shell + gatk_sv_codebase/inputs/values/dockers_azure.json + ``` + + + + + ```shell + gatk_sv_codebase/inputs/values/dockers.json + ``` + + + + + +## Workflow Layout + +The CI/CD workflow for building, testing, and publishing GATK-SV Docker images +is defined in [`sv_pipeline.yml`](https://github.com/broadinstitute/gatk-sv/blob/main/.github/workflows/sv_pipeline_docker.yml). +The [`build_docker.py`](https://github.com/broadinstitute/gatk-sv/blob/main/scripts/docker/build_docker.py) +script is utilized for building and publishing the images. +When a pull request is issued against the repository, the images are built, +and upon merging the pull request, they are published to ACR and GCR. + + + +The workflow consists of three +[_jobs_](https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#jobs) +discussed in the following sections. + + +### Determine Build Args {#args} +This job is responsible for determining the arguments to be used by the +`build_docker.py` script, specifically: + +- **Determining commit SHAs**: + Considering the size and number of GATK-SV Docker images, + the workflow focuses on building and publishing only the + Docker images that are affected by the changes introduced + in a pull request (PR). + You may refer to [this page](/docs/advanced/docker/deploy/incremental) + on details regarding the incremental build strategy. + This job determines the commit SHAs of `HEAD` and `BASE` + commits. + +- **Compose image tag**: + GATK-SV Docker images are tagged with a consistent template + to simplify referencing and usage in the pipeline. + The tag composition step follows the following template. + + ``` + [DATE]-[RELEASE_TAG]-[HEAD_SHA_8] + ``` + where `[DATE]` represents the `YYYY-MM-DD` format extracted + from the timestamp of the last commit on the branch associated + with the pull request. `RELEASE_TAG` is extracted from the + latest [pre-]release on GitHub. + Additionally, `HEAD_SHA_8` denotes the first eight characters + of the `HEAD` commit SHA. The following is an example tag generated + in this step. + + ``` + 2023-05-24-v0.27.3-beta-1796b665 + ``` + + +### Testing Docker Image Build {#build} + +The `Test Images Build` job is triggered when a commit is pushed to +the pull request branch. It is responsible for +building the Docker images identified by the +[`Determine Build Args`](#args) +job. If the Docker image building process fails, +this job will also fail. The Docker images created +by this job are not published to GCR or ACR and +are discarded once the job is successfully completed. +This job primarily serves for testing purposes during +the review process, ensuring that the affected images +can be successfully built and that the changes introduced +in the pull request do not disrupt the Docker build process. + + +### Publishing Docker Images {#publish} + +The `Publish` job is triggered when a pull request +is merged or a commit is pushed to the `main` branch. +Similar to the [`Test Images Build`](#build) job, +it builds Docker images; however, in addition, +this job also pushes the built images to the GCR and ACR, +and updates the list of published images. Specifically, +this job runs the following steps. + + +- **Login to ACR**: + To authorize access to the Azure Container Registry (ACR), + this job logs in to Docker by assuming an Azure service principal. + The credentials required for the login are defined as + [encrypted environment secrets](https://docs.github.com/en/actions/security-guides/encrypted-secrets). + +- **Login to GCR**: + Similar to ACR, to authorize access to GCR, + this job assumes a Google Cloud Platform service account. + The secrets related to the service account are defined as + [encrypted environment secrets](https://docs.github.com/en/actions/security-guides/encrypted-secrets). + +- **Build and publish to ACR and GCR**: + Similar to the [build job](#build), this job builds Docker images + based on the list of changed files specified using the + `HEAD` and `BASE` commit SHA. Additionally, it pushes the + built images to both ACR and GCR. It's important to note + that the job doesn't rebuild separate images for each registry. + Instead, it labels a single image for both ACR and GCR, + resulting in an identical image with the same tag and Docker + image hash being pushed to both registries. + This job will fail if the build or push process encounters any issues. + +- **Update the list of published images**: + GATK-SV maintains two JSON files that store the latest Docker + images built and pushed to ACR and GCR. + These files require updates whenever a new image is successfully + built and published. The `build_docker` script handles the + update of the JSON files by adding the latest built and + published Docker images for ACR and GCR. + + However, it's important to note that the updated JSON + files reside in the GitHub Actions virtual machines, + and they are discarded once the GitHub Actions job is + completed successfully. To preserve these changes, + we need to commit them to the `main` branch from within the + GitHub Actions VM as part of the CI/CD process. + To achieve this, we utilize a dedicated _bot_ account. + The steps necessary to perform this are explained + in the following. + + - **Login to git using the bot's Personal Access Token (PAT)**: + This step is necessary to enable the _bot_ account to + commit the modified JSON files to the `main` branch + and to authorize the _bot_ to push the changes from + the GitHub Actions VM to the `main` branch using its credentials. + + - **Commit changes and push to the `main` branch**: + This step configures the Git installation in the + GitHub Actions VMs using the _bot_'s credentials. + It commits the modified JSON files, which contain + the latest built and pushed images. The commit message + references the Git commit that triggered the [publish](#publish) job, + providing improved tracking of changes in the Git history. + Finally, it pushes the commit to the main branch. + It's worth noting that Git is intelligent enough + to recognize that this push is made from a GitHub + Actions environment, preventing it from triggering + another publish job. This avoids the issue of + infinite triggers of the publish job. + diff --git a/website/docs/advanced/docker/deploy/incremental.md b/website/docs/advanced/docker/deploy/incremental.md new file mode 100644 index 000000000..46ed7b3cb --- /dev/null +++ b/website/docs/advanced/docker/deploy/incremental.md @@ -0,0 +1,81 @@ +--- +title: Incremental Publishing +description: Incremental Publishing Strategy +sidebar_position: 4 +--- + + +The hierarchical and modular organization of GATK-SV Docker +images offers a significant advantage: when updating the codebase, +not every Docker image is affected, minimizing the impact of changes. +This means that not all Docker images need to be rebuilt and +published with each pipeline modification. The +[`build_docker`](https://github.com/broadinstitute/gatk-sv/blob/main/scripts/docker/build_docker.py) +script efficiently tracks these changes and determines which +Docker images are impacted. Consequently, only the affected Docker +images are built, saving both storage space and build time. + + +This incremental and selective building and publishing +strategy is particularly beneficial considering the size and +build time of Docker images. By building and publishing +only the necessary images, we can save on storage space and +reduce the overall build time. +This page provides a detailed explanation of +this incremental and selective approach. + + +## Determining Modified Files + +The incremental build strategy relies on the determination +of modified files to identify which Docker images require rebuilding. +Using `git` history, the `build_docker` script automatically +infers the list of changed files. + + +To achieve this, the script compares two +[`git` commit SHAs](https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/about-commits): + +- `BASE_SHA`: the reference commit representing the base branch + (e.g., `broadinstitute/gatk-sv:main`), and; +- `HEAD_SHA`: the target commit representing the latest commit on the feature branch. + + +By analyzing the changes between these commits +the script identifies the impacted files and proceeds to +build the corresponding Docker images. + +During manual runs, the user provides the commit SHAs, +while in automated builds as part of CI/CD, +the commit SHAs are determined automatically. + +In CI/CD, the commit SHAs are determined as the following example. + +``` + X---Y---Z feature branch + / \ +A---B---C---D---E main branch +``` + +In this example, `BASE_SHA=B`, `HEAD_SHA=Z`, and `E` is the merge commit. + + +## Identifying Images Requiring Rebuilding from Changed Files + +The build_docker script identifies the list of docker images +that need to be rebuilt based on two factors. +Firstly, directly impacted images are determined by examining the +list of files each image depends on. If any of these files have +been changed, the corresponding image requires rebuilding. +Secondly, indirectly impacted images are determined based on +the hierarchical dependency between images. If an image is +built upon another image, and the base image is being rebuilt, +then the dependent image also needs to be rebuilt. This two-step +process ensures that all the affected images are correctly +identified for rebuilding. + + +A comprehensive mapping of files to their corresponding +Docker images, specifying which images need to be +rebuilt when their associated files are updated is given in +[this section](https://github.com/broadinstitute/gatk-sv/blob/e86d59962146ae1770c535a97c2774d825026957/scripts/docker/build_docker.py#L170-L245). diff --git a/website/docs/advanced/docker/deploy/index.md b/website/docs/advanced/docker/deploy/index.md new file mode 100644 index 000000000..33ab138b1 --- /dev/null +++ b/website/docs/advanced/docker/deploy/index.md @@ -0,0 +1,30 @@ +--- +title: Deploying Docker Images +description: Docker Concepts and Execution Overview +sidebar_position: 2 +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +:::info +This section offers a comprehensive explanation of the process of +building, testing, and publishing Docker images. For details +regarding the images and their hierarchy, please refer to +[this page](/docs/advanced/docker/images). +::: + + +GATK-SV Docker image _deployment_ involves the essential steps of +_building_, _testing_, and _publishing_ to Docker container registries. +There are two deployment options available: fully automated and manual. +With the fully automated approach, GATK-SV Docker images are built +and published to Google Container Registry (GCR) and +Azure Container Registry (ACR) through continuous integration and +continuous delivery (CI/CD) after merging a pull request. +However, if you are working on extending or improving the +GATK-SV Docker images, you may need to build the images locally +for testing or store them on an alternative container registry. +This section provides comprehensive insights into the automatic +build process and a detailed guide on locally building the images +for development purposes. diff --git a/website/docs/advanced/docker/deploy/manual.md b/website/docs/advanced/docker/deploy/manual.md new file mode 100644 index 000000000..9cf2c899c --- /dev/null +++ b/website/docs/advanced/docker/deploy/manual.md @@ -0,0 +1,14 @@ +--- +title: Manual Deployment +description: Build and Publish Images +sidebar_position: 3 +--- + +If you are contributing to the GATK-SV codebase, specifically focusing on +enhancing tools, configuring dependencies in Dockerfiles, or modifying GATK-SV scripts +within the Docker images, it is important to build and test the Docker images locally. +This ensures that the images are successfully built and function as intended. +Additionally, if you wish to host the images in your own container registry, +you will need to follow these steps. +To simplify the build process, we have developed a Python script +that automates the image building, and publishing to your container registry. diff --git a/website/docs/advanced/docker/images.md b/website/docs/advanced/docker/images.md new file mode 100644 index 000000000..9b19424db --- /dev/null +++ b/website/docs/advanced/docker/images.md @@ -0,0 +1,99 @@ +--- +title: Docker Images Hierarchy +description: Docker Image Dependencies +sidebar_position: 1 +--- + +import useBaseUrl from '@docusaurus/useBaseUrl'; +import ThemedImage from '@theme/ThemedImage'; + +:::info +This page provides a detailed explanation of Docker +images and their hierarchy. For information on the process +of building these images, please refer to [this section](/docs/advanced/docker/deploy). +::: + + +The tools, scripts, dependencies, and configurations utilized by the +GATK-SV pipeline, written in WDL, are organized into separate Docker +containers. This modular approach ensures that each container +contains only the necessary tools for its specific task, +resulting in smaller image sizes. This design choice simplifies +the definition of Dockerfiles and facilitates easier maintenance. +Moreover, the smaller image sizes contribute to reduced disk +usage and lower workflow execution costs. + + +The figure below illustrates the relationships between the GATK-SV Docker images. + + + + +The image depicts the hierarchical relationship among GATK-SV +Docker images. Arrows indicate the flow from a base image +to a derived image. The base image, located at the arrow's +starting point, shares its content which is then expanded +upon and modified in the derived image. In simple terms, +the derived image inherits the same tools and configuration +as the base image, while incorporating additional settings and tools. + + +The list of the Docker images and their latest builds +are available in [`dockers.json`](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json) +and [`dockers_azure.json`](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers_azure.json) +for images hosted on Google Container Registry (GCR) and Azure Container Registry (ACR), respectively. + + +## Advantages of Dividing Images by Functionality + +The GATK-SV pipeline utilizes Docker images to encapsulate the necessary tools, +dependencies, and configurations. Instead of having a single monolithic image, +the pipeline is organized into multiple smaller images, each focusing on a specific task. +This approach offers several benefits. + + +By splitting the tools into separate Docker images, we achieve a modular +and focused structure. Each image contains the tools required for a specific +task within the GATK-SV pipeline. This enables users and developers to easily +work with individual images, as they can identify the specific tools needed +for their particular analysis. + + +Moreover, using smaller, task-specific Docker images offers the advantage +of reduced sizes, which is particularly beneficial in cloud environments. +These smaller images require less storage space when stored in container +registries like Google Cloud Container Registry (GCR) or Azure Container Registry (ACR). +Additionally, when creating virtual machines for workflow task execution, +the transfer of these smaller images is more efficient. + + +Separate Docker images enhance maintenance and extensibility +in the GATK-SV pipeline. Maintainers can easily modify or update +specific tools or configurations within a single image without +impacting others. This granularity improves maintainability +and enables seamless expansion of the pipeline by adding or +replacing tools as required. + + +Additionally, the Docker image hierarchy offers advantages in terms of +consistency and efficiency. One image can be built upon another, +leveraging existing setups and tools. This promotes code reuse and +reduces duplication, resulting in consistent configurations across +different stages of the pipeline. It also simplifies the management +of common dependencies, as changes or updates can be applied at the +appropriate level, cascading down to the dependent images. + + +In summary, by splitting the tools into smaller, task-specific images, +the pipeline becomes more modular and manageable. +This approach optimizes storage, execution, maintenance, +and extensibility in cloud environments. +Leveraging Docker's image hierarchy further enhances consistency, +code reuse, and dependency management, ensuring efficient and +scalable execution of the pipeline. diff --git a/website/docs/advanced/docker/index.md b/website/docs/advanced/docker/index.md new file mode 100644 index 000000000..bd4868f07 --- /dev/null +++ b/website/docs/advanced/docker/index.md @@ -0,0 +1,66 @@ +--- +title: Overview +description: Docker Concepts and Execution Overview +sidebar_position: 0 +--- + +import useBaseUrl from '@docusaurus/useBaseUrl'; +import ThemedImage from '@theme/ThemedImage'; + +To make the analysis process scalable, reproducible, and cost-efficient, +GATK-SV is designed as a cloud-native pipeline, +meaning it runs on virtual machines (VMs) hosted in the cloud. +These VMs are pre-configured with all the necessary tools, scripts, and settings +required to run the GATK-SV analysis reliably. + + +To ensure that the analysis can be easily replicated and shared, +GATK-SV utilizes Docker technology. +Docker allows the tools and scripts, including all their dependencies and configurations, +to be packaged into a self-contained unit called a container. +This container can be deployed and run on different VMs in the cloud, +making the analysis process consistent and reproducible across multiple experiments or collaborations. + + +Docker containers are built from Docker images, +which serve as the blueprints or templates for creating containers. +Dockerfiles are used to define the contents and behavior of a Docker image. +A Dockerfile is a text file that contains a series of instructions, +specifying the base image, adding dependencies, configuring settings, +and executing commands necessary to build the desired software environment within the container. + + +The following figure is a high-level illustration depicting the relationship +between Dockerfiles, Docker images, Docker containers, and Cloud VMs. + + + + + +The GATK-SV Docker setup is organized as follows: + + - **Dockerfile**: + These files define the instructions for building the necessary tools and + configurations required for the GATK-SV pipeline. + + - **Docker Images**: Docker images are automatically built based on each Dockerfile. + These images are stored in both Azure Container Registry (ACR) and + Google Cloud Container Registry (GCR). The images serve as self-contained + packages that encapsulate all the tools needed for the GATK-SV pipeline. + + - **Docker Containers**: Cromwell, a workflow execution system, creates GATK-SV + Docker containers on virtual machines within the Google Cloud Platform (GCP). + These containers are instantiated based on the Docker images obtained + from GCR. The GATK-SV data analysis tasks are then executed within + these containers, providing a consistent and isolated environment. + +In summary, the GATK-SV Docker setup involves multiple Dockerfiles defining +the build instructions, resulting in Docker images that are stored in ACR and GCR. +These images are used to create Docker containers on GCP virtual machines through Cromwell, +where the GATK-SV data analysis takes place. diff --git a/website/docs/gs/docker.md b/website/docs/gs/docker.md new file mode 100644 index 000000000..bf62de78b --- /dev/null +++ b/website/docs/gs/docker.md @@ -0,0 +1,41 @@ +--- +title: Docker Images +description: GATK-SV Docker Images +sidebar_position: 3 +slug: ./dockers +--- + + +To make the analysis process scalable, reproducible, and cost-efficient, +GATK-SV is designed as a cloud-native pipeline, +meaning it runs on virtual machines (VMs) in the cloud, +which are pre-configured with all the necessary tools, scripts, +and settings for reliable analysis. To easily replicate and share +the analysis, GATK-SV uses Docker technology. Docker packages the tools, +scripts, and their requirements into self-contained units called containers. +These containers can be deployed on different VMs in the cloud, +ensuring consistent and reproducible analysis for various experiments +and collaborations. + +The latest Docker image builds can be found in the following files. + + + +- [`dockers.json`](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json). + The list of images hosted on Google Container Registry (GCR). + You may use the Docker images listed in this file if you are running + the pipeline on Google Cloud Platform (GCP). + +- [`dockers_azure.json`](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers_azure.json). + The list of images hosted on Azure Container Registry (ACR). + You may use the Docker images listed in this file if you are + running the pipeline on Azure. + + +:::tip For developers and power users + +You may refer to [this section](/docs/advanced/docker/) for a detailed +description of the Docker images, including their design principles, +as well as guides on build and deploy them. +::: + \ No newline at end of file diff --git a/website/static/img/docker_hierarchy.png b/website/static/img/docker_hierarchy.png new file mode 100644 index 000000000..5ef7f98e2 --- /dev/null +++ b/website/static/img/docker_hierarchy.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:46143ad234a9932e6d7e9e3690a527309c7ac01e72e76920575e2b6c466469e3 +size 838559 diff --git a/website/static/img/docker_infra_diagram.png b/website/static/img/docker_infra_diagram.png new file mode 100644 index 000000000..905709103 --- /dev/null +++ b/website/static/img/docker_infra_diagram.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:917ccbfe2fc97a5d8adffc52ecdf77abd666b3273b28ab248bd23b117ef76ca6 +size 1126378