Skip to content

Commit

Permalink
Merge branch 'master' into EDU-329-Update-Secrets-docs-and-AWS-policies
Browse files Browse the repository at this point in the history
  • Loading branch information
llewellyn-sl authored Oct 2, 2024
2 parents 90dc356 + d6fd2f9 commit b232e0c
Show file tree
Hide file tree
Showing 105 changed files with 2,235 additions and 1,187 deletions.
2 changes: 1 addition & 1 deletion docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ export default async function createConfigAsync() {
],
rehypePlugins: [(await require("rehype-katex")).default],
editUrl: ({ docPath }) => {
return `https://github.com/MultiQC/MultiQC/blob/main/docs${docPath.replace('multiqc_docs/multiqc_repo/docs', '')}`
return `https://github.com/MultiQC/MultiQC/blob/main/docs/markdown/${docPath.replace('multiqc_docs/multiqc_repo/docs', '')}`
},
sidebarPath: "./multiqc_docs/sidebar.js",
},
Expand Down
74 changes: 44 additions & 30 deletions fusion_docs/faq.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,53 +2,67 @@
title: FAQ
---

# Frequently Asked Questions
### Which cloud object stores does Fusion support?

## Which object storage are supported by Fusion
Fusion supports AWS S3, Azure Blob, and Google Cloud Storage. Fusion can also be used with local storage solutions that support the AWS S3 API.

Fusion currently supports AWS S3 and Google Storage. In the near future it will also support Azure Blob storage.
### How does Fusion work?

## Can I use Fusion with Minio?
Fusion implements a FUSE driver that mounts the cloud storage bucket in the job execution context as
a POSIX file system. This allows the job script to read and write data files in cloud object storage as if they were local files.

Yes. [Minio](https://min.io/), implements a S3-compatible API, therefore it can be used in place of AWS S3.
See the documentation how to configure your pipeline execution to use Fusion and Minio. (link to guide TBD).
### Why is Fusion faster than other FUSE drivers?

## Can I download Fusion?
Fusion is not a general purpose file system. It has been designed to optimize the data transfer of bioinformatics pipelines by taking advantage of the Nextflow data model.

No. Currently, Fusion can only be used by enabling Wave containers in the configuration of your Nextflow pipeline.
### Why do I need Wave containers to use Fusion?

## Why I need Wave containers to use Fusion?
Fusion is designed to work at the job execution level. This means it must run in a containerized job execution context.

Fusion is designed to work at level of job executions. For this reason, it needs to run in containerised job
execution context.
Downloading and installing Fusion manually would require you to rebuild all the containers used by your data pipeline to include the Fusion client each time a new version of the client is released. You would also need to maintain a custom mirror or existing container
collections, such as [BioContainers](https://biocontainers.pro/).

This would require to rebuild all containers used by your data pipeline to include the Fusion client each time a new
version of the Fusion client is released, and it would make necessary to maintain a custom mirror or existing containers
collections, such as [BioContainers](https://biocontainers.pro/) which is definitively not desirable.
Wave enables you to add the Fusion client to your pipeline containers at deploy time, without the need to rebuild them or
maintain a separate container image collection.

Wave allows adding the Fusion client in your pipeline containers at deploy time, without having to rebuild them or
to maintainer a separate container images collection.
### Can Fusion mount more than one bucket in the job's file system?

## How Fusion works behind the scene?
Yes. Any access to cloud object storage is automatically detected by Fusion and the corresponding buckets are mounted
on demand.

Fusion is implemented a FUSE driver that mounts the storage bucket in the job execution context as
a POSIX file system. This allows the job script to read and write data over the object storage like it were local files.
### Can Fusion mount buckets of different vendors in the same execution?

## Can Fusion mount more than one bucket in job file system
No. Fusion can mount multiple buckets per execution, but all from the same vendor, such as AWS S3 or Google Cloud Storage.

Yes. Fusion any access to an object storage is automatically detected by Fusion and the corresponding bucket is mounted
on-demand.
### I tried Fusion, but I didn't notice any performance improvement. Why?

## Can Fusion mount buckets of different vendors in the same execution?
If you didn’t notice any performance improvement with Fusion, the bottleneck may lie in other factors, such as network latency or memory limitations. Fusion’s caching strategy relies heavily on NVMe SSD or similar storage technology, so ensure your computing nodes are using the recommended storage. Check your Platform compute environment page for optimal instance and storage configurations:

No. Fusion can mount multiple buckets but the must be of the vendor e.g. AWS S3 or Google Storage.
- [AWS Batch](https://docs.seqera.io/platform/latest/compute-envs/aws-batch)
- [Azure Batch](https://docs.seqera.io/platform/latest/compute-envs/azure-batch)
- [Google Cloud Batch](https://docs.seqera.io/platform/latest/compute-envs/google-cloud-batch)
- [Amazon EKS](https://docs.seqera.io/platform/latest/compute-envs/eks)
- [Google GKE](https://docs.seqera.io/platform/latest/compute-envs/gke)

## How Fusion can be faster of other existing FUSE driver?
### Can I pin a specific Fusion version to use with Nextflow?

Fusion is not a general purpose file system. Instead, it has been designed to optimise the data transfer of Nextflow
data pipeline taking advantage of the data model used by Nextflow. [to be improved]
Yes. Add the Fusion version's config URL using the `containerConfigUrl` option in the Fusion block of your Nextflow configuration (replace `v2.4.2` with the version of your choice):

## I tried Fusion, but I didn't notice any performance improvement. Why?
```groovy
fusion {
enabled = true
containerConfigUrl = 'https://fusionfs.seqera.io/releases/v2.4.2-amd64.json'
}
```

Make sure the computing nodes in your cluster have NVMe SSD storage or equivalent technology. Fusion implements an
aggressive caching strategy that requires the use of local scratch storage bases on solid-state disks.
:::note
For ARM CPU architectures, use https://fusionfs.seqera.io/releases/v2.4.2-arm64.json.
:::

### Can I use Fusion with Minio?

Yes. [Minio](https://min.io/) implements an S3-compatible API, therefore it can be used instead of AWS S3. See [Local execution with Minio](https://www.nextflow.io/docs/latest/fusion.html#local-execution-with-minio) for more information.

### Can I download Fusion?

No. Fusion can only be used directly in supported [Seqera Platform compute environments](https://docs.seqera.io/platform/latest/compute-envs/overview), or by enabling [Wave containers](https://docs.seqera.io/wave) in your Nextflow configuration.
56 changes: 56 additions & 0 deletions fusion_docs/get-started.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
title: Get started
description: "Use the Fusion v2 file system in Seqera Platform and Nextflow"
date: "23 Aug 2024"
tags: [fusion, storage, compute, file system, posix, client]
---

Use Fusion directly in Seqera Platform compute environments, or add Fusion to your Nextflow pipeline configuration.

### Seqera Platform

Use Fusion directly in the following Seqera Platform compute environments:
- [AWS Batch](https://docs.seqera.io/platform/latest/compute-envs/aws-batch)
- [Azure Batch](https://docs.seqera.io/platform/latest/compute-envs/azure-batch)
- [Google Cloud Batch](https://docs.seqera.io/platform/latest/compute-envs/google-cloud-batch)
- [Amazon EKS](https://docs.seqera.io/platform/latest/compute-envs/eks)
- [Google GKE](https://docs.seqera.io/platform/latest/compute-envs/gke)

See the Platform compute environment page for your cloud provider for Fusion configuration instructions and optimal compute and storage recommendations.

### Nextflow

:::note
Fusion requires Nextflow `22.10.0` or later.
:::

Fusion integrates with Nextflow directly and does not require any installation or change in pipeline code. It only requires to use of a container runtime or a container computing service such as Kubernetes, AWS Batch, or Google Cloud Batch.

#### Nextflow installation

If you already have Nextflow installed, update to the latest version using this command:

```bash
nextflow -self-update
```

Otherwise, install Nextflow with this command:

```bash
curl get.nextflow.io | bash
```

#### Fusion configuration

To enable Fusion in your Nextflow pipeline, add the following snippet to your `nextflow.config` file:

```groovy
fusion.enabled = true
wave.enabled = true
tower.accessToken = '<your Platform access token>' //optional
```

:::tip
The use of the Platform access token is not mandatory, however, it's required to enable access to private repositories
and it allows higher service rate limits compared to anonymous users.
:::
9 changes: 6 additions & 3 deletions fusion_docs/guide.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
---
title: User guide
description: "Overview of the Fusin v2 file system"
date: "23 Aug 2024"
tags: [fusion, storage, compute, file system, posix, client]
---

# User guide
Expand All @@ -23,21 +26,21 @@ Fusion smoothly integrates with Nextflow and does not require any installation o

### Nextflow installation

If you have already installed Nextflow, update to the latest version using this command::
If you have already installed Nextflow, update to the latest version using this command:

```bash
nextflow -self-update
```

If you don't have Nextflow already installed, install it with the command below::
If you don't have Nextflow already installed, install it with the command below:

```bash
curl get.nextflow.io | bash
```

### Fusion configuration

To enable Fusion in your Nextflow pipeline add the following snippet to your `nextflow.config` file::
To enable Fusion in your Nextflow pipeline add the following snippet to your `nextflow.config` file:

```groovy
fusion.enabled = true
Expand Down
36 changes: 0 additions & 36 deletions fusion_docs/guide/aws-batch-s3.mdx

This file was deleted.

50 changes: 50 additions & 0 deletions fusion_docs/guide/aws-batch.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: AWS Batch
description: "Use Fusion with AWS Batch and S3 storage"
date: "23 Aug 2024"
tags: [fusion, storage, compute, aws batch, s3]
---

Fusion simplifies and improves the efficiency of Nextflow pipelines in [AWS Batch](https://aws.amazon.com/batch/) in several ways:

- No need to use the AWS CLI tool for copying data to and from S3 storage.
- No need to create a custom AMI or create custom containers to include the AWS CLI tool.
- Fusion uses an efficient data transfer and caching algorithm that provides much faster throughput compared to AWS CLI and does not require a local copy of data files.
- By replacing the AWS CLI with a native API client, the transfer is much more robust at scale.

### Platform AWS Batch compute environments

Seqera Platform supports Fusion in Batch Forge and manual AWS Batch compute environments.

See [AWS Batch](https://docs.seqera.io/platform/latest/compute-envs/aws-batch) for compute and storage recommendations and instructions to enable Fusion.

### Nextflow CLI

:::tip
Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in
parallel to and from the object storage into the container-local temporary directory (`/tmp`). To achieve optimal performance, set up an SSD volume as the temporary directory.

Several AWS EC2 instance types include one or more NVMe SSD volumes. These volumes must be formatted to be used. See [SSD instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) for details. Seqera Platform automatically formats and configures NVMe instance storage with the “Fast instance storage” option when you create an AWS Batch compute environment.
:::

1. Add the following to your `nextflow.conf` file:

```groovy
process.executor = 'awsbatch'
process.queue = '<YOUR AWS BATCH QUEUE>'
process.scratch = false
process.containerOptions = '-v /path/to/ssd:/tmp' // Required for SSD volumes
aws.region = '<YOUR AWS REGION>'
fusion.enaled = true
wave.enabled = true
```

Replace `<YOUR AWS BATCH QUEUE>` and `<YOUR AWS REGION>` with your AWS Batch queue and region.

1. Run the pipeline with the usual run command:

```
nextflow run <YOUR PIPELINE SCRIPT> -w s3://<YOUR-BUCKET>/work
```

Replace `<YOUR PIPELINE SCRIPT>` with your pipeline Git repository URI and `<YOUR-BUCKET>` with your S3 bucket.
Loading

0 comments on commit b232e0c

Please sign in to comment.