Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-149] Proposal for Kubernetes Operator for Apache Superset #31408

Open
villebro opened this issue Dec 11, 2024 · 11 comments
Open

[SIP-149] Proposal for Kubernetes Operator for Apache Superset #31408

villebro opened this issue Dec 11, 2024 · 11 comments
Labels
design:proposal Design proposals sip Superset Improvement Proposal

Comments

@villebro
Copy link
Member

villebro commented Dec 11, 2024

[SIP-149] Proposal for Kubernetes Operator for Apache Superset

Motivation

Apache Superset's Helm chart [1] [2] is widely used and receives regular contributions, reflecting the popularity of Kubernetes-based deployments within the community. However, Helm's reliance on static templates, duplicated code, lack of built-in testing frameworks, and limited support for advanced lifecycle management makes maintenance of the Helm chart opaque, error prone, and can cause significant downtime risks in large scale deployments relying on it.

This proposal introduces a Kubernetes Operator [3] (hereafter referred to as "the Operator"), offering a Kubernetes-native approach to managing Superset deployments. The Operator will provide similar configuration options to the Helm chart, while addressing its limitations and introducing features like better testing, observability and automation. This proposal aligns with the approach taken by other Apache projects, such as Apache Flink [4] [5] and Apache Druid [6] [7], whose communities have embraced operators to manage their deployments more effectively.

Proposed Change

The Operator will introduce a Custom Resource Definition (CRD) [8] for managing Superset deployments declaratively. Key features include:

  1. Helm-Aligned Configuration: A configuration model similar to the Helm chart, exposing commonly needed configuration options.
  2. Enhanced Observability: Built-in support for metrics collection, making it easier to monitor key operator related metrics (reconciliation successes/failures, durations etc).
  3. Improved Lifecycle Management: Laying the groundwork for advanced features like staged upgrades, rollbacks, and downgrades, which are currently not possible using the Helm chart.
  4. Enhanced Testing: The Operator will leverage the Operator SDK [9] testing framework, making it easier to validate bug fixes and improvements while ensuring greater reliability and maintainability over time.

The Operator would be placed in a separate repo under the Apache GitHub org, preferably /apache/superset-kubernetes-operator. This would make it easier to maintain dedicated CI workflows, and would also decrease traffic on the main repo by having its own set of Releases, PRs and Issues.

New or Changed Public Interfaces

  1. Kubernetes CRD: A Superset CRD for declarative configuration. This structure will be similar to the values.yaml in the current Helm chart.
  2. Operator Image: Docker image for the Operator, built using the Go-based Operator SDK.
  3. Deployment Artifacts: YAML manifests for deploying the Operator, with optional OLM support.

image
Figure 1. A Superset deployment based on the current Helm Chart, where Helm renders manifests based on the values.yaml file and Helm chart, and applies them to the target namespace.

image
Figure 2. Diagram depicting the proposed operator based flow, where the Operator is deployed in its own namespace, and continuously reconciles the desired state in the custom Superset resources. The CRD ensures that the Superset manifests are valid and applies defaults as needed.

Changes to SIP and Release Process

To ensure breaking changes to Superset are handled by the Operator, the following changes would need to be done to existing processes:

  1. SIP process: A section for required changes to the Operator would be added to the SIP template. Most changes don't impact the infrastructure deployment process, but some do, like the addition/removal of workers or critical components, like the Celery scheduler. These changes would need to be made to the development version of the Operator before the changes are made generally available.
  2. Release process: In case that a breaking change is introduced that requires changes to the Operator, the official Operator images would be updated as follows:
    • Previous release: Checks for upper bounds for the Superset version would be added to the Operator to report a warning/error if an unsupported version is chosen. This would be reported both in the custom resource's status and via a metric.
    • Next release: Similar lower bound checks would be added to the Operator if the new version is incapable of supporting a deprecated/removed feature on an old Superset version.

To keep the releases of Superset and the Operator aligned, we would ensure that that all currently supported Superset versions are backed up by an Operator release. As we're officially maintaining "the latest minor of the last two majors" [10], the Operator would also support these. At the time of writing that would mean 4.1 and 3.1. Note that the Operator version would not track the official Superset version, as breaking changes that require changes to the Operator are fairly uncommon.

New dependencies

The Operator will rely on the Go-based Operator SDK [11] for its implementation and testing framework. Beyond this, it will share the same core dependencies as the existing Helm chart, such as Kubernetes APIs and configurations, but without requiring Helm as a dependency.

Migration Plan and Compatibility

Migrating from the Helm chart to the Operator will be straightforward, as the Operator’s CRD will closely align with the structure of the current values.yaml used in the Helm chart. Additionally, the resources created by the Operator will closely mimic those generated by the Helm chart, ensuring consistency and familiarity. Administrators already familiar with managing Superset via Helm will find the transition intuitive.

Benefits

  1. Kubernetes-Native Management: A clean CRD and continuous reconciliation provide a more natural Kubernetes experience.
  2. Dynamic Lifecycle Features: The Operator lays the foundation for advanced features like staged upgrades and automated recovery. These are difficult to achieve using the current Helm-based approach.
  3. Enhanced Observability: Prometheus-compatible metrics make it easy to monitor the Operator and Superset deployments.
  4. Improved Testing: Operator SDK enables comprehensive testing, both full integration tests and light weight unit tests, improving reliability.
  5. Helm Independence: Users can deploy Superset without relying on Helm.

Proposed Operator Scope and Deprecation of Helm Chart

We propose deprecating the Helm chart once the Operator is deemed stable to avoid the burden of maintaining both. The Operator will also exclude reconciliation support for PostgreSQL and Redis. Users can continue using Helm for these services or adopt dedicated operators [12] [13], ensuring a more focused approach for managing Superset.

Rejected Alternatives

  1. Enhancing the Helm Chart: Helm is limited in its ability to support advanced lifecycle features, testing, dynamic reconciliation, and observability.
  2. Standalone Scripts: Scripts lack maintainability and alignment with Kubernetes-native workflows.
  3. Existing operators: No open-source operators provide a clean CRD or are aligned with Superset’s Helm chart configurations.
@villebro villebro added the sip Superset Improvement Proposal label Dec 11, 2024
@dosubot dosubot bot added the design:proposal Design proposals label Dec 11, 2024
@villebro villebro changed the title [SIP] Proposal for Kubernetes Operator for Apache Superset [SIP-149] Proposal for Kubernetes Operator for Apache Superset Dec 12, 2024
@michael-s-molina
Copy link
Member

Thank you for the proposal @villebro. Do we plan to officially support this operator for official releases? If yes, could you enhance the SIP explaining how the Release Process would be affected?

@villebro
Copy link
Member Author

@michael-s-molina thanks for the feedback. Version support has not been a major issue in the current Helm chart, as it's mostly decoupled from the Superset release process. However, you're right that major changes, like the introduction/removal of new worker types, would definitely cause a breaking change in the operator, too. I will add a section to cover this.

@michael-s-molina
Copy link
Member

I will add a section to cover this.

Thanks. Please consider any necessary changes to RELEASING/README.md.

@villebro
Copy link
Member Author

villebro commented Dec 13, 2024

I will add a section to cover this.

Thanks. Please consider any necessary changes to RELEASING/README.md.

@michael-s-molina I think it's actually mostly relevant for the SIP process, rather than the release process. Any major breaking changes or new advanced features that affect how Superset is deployed may affect how the Docker image is built, our Docker Compose flows, and ultimately the Kubernetes deployment model. A few examples:

  • Global Async Queries using Websockets: there is an unofficial oneacrefund/superset-websocket image that requires an extra deployment on Kubenetes, which is currently supported by Helm. In retrospect, [SIP-43] should have addressed how this would be supported in all the currently existing deployment models.
  • Addition/removal of a critical component: Assuming we were to replace Celery Beat with another scheduler, that would need to be considered during the SIP review, as it would likely require changing what the scheduler Deployment looks like.

Therefore, major changes should be handled as follows:

  1. Infra related breaking changes will need to be raised during the SIP process to ensure they're considered before the vote.
  2. Support for accepted SIPs will need to be added to the dev version of the Operator, so that a clear warning/error can be emitted if the chosen Superset version is unsupported. Note that will be easier to support in the Operator, but more difficult in Helm, as Helm doesn't easily support this type of logic.
  3. Once the new version is released that introduces the breaking change, the affected versions of the operator should be patched with logic to check if they are compatible with the newly introduced version or not.

@mistercrunch
Copy link
Member

Do these typically live in mono-repo or in their own repo?

@villebro
Copy link
Member Author

villebro commented Dec 13, 2024

@mistercrunch I would place this in a separate repo, similar to what Flink is doing: https://github.com/apache/flink-kubernetes-operator (I would suggest following this pattern: apache/superset-kubernetes-operator). Then we wouldn't have to burden the main repo's CI, and could let both repos evolve in their own directions as needed.

Edit: I added a note about this in the proposal.

@mistercrunch
Copy link
Member

mistercrunch commented Dec 13, 2024

Probably fine to use https://github.com/apache-superset/ org for this, that way you get admin rights and we don't have to consider this tool/repo as an ASF-sanctioned thing that provides all the ASF-related-type constrainst & guarantees

@mistercrunch
Copy link
Member

In some ways this would also make it such that we don't really require a SIP or the SIP process.

@villebro
Copy link
Member Author

Some pros/cons that come to mind:

  • Since this is directly tied to Apache Superset, it seems logical to have it reside under the ASF umbrella to give the strong quality guarantees that the ASF process provides. Some orgs may not be able to use the code/assets unless they're governed by the ASF.
  • There's definitely extra overhead for setting this up under the ASF. However, since Flink has been able to get it working, I'm sure we can make it work, too.
  • Druid has decided to go the way of a non-ASF repo (https://github.com/datainfrahq/druid-operator), so that's apparently ok, too.

I would personally vote to keep this under the ASF GitHub org, but I'm not super opinionated, so I can probably be convinced the other way, too.

@mistercrunch
Copy link
Member

Makes sense, though from my understanding the ASF and its participant can't really officially stamp things like a docker image since it include all sorts of other binaries that we can't/shouldn't certify for legal reasons. The only binaries that are official are the tarballs. As long as it's a "recipe" and not a meal it's fine, meaning say a Dockerfile is fair game, but the docker image itself with a bunch of other binaries in it we can't officially certify or distribute. Guessing the k8s Operator would be mostly a recipe, which would be fine.

@Synarcs
Copy link

Synarcs commented Dec 26, 2024

@villebro , I believe this proposal is getting away from the helm finalizers as well and add custom finalizers, ownerreferences for each superset deployment-able manifest for complete lifecycle management of the state as mentioned in crd.
In addition, what are the plans for gateway networking, , I believe it would be agnostic to underlying ingress routing policy in the cluster. but will the operator also own support for deploying ingress resources for cluster running older ingress api, or resources for cluster running k8s gateway api (istio, contour, traeflik) etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design:proposal Design proposals sip Superset Improvement Proposal
Projects
Status: [DISCUSS] thread opened
Development

No branches or pull requests

4 participants