Add the `prometheus-longterm-metrics` and `thanos` optional components #461

mishaschwartz · 2024-06-07T20:12:20Z

Overview

The prometheus-longterm-metrics component collects longterm monitoring metrics from the original prometheus instance (the one created by the components/monitoring component).

Longterm metrics are any prometheus rule that have the label group: longterm-metrics or in other words are selectable using prometheus's '{group="longterm-metrics"}' query filter. To see which longterm metric rules are added by default see the optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template file.

To configure this component:

update the PROMETHEUS_LONGTERM_RETENTION_TIME variable to set how long the data will be kept by prometheus
update the PROMETHEUS_LONGTERM_STORE_INTERVAL variable to set how often the longterm metrics rules will be calculated. For example, setting it to 10h will calculate these metrics every 10 hours.

Enabling the prometheus-longterm-metrics component creates the additional endpoint /prometheus-longterm-metrics.

The thanos component enables better storage of longterm metrics collected by the optional-components/prometheus-longterm-metrics component. Data will be collected from the prometheus-longterm-metrics and stored in an S3 object store indefinitely.

When enabling this component, please change the default values for the MINIO_ROOT_USER and MINIO_ROOT_PASSWORD by updating the env.local file. These set the login credentials for the root user that runs the minio object store.

Enabling the thanos component creates the additional endpoints:

/thanos-query: a prometheus-like query interface to inspect the data stored by thanos
/thanos-minio: a minio web console to inspect the data stored by minio.

This also includes an update to the prometheus version from v2.19.0 to the current latest v2.52.0. This is to
required to support the interaction between prometheus and thanos.

Changes

Non-breaking changes

New component version: prometheus:v2.52.0

Related Issue / Discussion

Resolves 💡 [Feature] monitoring: Add a second Prometheus to scrap the first Prometheus to keep stats for long term at a lower granularity #277
Add some initial metrics as described in 💡 [Feature] Add recording rules to Prometheus configuration to store hourly/daily metrics #447 but we should really add more (either to this PR or in a future PR) by adding more rules to the birdhouse/optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template file.

Additional Information

I tested upgrading the prometheus version and there were no issues (no loss of data, no changed APIs etc.)
Note that the thanos set up is pretty minimal but probably good enough for our purposes. We can always add more of the thanos features/components in the future if needed.

CI Operations

birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false

huard

I can't comment on the specifics, but this is really great news !

Question: this assumes a MinIO server exists. Long has set one up on PAVICS, but I'm wondering if a MinIO config already officially exists on birdhouse-deploy?

mishaschwartz · 2024-06-13T16:41:25Z

this assumes a MinIO server exists.

If you enable the thanos optional component, the a minio server will be created in a docker container. It is not dependent on an externally deployed/created minio server. But if you have one already you could customize this to use that one instead.

I'm wondering if a MinIO config already officially exists on birdhouse-deploy?

Not at the moment, you likely have a custom component that sets one up for PAVICS. Do you know if this is true @tlvu?

huard · 2024-06-13T17:22:32Z

Huh, that's convenient ! Ok thanks !

tlvu · 2024-06-13T17:45:24Z

I'm wondering if a MinIO config already officially exists on birdhouse-deploy?

Not at the moment, you likely have a custom component that sets one up for PAVICS. Do you know if this is true @tlvu?

@mishaschwartz Yes I deployed one, just for prototyping so did not hook it up "officially" to the PAVICS stack because it's not behind the proxy and Magpie.

@huard like Misha, I'd rather have the MinIO for the monitoring and MinIO for other usage be separate. It much simplifies the management (custom config, version requirements, upgrade, ...). I think Weaver or Magpie was hooking into the default Posgres at some point, then it also rolls it's own Postgres container because it requires a different version of Postgres.

@mishaschwartz Can this new component be enabled standalone? Meaning on another physical host with only itself and the proxy, as describe in this comment #277 (comment)?

If not yet standalone, it's okay we can do it in another PR.

huard · 2024-06-13T18:02:46Z

Good point, makes sense.

tlvu

Quick review just by reading the code, not tried to deploy yet.

CHANGES.md

tlvu · 2024-06-18T14:44:13Z

birdhouse/optional-components/README.rst

+
+This enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics``
+component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store
+indefinitely.


Indefinitely ! Do we actually want this? Can we set an expiry after like 10 years?

Grafana will be able to display data from Thanos go to back to 10 years? With this kind of extreme long term stats, what is the UI to visualize it?

You can choose to change this if you wish. Thanos suggests keeping data indefinitely by default. If you do not need to keep data forever, I suggest just using the prometheus-longterm-monitoring component without thanos and setting the PROMETHEUS_LONGTERM_RETENTION_TIME to whatever you'd like.

Oh I see, is there a switch to disable Thanos?

Same question, in the case we want to use Thanos, how to visualize the data stored on Thanos? I assume if Thanos is enabled, the retention duration on the Prometheus side will be very short to avoid doubling the storage so without data being stored in Prometheus, how to visualize that data stored on Thanos.

Just a question. If another component is required, we can do it in a follow up Pr.

Oh I see, is there a switch to disable Thanos?

To answer my own question, Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component? The Prometheus-long-term component can function standalone of Thanos?

Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component

Yes that's right. prometheus-longterm-metrics collects and stores specific metrics that we want to keep for longer from prometheus. If you want to also enable thanos, then thanos will store those same metrics in a much more compact/efficient way so that you can store more data over a longer time period.

I'm happy with forever storage for long-term-metrics. The point is to keep an archive of key metrics. If those are daily or hourly, archiving a few dozen metrics won't be a problem.

Let's see how much space that will take in practice and we will adjust. And eventually we need a way to visualize those older metrics. Otherwise, what's the point to keep them forever if we can not visualize?

tlvu · 2024-06-18T14:52:06Z

birdhouse/optional-components/prometheus-longterm-metrics/default.env

+"
+
+COMPONENT_DEPENDENCIES="
+  ./components/monitoring


Oh this dependency will make the new Prometheus long term not able to run standalone.

No it cannot run standalone, it collects metrics from the services that are enabled in the monitoring component.

Say I already have a bunch of PAVICS servers running with monitoring enabled and I just want to point this new Prometheus to aggregate all the data?

Can be in a follow up PR.

The goal of this PR is to create a method for saving existing prometheus data over the long term for a single birdhouse/PAVICS deployment. If you want something that will collect prometheus data for multiple servers I would recommend creating a new repository to host that code.

Since the architecture here is pluggable, we do not need a separate repo and duplicate the work. To deploy the 2nd Prometheus only, on a separate machine, I see we only enable the proxy and the prometheus-longterm-metrics and optionally thanos components on the new machine and that's it.

But agree we can make this "standalone" support in a separate PR.

tlvu · 2024-06-18T14:54:07Z

birdhouse/optional-components/prometheus-longterm-metrics/prometheus.yml

+
+scrape_configs:
+  - job_name: 'federate'
+    scrape_interval: 15s


This one is a long term but it still scrape every 15 sec? Or I might misunderstood this config.

It collects and stores metrics that are already collected by the other prometheus service. Depending on how often the other prometheus instance is calculating those metrics, you may not need such a small scrape_interval. What interval would you suggest?

I made it configurable

I do not fully understand federation yet but I had the impression the 2nd Prometheus would query the max/min/average value from the 1st Prometheus at a much longer interval. Ex: the 1st Prometheus scrap every minute. The 2nd Prometheus would scrap every hour, taking the max/min/average over the hour of the 1st Prometheus?

You don't need multiple prometheus instances for that. You just need to create recording rules that calculate the metrics that you want to store for a longer period and then use the second prometheus instance to grab those metrics for long-term storage.

The second prometheus instance is necessary only because the data retention time is configured for each instance, so if you want to keep data longer you need a separate prometheus instance.

If you're storing data for a long time with thanos instead, you still need a second prometheus instance so that you can select which data thanos will store. thanos stores all data from a prometheus instance, you cannot pick and choose, so we get around this problem by only collecting the data we are interested in the "longterm" prometheus instance and only using thanos to store data from that specific instance.

The best way to think about it is:

prometheus: collects all metrics we are interested in and stores them for a short period of time

prometheus-longterm-metrics: selects some metrics collected by prometheus that we are interested in storing for a long period of time.

thanos: selects all metrics from prometheus-longterm-metrics and stores them efficiently over an even longer period of time

I see clearer now, thanks !

Just to confirm, the current setup prometheus-longterm-metrics is collecting everything from prometheus so we are not yet "selecting some metrics" yet, right?

It's alright we can perform the selection in a subsequent PR.

Just to confirm, the current setup prometheus-longterm-metrics is collecting everything from prometheus so we are not yet "selecting some metrics" yet, right?

Opps sorry, just read your other reply now and so to answer my own remark: prometheus-longterm-metrics is collecting only metrics from group: longterm-metrics from prometheus.

birdhouse/optional-components/thanos/default.env

birdhouse/optional-components/thanos/docker-compose-extra.yml

mishaschwartz · 2024-06-18T15:25:42Z

@tlvu

Can this new component be enabled standalone? Meaning on another physical host with only itself and the proxy, as describe in this comment #277 (comment)?

Not with the current way that we deploy without some serious additional coordination to support the networking and permissions. If we used something like kubernetes or docker swarm to deploy this would be easier to deploy on multiple hosts but we're not there yet.

tlvu · 2024-06-18T19:17:10Z

additional coordination to support the networking and permissions.

Right. Let's say we expose the 1st Prometheus port to the PAVICS docker host and configure the firewall on the PAVICS docker host to allow connection from the 2nd Prometheus long term only, would that would that work?

tlvu · 2024-06-18T19:34:00Z

birdhouse/optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules

+      rules:
+        # percentage of the time, over the last hour, that all CPUs were working
+        # 1 means all CPUs were working all the time, 0 means they were all idle all the time
+        - record: instance:cpu_load:avg_rate1h


So the 2nd Prometheus-longterm will scrap only these 2 new metrics or it will also scrap all the existing ones?

By default, only these two. I suggest we add more rules to this default list in the future though.

If you have other metrics that you want it to scrape for long term storage, you just have to give that rule the label group: longterm-metrics.

tlvu · 2024-06-18T19:46:23Z

birdhouse/optional-components/prometheus-longterm-metrics/prometheus.yml.template

+
+    static_configs:
+      - targets:
+        - 'prometheus:9090'


If you make this target list a template expansion variable, then we can easily add more hosts and override this default local Prometheus and points to other hosts, ex:

- existing_pavics_1:9090 - existing_pavics_2:9090 - ...

This assume I open up the Prometheus port on existing_pavics_1 and existing_pavics_2 and I handle manually the inclusion of prometheus.rules on on existing_pavics_1 and existing_pavics_2.

Then you can remove the dependency on ./components/monitoring.

Just add a note in the README to enable ./components/monitoring together with this ./optional-components/prometheus-longterm-metrics component only if the admin wants both Prometheus to be on the same machine.

I think this is for a separate discussion and out of the scope of this PR.

We can discuss the best way to handle multiple targets later on but at the very least we need to consider the security implications of recommending users open up additional ports.

I really think that any code that handles monitoring multiple prometheus endpoints should go in a separate repo.

mishaschwartz · 2024-06-18T20:12:13Z

Right. Let's say we expose the 1st Prometheus port to the PAVICS docker host and configure the firewall on the PAVICS docker host to allow connection from the 2nd Prometheus long term only, would that would that work?

Yup, that would work. I think that code can go in a different repo though.

tlvu · 2024-06-19T16:58:34Z

Right. Let's say we expose the 1st Prometheus port to the PAVICS docker host and configure the firewall on the PAVICS docker host to allow connection from the 2nd Prometheus long term only, would that would that work?

Yup, that would work. I think that code can go in a different repo though.

Yup, I never see this code in this repo. Ouranos have our own override repo so we can do this in our own override. Because only our repo will actually know what are all the "other" hostnames !

The only change needed for this repo so to turn the targets section in birdhouse/optional-components/prometheus-longterm-metrics/prometheus.yml.template to a template variable so our override repo can override with a real list of "other" hosts.

tlvu

This looks pretty good for now. Any extra improvements can be done in subsequent PRs.

mishaschwartz · 2024-06-25T20:33:21Z

@tlvu I've made a few small changes to make it easier to deploy this on a separate server. Check out 2eab8b7 for the relevant changes.

tlvu

Thanks ! This should indeed make the longterm prometheus standalone if need be.

fmigneault

Hi. Sorry for the delay of the review.
Still going through my plethora of emails following my long leave.

Nice feature. Will rely on the configurations to have been tested on a local instance since I cannot deep dive into it yet.

I think the README documentation with the existing monitoring component would benefit from a cross-reference of these new optional components details. A user reading about the monitoring component might not have the reflex to look for corresponding optional components definitions, and might get confused by the mixed variables/config/rule references. Also, they might not be able to figure out that thanos is relevant for prometheus and vice-versa.

In the same line of thought, it is not clear yet to me what are all the impacts of the various components and how they interact between each other. There are implicit dependencies between thanos, prometheus, prometheus-longterm-metrics, as well as grafana, alertmanager, cadvisor, others(?) through components/monitoring and all their derived dockers/volumes, but it is really not obvious unless one has personally integrated them. Can we add some guidance in the docs to help disambiguate all of them?

CHANGES.md

fmigneault · 2024-07-08T20:47:03Z

CHANGES.md

+  * ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
+  * ``/thanos-minio``: a minio web console to inspect the data stored by minio.


Should those be configurable?

Which bits would be useful to configure? The endpoints, paths, images, other?

I agree that we could always add more configuration options, I'm just wondering which are a priority for you?

The endpoints would be the priority, to allow serving them from some other location, though still a low priority relative to the feature as a whole. Worst case, redirects can be defined in the nginx configuration, so don't block the PR just for this.

CHANGES.md

birdhouse/optional-components/prometheus-longterm-metrics/default.env

fmigneault · 2024-07-08T21:12:38Z

birdhouse/optional-components/prometheus-longterm-metrics/default.env

+# Note that this component does not depend explicitly on the `components/monitoring` component so that this can
+# theoretically be deployed on a different machine than the `prometheus` service. This is currently untested.


Curious how that would work out?
It seems to depend on prometheus:9090 and extends the volumes of prometheus service.
I think it is more confusing to omit the dependency.

It's actually a completely separate service with no shared volumes. It does depend on prometheus:9090 by default but that default can be updated by setting the PROMETHEUS_LONGTERM_TARGETS variable.

PROMETHEUS_LONGTERM_TARGETS can point to any prometheus endpoints at all.

See the discussion about this here: #461 (comment)

What I meant is that the rules are mounted under prometheus service here:
https://github.com/bird-house/birdhouse-deploy/blob/longterm-monitoring/birdhouse/optional-components/prometheus-longterm-rules/config/monitoring/docker-compose-extra.yml

So, whether PROMETHEUS_LONGTERM_TARGETS refers to the same instance or a custom remote one, the local instance needs to have the long-term rules defined. Therefore, doesn't that mean that prometheus service (i.e.: using components/monitoring) becomes mandatory regardless, since it is needed to mount the rules?

Yes, prometheus must be running on the same host as the rest of the birdhouse stack.

That first prometheus instance must have at least one rule that has the longterm-metrics group label in order for a second prometheus instance (either running on the same machine or elsewhere) to know which metrics it should monitor and store from the first prometheus instance.

I've taken a second look at the directory structure, and starting to see why components/monitoring is not a dependency. Is it only because components/monitoring combines cardvisor, grafana, alertmanager, prometheus, etc. all together?

Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus service with a definition similar to this?

birdhouse-deploy/birdhouse/components/monitoring/docker-compose-extra.yml

Lines 32 to 49 in 3f75d49

prometheus:

image: ${PROMETHEUS_IMAGE}

container_name: prometheus

volumes:

- ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro

- ./components/monitoring/prometheus.rules:/etc/prometheus/prometheus.rules:ro

- prometheus_persistence:/prometheus:rw

command:

# restore original CMD from image

- --config.file=/etc/prometheus/prometheus.yml

- --storage.tsdb.path=/prometheus

- --web.console.libraries=/usr/share/prometheus/console_libraries

- --web.console.templates=/usr/share/prometheus/consoles

# https://prometheus.io/docs/prometheus/latest/storage/

- --storage.tsdb.retention.time=90d

# wrong default was http://container-hash:9090/

- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus/

restart: always

I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why):

optional-components/prometheus-longterm-metrics

optional-components/prometheus-longterm-rules

From a user's perspective, this all seems really convoluted.
Isn't there are way to simplify this hierarchy? For example, what if there was a distinct optional-components/prometheus definition, and components/monitoring depends on it. Then, optional-components/prometheus-longterm-metrics could also depend only on optional-components/prometheus?

@fmigneault

Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus service with a definition similar to this?

If someone only wants longterm metrics then they can have a single prometheus service that sets --storage.tsdb.retention.time to something longer than 90 days.

However, if you want some metrics to be stored for longer than others, you need a second prometheus instance. This is because the --storage.tsdb.retention.time value is set for an entire prometheus instance (see discussion here: #277 (comment) and here: #461 (comment)).

I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why)

These are separate in order to accommodate @tlvu's request to be able to deploy the prometheus-longterm-metrics component on a different server. prometheus-longterm-rules are just suggested default rules to be added to the prometheus server. As described in the README file:

This adds some default longterm metrics rules to the `prometheus` component

These rules are added to the prometheus component on the same machine that is running the rest of the stack because that is where all the relevant metrics are collected by the rest of the monitoring components (cadvisor, nodeexporter). These rules are not necessary, they are recommended defaults but you can customize them however you like. It is easier to customize and add your own if the rules are separate. Users can choose to include the defaults, or not.

Agree with Misha's explanations. I just want to add that each organization needs are different so we should not force a certain "configuration".

Specifically to this PR, we should not force the 2 Prometheus to be on the same machine. So optional-components/prometheus-longterm-metrics should not depend on components/monitoring. We should also not be forced to use optional-components/prometheus-longterm-rules because each org will probably have different metrics they deemed useful to keep for longterm.

During Misha's leave, a sysadmin from PCIC actually had a question about how to pull existing metrics from our PAVICS Prometheus into his own centralized Prometheus for a centralized view of all his servers in one place. I point him to this PR for inspiration.

My bad. I was not clear in my explanation. What I intended to say was:
"if someone wants only prometheus monitoring with both long-term and short term metrics, they need to define their own prometheus service" [...], because they cannot activate prometheus on its own (i.e.: they must use the full package components/monitoring that includes all other services).

Having short/long-term separate, and having them configurable on different servers is perfectly fine by me. I'm all for that flexibility, and never mentioned otherwise.

It just seems from the configuration files that, in order for long-term metrics to be sent to the remote server, the rules must be mounted into the short-term local prometheus (as shown below), and since that prometheus must exist, then it should be a dependency. I understand long-term-prometheus NOT being a dependency to toggle it independently of the short-term one. This is good. Here I am referring to https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-rules having component/monitoring dependency, NOT https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-metrics. In other words, if one adds optional-components/prometheus-longterm-rules ONLY without also thinking to add components/monitoring, docker-compose will complain that service: prometheus is not defined.

birdhouse-deploy/birdhouse/optional-components/prometheus-longterm-rules/config/monitoring/docker-compose-extra.yml

Lines 3 to 6 in 3f75d49

services:

prometheus:

volumes:

- ./optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro

Ok I understand your concern now.

In other words, if one adds optional-components/prometheus-longterm-rules ONLY without also thinking to add components/monitoring, docker-compose will complain that service: prometheus is not defined.

Docker compose won't actually complain in this case. Additional settings defined under the config/ folder in components are only applied if the component that it references is also enabled. So in this case, if you add prometheus-longterm-rules as a component but not monitoring, then nothing will happen (since the additional docker-compose settings are in prometheus-longterm-rules/config/monitoring.

If we do make the prometheus-longterm-rules component dependent on the monitoring components then if someone adds the prometheus-longterm-rules but not the monitoring rule to the BIRDHOUSE_EXTRA_CONF_DIRS configuration variable, then monitoring will be added in automatically. This is also a surprising behaviour that users may not expect.

My opinion is that we should leave it as is, so that users must specify both components if they want both components. I'm open to other opinions on this subject though.

Thanks for validating how this resolves.
I'm fine with either approach if the README indicates clearly what is supposed to happen and what to do if the one or the other situation is desired.

fmigneault · 2024-09-18T23:26:58Z

birdhouse/optional-components/README.rst

+Prometheus Long-term Metrics
+----------------------------
+
+This is a second prometheus instance that collects longterm monitoring metrics from the original prometheus instance
+(the one created by the ``components/monitoring`` component).
+
+Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
+selectable using prometheus' ``'{group="longterm-metrics"}'`` query filter. To add some default longterm metrics rules
+also enable the ``prometheus-longterm-rules`` component.
+
+You may also choose to create your own set of rules instead of, or as well as, the default ones. See how to 
+:ref:`add additional rules here <monitoring-customize-the-component>`.
+
+To configure this component:
+
+    * update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus
+
+Enabling this component creates the additional endpoint ``/prometheus-longterm-metrics``.
+
+.. _prometheus-longterm-rules
+
+Prometheus Long-term Rules
+--------------------------
+
+This adds some default longterm metrics rules to the `prometheus` component for use by the `prometheus-longterm-metrics`
+component. These rules all have the label ``group: longterm-metrics``.
+
+To see which rules are added, check out the
+`optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules` file.


Re-reading this following other comments in the PR, I am left only more confused...

Putting ourselves in the shoes of someone having absolutely no idea about the implementation details of each component and how they interact with each other, what must the user do to achieve some results (aka: the "To enable this optional-component:" of each other section)?

ie:

If user wants SHORT only, what must be enabled (guessing only components/monitoring?)

If user wants SHORT + LONG all locally, what must be enabled (components/monitoring + optional-components/prometheus-longterm-metrics ?) (+ optional-components/prometheus-longterm-rules depending on other dependency comment)

If user wants SHORT local and LONG remote (components/monitoring + optional-components/prometheus-longterm-rules + IMPORTANT PROMETHEUS_LONGTERM_TARGETS override ?)

Specifically regarding the "Prometheus Long-term Rules" section.
It is mentioned that they are applied to prometheus. (this is fine, I understand why to apply the group tags)

However, for a user without the details of "why", it is very confusing! I can see a user think they "should logically" be applied to prometheus-longterm-rules service given their names. A much better explanation must be given.

Basically, something must indicate that service prometheus is the source that feeds the others instances using optional-components/prometheus-longterm-rules. Then, using optional-components/prometheus-longterm-metrics, they can decide that this receiving instance is another local one (ie: prometheus-longterm-metrics service), or dispatch it remotely by overriding PROMETHEUS_LONGTERM_TARGETS accoridngly.

Ok... just to clarify what you're asking for here ...

You would like specific instructions about how to set up each of these three cases:

If user wants SHORT only ...

If user wants SHORT + LONG all locally ...

If user wants SHORT local and LONG remote ...

And you would like an additional description of how the various monitoring components interact?
Is there anything else?

That's it. Just mitigate user expectations and ensure there is a common understanding of the components/services.

mishaschwartz added 2 commits June 7, 2024 14:38

add longterm monitoring components

d981ef3

add documentation

06dc997

mishaschwartz requested a review from huard June 7, 2024 20:12

mishaschwartz requested review from tlvu and fmigneault as code owners June 7, 2024 20:12

github-actions bot added component/magpie Related to https://github.com/Ouranosinc/Magpie documentation Improvements or additions to documentation labels Jun 7, 2024

huard approved these changes Jun 13, 2024

View reviewed changes

tlvu reviewed Jun 18, 2024

View reviewed changes

tlvu self-requested a review June 18, 2024 15:07

mishaschwartz added 4 commits June 18, 2024 11:30

create default rules

0d7178e

review updates

42f687d

fix missing default var

a921527

bug fixes

0307996

tlvu reviewed Jun 18, 2024

View reviewed changes

tlvu self-requested a review June 18, 2024 19:46

tlvu approved these changes Jun 19, 2024

View reviewed changes

make longterm monitoring deployable on a different server

2eab8b7

mishaschwartz requested a review from tlvu June 25, 2024 20:33

tlvu approved these changes Jun 25, 2024

View reviewed changes

fmigneault reviewed Jul 8, 2024

View reviewed changes

mishaschwartz added 2 commits September 16, 2024 10:51

Merge branch 'master' into longterm-monitoring

76e60f7

documentation updates and remove unused references to variables

3f75d49

fmigneault reviewed Sep 18, 2024

View reviewed changes

update documentation [ci skip]

79a531d

mishaschwartz requested a review from fmigneault October 2, 2024 13:05

		* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
		* ``/thanos-minio``: a minio web console to inspect the data stored by minio.

		# Note that this component does not depend explicitly on the `components/monitoring` component so that this can
		# theoretically be deployed on a different machine than the `prometheus` service. This is currently untested.

	prometheus:
	image: ${PROMETHEUS_IMAGE}
	container_name: prometheus
	volumes:
	- ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
	- ./components/monitoring/prometheus.rules:/etc/prometheus/prometheus.rules:ro
	- prometheus_persistence:/prometheus:rw
	command:
	# restore original CMD from image
	- --config.file=/etc/prometheus/prometheus.yml
	- --storage.tsdb.path=/prometheus
	- --web.console.libraries=/usr/share/prometheus/console_libraries
	- --web.console.templates=/usr/share/prometheus/consoles
	# https://prometheus.io/docs/prometheus/latest/storage/
	- --storage.tsdb.retention.time=90d
	# wrong default was http://container-hash:9090/
	- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus/
	restart: always

	services:
	prometheus:
	volumes:
	- ./optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro

Add the prometheus-longterm-metrics and thanos optional components #461

Are you sure you want to change the base?

Add the prometheus-longterm-metrics and thanos optional components #461

Conversation

mishaschwartz commented Jun 7, 2024

Overview

Changes

Related Issue / Discussion

Additional Information

CI Operations

huard left a comment

Choose a reason for hiding this comment

mishaschwartz commented Jun 13, 2024 • edited Loading

huard commented Jun 13, 2024

tlvu commented Jun 13, 2024

huard commented Jun 13, 2024

tlvu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mishaschwartz Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mishaschwartz commented Jun 18, 2024 • edited Loading

tlvu commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlvu Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mishaschwartz commented Jun 18, 2024

tlvu commented Jun 19, 2024

tlvu left a comment

Choose a reason for hiding this comment

mishaschwartz commented Jun 25, 2024

tlvu left a comment

Choose a reason for hiding this comment

fmigneault left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mishaschwartz Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

tlvu Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add the `prometheus-longterm-metrics` and `thanos` optional components #461

Add the `prometheus-longterm-metrics` and `thanos` optional components #461

mishaschwartz commented Jun 13, 2024 •

edited

Loading

mishaschwartz Jun 18, 2024 •

edited

Loading

mishaschwartz commented Jun 18, 2024 •

edited

Loading

tlvu Jun 18, 2024 •

edited

Loading

mishaschwartz Sep 18, 2024 •

edited

Loading

tlvu Sep 18, 2024 •

edited

Loading