-
Notifications
You must be signed in to change notification settings - Fork 84
Querying ClickHouse database for fun and profit
ClickHouse is an open-source column-oriented relational database that PyTorch Dev Infra team is using to store all open source data from PyTorch-org including GitHub events, test stats, benchmark results, and many more things. The database is hosted on ClickHouse Cloud and that's also where you can login and start querying the data.
Skip this part if you already have access to PyTorch Dev Infra ClickHouse cluster on https://console.clickhouse.cloud/
For metamates, goto https://console.clickhouse.cloud/ and login with your Meta email. The portal uses SSO, so you just need to follow the step on your browser to request access. We grant read-only access by default.
Note that propagating the permission takes sometime from half an hour to an hour. So, you can go grab a coffee if you like.
The list of all databases and tables on CH is at https://github.com/pytorch/test-infra/wiki/Available-databases-on-ClickHouse. If you are looking for more, please take a look at https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse and reach out to us (poc @clee2000 @huydhn) to chat about your new use cases.
After logging in, you should see our Dev Infra CH service running there, let's go there.
The SQL console and the two databases default
and benchmark
are probably where you first want to go to. By selecting the database, you can start writing SQL queries there.
With read-only permission, you will only be able to write SELECT
queries there. However, if you want to experiment with write queries such as CREATE TABLE, INSERT INTO, you can switch to the fortesting
database which grants write permission to everyone.
Here is an example query to see how long you wait for signals on your PR:
select
-- The name of the workflow
w.name,
-- CH has many functions, this is one of their function to handle DateTime
-- https://clickhouse.com/docs/en/sql-reference/functions/date-time-functions
timeDiff(w.created_at, w.updated_at) as workflow_duration_in_second
from
-- The table that includes all the information about a GitHub workflow run. See
-- https://docs.github.com/en/webhooks/webhook-events-and-payloads#workflow_run
workflow_run w
where
not empty(w.pull_requests)
and w.status = 'completed'
-- Here is an example of use of the powerful CH query params where we have the
-- pull request number as the input parameter to the query. See
-- https://clickhouse.com/docs/en/sql-reference/syntax#defining-and-using-query-parameters
and tupleElement(w.pull_requests [ 1 ], 'number') = {prNumber: Int64 }
When you happy with the query, you could:
- Provide a pull request number as the input to run it.
- The results can be exported to CSV.
- Save the query and share it with other users.
select
-- The schema is defined at https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/oss_ci_benchmark_v3/query.sql
o.head_sha,
o.benchmark.extra_info,
o.model.name,
o.metric.name,
-- Compute the average value if the benchmark is run multiple times on the same commit
floor(arrayAvg(o.metric.benchmark_values), 2) as value,
from
-- The benchmark database where all the benchmark results are kept
oss_ci_benchmark_v3 o
where
-- Provide the metric name as a query parameter, for example compile_time_instruction_count
metric.name = {metric: String }
-- Provide the model name, leave this empty to query all of them
and (model.name = {model: String } or {model: String } = '')
-- The starting time, in YYYY-MM-DDTHH:mm:ss.ddd format
and o.timestamp >= toUnixTimestamp({startTime: DateTime64(3) })
-- The stopping time, in the same YYYY-MM-DDTHH:mm:ss.ddd format
and o.timestamp < toUnixTimestamp({stopTime: DateTime64(3) })
-- Only query results from pytorch/pytorch main commits
and o.head_branch = 'refs/heads/main'
and o.repo = 'pytorch/pytorch'
order by
-- Newest results first
o.timestamp desc