Skip to content

Querying ClickHouse database for fun and profit

Huy Do edited this page Dec 20, 2024 · 8 revisions

What is ClickHouse?

ClickHouse is an open-source column-oriented relational database that PyTorch Dev Infra team is using to store all open source data from PyTorch-org including GitHub events, test stats, benchmark results, and many more things. The database is hosted on ClickHouse Cloud and that's also where you can login and start querying the data.

Prerequisites

First time login

Skip this part if you already have access to PyTorch Dev Infra ClickHouse cluster on https://console.clickhouse.cloud/

For metamates, goto https://console.clickhouse.cloud/ and login with your Meta email. The portal uses SSO, so you just need to follow the step on your browser to request access. We grant read-only access by default.

Note that propagating the permission takes sometime from half an hour to an hour. So, you can go grab a coffee if you like.

Skim through the data we have

The list of all databases and tables on CH is at https://github.com/pytorch/test-infra/wiki/Available-databases-on-ClickHouse. If you are looking for more, please take a look at https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse and reach out to us (poc @clee2000 @huydhn) to chat about your new use cases.

Get familiar with CH Cloud UX

After logging in, you should see our Dev Infra CH service running there, let's go there. Screenshot 2024-12-20 at 10 53 10

The SQL console and the two databases default and benchmark are probably where you first want to go to. By selecting the database, you can start writing SQL queries there. Screenshot 2024-12-20 at 10 58 27

With read-only permission, you will only be able to write SELECT queries there. However, if you want to experiment with write queries such as CREATE TABLE, INSERT INTO, you can switch to the fortesting database which grants write permission to everyone.

Write your first query

Here is an example query to see how long you wait for signals on your PR:

select
    -- The name of the workflow
    w.name,
    -- CH has many functions, this is one of their function to handle DateTime
    -- https://clickhouse.com/docs/en/sql-reference/functions/date-time-functions
    timeDiff(w.created_at, w.updated_at) as workflow_duration_in_second
from
    -- The table that includes all the information about a GitHub workflow run. See
    -- https://docs.github.com/en/webhooks/webhook-events-and-payloads#workflow_run
    workflow_run w
where
    not empty(w.pull_requests)
    and w.status = 'completed'
    -- Here is an example of use of the powerful CH query params where we have the
    -- pull request number as the input parameter to the query. See
    -- https://clickhouse.com/docs/en/sql-reference/syntax#defining-and-using-query-parameters
    and tupleElement(w.pull_requests [ 1 ], 'number') = {prNumber: Int64 }

When you happy with the query, you could:

  • Provide a pull request number as the input to run it.
  • The results can be exported to CSV.
  • Save the query and share it with other users.

More example queries

Query the benchmark results from a period of time

select
    -- The schema is defined at https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/oss_ci_benchmark_v3/query.sql
    o.head_sha,
    o.benchmark.extra_info,
    o.model.name,
    o.metric.name,
    -- Compute the average value if the benchmark is run multiple times on the same commit
    floor(arrayAvg(o.metric.benchmark_values), 2) as value,
from
    -- The benchmark database where all the benchmark results are kept
    oss_ci_benchmark_v3 o
where
    -- Provide the metric name as a query parameter, for example compile_time_instruction_count
    metric.name = {metric: String }
    -- Provide the model name, leave this empty to query all of them
    and (model.name = {model: String } or {model: String } = '')
    -- The starting time, in YYYY-MM-DDTHH:mm:ss.ddd format
    and o.timestamp >= toUnixTimestamp({startTime: DateTime64(3) })
    -- The stopping time, in the same YYYY-MM-DDTHH:mm:ss.ddd format
    and o.timestamp < toUnixTimestamp({stopTime: DateTime64(3) })
    -- Only query results from pytorch/pytorch main commits
    and o.head_branch = 'refs/heads/main'
    and o.repo = 'pytorch/pytorch'
order by
    -- Newest results first
    o.timestamp desc