feat(influx_tools): Add export to parquet files #25297

srebhan · 2024-09-09T11:56:09Z

Closes #
Superseeds #25253

Describe your proposed changes here.

I've read the contributing section of the project README.
Signed CLA (if not already signed).

This PR adds a command to export data into per-shard parquet files. To do so, the command iterates over the shards, creates a cumulative schema over the series of a measurement (i.e. a super-set of tags and fields) and exports the data to a parquet file per measurement and shard.

To test the tool run

go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config influxdb.conf -database telegraf

.circleci/config.yml

cmd/influx_tools/main.go

cmd/influx_tools/parquet/batcher.go

cmd/influx_tools/parquet/command.go

cmd/influx_tools/parquet/exporter.go

davidby-influx

I did a quick review, but I'm not familiar with arrow and certainly missed some things. I can do a more thorough review if we paired to walk through the algorithm once.

cmd/influx_tools/parquet/schema.go

srebhan · 2024-09-18T13:49:05Z

@davidby-influx thanks for the thorough review! I tried to address all issues and commented on the three unresolved ones. Will schedule a meeting for walking through the code. Thanks again!

cmd/influx_tools/parquet/batcher.go

davidby-influx

LGTM

cmd/influx_tools/parquet/cursors.go

alespour · 2024-10-02T11:13:19Z

~~I'm not sure what to make of this: I have v1 db with several measurements like cpu disk etc, each with ~8M rows~~

> select count(usage_user) from cpu name: cpu time count ---- ----- 0 8631360

The same query returns different number of rows in exported Parquet "db":

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'all/cpu-*.parquet'" count(usage_user) ----------------- 28771200

Log attached.
cpu-export.log

alespour · 2024-10-02T14:06:08Z

tested measurement without tags - OK
tested single & all measurements export - OK, except the discrepancy of number of rows

Tested with db with simulating 1-month of monitoring data of a small data center (9 measurements like cpu, disk etc, 10 tags). DB files size on disk 4.1 GB, 5 shards.

Exported Parquet size on disk 11 GB, took 1h6m on somewhat obsolete laptop (Core i7 CPU, 8-core, 16 GB RAM, SSD). Memory usage during export was stable (RSS peak ~2 GB).

InfluxDB measuement structure example:

> show tag keys from cpu
name: cpu
tagKey
------
arch
datacenter
hostname
os
rack
region
service
service_environment
service_version
team

> show field keys from cpu
name: cpu
fieldKey         fieldType
--------         ---------
usage_guest      float
usage_guest_nice float
usage_idle       float
usage_iowait     float
usage_irq        float
usage_nice       float
usage_softirq    float
usage_steal      float
usage_system     float
usage_user       float

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "describe select * from 'all/cpu-*.parquet'"

column_name          column_type  null  key  default  extra
-------------------  -----------  ----  ---  -------  -----
time                 TIMESTAMP    YES                      
arch                 VARCHAR      YES                      
datacenter           VARCHAR      YES                      
hostname             VARCHAR      YES                      
os                   VARCHAR      YES                      
rack                 VARCHAR      YES                      
region               VARCHAR      YES                      
service              VARCHAR      YES                      
service_environment  VARCHAR      YES                      
service_version      VARCHAR      YES                      
team                 VARCHAR      YES                      
usage_guest          DOUBLE       YES                      
usage_guest_nice     DOUBLE       YES                      
usage_idle           DOUBLE       YES                      
usage_iowait         DOUBLE       YES                      
usage_irq            DOUBLE       YES                      
usage_nice           DOUBLE       YES                      
usage_softirq        DOUBLE       YES                      
usage_steal          DOUBLE       YES                      
usage_system         DOUBLE       YES                      
usage_user           DOUBLE       YES

Measurement without tags:

alespour@master-node:/bigdata/x$ duckdb -column -s "select * from 'notags/*.parquet'"
time                        lat    lon  
--------------------------  -----  -----
2024-10-02 13:03:55.643371  49.95  14.47
2024-10-02 13:04:04.423014  49.91  14.49
2024-10-02 13:04:12.726653  49.94  14.53

alespour · 2024-10-02T14:19:02Z

I will repeat the test to verify the number of rows (mis)match.

alespour · 2024-10-02T18:09:54Z

My apologies, it was a mistake on my side. Row count matches.

InfluxDB:

> select count(usage_user) from cpu
name: cpu
time count
---- -----
0    28771200

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'cpu/*.parquet'"
count(usage_user)
-----------------
28771200

alespour · 2024-10-03T08:08:49Z

tested other types - OK

Creating the following schemata for 1 measurement(s):
  Measurement "types" with 0 tag(s) and  5 field(s):
    Column	Kind		Datatype
    ------	----		--------
    time	timestamp	timestamp (nanosecond)
    label	field		string
    lat		field		float
    lon		field		float
    match	field		boolean
    scale	field		integer

alespour@master-node:/bigdata/x$ sudo duckdb -column -s "describe from 'types/*.parquet'"
column_name  column_type  null  key  default  extra
-----------  -----------  ----  ---  -------  -----
time         TIMESTAMP    YES                      
label        VARCHAR      YES                      
lat          DOUBLE       YES                      
lon          DOUBLE       YES                      
match        BOOLEAN      YES                      
scale        BIGINT       YES

alespour@master-node:/bigdata/x$ sudo duckdb -column -s "select * from 'types/*.parquet' limit 1"
time                        label  lat    lon    match  scale
--------------------------  -----  -----  -----  -----  -----
2024-10-03 07:58:33.419431  a1     49.94  14.53  true   4

alespour · 2024-10-03T08:13:15Z

It's GTG by me 👍

srebhan force-pushed the v1-bulk-exporter-parquet branch 2 times, most recently from 6869ba3 to bd44db9 Compare September 9, 2024 14:12

srebhan force-pushed the v1-bulk-exporter-parquet branch from 7c930bb to 2bb73ce Compare September 17, 2024 19:39

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/exporter.go Outdated Show resolved Hide resolved

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved

cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved

feat(influx_tools): Add export to parquet files

46aef0b

srebhan force-pushed the v1-bulk-exporter-parquet branch from 2bb73ce to 46aef0b Compare September 18, 2024 10:41

srebhan added 9 commits September 18, 2024 12:45

chore: Wrap errors in influx_tools main

9ed1d01

chore: Do not create unused series cursor and simplify batcher creation

c12f293

chore: Move converter creation to batcher as it is only used there

a2367ee

fix: Caputure error when closing series cursor

41dacce

feat: Print shard series-file path on error

b7c9475

chore: Replace panic by returning an error

182195f

feat: Use logger instead of raw printing

795e581

fix: Caputure error when closing exporter

59b60e6

fix: Caputure more defer errors

390cf30

feat: Detect name conflicts after name resolution

3bfe17c

davidby-influx assigned srebhan Sep 18, 2024

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/batcher.go Outdated Show resolved Hide resolved

fix: Make sure deferred functions are actually called

76a88d1

srebhan force-pushed the v1-bulk-exporter-parquet branch from 23e7a05 to d7216ca Compare September 19, 2024 20:01

srebhan added 2 commits September 19, 2024 22:03

feat: Move out cursor handling

f2423af

feat: Preallocate maps and slices

a7d0f1b

srebhan force-pushed the v1-bulk-exporter-parquet branch from d7216ca to a7d0f1b Compare September 19, 2024 20:03

davidby-influx approved these changes Sep 19, 2024

View reviewed changes

cmd/influx_tools/parquet/cursors.go Show resolved Hide resolved

srebhan mentioned this pull request Oct 2, 2024

feat: influx_tools export parquet #25253

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(influx_tools): Add export to parquet files #25297

feat(influx_tools): Add export to parquet files #25297

srebhan commented Sep 9, 2024 •

edited

Loading

davidby-influx left a comment

srebhan commented Sep 18, 2024

davidby-influx left a comment

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024

alespour commented Oct 2, 2024

alespour commented Oct 3, 2024 •

edited

Loading

alespour commented Oct 3, 2024

feat(influx_tools): Add export to parquet files #25297

Are you sure you want to change the base?

feat(influx_tools): Add export to parquet files #25297

Conversation

srebhan commented Sep 9, 2024 • edited Loading

davidby-influx left a comment

Choose a reason for hiding this comment

srebhan commented Sep 18, 2024

davidby-influx left a comment

Choose a reason for hiding this comment

alespour commented Oct 2, 2024 • edited Loading

alespour commented Oct 2, 2024 • edited Loading

alespour commented Oct 2, 2024

alespour commented Oct 2, 2024

alespour commented Oct 3, 2024 • edited Loading

alespour commented Oct 3, 2024

srebhan commented Sep 9, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 3, 2024 •

edited

Loading