PARQUET-1381: Support merging of rowgroups during file rewrite #1121

MaheshGPai · 2023-07-15T12:11:10Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title.
- https://issues.apache.org/jira/browse/PARQUET-1381
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

MaheshGPai · 2023-07-15T12:19:43Z

Taking forward a PR that had remained inactive. Original PR - #775

wgtmac

I simply did an initial review. I

wgtmac · 2023-07-19T05:20:57Z

parquet-cli/src/main/java/org/apache/parquet/cli/commands/RewriteCommand.java

@@ -72,6 +72,18 @@ public class RewriteCommand extends BaseCommand {
          required = false)
  String codec;

+  @Parameter(
+    names = {"-m", "--merge-rowgroups"},
+    description = "<true/false>",


Could you please add a brief description?

wgtmac · 2023-07-19T05:21:37Z

parquet-cli/src/main/java/org/apache/parquet/cli/commands/RewriteCommand.java

+
+  @Parameter(
+    names = {"-s", "--max-rowgroup-size"},
+    description = "<max size of the merged rowgroups>",


It would be good to say it is used together with --merge-rowgroups=true in the description.

wgtmac · 2023-07-19T05:23:16Z

parquet-cli/src/main/java/org/apache/parquet/cli/commands/RewriteCommand.java

+      builder.enableRowGroupMerge();
+      builder.maxRowGroupSize(maxRowGroupSize);


What about use a single function? Like builder.mergeRowGroups(maxRowGroupSize).

I have made changes as per the comment. I'm fine either way.

wgtmac · 2023-07-19T05:38:41Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/RowGroupMerger.java

+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.*;


Please do not use import star.

wgtmac · 2023-07-19T05:41:15Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/RowGroupMerger.java

+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+
+public class RowGroupMerger {


Suggested change

public class RowGroupMerger {

class RowGroupMerger {

It would be good not to make it public for now.

Probably you need to relocate it into the rewrite package.

wgtmac · 2023-07-19T06:03:16Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

+      initNextReader();
+    }
+    while(reader != null);
+    new RowGroupMerger(schema, newCodecName, v2EncodingHint).merge(readers, maxRowGroupSize, writer);


I didn't review it in depth. Does it handle encryption or masking properties internally?

Yes. Underneath, it uses the same instance of ParquetFileWriter which handles these operations.

advancedxy

This is a nice feature @MaheshGPai. I'm wondering similar features too, thanks for your work.

By the way, do you have any performance number comparing this with rewrote by query engines such as Spark/Hive.

advancedxy · 2023-07-20T08:53:06Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

+    List<ParquetFileReader> readers = new ArrayList<>();
+    do {
+      readers.add(reader);
+      initNextReader();


Looks like v2EncodingHint only checks the first parquet file..

Should all the files to be checked?

advancedxy · 2023-07-20T09:04:24Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RowGroupMerger.java

+        DictionaryPage dictPage = columnReader.readDictionaryPage();
+        Dictionary decodedDictionary = null;
+        if (dictPage != null) {
+          decodedDictionary = dictPage.getEncoding().initDictionary(column.getColumnDesc(), dictPage);
+        }


If I understand the process of page encoding correctly: parquet tries to use dictionary encoding by default, If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The check and fallback logic happens when emit the first page.

So when we are merging multiple column chunks from different row groups, if the first column chunks is dictionary encoded and others are not because it fallbacks to plain encoding, we should disable the dictionary encoding for that column on purpose to avoid introducing overhead.

Current logic doesn't handle that, it will use dictionary encoding if the column chunk in the first row group to be merged use dictionary encoding.

advancedxy · 2023-07-20T09:08:20Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RowGroupMerger.java

+
+        if (mergedBlock == null && estimator.estimate(blockMeta) > maxRowGroupSize) {
+          //save it directly without re encoding it
+          saveBlockTo(ReadOnlyMergedBlock.of(blockMeta, group, schema, compressor), writer);


I checked related code, seems that startColumn and endColumn doesn't maintain bloom filter....

It might be hard to maintain bloom filters when merging multiple row groups, but it should be possible and easy to maintain bloom filter for only one row group. See ParquetWriter#L337 for related code.

I agree, so it might be good to integrate this with ParquetRewriter if one row group does not need to be merged.

wgtmac

Thanks for quick update!

I know this PR comes from another PR which was created long before ParquetRewriter was implemented. However, my main concern is that the current implementation of RowGroupMerger diverges from ParquetRewriter, which makes it difficult to maintain in the future. For example, RowGroupMerger seems does not support column masking (nullify column values) if RewriterOptions has requested to do so. And it has duplicate implementation (i.e. ReadOnlyMergedBlock) if a row group does not need to merge which ParquetRewriter has already supported. Could you consider to consolidate these implementations? Otherwise it would not be easy if we want to add more features to the rewriter.

cc @shangxinli @gszadovszky

wgtmac · 2023-07-22T15:01:35Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RowGroupMerger.java

+
+        if (mergedBlock == null && estimator.estimate(blockMeta) > maxRowGroupSize) {
+          //save it directly without re encoding it
+          saveBlockTo(ReadOnlyMergedBlock.of(blockMeta, group, schema, compressor), writer);


I agree, so it might be good to integrate this with ParquetRewriter if one row group does not need to be merged.

wgtmac · 2023-07-22T15:26:36Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RowGroupMerger.java

+            @Override
+            public DataPage visit(DataPageV1 pageV1) {
+
+              return new DataPageV1(compress(pageV1.getBytes(), compressor), pageV1.getValueCount(),


Why does DataPageV1 require to compress again here but DataPageV2 does not (line 384 below)?

wgtmac · 2023-07-22T15:58:51Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RowGroupMerger.java

+
+              newValuesWriter.reset();
+
+              long firstRowIndex = pageV1.getFirstRowIndex().orElse(-1L);


We cannot simply copy firstRowIndex if pages are not from the 1st row group in this MutableMergedBlock.

shangxinli · 2023-09-22T02:55:23Z

This is a great initiative. Do you still have plan to address the feedback @MaheshGPai ?

MaheshGPai · 2023-09-23T07:16:17Z

This is a great initiative. Do you still have plan to address the feedback @MaheshGPai ?

@shangxinli I do plan to work on it. But I have not had time to get to this.

ConeyLiu · 2023-09-28T12:53:50Z

Hi @MaheshGPai, thanks for the contribution. If you don't have time to work on this, I can continue with it.

MaheshGPai · 2023-09-28T15:29:50Z

@ConeyLiu Please feel free to continue. I'll not be able to look at this for another week or so.

ConeyLiu · 2023-09-29T09:01:02Z

OK, I will deep into it.

MaheshGPai changed the title ~~Support merging of rowgroups during file rewrite~~ PARQUET-1381: Support merging of rowgroups during file rewrite Jul 15, 2023

MaheshGPai mentioned this pull request Jul 15, 2023

PARQUET-1381: add parquet block merging feature #775

Open

MaheshGPai force-pushed the PR branch 2 times, most recently from 1a84d5c to 1f511a6 Compare July 15, 2023 17:28

wgtmac reviewed Jul 19, 2023

View reviewed changes

MaheshGPai requested a review from wgtmac July 19, 2023 12:22

advancedxy reviewed Jul 20, 2023

View reviewed changes

wgtmac requested changes Jul 22, 2023

View reviewed changes

MaheshGPai added 2 commits September 23, 2023 12:32

Support merging of rowgroups during file rewrite

324a669

Review comments

c3d5f12

Merge statistics

1f585e5

MaheshGPai force-pushed the PR branch from a07e39f to 1f585e5 Compare September 23, 2023 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1381: Support merging of rowgroups during file rewrite #1121

PARQUET-1381: Support merging of rowgroups during file rewrite #1121

MaheshGPai commented Jul 15, 2023 •

edited

Loading

MaheshGPai commented Jul 15, 2023

wgtmac left a comment

wgtmac Jul 19, 2023

MaheshGPai Jul 19, 2023

wgtmac Jul 19, 2023

MaheshGPai Jul 19, 2023

wgtmac Jul 19, 2023

MaheshGPai Jul 19, 2023

wgtmac Jul 19, 2023

MaheshGPai Jul 19, 2023

wgtmac Jul 19, 2023

wgtmac Jul 19, 2023

MaheshGPai Jul 19, 2023

wgtmac Jul 19, 2023

MaheshGPai Jul 19, 2023

advancedxy left a comment

advancedxy Jul 20, 2023

advancedxy Jul 20, 2023

advancedxy Jul 20, 2023

wgtmac Jul 22, 2023

wgtmac left a comment

wgtmac Jul 22, 2023

wgtmac Jul 22, 2023

wgtmac Jul 22, 2023

shangxinli commented Sep 22, 2023

MaheshGPai commented Sep 23, 2023

ConeyLiu commented Sep 28, 2023

MaheshGPai commented Sep 28, 2023

ConeyLiu commented Sep 29, 2023

		builder.enableRowGroupMerge();
		builder.maxRowGroupSize(maxRowGroupSize);


		newValuesWriter.reset();

		long firstRowIndex = pageV1.getFirstRowIndex().orElse(-1L);

PARQUET-1381: Support merging of rowgroups during file rewrite #1121

Are you sure you want to change the base?

PARQUET-1381: Support merging of rowgroups during file rewrite #1121

Conversation

MaheshGPai commented Jul 15, 2023 • edited Loading

Jira

Tests

Commits

Documentation

MaheshGPai commented Jul 15, 2023

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli commented Sep 22, 2023

MaheshGPai commented Sep 23, 2023

ConeyLiu commented Sep 28, 2023

MaheshGPai commented Sep 28, 2023

ConeyLiu commented Sep 29, 2023

MaheshGPai commented Jul 15, 2023 •

edited

Loading