-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push the runtime filter from HashJoin down to SeqScan or AM. #724
base: main
Are you sure you want to change the base?
Conversation
|
||
/* append new runtime filters to target node */ | ||
SeqScanState *sss = castNode(SeqScanState, attr_filter->target); | ||
sss->filters = list_concat(sss->filters, scankeys); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge filter here on the same attno ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Combining Bloom filters will result in a higher False Positive Rate (FPR) compared to using each of the individual Bloom filters separately, so it is not recommended;
- There is the same problem to combine range filters like combining Bloom filters;
There is only one Bloom filter and one range filter on the same attribute in many cases;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create table t1(a int, b int) with(parallel_workers=2);
create table rt1(a int, b int) with(parallel_workers=2);
create table rt2(a int, b int);
create table rt3(a int, b int);
insert into t1 select i, i from generate_series(1, 100000) i;
insert into t1 select i, i+1 from generate_series(1, 10) i;
insert into rt1 select i, i+1 from generate_series(1, 10) i;
insert into rt2 select i, i+1 from generate_series(1, 10000) i;
insert into rt3 select i, i+1 from generate_series(1, 10) i;
analyze t1;
analyze rt1;
analyze rt2;
analyze rt3;
explain analyze select * from rt1 join t1 on rt1.a = t1.b join rt3 on rt3.a = t1.b;
postgres=# explain select * from rt1 join t1 on rt1.a = t1.b join rt3 on rt3.a = t1.b;
QUERY PLAN
--------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=2.45..428.51 rows=17 width=24)
-> Hash Join (cost=2.45..428.29 rows=6 width=24)
Hash Cond: (t1.b = rt1.a)
-> Hash Join (cost=1.23..427.00 rows=6 width=16)
Hash Cond: (t1.b = rt3.a)
-> Seq Scan on t1 (cost=0.00..342.37 rows=33337 width=8)
-> Hash (cost=1.10..1.10 rows=10 width=8)
-> Seq Scan on rt3 (cost=0.00..1.10 rows=10 width=8)
-> Hash (cost=1.10..1.10 rows=10 width=8)
-> Seq Scan on rt1 (cost=0.00..1.10 rows=10 width=8)
Optimizer: Postgres query optimizer
(11 rows)
you can try this case, will got two range filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
continue; | ||
|
||
val = slot_getattr(slot, sk->sk_attno, &isnull); | ||
if (isnull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CREATE TABLE distinct_1(a int);
CREATE TABLE distinct_2(a int);
INSERT INTO distinct_1 VALUES(1),(2),(NULL);
INSERT INTO distinct_2 VALUES(1),(NULL);
SELECT * FROM distinct_1, distinct_2 WHERE distinct_1.a IS NOT DISTINCT FROM distinct_2.a;
test got wrong result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
src/backend/executor/nodeSeqscan.c
Outdated
return slot; | ||
|
||
if (node->filter_in_seqscan && node->filters && | ||
!PassByBloomFilter(node, slot)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpcds 1TB, bloom filter will lose efficacy or create failed due to large rows ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when creating the Bloom filter, the system evaluates the estimated number of rows that this hash join will process and the amount of available memory during the execution plan generation. It determines whether using a Bloom filter for filtering data would be effective based on this evaluation. If it is assessed that the Bloom filter would not sufficiently enhance performance, then the Bloom filter will not be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It determines whether using a Bloom filter for filtering data would be effective based on this evaluation
That makes sense, but where is related code, I just didn't see them in this pr.
Does it compares the number of rows between the output of hashtable and data in the probe table? If the rows of the hashtable are far less than that of the probe table , then use the runtime filter?
src/backend/executor/nodeHashjoin.c
Outdated
{ | ||
match = false; | ||
|
||
if (!IsA(lfirst(lc), Var)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it support other expression, whose one arg is the column attr, and the other is a const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that expressions like t1.c1 = 5
should be pushed down by the optimizer to operators such as SeqScan for early processing. Therefore, this feature does not handle expressions of the form t1.c1 = 5
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't make it clear. I don't mean the predication on the var. like the below sql
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT t1.c3 FROM t1, t2 WHERE t1.c2 = (t2.c2 + 10);
QUERY PLAN
-------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (actual rows=0 loops=1)
-> Hash Join (actual rows=0 loops=1)
Hash Cond: (t1.c2 = (t2.c2 + 10))
Extra Text: (seg2) Hash chain length 8.0 avg, 8 max, using 4 of 524288 buckets.
-> Seq Scan on t1 (actual rows=128 loops=1)
-> Hash (actual rows=32 loops=1)
Buckets: 524288 Batches: 1 Memory Usage: 4098kB
-> Seq Scan on t2 (actual rows=32 loops=1)
Optimizer: Postgres query optimizer
(9 rows)
As t2.c2 + 10
is not a Var
but a T_OpExpr
, the runtime filter cannot handle it.
Could we just iterate the expression tree and check if it only contains var and const ?
Looks interesting. And I have some questions to discuss.
|
src/backend/executor/nodeHashjoin.c
Outdated
* result (hash filter) | ||
* seqscan on t1, t1 is replicated table | ||
*/ | ||
if (!IsA(child, HashJoinState) && !IsA(child, ResultState)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash Join (cost=0.00..4019.55 rows=37 width=9) (actual time=3203.012..9927.435 rows=1399 loops=1)
Hash Cond: (web_sales_1_prt_2.ws_item_sk = item.i_item_sk)
Join Filter: (web_sales_1_prt_2.ws_ext_discount_amt > ((1.3 * avg(web_sales_1_prt_2_1.ws_ext_discount_amt))))
Rows Removed by Join Filter: 4763
Extra Text: (seg2) Hash chain length 1.0 avg, 1 max, using 198 of 2097152 buckets.
-> Append (cost=0.00..676.44 rows=2399189 width=13) (actual time=16.899..5572.473 rows=3090021 loops=1)
-> Seq Scan on web_sales_1_prt_2 (cost=0.00..676.44 rows=2399189 width=13) (actual time=16.895..1138.267 rows=662
149 loops=1)
-> Seq Scan on web_sales_1_prt_3 (cost=0.00..676.44 rows=2399189 width=13) (actual time=8.947..1102.409 rows=6621
36 loops=1)
-> Seq Scan on web_sales_1_prt_4 (cost=0.00..676.44 rows=2399189 width=13) (actual time=8.822..1100.839 rows=6621
48 loops=1)
-> Seq Scan on web_sales_1_prt_5 (cost=0.00..676.44 rows=2399189 width=13) (actual time=11.391..1083.785 rows=662
179 loops=1)
-> Seq Scan on web_sales_1_prt_6 (cost=0.00..676.44 rows=2399189 width=13) (actual time=13.030..649.141 rows=4414
09 loops=1)
-> Seq Scan on web_sales_1_prt_7 (cost=0.00..676.44 rows=2399189 width=13) (never executed)
-> Seq Scan on web_sales_1_prt_others (cost=0.00..676.44 rows=2399189 width=13) (actual time=1.213..3.203 rows=17
88 loops=1)
-> Hash (cost=2432.09..2432.09 rows=109 width=12) (actual time=3177.768..3177.770 rows=198 loops=1)
Buckets: 2097152 Batches: 1 Memory Usage: 16392kB
-> Broadcast Motion 3:3 (slice3; segments: 3) (cost=
need to consider partitioned table .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support partitioned table in c701092
src/backend/executor/nodeHash.c
Outdated
attr_filter->min = LLONG_MAX; | ||
attr_filter->max = LLONG_MIN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLONG_MAX, LLONG_MIN are platform-spec value, i.e. the bound value for unsigned long long
, which may not be exactly the same width as Datum. For safety, static assert could be considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see StaticAssertDecl(SIZEOF_DATUM == 8, "sizeof datum is not 8");
in postgres.h, so it's better to use INT64_MAX/INT64_MIN here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use LONG_MAX, LONG_MIN instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix with 99eabb2
src/backend/executor/nodeHashjoin.c
Outdated
/* | ||
* Only applicatable for inner, right and semi join, | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give a little more explain about why these join types are supported and others are not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more message to explain why just inner, right and semi join are allowed with runtime filter.
fix it in 98dac6d
src/backend/executor/nodeHashjoin.c
Outdated
if (!IsA(expr, OpExpr) && !IsA(expr, FuncExpr)) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 lines duplicate with the following if-elseif-else code, could be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix with 99eabb2
src/backend/executor/nodeHashjoin.c
Outdated
break; | ||
|
||
var = lfirst(lc); | ||
if (var->varno == INNER_VAR) | ||
*rattno = var->varattno; | ||
else if (var->varno == OUTER_VAR) | ||
*lattno = var->varattno; | ||
else | ||
break; | ||
|
||
match = true; | ||
} | ||
|
||
return match; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The match flag gets the code hard(several modifications) to read. The break statement could be replaced by return false;
. If the foreach loop ends, all conditions match, so returns true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the more intuitive way to refactor the code, like below
/* check the first arg */
...
/* check the second arg */
...
return true;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix with 99eabb2
src/backend/executor/nodeSeqscan.c
Outdated
if (TupIsNull(slot)) | ||
return slot; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that slot is never NULL here, so Assert(!TupIsNull(slot));
is better or remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix with 99eabb2
/* | ||
* SK_EMPYT means the end of the array of the ScanKey | ||
*/ | ||
sk[*num].sk_flags = SK_EMPYT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to check the boundary of the ScanKey
array in rescan? In normal rescan, the number of ScanKey
s is the same as begin_scan. If the number of ScanKey
s is larger in rescan than that in begin_scan, the boundary value might be invalid and dangerous to access.
+----------+ AttrFilter +------+ ScanKey +------------+ | HashJoin | ------------> | Hash | ---------> | SeqScan/AM | +----------+ +------+ +------------+ If "gp_enable_runtime_filter_pushdown" is on, three steps will be run: Step 1. In ExecInitHashJoin(), try to find the mapper between the var in hashclauses and the var in SeqScan. If found we will save the mapper in AttrFilter and push them to Hash node; Step 2. We will create the range/bloom filters in AttrFilter during building hash table, and these filters will be converted to the list of ScanKey and pushed down to Seqscan when the building finishes; Step 3. If AM support SCAN_SUPPORT_RUNTIME_FILTER, these ScanKeys will be pushed down to the AM module further, otherwise will be used to filter slot in Seqscan;
980e2de
to
99eabb2
Compare
There are codes changed in MultiExecParallelHash, please add some parallel tests with runtime filter. |
got it. |
|
Thanks, I'll reproduce the issue and fix it. |
Thanks for your detailed explanation.
Make sense. When doing hashjoin, index scan or index only scan are often not used on probe node.
Exactly, and if there is any lock used to solve the problem may even lead bad performance. |
runtime_filter has been pushed down to t3 table seqscan, but 'explain analyze' doesn't print them out.
|
7160067
to
76a003a
Compare
76a003a
to
98dac6d
Compare
Thanks for your test case. Based on these, I rewrote code to ensure that debug info are always displayed even when the number of filtered rows is zero. And add the test case into gp_runtime_filter.sql too. |
Thanks for your test case. I fix it in 98dac6d And add the test case into gp_runtime_filter.sql too. |
Hi @zhangyue-hashdata Asking mostly out of curiosity, I see here are quite a few reviewers here already :) |
Basically, you're correct. Because our goal is to filter out as much data as possible right at the point of data generation. However, this will lead to very complex evaluations, so we only made a simple estimation based on rows and work memory when creating the Bloom filter. |
fix it in 7ab040a |
src/backend/executor/nodeSeqscan.c
Outdated
if (table_scan_getnextslot(scandesc, direction, slot)) | ||
while (table_scan_getnextslot(scandesc, direction, slot)) | ||
{ | ||
if (node->filter_in_seqscan && node->filters && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!node->filter_in_seqscan || !node->filters)
{
if (table_scan_getnextslot(scandesc, direction, slot))
return slot;
}
else
{
while (table_scan_getnextslot(scandesc, direction, slot))
{
.....
}
}
this make origin path more efficient and readable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! I fix in bcf93e6
from tpcds 10s details table, there are some bad cases. |
+----------+ AttrFilter +------+ ScanKey +------------+
| HashJoin | ------------> | Hash | ---------> | SeqScan/AM |
+----------+ +------+ +------------+
If "gp_enable_runtime_filter_pushdown" is on, three steps will be run:
Step 1. In ExecInitHashJoin(), try to find the mapper between the var in
hashclauses and the var in SeqScan. If found we will save the mapper in
AttrFilter and push them to Hash node;
Step 2. We will create the range/bloom filters in AttrFilter during building
hash table, and these filters will be converted to the list of ScanKey
and pushed down to Seqscan when the building finishes;
Step 3. If AM support SCAN_SUPPORT_RUNTIME_FILTER, these ScanKeys will be pushed
down to the AM module further, otherwise will be used to filter slot in
Seqscan;
perf:
CPU E5-2680 v2 10 cores, memory 32GB, 3 segments
tpcds 10s details
Fixes #ISSUE_Number
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheck
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
[skip ci]
to your PR title. Only use when necessary!