Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Core Dump in mirror_replay Test Suite During Execution #782

Open
1 of 2 tasks
edespino opened this issue Dec 16, 2024 · 9 comments
Open
1 of 2 tasks

[Bug] Core Dump in mirror_replay Test Suite During Execution #782

edespino opened this issue Dec 16, 2024 · 9 comments
Labels
type: Bug Something isn't working

Comments

@edespino
Copy link
Contributor

Apache Cloudberry version

main branch

What happened

The mirror_replay test suite is consistently generating a core dump during execution. This test is part of the greenplum_schedule running under the ic-good-opt-off (make -c src/test/regress installcheck-good) test matrix configuration. From the core dump's stack , the issue occurs specifically during the append-only segment file handling in the startup process.

Environment

Project: Apache Cloudberry
Test Suite: mirror_replay
Schedule: greenplum_schedule
Test Matrix Config: ic-good-opt-off
Build Type: Debug build with the following configuration:

--enable-debug
--enable-profiling
--enable-cassert
--enable-debug-extensions

Stack Trace
The core dump stack trace indicates the crash occurs during append-only segment file handling:

Thread 1 (Thread 0x7f9cf7a5ed00 (LWP 8442)):
#0  0x00007f9cf8f11a6c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f9cf8ec4686 in raise () from /lib64/libc.so.6
#2  0x00007f9cf8eae833 in abort () from /lib64/libc.so.6
#3  0x00007f9cf9ca28bf in errfinish (filename=<optimized out>, filename@entry=0x7f9cfa27ef7a "xlogutils.c", lineno=lineno@entry=103, funcname=funcname@entry=0x7f9cfa27f060 <__func__.5> "log_invalid_page") at elog.c:819
#4  0x00007f9cf97272d6 in log_invalid_page (present=false, blkno=1, forkno=MAIN_FORKNUM, node=...) at xlogutils.c:103
#5  XLogAOSegmentFile (rnode=..., segmentFileNum=1) at xlogutils.c:567
#6  0x00007f9cf9d590a6 in ao_truncate_replay (record=<optimized out>, record=<optimized out>) at cdbappendonlyxlog.c:177
#7  0x00007f9cf971b7e5 in StartupXLOG () at xlog.c:7824
#8  0x00007f9cf9a6d124 in StartupProcessMain () at startup.c:267
#9  0x00007f9cf9767e52 in AuxiliaryProcessMain (argc=<optimized out>, argc@entry=2, argv=<optimized out>, argv@entry=0x7ffd0b1cc490) at bootstrap.c:483
#10 0x00007f9cf9a6cbd4 in StartChildProcess (type=StartupProcess) at postmaster.c:6139
#11 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x137aa30) at postmaster.c:1668
#12 0x000000000040282f in main (argc=7, argv=0x137aa30) at main/main.c:270
$1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad = {8442, 1000, 0 <repeats 26 times>}, _kill = {si_pid = 8442, si_uid = 1000}, _timer = {si_tid = 8442, si_overrun = 1000, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 8442, si_uid = 1000, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 8442, si_uid = 1000, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x3e8000020fa, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 4294967304442, si_fd = 0}, _sigsys = {_call_addr = 0x3e8000020fa, _syscall = 0, _arch = 0}}}

Impact

  • Blocks successful execution of mirror_replay test suite
  • May indicate potential issues with append-only segment file handling during mirror synchronization

What you think should happen instead

Analysis

  1. The crash occurs in the startup process during XLOG replay
  2. Specifically fails in log_invalid_page() function in xlogutils.c
  3. The context suggests this is related to append-only segment file handling during mirror replay
  4. The immediate cause appears to be an invalid page access during AO segment file processing

How to reproduce

Ensure your system is capable of generating core files. Execute the following dev test execution command:

make -c src/test/regress installcheck-good

Issue reproduces consistently without additional steps

Operating System

Rocky Linux 9 (should be platfo independent)

Anything else

Additional Context
The error occurs during the append-only truncate replay operation (ao_truncate_replay), suggesting potential issues with either:

  • Invalid segment file state during replay
  • Corruption in the XLOG records
  • Incorrect handling of append-only segment files during mirror synchronization

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

@edespino edespino added the type: Bug Something isn't working label Dec 16, 2024
@edespino
Copy link
Contributor Author

FYI: Non debug builds produce the following:

Thread 1 (Thread 0x7eff22bfad00 (LWP 95861)):
#0  0x00007eff240ada6c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007eff24060686 in raise () from /lib64/libc.so.6
#2  0x00007eff2404a833 in abort () from /lib64/libc.so.6
#3  0x00007eff24d18c46 in errfinish () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#4  0x00007eff2484f836 in XLogAOSegmentFile () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#5  0x00007eff24db0e96 in ao_truncate_replay.isra () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#6  0x00007eff248452ff in StartupXLOG () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#7  0x00007eff24b29fb5 in StartupProcessMain () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#8  0x00007eff24882642 in AuxiliaryProcessMain () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#9  0x00007eff24b24db5 in StartChildProcess () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#10 0x00007eff24b298cf in PostmasterMain () from /usr/local/cloudberry-db-99.0.0/lib/libpostgres.so
#11 0x00000000004027db in main ()
$1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad = {95861, 1000, 0 <repeats 26 times>}, _kill = {si_pid = 95861, si_uid = 1000}, _timer = {si_tid = 95861, si_overrun = 1000, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 95861, si_uid = 1000, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 95861, si_uid = 1000, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x3e800017675, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 4294967391861, si_fd = 0}, _sigsys = {_call_addr = 0x3e800017675, _syscall = 0, _arch = 0}}}

@yjhjstz
Copy link
Member

yjhjstz commented Dec 16, 2024

CI seem lost isolation2 that make -C src/test/isolation2 installcheck

@edespino
Copy link
Contributor Author

@yjhjstz & @avamingli I hope to bring it and others online soon. I am able to run two of the isolation2 tests and did notice there are failures (output differences). They can be seen here.

https://github.com/edespino/cloudberry/actions/runs/12364538041

edespino added a commit to edespino/cloudberry that referenced this issue Dec 17, 2024
This test is currently causing core dumps when run as part of the
greenplum_schedule. To prevent this from blocking other testing while
we investigate the root cause:

- Created new fixme_schedule containing only mirror_replay
- Removed mirror_replay from greenplum_schedule
- Added installcheck-fixme make target to run problematic tests in
  isolation

Issue: apache#782
@avamingli
Copy link
Contributor

@yjhjstz & @avamingli I hope to bring it and others online soon. I am able to run two of the isolation2 tests and did notice there are failures (output differences). They can be seen here.

https://github.com/edespino/cloudberry/actions/runs/12364538041

Hi, at a glance, that's a case we should fix, please feel free to create the PR bringing isolation2 back if there were only that case failed. I will help you fix the diffs there.(on vacation today, perhaps tomorrow I will be back)

@yjhnupt
Copy link

yjhnupt commented Dec 18, 2024

diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /__w/cloudberry/cloudberry/src/test/isolation2/expected/parallel_retrieve_cursor/explain.out /__w/cloudberry/cloudberry/src/test/isolation2/results/parallel_retrieve_cursor/explain.out
[18](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:19)
--- /__w/cloudberry/cloudberry/src/test/isolation2/expected/parallel_retrieve_cursor/explain.out	2024-12-16 17:38:39.620082360 -0800
[19](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:20)
+++ /__w/cloudberry/cloudberry/src/test/isolation2/results/parallel_retrieve_cursor/explain.out	2024-12-16 17:38:39.628082370 -0800
[20](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:21)
@@ -113,40 +113,40 @@
[21](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:22)
 QUERY PLAN
[22](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:23)
 ___________
[23](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:24)
  Seq Scan on pg_catalog.pg_class
[24](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:25)
-   Output: oid, relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, reltoastrelid, relhasindex, relisshared, relpersistence, relkind, relnatts, relchecks, relhasrules, relhastriggers, relhassubclass, relrowsecurity, relforcerowsecurity, relispopulated, relreplident, relispartition, relisivm, relrewrite, relfrozenxid, relminmxid, relacl, reloptions, relpartbound
[25](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:26)
-GP_IGNORE:(3 rows)
[26](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:27)
+   Output: oid, relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, reltoastrelid, relhasindex, relisshared, relpersistence, relkind, relnatts, relchecks, relhasrules, relhastriggers, relhassubclass, relrowsecurity, relforcerowsecurity, relispopulated, relreplident, relispartition, relisivm, relisdynamic, relrewrite, relfrozenxid, relminmxid, relacl, reloptions, relpartbound
[27](https://github.com/edespino/cloudberry/actions/runs/12364538041/job/34508249298#step:18:28)
+GP_IGNORE:(4 rows)

help to add relisdynamic field to fix test. @avamingli

edespino added a commit to edespino/cloudberry that referenced this issue Dec 18, 2024
This test is currently causing core dumps when run as part of the
greenplum_schedule. To prevent this from blocking other testing while
we investigate the root cause:

- Created new fixme_schedule containing only mirror_replay
- Removed mirror_replay from greenplum_schedule
- Added installcheck-fixme make target to run problematic tests in
  isolation

Issue: apache#782
edespino added a commit that referenced this issue Dec 18, 2024
* Enhance Build Pipeline with Debug and Core Analysis Support

Adds comprehensive debug build support and automated core dump analysis to
the Cloudberry build pipeline. Key features:

- Debug build capability with preserved symbols and debug-specific RPMs
- Automated core dump detection and analysis during test execution
- Core file correlation with test failures
- Enhanced test result reporting with core dump status
- Improved artifact management for debug builds

The changes enable better debugging of test failures and provide more
detailed information about process crashes during testing.

* test: Move mirror_replay test to separate schedule due to core dumps

This test is currently causing core dumps when run as part of the
greenplum_schedule. To prevent this from blocking other testing while
we investigate the root cause:

- Created new fixme_schedule containing only mirror_replay
- Removed mirror_replay from greenplum_schedule
- Added installcheck-fixme make target to run problematic tests in
  isolation

Issue: #782

* test: Mark mirror_replay cores as warnings

When enable_check_core is disabled, the test should proceed with a
warning rather than failing. Modified the core file check and summary
to mark mirror_replay with a warning status in these cases.

This complements the previous isolation of this test into
fixme_schedule, allowing testing to proceed while we investigate the
underlying core dump issue.
@yjhjstz
Copy link
Member

yjhjstz commented Dec 25, 2024

@edespino can you help to bring make installcheck-cbdb-parallel parallel test back ?

@edespino
Copy link
Contributor Author

@edespino can you help to bring make installcheck-cbdb-parallel parallel test back ?

Yes I will

@edespino
Copy link
Contributor Author

@edespino can you help to bring make installcheck-cbdb-parallel parallel test back ?

@yjhjstz If you could help an approval for #819 it wold be appreciated.

@edespino
Copy link
Contributor Author

@yjhjstz FYI: installcheck-cbdb-parallel is now live: https://github.com/apache/cloudberry/actions/runs/12502691175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants