[Bug] Query Causes Segmentation Fault #806

antoniopetrole · 2024-12-20T22:11:20Z

Apache Cloudberry version

Cloudberry 1.6.0 (this is a pre-Apache release)

What happened

We had a number of segments go down both on 12/16 (multiple waves of segments going down) as well as 12/18 due to a problem query that was run. The log messages show this query along with a segmentation fault error in each segment log that crashed. I've obfuscated the table and column names for the sake of privacy in the below error message as well as the provided DDL and problem query, but everything else is exactly the same as it was when it was executed

Segmentation fault","Failed process was running: SELECT col1
           , col17
           , LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14)
           , col14
           , col14 - LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14) AS t
           , col14 - LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14) < '5 mins' AS v
      FROM myschema.mytable
      WHERE col14 >= '2024-11-01'
        AND col14 < '2024-12-01'
        AND col17 = 'FOOBAR’;",,,,,,0,,"postmaster.c",4315,
2024-12-16 22:08:13.008648 UTC,,,p3534838,th-2126882688,,,,0,,,seg171,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",4023,
2024-12-16 22:08:13.136640 UTC,,,p3534856,th-2126882688,,,,0,,,seg171,,,,,"WARNING","01000","ic-proxy-server: received signal 3",,,,,,,0,,"ic_proxy_main.c",474,
2024-12-16 22:08:13.182293 UTC,,,p3534838,th-2126882688,,,,0,,,seg171,,,,,"LOG","00000","background worker ""ic proxy process"" (PID 3534856) exited with exit code 1",,,,,,,0,,"postmaster.c",4293,

As far as I can tell it doesn't seem like this query was ever successful, however it's hard to tell from the logs since they're pretty verbose. However I did confirm with an engineer that this query was first executed on 12/16 (the day of our segments going down the first time with this query), there's no entry of it being executed at all on 12/17, and then it was executed again on 12/18 causing the same error and crash. This is a very large table with a lot of partitions and just from taking a quick sampling each partition is around 100GB+ in size, however I can get exact sizes if that's helpful for analysis.

I've attached the core dump analysis files for all 31 segments that went down during this time (this is directly related to issue #803) provided by @edespino in his comment here #803 (comment). I've also provided the DDL to create the table that I captured using gpbackup and then obfuscated as well as the query itself in files that are attached. Happy to provide any more detail upon request. I've tested this in my local Cloudberry docker environment but haven't been able to recreate the segfault. Mind you though that this test didn't include any data in the table so I imagine that would influence what steps the query execution plan actually takes.

segfault_artifacts.zip

What you think should happen instead

This query shouldn't cause a segmentation fault and should just return the results when it's finished

How to reproduce

I've tested this locally and haven't been able to reproduce the segfault

Operating System

Rocky Linux 8.10 (Green Obsidian)

Anything else

No response

Are you willing to submit PR?

Yes, I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct.

The text was updated successfully, but these errors were encountered:

antoniopetrole · 2024-12-20T22:11:46Z

Tagging @foreyes per request

edespino · 2024-12-21T01:24:59Z

CloudBerry Database Interconnect Crash Analysis

Overview

A widespread interconnect crash occurred across a CloudBerry Database 1.6.0 cluster affecting multiple segments. This analysis examines the crash patterns across the affected segments and hosts.

Would it be possible to share a couple of representative core files from this crash? The analysis shows a consistent pattern:

All crashes are SIGABRT in the interconnect layer (rxThreadFunc)
The crash propagated across 19 hosts in about 5.2 seconds
47 unique segments were affected, with some segments crashing multiple times

Having access to two core files would help with:

Identifying the specific assertion that's failing
Understanding the state of the interconnect at crash time
Examining relevant memory structures

Core Analysis Script Output and Source

Analysis Output

Starting core file analysis...

Core File Analysis Summary
==================================================

Hosts Affected: 19
10.9.1.115: 1 core files
  19:22:29 - seg19 (SIGABRT)
10.9.1.14: 5 core files
  19:22:28 - seg178 (SIGABRT)
  19:22:29 - seg177 (SIGABRT)
  19:22:30 - seg183 (SIGABRT)
  19:22:31 - seg181 (SIGABRT)
  19:22:32 - seg180 (SIGABRT)
10.9.1.140: 3 core files
  19:22:29 - seg70 (SIGABRT)
  19:22:31 - seg69 (SIGABRT)
  19:22:33 - seg71 (SIGABRT)
10.9.1.161: 2 core files
  19:22:28 - seg138 (SIGABRT)
  19:22:29 - seg143 (SIGABRT)
10.9.1.166: 5 core files
  19:22:28 - seg171 (SIGABRT)
  19:22:30 - seg174 (SIGABRT)
  19:22:31 - seg175 (SIGABRT)
  19:22:32 - seg173 (SIGABRT)
  19:22:33 - seg171 (SIGABRT)
10.9.1.168: 4 core files
  19:22:28 - seg143 (SIGABRT)
  19:22:29 - seg147 (SIGABRT)
  19:22:30 - seg146 (SIGABRT)
  19:22:31 - seg144 (SIGABRT)
10.9.1.175: 4 core files
  19:22:29 - seg131 (SIGABRT)
  19:22:30 - seg129 (SIGABRT)
  19:22:31 - seg128 (SIGABRT)
  19:22:32 - seg130 (SIGABRT)
10.9.1.176: 4 core files
  19:22:29 - seg44 (SIGABRT)
  19:22:30 - seg47 (SIGABRT)
  19:22:31 - seg45 (SIGABRT)
  19:22:32 - seg46 (SIGABRT)
10.9.1.185: 3 core files
  19:22:28 - seg175 (SIGABRT)
  19:22:29 - seg179 (SIGABRT)
  19:22:31 - seg177 (SIGABRT)
10.9.1.210: 2 core files
  19:22:28 - seg45 (SIGABRT)
  19:22:30 - seg51 (SIGABRT)
10.9.1.222: 5 core files
  19:22:29 - seg63 (SIGABRT)
  19:22:30 - seg62 (SIGABRT)
  19:22:31 - seg57 (SIGABRT)
  19:22:32 - seg61 (SIGABRT)
  19:22:33 - seg60 (SIGABRT)
10.9.1.232: 5 core files
  19:22:29 - seg51 (SIGABRT)
  19:22:30 - seg54 (SIGABRT)
  19:22:31 - seg55 (SIGABRT)
  19:22:32 - seg53 (SIGABRT)
  19:22:33 - seg51 (SIGABRT)
10.9.1.238: 3 core files
  19:22:28 - seg106 (SIGABRT)
  19:22:29 - seg107 (SIGABRT)
  19:22:30 - seg105 (SIGABRT)
10.9.1.240: 2 core files
  19:22:29 - seg180 (SIGABRT)
  19:22:30 - seg187 (SIGABRT)
10.9.1.242: 2 core files
  19:22:28 - seg15 (SIGABRT)
  19:22:30 - seg12 (SIGABRT)
10.9.1.246: 1 core files
  19:22:30 - seg108 (SIGABRT)
10.9.1.45: 2 core files
  19:22:29 - seg54 (SIGABRT)
  19:22:30 - seg57 (SIGABRT)
10.9.1.61: 1 core files
  19:22:29 - seg138 (SIGABRT)
10.9.1.83: 4 core files
  19:22:29 - seg163 (SIGABRT)
  19:22:30 - seg162 (SIGABRT)
  19:22:31 - seg161 (SIGABRT)
  19:22:32 - seg160 (SIGABRT)

Stack Trace Patterns
------------------------------
Pattern occurred 58 times:
  1. raise
  2. poll
  3. rxThreadFunc
  4. start_thread
  5. clone

Signal Distribution
------------------------------
SIGABRT: 58

Timing Analysis
------------------------------
First core: 19:22:28.073766
Last core:  19:22:33.310271
Duration:   5.237 seconds

Segment Distribution
------------------------------
seg105: 1 cores
seg106: 1 cores
seg107: 1 cores
seg108: 1 cores
seg12: 1 cores
seg128: 1 cores
seg129: 1 cores
seg130: 1 cores
seg131: 1 cores
seg138: 2 cores
seg143: 2 cores
seg144: 1 cores
seg146: 1 cores
seg147: 1 cores
seg15: 1 cores
seg160: 1 cores
seg161: 1 cores
seg162: 1 cores
seg163: 1 cores
seg171: 2 cores
seg173: 1 cores
seg174: 1 cores
seg175: 2 cores
seg177: 2 cores
seg178: 1 cores
seg179: 1 cores
seg180: 2 cores
seg181: 1 cores
seg183: 1 cores
seg187: 1 cores
seg19: 1 cores
seg44: 1 cores
seg45: 2 cores
seg46: 1 cores
seg47: 1 cores
seg51: 3 cores
seg53: 1 cores
seg54: 2 cores
seg55: 1 cores
seg57: 2 cores
seg60: 1 cores
seg61: 1 cores
seg62: 1 cores
seg63: 1 cores
seg69: 1 cores
seg70: 1 cores
seg71: 1 cores

Analysis Script

#!/usr/bin/env python3

import os
import json
import datetime
from collections import defaultdict
from pathlib import Path

def load_core_analysis(json_file):
    """Load and parse a core analysis JSON file."""
    with open(json_file) as f:
        return json.load(f)

def extract_process_info(file_output):
    """Extract process info from core file output string."""
    info = {}
    try:
        parts = file_output.split("from 'postgres: ")[1].split("'")[0].strip().split()
        info['pid'] = parts[0].strip(',')
        conn_details = ' '.join(parts[1:])
        info['details'] = conn_details
        for part in conn_details.split():
            if 'seg' in part:
                info['segment'] = part
            elif 'slice' in part:
                info['slice'] = part
            elif 'cmd' in part:
                info['cmd'] = part
    except:
        info['details'] = 'Parse failed'
    return info

def analyze_stack_trace(stack_trace):
    """Analyze stack trace for key patterns."""
    if not stack_trace:
        return {}
    top_functions = [frame.get('function', 'unknown') for frame in stack_trace[:5]]
    return {
        'top_functions': top_functions,
        'crash_function': top_functions[0] if top_functions else 'unknown'
    }

def analyze_cores(base_dir):
    results = {
        'cores_by_host': defaultdict(list),
        'stack_patterns': defaultdict(int),
        'segments': defaultdict(list),
        'timing': [],
        'signals': defaultdict(int)
    }
    
    for host_dir in Path(base_dir).iterdir():
        if not host_dir.is_dir():
            continue
        host = host_dir.name
        for json_file in host_dir.glob("*.json"):
            analysis = load_core_analysis(json_file)
            proc_info = extract_process_info(analysis['file_info']['file_output'])
            timestamp = datetime.datetime.fromisoformat(analysis['timestamp'])
            
            results['timing'].append({
                'host': host,
                'timestamp': timestamp,
                'segment': proc_info.get('segment', 'unknown'),
                'file': json_file.name
            })
            
            stack_info = analyze_stack_trace(analysis.get('stack_trace', []))
            stack_key = tuple(stack_info.get('top_functions', ['unknown']))
            results['stack_patterns'][stack_key] += 1
            
            signal_info = analysis.get('signal_info', {})
            signal_name = signal_info.get('signal_name', 'Unknown')
            results['signals'][signal_name] += 1
            
            results['cores_by_host'][host].append({
                'file': json_file.name,
                'timestamp': timestamp,
                'process_info': proc_info,
                'signal': signal_name,
                'crash_function': stack_info.get('crash_function', 'unknown')
            })
            
            if 'segment' in proc_info:
                results['segments'][proc_info['segment']].append({
                    'host': host,
                    'timestamp': timestamp,
                    'file': json_file.name
                })
    
    return results

def print_analysis(results):
    print("\nCore File Analysis Summary")
    print("=" * 50)
    
    print("\nHosts Affected:", len(results['cores_by_host']))
    for host, cores in sorted(results['cores_by_host'].items()):
        print(f"\n{host}: {len(cores)} core files")
        for core in sorted(cores, key=lambda x: x['timestamp']):
            print(f"  {core['timestamp'].strftime('%H:%M:%S')} - {core['process_info'].get('segment', 'unknown')} "
                  f"({core['signal']})")
    
    print("\nStack Trace Patterns")
    print("-" * 30)
    for stack, count in sorted(results['stack_patterns'].items(), key=lambda x: x[1], reverse=True):
        print(f"\nPattern occurred {count} times:")
        for i, func in enumerate(stack, 1):
            print(f"  {i}. {func}")
    
    print("\nSignal Distribution")
    print("-" * 30)
    for signal, count in sorted(results['signals'].items(), key=lambda x: x[1], reverse=True):
        print(f"{signal}: {count}")
    
    timestamps = [t['timestamp'] for t in results['timing']]
    if timestamps:
        print("\nTiming Analysis")
        print("-" * 30)
        min_time = min(timestamps)
        max_time = max(timestamps)
        duration = max_time - min_time
        print(f"First core: {min_time.strftime('%H:%M:%S.%f')}")
        print(f"Last core:  {max_time.strftime('%H:%M:%S.%f')}")
        print(f"Duration:   {duration.total_seconds():.3f} seconds")
    
    print("\nSegment Distribution")
    print("-" * 30)
    for segment, occurrences in sorted(results['segments'].items()):
        print(f"{segment}: {len(occurrences)} cores")

if __name__ == "__main__":
    print("Starting core file analysis...")
    results = analyze_cores('.')
    print_analysis(results)

yjhjstz · 2024-12-24T00:43:18Z

can you provide core and postgres binary ? @antoniopetrole

And table ddl of myschema.mytable ?

antoniopetrole · 2024-12-26T16:28:00Z

@yjhjstz Wow I can't believe I forgot to upload the ddl, I've attached it to this comment. Also I shared the coredump with @edespino directly, I don't want to share it publicly since it can contain sensitive data.
segfault_table.zip

antoniopetrole added the type: Bug Something isn't working label Dec 20, 2024

antoniopetrole changed the title ~~[Bug] Query Causes Segmentation Fault Twice~~ [Bug] Query Causes Segmentation Fault Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Query Causes Segmentation Fault #806

[Bug] Query Causes Segmentation Fault #806

antoniopetrole commented Dec 20, 2024

antoniopetrole commented Dec 20, 2024

edespino commented Dec 21, 2024

yjhjstz commented Dec 24, 2024 •

edited

Loading

antoniopetrole commented Dec 26, 2024

[Bug] Query Causes Segmentation Fault #806

[Bug] Query Causes Segmentation Fault #806

Comments

antoniopetrole commented Dec 20, 2024

Apache Cloudberry version

What happened

What you think should happen instead

How to reproduce

Operating System

Anything else

Are you willing to submit PR?

Code of Conduct

antoniopetrole commented Dec 20, 2024

edespino commented Dec 21, 2024

CloudBerry Database Interconnect Crash Analysis

Overview

Core Analysis Script Output and Source

Analysis Output

Analysis Script

yjhjstz commented Dec 24, 2024 • edited Loading

antoniopetrole commented Dec 26, 2024

yjhjstz commented Dec 24, 2024 •

edited

Loading