Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Query Causes Segmentation Fault #806

Open
1 of 2 tasks
antoniopetrole opened this issue Dec 20, 2024 · 4 comments
Open
1 of 2 tasks

[Bug] Query Causes Segmentation Fault #806

antoniopetrole opened this issue Dec 20, 2024 · 4 comments
Labels
type: Bug Something isn't working

Comments

@antoniopetrole
Copy link
Member

Apache Cloudberry version

Cloudberry 1.6.0 (this is a pre-Apache release)

What happened

We had a number of segments go down both on 12/16 (multiple waves of segments going down) as well as 12/18 due to a problem query that was run. The log messages show this query along with a segmentation fault error in each segment log that crashed. I've obfuscated the table and column names for the sake of privacy in the below error message as well as the provided DDL and problem query, but everything else is exactly the same as it was when it was executed

Segmentation fault","Failed process was running: SELECT col1
           , col17
           , LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14)
           , col14
           , col14 - LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14) AS t
           , col14 - LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14) < '5 mins' AS v
      FROM myschema.mytable
      WHERE col14 >= '2024-11-01'
        AND col14 < '2024-12-01'
        AND col17 = 'FOOBAR’;",,,,,,0,,"postmaster.c",4315,
2024-12-16 22:08:13.008648 UTC,,,p3534838,th-2126882688,,,,0,,,seg171,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",4023,
2024-12-16 22:08:13.136640 UTC,,,p3534856,th-2126882688,,,,0,,,seg171,,,,,"WARNING","01000","ic-proxy-server: received signal 3",,,,,,,0,,"ic_proxy_main.c",474,
2024-12-16 22:08:13.182293 UTC,,,p3534838,th-2126882688,,,,0,,,seg171,,,,,"LOG","00000","background worker ""ic proxy process"" (PID 3534856) exited with exit code 1",,,,,,,0,,"postmaster.c",4293,

As far as I can tell it doesn't seem like this query was ever successful, however it's hard to tell from the logs since they're pretty verbose. However I did confirm with an engineer that this query was first executed on 12/16 (the day of our segments going down the first time with this query), there's no entry of it being executed at all on 12/17, and then it was executed again on 12/18 causing the same error and crash. This is a very large table with a lot of partitions and just from taking a quick sampling each partition is around 100GB+ in size, however I can get exact sizes if that's helpful for analysis.

I've attached the core dump analysis files for all 31 segments that went down during this time (this is directly related to issue #803) provided by @edespino in his comment here #803 (comment). I've also provided the DDL to create the table that I captured using gpbackup and then obfuscated as well as the query itself in files that are attached. Happy to provide any more detail upon request. I've tested this in my local Cloudberry docker environment but haven't been able to recreate the segfault. Mind you though that this test didn't include any data in the table so I imagine that would influence what steps the query execution plan actually takes.

segfault_artifacts.zip

What you think should happen instead

This query shouldn't cause a segmentation fault and should just return the results when it's finished

How to reproduce

I've tested this locally and haven't been able to reproduce the segfault

Operating System

Rocky Linux 8.10 (Green Obsidian)

Anything else

No response

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

@antoniopetrole antoniopetrole added the type: Bug Something isn't working label Dec 20, 2024
@antoniopetrole
Copy link
Member Author

Tagging @foreyes per request

@antoniopetrole antoniopetrole changed the title [Bug] Query Causes Segmentation Fault Twice [Bug] Query Causes Segmentation Fault Dec 20, 2024
@edespino
Copy link
Contributor

CloudBerry Database Interconnect Crash Analysis

Overview

A widespread interconnect crash occurred across a CloudBerry Database 1.6.0 cluster affecting multiple segments. This analysis examines the crash patterns across the affected segments and hosts.

Would it be possible to share a couple of representative core files from this crash? The analysis shows a consistent pattern:

  1. All crashes are SIGABRT in the interconnect layer (rxThreadFunc)
  2. The crash propagated across 19 hosts in about 5.2 seconds
  3. 47 unique segments were affected, with some segments crashing multiple times

Having access to two core files would help with:

  • Identifying the specific assertion that's failing
  • Understanding the state of the interconnect at crash time
  • Examining relevant memory structures

Core Analysis Script Output and Source

Analysis Output

Starting core file analysis...

Core File Analysis Summary
==================================================

Hosts Affected: 19
10.9.1.115: 1 core files
  19:22:29 - seg19 (SIGABRT)
10.9.1.14: 5 core files
  19:22:28 - seg178 (SIGABRT)
  19:22:29 - seg177 (SIGABRT)
  19:22:30 - seg183 (SIGABRT)
  19:22:31 - seg181 (SIGABRT)
  19:22:32 - seg180 (SIGABRT)
10.9.1.140: 3 core files
  19:22:29 - seg70 (SIGABRT)
  19:22:31 - seg69 (SIGABRT)
  19:22:33 - seg71 (SIGABRT)
10.9.1.161: 2 core files
  19:22:28 - seg138 (SIGABRT)
  19:22:29 - seg143 (SIGABRT)
10.9.1.166: 5 core files
  19:22:28 - seg171 (SIGABRT)
  19:22:30 - seg174 (SIGABRT)
  19:22:31 - seg175 (SIGABRT)
  19:22:32 - seg173 (SIGABRT)
  19:22:33 - seg171 (SIGABRT)
10.9.1.168: 4 core files
  19:22:28 - seg143 (SIGABRT)
  19:22:29 - seg147 (SIGABRT)
  19:22:30 - seg146 (SIGABRT)
  19:22:31 - seg144 (SIGABRT)
10.9.1.175: 4 core files
  19:22:29 - seg131 (SIGABRT)
  19:22:30 - seg129 (SIGABRT)
  19:22:31 - seg128 (SIGABRT)
  19:22:32 - seg130 (SIGABRT)
10.9.1.176: 4 core files
  19:22:29 - seg44 (SIGABRT)
  19:22:30 - seg47 (SIGABRT)
  19:22:31 - seg45 (SIGABRT)
  19:22:32 - seg46 (SIGABRT)
10.9.1.185: 3 core files
  19:22:28 - seg175 (SIGABRT)
  19:22:29 - seg179 (SIGABRT)
  19:22:31 - seg177 (SIGABRT)
10.9.1.210: 2 core files
  19:22:28 - seg45 (SIGABRT)
  19:22:30 - seg51 (SIGABRT)
10.9.1.222: 5 core files
  19:22:29 - seg63 (SIGABRT)
  19:22:30 - seg62 (SIGABRT)
  19:22:31 - seg57 (SIGABRT)
  19:22:32 - seg61 (SIGABRT)
  19:22:33 - seg60 (SIGABRT)
10.9.1.232: 5 core files
  19:22:29 - seg51 (SIGABRT)
  19:22:30 - seg54 (SIGABRT)
  19:22:31 - seg55 (SIGABRT)
  19:22:32 - seg53 (SIGABRT)
  19:22:33 - seg51 (SIGABRT)
10.9.1.238: 3 core files
  19:22:28 - seg106 (SIGABRT)
  19:22:29 - seg107 (SIGABRT)
  19:22:30 - seg105 (SIGABRT)
10.9.1.240: 2 core files
  19:22:29 - seg180 (SIGABRT)
  19:22:30 - seg187 (SIGABRT)
10.9.1.242: 2 core files
  19:22:28 - seg15 (SIGABRT)
  19:22:30 - seg12 (SIGABRT)
10.9.1.246: 1 core files
  19:22:30 - seg108 (SIGABRT)
10.9.1.45: 2 core files
  19:22:29 - seg54 (SIGABRT)
  19:22:30 - seg57 (SIGABRT)
10.9.1.61: 1 core files
  19:22:29 - seg138 (SIGABRT)
10.9.1.83: 4 core files
  19:22:29 - seg163 (SIGABRT)
  19:22:30 - seg162 (SIGABRT)
  19:22:31 - seg161 (SIGABRT)
  19:22:32 - seg160 (SIGABRT)

Stack Trace Patterns
------------------------------
Pattern occurred 58 times:
  1. raise
  2. poll
  3. rxThreadFunc
  4. start_thread
  5. clone

Signal Distribution
------------------------------
SIGABRT: 58

Timing Analysis
------------------------------
First core: 19:22:28.073766
Last core:  19:22:33.310271
Duration:   5.237 seconds

Segment Distribution
------------------------------
seg105: 1 cores
seg106: 1 cores
seg107: 1 cores
seg108: 1 cores
seg12: 1 cores
seg128: 1 cores
seg129: 1 cores
seg130: 1 cores
seg131: 1 cores
seg138: 2 cores
seg143: 2 cores
seg144: 1 cores
seg146: 1 cores
seg147: 1 cores
seg15: 1 cores
seg160: 1 cores
seg161: 1 cores
seg162: 1 cores
seg163: 1 cores
seg171: 2 cores
seg173: 1 cores
seg174: 1 cores
seg175: 2 cores
seg177: 2 cores
seg178: 1 cores
seg179: 1 cores
seg180: 2 cores
seg181: 1 cores
seg183: 1 cores
seg187: 1 cores
seg19: 1 cores
seg44: 1 cores
seg45: 2 cores
seg46: 1 cores
seg47: 1 cores
seg51: 3 cores
seg53: 1 cores
seg54: 2 cores
seg55: 1 cores
seg57: 2 cores
seg60: 1 cores
seg61: 1 cores
seg62: 1 cores
seg63: 1 cores
seg69: 1 cores
seg70: 1 cores
seg71: 1 cores

Analysis Script

#!/usr/bin/env python3

import os
import json
import datetime
from collections import defaultdict
from pathlib import Path

def load_core_analysis(json_file):
    """Load and parse a core analysis JSON file."""
    with open(json_file) as f:
        return json.load(f)

def extract_process_info(file_output):
    """Extract process info from core file output string."""
    info = {}
    try:
        parts = file_output.split("from 'postgres: ")[1].split("'")[0].strip().split()
        info['pid'] = parts[0].strip(',')
        conn_details = ' '.join(parts[1:])
        info['details'] = conn_details
        for part in conn_details.split():
            if 'seg' in part:
                info['segment'] = part
            elif 'slice' in part:
                info['slice'] = part
            elif 'cmd' in part:
                info['cmd'] = part
    except:
        info['details'] = 'Parse failed'
    return info

def analyze_stack_trace(stack_trace):
    """Analyze stack trace for key patterns."""
    if not stack_trace:
        return {}
    top_functions = [frame.get('function', 'unknown') for frame in stack_trace[:5]]
    return {
        'top_functions': top_functions,
        'crash_function': top_functions[0] if top_functions else 'unknown'
    }

def analyze_cores(base_dir):
    results = {
        'cores_by_host': defaultdict(list),
        'stack_patterns': defaultdict(int),
        'segments': defaultdict(list),
        'timing': [],
        'signals': defaultdict(int)
    }
    
    for host_dir in Path(base_dir).iterdir():
        if not host_dir.is_dir():
            continue
        host = host_dir.name
        for json_file in host_dir.glob("*.json"):
            analysis = load_core_analysis(json_file)
            proc_info = extract_process_info(analysis['file_info']['file_output'])
            timestamp = datetime.datetime.fromisoformat(analysis['timestamp'])
            
            results['timing'].append({
                'host': host,
                'timestamp': timestamp,
                'segment': proc_info.get('segment', 'unknown'),
                'file': json_file.name
            })
            
            stack_info = analyze_stack_trace(analysis.get('stack_trace', []))
            stack_key = tuple(stack_info.get('top_functions', ['unknown']))
            results['stack_patterns'][stack_key] += 1
            
            signal_info = analysis.get('signal_info', {})
            signal_name = signal_info.get('signal_name', 'Unknown')
            results['signals'][signal_name] += 1
            
            results['cores_by_host'][host].append({
                'file': json_file.name,
                'timestamp': timestamp,
                'process_info': proc_info,
                'signal': signal_name,
                'crash_function': stack_info.get('crash_function', 'unknown')
            })
            
            if 'segment' in proc_info:
                results['segments'][proc_info['segment']].append({
                    'host': host,
                    'timestamp': timestamp,
                    'file': json_file.name
                })
    
    return results

def print_analysis(results):
    print("\nCore File Analysis Summary")
    print("=" * 50)
    
    print("\nHosts Affected:", len(results['cores_by_host']))
    for host, cores in sorted(results['cores_by_host'].items()):
        print(f"\n{host}: {len(cores)} core files")
        for core in sorted(cores, key=lambda x: x['timestamp']):
            print(f"  {core['timestamp'].strftime('%H:%M:%S')} - {core['process_info'].get('segment', 'unknown')} "
                  f"({core['signal']})")
    
    print("\nStack Trace Patterns")
    print("-" * 30)
    for stack, count in sorted(results['stack_patterns'].items(), key=lambda x: x[1], reverse=True):
        print(f"\nPattern occurred {count} times:")
        for i, func in enumerate(stack, 1):
            print(f"  {i}. {func}")
    
    print("\nSignal Distribution")
    print("-" * 30)
    for signal, count in sorted(results['signals'].items(), key=lambda x: x[1], reverse=True):
        print(f"{signal}: {count}")
    
    timestamps = [t['timestamp'] for t in results['timing']]
    if timestamps:
        print("\nTiming Analysis")
        print("-" * 30)
        min_time = min(timestamps)
        max_time = max(timestamps)
        duration = max_time - min_time
        print(f"First core: {min_time.strftime('%H:%M:%S.%f')}")
        print(f"Last core:  {max_time.strftime('%H:%M:%S.%f')}")
        print(f"Duration:   {duration.total_seconds():.3f} seconds")
    
    print("\nSegment Distribution")
    print("-" * 30)
    for segment, occurrences in sorted(results['segments'].items()):
        print(f"{segment}: {len(occurrences)} cores")

if __name__ == "__main__":
    print("Starting core file analysis...")
    results = analyze_cores('.')
    print_analysis(results)

@yjhjstz
Copy link
Member

yjhjstz commented Dec 24, 2024

can you provide core and postgres binary ? @antoniopetrole

And table ddl of myschema.mytable ?

@antoniopetrole
Copy link
Member Author

@yjhjstz Wow I can't believe I forgot to upload the ddl, I've attached it to this comment. Also I shared the coredump with @edespino directly, I don't want to share it publicly since it can contain sensitive data.
segfault_table.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants