Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing under Gentoo 6.10.6 #16502

Open
mauricev opened this issue Sep 3, 2024 · 12 comments
Open

Crashing under Gentoo 6.10.6 #16502

mauricev opened this issue Sep 3, 2024 · 12 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@mauricev
Copy link

mauricev commented Sep 3, 2024

I have the most recent zfs-9999 installation (I assume this is a recent github snapshot) of ZFS installed on Gentoo x64 with gentoo-sources kernel 6.10.6. I have one mirrored zpool. I am seeing occasional kernel panics.

Screenshot 2024-08-30 at 2 16 48 PM

Screenshot 2024-09-03 at 7 10 35 PM

The pool passes scrub with no errors.

I'm not sure of any other way to document panics. I have another similar system slightly older running 6.10.5 and it has yet to crash.

@mauricev mauricev added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 3, 2024
@robn
Copy link
Member

robn commented Sep 4, 2024

@mauricev can you please update this with the information requested in the issue template. We need that info to begin triage.

@mauricev
Copy link
Author

mauricev commented Sep 4, 2024

Distribution Name | Gentoo
Distribution Version | emerged in the last week
Kernel Version | 6.10.6-gentoo-x86_64
Architecture | x64
OpenZFS Version | zfs-2.2.99-529_g23a489a41
zfs-kmod-2.2.99-687_gb3b749161

@robn
Copy link
Member

robn commented Sep 4, 2024

Are you doing something in particular at the time this happens? I see docker and overlay filesystems are in play; does this coincide with a particular docker activity? You say "occasional", so it would be nice to line this up with a particular action.

Did anything change on this system recently? You mention another system on 6.10.5, and a recent upgrade. Was this system running that older kernel before? Did this happen then? If you're able to downgrade your kernel, I'd be interested to see if anything changes there.

Are you able to try with a release OpenZFS (say 2.2.5)?

@mauricev
Copy link
Author

mauricev commented Sep 4, 2024

This is a new installation replacing an older system. This new system has a new, third docker container and when I build the image for it, about 50% of the time, this crash will happen. However, it just happened again today unrelated to my building a docker image. There is another nearly identical server running 6.10.5, but without docker. It has not crashed so far. Could a low-memory condition trigger zfs to crash this way? I think I would have to revert the kernel version to install zfs 2.2.5.

@snajpa
Copy link
Contributor

snajpa commented Sep 4, 2024

Can you please share the build command/Dockerfile? 50% luck is good enough, I'll give it a try (need to make sure 6.10 runs well, we'd like to move onto it at vpsFree)

@mauricev
Copy link
Author

mauricev commented Sep 4, 2024

# Use Python 3.12-slim as the base image
FROM python:3.12-slim

# Set the working directory in the container
WORKDIR /app

# Install dependencies for OpenCV, libgthread, and CA certificates
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx \
    libglib2.0-0 \
    cron \
    procps \
    curl \
    ca-certificates \
    && apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Copy the current directory contents into the container at /app
COPY . /app

# Upgrade pip and setuptools to ensure compatibility with SSL/TLS
RUN pip install --upgrade pip setuptools

# Install Python dependencies using pip and the requirements.txt file
RUN pip install --no-cache-dir --trusted-host pypi.org --trusted-host files.pythonhosted.org --no-cache -r requirements.txt

# Create the uploadedImages directory if it doesn't exist
RUN mkdir -p /app/static/uploadedImages

# Expose the port that your Flask app will run on internally
EXPOSE 5000

# Define environment variable for Flask
ENV FLASK_APP=main.py

# Copy the start script into the container
COPY start.sh /app/start.sh

# Set the entry point to the start script
ENTRYPOINT ["/app/start.sh"]

@snajpa
Copy link
Contributor

snajpa commented Sep 6, 2024

I had no luck, left that running in a loop over night, no crash :( Can you describe how the pool and datasets are set up? What properties are set, etc.?

@mauricev
Copy link
Author

mauricev commented Sep 7, 2024

  pool: spool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:38 with 0 errors on Fri Aug 30 12:14:01 2024
config:

	NAME        STATE     READ WRITE CKSUM
	spool       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    vdb     ONLINE       0     0     0
	    vdc     ONLINE       0     0     0

errors: No known data errors
NAME               USED  AVAIL  REFER  MOUNTPOINT
spool             7.13G  7.89G    96K  /spool
spool/docker      1.91G  7.89G  1.91G  /spool/docker
spool/kevin       93.4M  7.89G  93.4M  /var/www/einsteinmedneuroscience/kevin
spool/mysql-else   382M  7.89G   374M  /spool/mysql-else
spool/mysql-wp    1.97G  7.89G  1.61G  /spool/mysql-wp
spool/odes         269M  7.89G   269M  /spool/odes
spool/wp          2.50G  7.89G  2.49G  /var/www/localhost/htdocs/neuroscience
NAME   PROPERTY                       VALUE                          SOURCE
spool  size                           15.5G                          -
spool  capacity                       45%                            -
spool  altroot                        -                              default
spool  health                         ONLINE                         -
spool  guid                           7238931578768070567            -
spool  version                        -                              default
spool  bootfs                         -                              default
spool  delegation                     on                             default
spool  autoreplace                    off                            default
spool  cachefile                      -                              default
spool  failmode                       wait                           default
spool  listsnapshots                  off                            default
spool  autoexpand                     off                            default
spool  dedupratio                     1.00x                          -
spool  free                           8.37G                          -
spool  allocated                      7.13G                          -
spool  readonly                       off                            -
spool  ashift                         12                             local
spool  comment                        -                              default
spool  expandsize                     -                              -
spool  freeing                        0                              -
spool  fragmentation                  16%                            -
spool  leaked                         0                              -
spool  multihost                      off                            default
spool  checkpoint                     -                              -
spool  load_guid                      7771301092460470127            -
spool  autotrim                       off                            default
spool  compatibility                  off                            default
spool  bcloneused                     0                              -
spool  bclonesaved                    0                              -
spool  bcloneratio                    1.00x                          -
spool  feature@async_destroy          enabled                        local
spool  feature@empty_bpobj            active                         local
spool  feature@lz4_compress           active                         local
spool  feature@multi_vdev_crash_dump  enabled                        local
spool  feature@spacemap_histogram     active                         local
spool  feature@enabled_txg            active                         local
spool  feature@hole_birth             active                         local
spool  feature@extensible_dataset     active                         local
spool  feature@embedded_data          active                         local
spool  feature@bookmarks              enabled                        local
spool  feature@filesystem_limits      enabled                        local
spool  feature@large_blocks           enabled                        local
spool  feature@large_dnode            enabled                        local
spool  feature@sha512                 enabled                        local
spool  feature@skein                  enabled                        local
spool  feature@edonr                  enabled                        local
spool  feature@userobj_accounting     active                         local
spool  feature@encryption             enabled                        local
spool  feature@project_quota          active                         local
spool  feature@device_removal         enabled                        local
spool  feature@obsolete_counts        enabled                        local
spool  feature@zpool_checkpoint       enabled                        local
spool  feature@spacemap_v2            active                         local
spool  feature@allocation_classes     enabled                        local
spool  feature@resilver_defer         enabled                        local
spool  feature@bookmark_v2            enabled                        local
spool  feature@redaction_bookmarks    enabled                        local
spool  feature@redacted_datasets      enabled                        local
spool  feature@bookmark_written       enabled                        local
spool  feature@log_spacemap           active                         local
spool  feature@livelist               enabled                        local
spool  feature@device_rebuild         enabled                        local
spool  feature@zstd_compress          enabled                        local
spool  feature@draid                  enabled                        local
spool  feature@zilsaxattr             active                         local
spool  feature@head_errlog            active                         local
spool  feature@blake3                 enabled                        local
spool  feature@block_cloning          enabled                        local
spool  feature@vdev_zaps_v2           active                         local
spool  feature@redaction_list_spill   enabled                        local
spool  feature@raidz_expansion        enabled                        local

I had run the docker build twice and it did crash again once. When I run the command on the staging server, it never crashes but that has only btrfs disks.

@snajpa
Copy link
Contributor

snajpa commented Sep 7, 2024

Maybe it's something in requirements.txt? There was also start.sh missing, I solved both by touching an empty file; perhaps something's happening with the python-stuff. Could you please supply that too?

@mauricev
Copy link
Author

mauricev commented Sep 7, 2024

start.sh

#!/bin/bash

# Function to handle SIGTERM
trap 'kill -TERM $PID' TERM INT

# Start cron in the background
/usr/sbin/cron &

# Start gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 main:app &

# Capture PID of gunicorn
PID=$!

# Wait for gunicorn process
wait $PID

requirements.txt

numpy
pillow
opencv-python-headless
pandas
ultralytics
flask
gunicorn
torch==2.4.0

Come to think of it, the crash always occurs during the processing of requirements.

@snajpa
Copy link
Contributor

snajpa commented Sep 8, 2024

I left it looping for 8 hours, nothing :(

@snajpa
Copy link
Contributor

snajpa commented Sep 8, 2024

Uh I enabled block cloning in hope to reproduce this, perhaps it could be related... - only to see endless txg syncs with no data written whatsoever. OK, got it, continuing to keep that feature off and recreating my dev pool now :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants