Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParalleRunner hangs on Linux Server #4176

Open
Dekermanjian opened this issue Sep 18, 2024 · 8 comments
Open

ParalleRunner hangs on Linux Server #4176

Dekermanjian opened this issue Sep 18, 2024 · 8 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@Dekermanjian
Copy link

Dekermanjian commented Sep 18, 2024

Description

I have a pipeline that I would like to run using the ParallelRunner. When I run this pipeline on my local windows machine it works just fine. However, when I try running the exact same pipeline on a Linux server (Rocky Linux) it will just hang at the loading datasets stage.

  • Kedro version used (pip show kedro or kedro -V): 0.19.8
  • Python version used (python -V): 3.11.8
  • Operating system and version: Rocky Linux version 8.10
@noklam
Copy link
Contributor

noklam commented Sep 18, 2024

Can you provide some more context, if possible to share a simplified version of repository that we can try to reproduce locally.

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Sep 18, 2024
@Dekermanjian
Copy link
Author

Dekermanjian commented Sep 18, 2024

@noklam Yeah, of course. Let me try to put together something simple that will hang on the server and then I'll share the repo with you.

@Dekermanjian
Copy link
Author

Dekermanjian commented Sep 18, 2024

@noklam Okay, I figured out why it is not working. I just don't understand why it doesn't work on Linux but it does on Windows. Here is a simple example: https://github.com/Dekermanjian/test-parallel-runner

The reason it is not working on the linux server is because I am loading a parquet file in my settings.py file. When I load that file in the simple example the ParallelRunner will hang at the loading dataset stage. If you comment that line out (line 6) then it will work. You can generate the data by running the notebook I created.

Sorry let me add the command to run: kedro run --runner=ParallelRunner -p data_processing

@Dekermanjian
Copy link
Author

Hey @noklam, were you able to reproduce the issue? I am wondering if this is just happening on my end based on how the Linux server I am using is set up.

@ankatiyar
Copy link
Contributor

Hey @Dekermanjian, I was trying to reproduce this but I lack the input data, would you be able to commit it to your test project repo or share it with us? (assuming that it is sanitised and shareable)

@Dekermanjian
Copy link
Author

@ankatiyar, yes, I can do that now. I am sorry, I forgot to adjust the .gitignore before pushing. Okay, I have now pushed the data to the repo.

A couple of things that I noticed while further testing. I noticed that in settings.py if I read the parquet file with pandas then it does not hang. I can also read it with pandas and then turn it to a polars data frame and it will also not hang. So it seems the problem is reading the file using polars. This is most likely a polars problem and not a kedro problem, but it would be good for someone else to reproduce this so that others that may experience the same problem can refer to this issue for a quick fix.

@ankatiyar
Copy link
Contributor

@Dekermanjian thanks for the quick response, I have been able to reproduce it on Gitpod which has linux but it runs just fine on my Mac M1 locally. It also works when I use pandas instead of polars on Linux. Just curious, what would you need to load the dataset in settings.py for?

@Dekermanjian
Copy link
Author

Okay, perfect! That is also what I am experiencing. In my actual project, I create a dynamic pipeline that runs a model on patient level data every hour (one pipeline per patient). Some patients don't have any new data between hours so I read in a file in settings.py that is generated in a previous kedro pipeline to filter out any patients that have not had any new data within the new hour.

Thank you, for taking the time to reproduce this issue @ankatiyar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
Status: No status
Development

No branches or pull requests

4 participants