Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving package setup to pyproject.toml #803

Open
IFFranciscoME opened this issue Oct 31, 2024 · 1 comment
Open

Improving package setup to pyproject.toml #803

IFFranciscoME opened this issue Oct 31, 2024 · 1 comment

Comments

@IFFranciscoME
Copy link

IFFranciscoME commented Oct 31, 2024

Description

Improve the approach for packaging and use a pyproject.toml approach

Context

The common approach for a python project to be used is either a direct way, mostly like cloning the repository, and a compacted way, mostly in the form of packaging and/or containerization. In either case, for the same version of the software, the same functionality should be available for any given system independently on how it was installed.

Problem

When the user decides to go for the package/container route of installing the software, there will be dependency issues not straight forward solvable for some cases (more yet to be mapped). A non exhaustive list of these problems is:

  • Python version and sub-version compatibility.
  • PYTHONPATH values maybe are not updated programatically.
  • Order of dependency installs might break out inter-dependencies.
  • Dependencies to download data might conflict with those for running workloads.

Elements for the solution (Draft)

In general, it might me a good opportunity to update the packaging from a setup.py oriented approach to pyproject.toml approach.

PEP 518 – Specifying Minimum Build System Requirements for Python Projects

This PEP specifies how Python software packages should specify what build dependencies they have in order to execute their chosen build system. As part of this specification, a new configuration file is introduced for software packages to use to specify their build dependencies (with the expectation that the same configuration file will be used for future configuration details).

Externally Managed Environments

To allows a Python installation to indicate to Python-specific tools such as pip that they neither install nor remove packages into the interpreter’s default installation environment

Some other details might be useful:

  • Every package now should be strictly necessary (at least really really hard to not depend on it)
  • Use of exact versions for every single package.
  • Include explicit installation order and without using cache installed versions.
  • Make sure the latest version of pip is installed.
  • Use of installation flags for python packages.
  • Compile/store the package's Wheel (offline file) according to particular architectures.
@IFFranciscoME
Copy link
Author

Ok, @priyakasimbeg, here is my proposition of more actionable items to start the first phase of refactoring: And is in the Datasets installation/downloading process actually, previous to the overall project.

Problems:

  • Dependencies across datasets might differ in time.
  • Dependencies across datasets and the submission script, or any other script for that matter, might differ in time.
  • Pre-processing for one dataset might be different to another one.
  • Changes and updates into the understanding for each dataset might be at different pace.

Improvement oportunities:

  1. Move from a monolithic dataset config to a per-data set config.
  2. Move from a dataset_setup.py config to a pyproject.toml logic.
  3. Extend/update with the following:
    1. Keep the ~/data/ and ~/temp/data local folder creation.
    2. Define a pyproject.toml file for all datasets.
    3. Within pyproject.toml specify a dependency list for each dataset.
    4. Create a sub-folder per dataset
      1. Create/relocate/expand downloading, pre-processing, eda scripts.
      2. A README.md for each dataset with some of the following:
        1. Official Name, Creator, License.
        2. Data and File Structure.
        3. Exact Full Size (On decompressed + worst case scenario).
        4. Other extra details.
  4. Environment and execution complements:
    4. (Good to have) Instructions to avoid terminal locking (a new tab, locked max-resources, tmux).
    5. (Good to have) With progress indication in terminal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants