Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LintFilesRequest gets confused with a file named enums.py #21369

Open
wallace11 opened this issue Sep 2, 2024 · 17 comments
Open

LintFilesRequest gets confused with a file named enums.py #21369

wallace11 opened this issue Sep 2, 2024 · 17 comments
Assignees
Labels

Comments

@wallace11
Copy link

wallace11 commented Sep 2, 2024

Describe the bug
If a file has the name enums.py anywhere in the repository, the lint goal for any linter that uses LintFilesRequest will run twice.

I was able to reproduce this in a clean environment with the following tree:

├── pants.toml
├── regex-lint.yaml
└── src
    ├── BUILD
    ├── enums.py
    └── file.py

Using regex-lint as an example (since it uses LintFilesRequest):

pants.toml

[GLOBAL]
pants_version = "2.21.0"
backend_packages = [
    "pants.backend.python",
    "pants.backend.project_info"
]

[regex-lint]
config = "@regex-lint.yaml"

regex-lint.yaml

required_matches:
  python_file:
    - hello

path_patterns:
  - name: python_file
    pattern: \.py$

content_patterns:
  - name: hello
    pattern: ^hello

BUILD

python_sources()

enums.py & file.py are empty.

Output of pants lint :::

12:55:47.66 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/enums.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.


12:55:47.66 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.

✕ regex-lint failed.

Pants version
2.21.0

OS
Linux

Additional info
At some point I was able to also get a file with the name rules.py in a custom plugin to cause this behavior but I'm unable to reliably reproduce this. It might be related to the order at which the files are collected or something along these lines but I haven't looked further.

Edit:

  1. Fixed a typo
  2. Fixed output of lint to reflect the stripped-down regex-lint.yaml (I previously used the version from my production environment)
@wallace11 wallace11 added the bug label Sep 2, 2024
@benjyw benjyw self-assigned this Sep 3, 2024
@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 3, 2024

Holy smokes, that is weird and bad. Thanks for the report and the repro.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 5, 2024

So this isn't quite so bad: this is in fact to do with partitioning. If you look at the output you can see that it's running on src/enums.py separately from src/file.py, but if you rename enums.py then it runs them both together. But it is not doing the same work twice.

Of course why the names should affect the partitioning is a matter still to be looked into...

@krishnan-chandra
Copy link
Contributor

I tracked this down and the reason this happens is that the partitioning is based on a hash of the file name:

https://github.com/pantsbuild/pants/blob/main/src/python/pants/core/goals/lint.py#L433
https://github.com/pantsbuild/pants/blob/main/src/python/pants/util/collections.py#L102-L152

I converted the configuration in the OP into a reproducer shell script:

#!/bin/bash

set -e

cd "$(mktemp -d)"
echo "Working directory: $(pwd)"

# Create pants.toml
cat >pants.toml <<EOF
[GLOBAL]
pants_version = "2.21.0"
backend_packages = [
    "pants.backend.python",
    "pants.backend.project_info"
]

[regex-lint]
config = "@regex-lint.yaml"
EOF

# Create regex-lint.yaml
cat >regex-lint.yaml <<EOF
required_matches:
  python_file:
    - hello
path_patterns:
  - name: python_file
    pattern: \.py$
content_patterns:
  - name: hello
    pattern: ^hello
EOF

# Create project structure
mkdir -p src
cat >src/BUILD <<EOF
python_sources()
EOF

# Create empty Python files
touch src/enums.py
touch src/file.py

# Run pants lint
echo "Running 'pants lint ::'"
# This is to run Pants with some custom log messages that I added
PANTS_SOURCE=~/Projects/pants pants lint ::

echo "Reproducer completed."

If I run the reproducer script using the project configuration in the OP, I get the following output:

./reproducer.sh

Working directory: /var/folders/0w/zhpm9fs12zzcmt3m5799xb9c0000gn/T/tmp.CmgNoR7YDn
Running 'pants lint ::'
Pantsd has been turned off via Env.

23:02:18.89 [INFO] Emitting new batch of size 1
23:02:18.89 [INFO] Items in batch: ['src/enums.py']
23:02:18.89 [INFO] Emitting final batch of size 1
23:02:18.89 [INFO] Items in batch: ['src/file.py']
23:02:18.89 [INFO] Number of batches per request type: {'RegexLintRequest': 2}
23:02:18.89 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.


23:02:18.89 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/enums.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.

However, if I change the name of src/enums.py to src/new_enums.py, then I get:

./reproducer.sh
Working directory: /var/folders/0w/zhpm9fs12zzcmt3m5799xb9c0000gn/T/tmp.pPg07nnA5u
Running 'pants lint ::'
Pantsd has been turned off via Env.
23:04:13.82 [INFO] Emitting final batch of size 2
23:04:13.82 [INFO] Items in batch: ['src/file.py', 'src/new_enums.py']
23:04:13.82 [INFO] Number of batches per request type: {'RegexLintRequest': 1}
23:04:13.82 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello
X src/new_enums.py: Didn't match: hello

0 files matched all required patterns.
2 files failed to match at least one required pattern.


✕ regex-lint failed.

I don't think there is anything special about the name enums.py, except that it happens to have hash characteristics that shunt it into a separate partition. I suspect there may be a better way to partition files (based on source roots, maybe?) that is more consistent and doesn't rely on the hash of the file name falling into a specific range.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 8, 2024

Good detection @krishnan-chandra! Still unclear to me why we partition at all in this case.

Note that partitioning this way ensures balanced partitions, which has its own advantages. Also, for anyone poking around the code, I think this is to do with "batching" (which is to prevent passing too many files to a single process) rather than "partitioning" (which has to do with files that cannot be in the same process because they require different settings in the underlying tool).

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 8, 2024

So this is partly a UX issue, we are not making it clear to the user that batching is happening, so it looks like the same work is being done multiple times.

But also, why are we batching at all in the case of two files...

@krishnan-chandra
Copy link
Contributor

The batching code always runs, no matter how many files there are: https://github.com/pantsbuild/pants/blob/main/src/python/pants/core/goals/lint.py#L430-L448

I suppose we could set some threshold / lower bound (maybe 50-100 files?) below which we don't bother with the batching code at all.

@wallace11
Copy link
Author

If a single partition is used, I'd expect batching to be skipped.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 9, 2024

How does the batching code determine the number of batches? I would have assumed it chooses the smallest number of batches so that each batch will have a max of N files for some N. But I guess not?

@krishnan-chandra
Copy link
Contributor

It doesn't seem that the batching code necessarily creates a specific number of batches - instead, when iterating over the list of items to batch there is a boundary condition for starting a new batch. The only guarantee made is that the batch size will not be larger than the provided size_max.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 11, 2024

Hrm, so this is probably fine for large numbers of input files, but leads to kinda silly behavior on small numbers of input files...

In any case, important to emphasize that work is not being duplicated. It's just that work is being spread over more processes than is necessary.

@krishnan-chandra
Copy link
Contributor

Yeah. I think there is no bug here, so we can close this issue if no objections @wallace11. While we could potentially optimize the batching behavior a bit further, I'm not sure that necessarily needs to be done here.

@wallace11
Copy link
Author

@krishnan-chandra
The plugin I was working on while discovering this bug requires getting the entire file tree in one go, so while it might appear as a cosmetic bug (still a bug, though...), it's got some real functional implications.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 15, 2024

Ah, you are free not to use batching in your own plugin. You can request all the files directly (without using LintFilesRequest, which does make assumptions about batching).

@wallace11
Copy link
Author

Ah, you are free not to use batching in your own plugin. You can request all the files directly (without using LintFilesRequest, which does make assumptions about batching).

I believe that your suggestion is not possible.
Linting rule doesn't work if the request is not LintFilesRequest.Batch.
The bare minimum partition rule that I could get working is:

@rule
async def partition_inputs(
    request: MyLintFilesRequest.PartitionRequest,
    subsystem: MyLintSubsystem,
) -> Partitions[str, Any]:
    if subsystem.skip:
        return Partitions()
    return Partitions.single_partition(sorted(request.files))

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 19, 2024

My point is that LintFilesRequest is just a convenience for linters, and one of the main conveniences it provides is batching. But since you don't want that at all, you can just use one of the underlying constructs.

For example, if you want python files you can do:

await Get(
        PythonSourceFiles, PythonSourceFilesRequest(..list of targets.., include_files=True)
    )

Or if you just want files in general, regardless of any BUILD file targets, you can use Globs directly.

A big question is: can your linter run on a subset of the repo (based on the cli arguments) or must it always run on every file?

@wallace11
Copy link
Author

I now understand what you mean.
I guess it's indeed possible to ditch LintFilesRequest and use something else, however as you already assumed, this would make it more difficult to use targets.

The plugin I developed is a "health check" that makes sure that certain files that have to be somewhere are indeed there. It can check the entire repository or just a subset of it.
Right now, my plugin is working just fine using LintFilesRequest and it can work on any target provided to it because I receive that "for free".
Depending on where the cut-off happens, it might interfere with the plugin's behavior (because it's checking nearby files).

All in all, I don't think this is a critical bug, but I personally wouldn't dismiss it as "working as intended" because there's clearly a quirk in the system here.

Hopefully you'd consider an improvement to this mechanism in the future :)

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 25, 2024

Hmm, I see. So I think your use case is not really a linter. The LintFilesRequest mechanism envisioned cases where all the information you need to lint a file is in that file (and maybe its transitive dependencies). This seems like something else.

Can you elaborate by example? That is, what are the inputs of your health checker and how do they map to outputs? I have a feeling that doing this with Globs might be more correct for more than just this reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants