Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to select shortest item in the duplicate list for process.dedupe #76

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

smtodd
Copy link

@smtodd smtodd commented Sep 1, 2024

When using thefuzz.process.dedupe to deduplicate lists and assign a single name to an entity, the only choice so far has been to select the longest item in the group. This PR allows a choice between using the longest name or the shortest name. It makes the following changes:

  1. Add an optional parameter "len_selector" to process.dedupe. Possible values are "longest" and "shortest". "longest" is the default and maintains current behavior. "shortest" will select the shortest item in the list of duplicates. This is useful in cases where the shortest item has the most generalized information about the entity, and could thus include the others. An example could be a department in a university.
  2. Defines behavior for both "longest" and "shortest" parameter values.
  3. Adds information about this parameter to the docstring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant