Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior of proximity search (~N) in Whoosh based on word type #48

Open
bbernicker opened this issue Sep 25, 2023 · 0 comments

Comments

@bbernicker
Copy link

Description:
The proximity search (~N) in Whoosh shows inconsistent behavior based on the nature of the words in the indexed document. It seems that individual letters and commonly used filler words might be disregarded, whereas semantically meaningful words are counted.

Expected Behavior:
A search for "hello world"~2 should match strings where "hello" and "world" are separated by up to two terms, regardless of the nature or semantic value of the intervening terms.

Actual Behavior:
The behavior of the proximity search appears inconsistent:

  1. It matches strings like "hello X Y Z A B C D world" and "hello to a the but and for this world" even though there are many terms between "hello" and "world".
  2. It does not match strings like "hello add more words to illustrate the problem world", correctly following the ~2 constraint.

Minimal Working Example (MWE):

from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
from whoosh.qparser import QueryParser
from whoosh.filedb.filestore import RamStorage

schema = Schema(content=TEXT(stored=True))
storage = RamStorage()

def create_new_index():
    return storage.create_index(schema)

def add_to_index(idx, content):
    writer = idx.writer()
    writer.add_document(content=content)
    writer.commit()

def matches_whoosh(query, indexed_opinion):
    with indexed_opinion.searcher() as searcher:
        parsed_query = QueryParser("content", indexed_opinion.schema).parse(query)
        results = searcher.search(parsed_query)
        return len(results) > 0

# Add test cases and print results
test_cases = {
    "Case 1": "hello X Y Z A B C D world",
    "Case 2": "hello to a the but and for this world",
    "Case 3": "hello add more words to illustrate the problem world"
}

query = '"hello world"~2'
for case_name, content in test_cases.items():
    idx = create_new_index()
    add_to_index(idx, content)
    print(f"{case_name}: {matches_whoosh(query, idx)}")

Environment:

  • Whoosh version: 2.7.4
  • Python version: 3.10.0
  • Operating System: macOS 13.5.1 with Apple M1 Pro
cclauss pushed a commit to cclauss/whoosh-1 that referenced this issue Feb 9, 2024
# Description

This resolves the code coverage reporting, so the actual source files
will also have coverage reported. I configured my own fork with a token
and you can view the results:
https://app.codecov.io/github/stumpylog/whoosh-reloaded

The main fix is to install using `pip install -e .` or editable.
Otherwise, coverage was not picked up those files are being relvant.

The other small fix was to only run the testing once.

Closes: mchaput#48

# Checklist:

- [x] I have performed a self-review of my own code
- [ ] I have commented my code in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant