Inconsistent behavior of proximity search (`~N`) in Whoosh based on word type #48

bbernicker · 2023-09-25T21:18:56Z

Description:
The proximity search (~N) in Whoosh shows inconsistent behavior based on the nature of the words in the indexed document. It seems that individual letters and commonly used filler words might be disregarded, whereas semantically meaningful words are counted.

Expected Behavior:
A search for "hello world"~2 should match strings where "hello" and "world" are separated by up to two terms, regardless of the nature or semantic value of the intervening terms.

Actual Behavior:
The behavior of the proximity search appears inconsistent:

It matches strings like "hello X Y Z A B C D world" and "hello to a the but and for this world" even though there are many terms between "hello" and "world".
It does not match strings like "hello add more words to illustrate the problem world", correctly following the ~2 constraint.

Minimal Working Example (MWE):

from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
from whoosh.qparser import QueryParser
from whoosh.filedb.filestore import RamStorage

schema = Schema(content=TEXT(stored=True))
storage = RamStorage()

def create_new_index():
    return storage.create_index(schema)

def add_to_index(idx, content):
    writer = idx.writer()
    writer.add_document(content=content)
    writer.commit()

def matches_whoosh(query, indexed_opinion):
    with indexed_opinion.searcher() as searcher:
        parsed_query = QueryParser("content", indexed_opinion.schema).parse(query)
        results = searcher.search(parsed_query)
        return len(results) > 0

# Add test cases and print results
test_cases = {
    "Case 1": "hello X Y Z A B C D world",
    "Case 2": "hello to a the but and for this world",
    "Case 3": "hello add more words to illustrate the problem world"
}

query = '"hello world"~2'
for case_name, content in test_cases.items():
    idx = create_new_index()
    add_to_index(idx, content)
    print(f"{case_name}: {matches_whoosh(query, idx)}")

Environment:

Whoosh version: 2.7.4
Python version: 3.10.0
Operating System: macOS 13.5.1 with Apple M1 Pro

The text was updated successfully, but these errors were encountered:

# Description This resolves the code coverage reporting, so the actual source files will also have coverage reported. I configured my own fork with a token and you can view the results: https://app.codecov.io/github/stumpylog/whoosh-reloaded The main fix is to install using `pip install -e .` or editable. Otherwise, coverage was not picked up those files are being relvant. The other small fix was to only run the testing once. Closes: mchaput#48 # Checklist: - [x] I have performed a self-review of my own code - [ ] I have commented my code in hard-to-understand areas - [ ] I have made corresponding changes to the documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior of proximity search (`~N`) in Whoosh based on word type #48

Inconsistent behavior of proximity search (`~N`) in Whoosh based on word type #48

bbernicker commented Sep 25, 2023

Inconsistent behavior of proximity search (~N) in Whoosh based on word type #48

Inconsistent behavior of proximity search (~N) in Whoosh based on word type #48

Comments

bbernicker commented Sep 25, 2023

Inconsistent behavior of proximity search (`~N`) in Whoosh based on word type #48

Inconsistent behavior of proximity search (`~N`) in Whoosh based on word type #48