Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragmenter doesn't properly filter HTML tags #28

Open
chang-zhao opened this issue Jun 11, 2022 · 1 comment
Open

Fragmenter doesn't properly filter HTML tags #28

chang-zhao opened this issue Jun 11, 2022 · 1 comment

Comments

@chang-zhao
Copy link

I search HTML documents and often get html tags or their parts in the highlighter results, like:

Абсолютная истина.<br />Не-противостояние и растворение напряжений</h1...Абсолютная истина</h4...В этом мире нет

Most often, these tags or tag pieces are:

    </p
    <br />
    </h1, </h2 and so on
    </em></strong>

I switched to using SentenceFragmenter (which is also more suitable for my needs):

results.fragmenter = highlight.SentenceFragmenter(
                            maxchars=240,
                            sentencechars='</>.!?',
                            charlimit = None
                            )

so it should filter all that out, but it doesn't work. I even tried to escape those characters like this:

sentencechars='\<\/\>.!?'

Nope. It seems I will have to resort to additional search and replace.

@chang-zhao
Copy link
Author

ZeroCool940711 added a commit to cclauss/whoosh-1 that referenced this issue Feb 1, 2024
use idiomatic python version check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant