Fragmenter doesn't properly filter HTML tags #28

chang-zhao · 2022-06-11T07:48:22Z

I search HTML documents and often get html tags or their parts in the highlighter results, like:

Абсолютная истина.<br />Не-противостояние и растворение напряжений</h1...Абсолютная истина</h4...В этом мире нет

Most often, these tags or tag pieces are:

    </p
    <br />
    </h1, </h2 and so on
    </em></strong>

I switched to using SentenceFragmenter (which is also more suitable for my needs):

results.fragmenter = highlight.SentenceFragmenter(
                            maxchars=240,
                            sentencechars='</>.!?',
                            charlimit = None
                            )

so it should filter all that out, but it doesn't work. I even tried to escape those characters like this:

sentencechars='\<\/\>.!?'

Nope. It seems I will have to resort to additional search and replace.

The text was updated successfully, but these errors were encountered:

chang-zhao · 2022-06-11T13:24:54Z

Here's how I clean it:
https://gist.github.com/chang-zhao/2a18dcab0b40e3011decefb65c91b4ca

use idiomatic python version check

ZeroCool940711 added a commit to cclauss/whoosh-1 that referenced this issue Feb 1, 2024

Merge pull request mchaput#28 from jap/version-check

f2ebc71

use idiomatic python version check

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fragmenter doesn't properly filter HTML tags #28

Fragmenter doesn't properly filter HTML tags #28

chang-zhao commented Jun 11, 2022

chang-zhao commented Jun 11, 2022

Fragmenter doesn't properly filter HTML tags #28

Fragmenter doesn't properly filter HTML tags #28

Comments

chang-zhao commented Jun 11, 2022

chang-zhao commented Jun 11, 2022