-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving search with Levenshtein distance or similar algorithms #5325
Comments
Hi @KuramaSyu, I would be happy to consider changes if they can fit into the current general structure & scope of how search has been implemented, where added complexity is minimal and where existing efficiency strategies (like use of database indexes for normal search terms) can remain. Even if a solution can fit the above, then we'd need to evaluate the end result. For anything too complex (in terms of added technology/dependency requirements or logical implementation) I'd view it like with LLMs, where I'd rather look to provide interfaces for external options instead of supporting ourselves. |
@ssddanbrown could you tell me, where I would need to search in the repo? I found the search dir, but I have no idea, what the "startpoint" there is Ans where the sql statements are made. Since mysql is used, I searched for it. There is a solution from mysql called soundex which calculates, if 2 strings sound similar, and the other option is manually adding a mysql func for levenshtein. And for sure looking if levenshtein is fast enough and in some way even possible when comparing words to full titles which are of course not similar. Another option would be using PostgreSQL, since it has a similarity function build in which works quite well. But I guess this is not possible |
Indexing is done here: https://github.com/BookStackApp/BookStack/blob/development/app/Search/SearchIndex.php Logic SummaryDuring indexing, content for an entity (book/chapter/page) is split into words, with words reduced down to a score for per entity per word, with frequency and location (titles and headings are boosted for example) impacting that score. This is all stored in a When a search is performed, we split out normal search terms, then query against The incoming (search query) terms also have their own score adjustments made based on their frequency, to bump the score of less common words. This is to act like some level of cheap runtime tf–idf. |
Describe the feature you'd like
When searching something, I sometimes don't find what I need, because I have a typo or just named the title slightly different. An example would be if I search
settings.json but the title is settings json - or the other way.
An other example would be searching Linxu instead of Linux, where I currently wouldn't get any results.
I would be happy to contribute to it, but first I just want to ask, if that is even wanted.
Describe the benefits this would bring to existing BookStack users
Better search feature, hence better overall experience, since search is in my opinion one of the most important features.
Can the goal of this request already be achieved via other means?
Yes, there is an issue which wants to use AI implementation, but that would be way more expensive then using levenshtein distance algorithm. Even if thats not the main purpose of issue #5318.
Have you searched for an existing open/closed issue?
How long have you been using BookStack?
1 to 5 years
Additional context
No response
The text was updated successfully, but these errors were encountered: