Excuse me, how to support other language search, such as Chinese search, thank you #201

sinianzhiren · 2023-02-01T09:27:45Z

Excuse me, how to support other language search, such as Chinese search, thank you .

lucaong · 2023-02-01T09:52:13Z

Hello @sinianzhiren ,
in principle, MiniSearch should be able to work with any language. In practice, in some cases it might be necessary to tweak some options, but I think that the defaults should be a good starting point.

Unfortunately I do not know enough about Chinese language to guide you here, but other users have successfully used MiniSearch for Chinese (look for example at this comment or at this issue).

Did you encounter a specific problem supporting Chinese or other languages? If so, you can describe it, and I would be happy to help if I can.

SSShooter · 2023-02-07T08:27:26Z

Excuse me, how to support other language search, such as Chinese search, thank you .

You should do Chinese word segmentation with library like nodejieba before indexing documents.

acnebs · 2023-07-04T11:08:29Z

If you don't care about supporting Firefox, Intl.Segmenter is great.

mantou132 · 2024-01-14T11:16:18Z

Firefox Nightly support Intl.Segmenter.

Use Intl.Segmenter support CJK example:

const segmenter =
  Intl.Segmenter && new Intl.Segmenter("zh", { granularity: "word" });

const miniSearch = new MiniSearch({
  fields: ["text"],
  processTerm: (term) => {
    if (!segmenter) return term;
    const tokens = [];
    for (const seg of segmenter.segment(term)) {
      tokens.push(seg.segment);
    }
    return tokens;
  },
});

const documents = [
  { id: 1, text: "为字段添加 required 属性，并在提交时进行表单验证" },
  {
    id: 2,
    text: "By default, the same processing is applied to search queries. In order to apply a different processing to search queries, supply a processTerm search option:",
  },
];

miniSearch.addAll(documents);
console.log("===");
console.log(miniSearch.search("添加"));

This is online example: https://duoyun-ui.gemjs.org/zh/
Search front end use @docsearch/js

ThomasChan · 2024-07-15T01:59:02Z

I also encountered the problem of searching Chinese, for example, when searching for "预置", due to the problem of word segmentation, the content cannot be searched due to the word segmentation of "预" and "置", my project uses vitepress, indirectly uses minisearch, and finally I configured it like this to support search:

...
export default defineConfig({
  ...
  themeConfig: {
    search: {
      options: {
        miniSearch: {
          options: {
            tokenize: (term) => {
              if (typeof term === 'string') term = term.toLowerCase();
              // @ts-ignore
              const segmenter = Intl.Segmenter && new Intl.Segmenter("zh", { granularity: "word" });
              if (!segmenter) return [term];
              const tokens = [];
              for (const seg of segmenter.segment(term)) {
                // @ts-ignore
                tokens.push(seg.segment);
              }
              return tokens;
            },
          },
          searchOptions: {
            combineWith: 'AND', // important for search chinese
            processTerm: (term) => {
              if (typeof term === 'string') term = term.toLowerCase();
              // @ts-ignore
              const segmenter = Intl.Segmenter && new Intl.Segmenter("zh", { granularity: "word" });
              if (!segmenter) return term;
              const tokens = [];
              for (const seg of segmenter.segment(term)) {
                // @ts-ignore
                tokens.push(seg.segment);
              }
              return tokens;
            },
          },
        },
      },
    },
  },
  ...
});

Thanks to @mantou132

per lucaong/minisearch#201 (comment)

* Fix minisearch Chinese search per lucaong/minisearch#201 (comment) * Mention feature in readme

onyxblade · 2024-11-30T02:37:48Z

I found that Intl.Segmenter is not reliable enough for search. For example, '懵逼了' in a document is broken down ['懵', '逼了'], while a search term '懵逼' is broken down ['懵', '逼']. This damages searchability.

I use bigram for Chinese search instead. Since minisearch can support different fuzzy option based on term, I can disable fuzzy search for Chinese terms and let it use bigram, and still have fuzzy search enabled for English terms. This works so far well for my usecase.

import { bigram } from "n-gram"

const SPACE_OR_PUNCTUATION = /[\n\r\p{Z}\p{P}]+/u
// From https://github.com/vinta/pangu.js/blob/master/src/shared/core.js
const CJK_RANGE = '\u2e80-\u2eff\u2f00-\u2fdf\u3040-\u309f\u30a0-\u30fa\u30fc-\u30ff\u3100-\u312f\u3200-\u32ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff'
const CJK_NCJK = new RegExp(`([${CJK_RANGE}])([^${CJK_RANGE}])`, 'g')
const NCJK_CJK = new RegExp(`([^${CJK_RANGE}])([${CJK_RANGE}])`, 'g')
const CJK_WORD = new RegExp(`^[${CJK_RANGE}]+$`)

function isCJKTerm(term: string) {
  return !!term.match(CJK_WORD)
}

// Function to add space between CJK and non-CJK characters.
// '中文Latin中文' => '中文 Latin 中文'
function addSpaceBetweenCJKandNonCJK(text: string) {
  return text.replace(CJK_NCJK, '$1 $2').replace(NCJK_CJK, '$1 $2')
}

const miniSearch = new MiniSearch({
  tokenize(text) {
    const tokens: string[] = []

    // Add space between CJK and non-CJK, and then split them by empty space.
    const segments = addSpaceBetweenCJKandNonCJK(text).split(SPACE_OR_PUNCTUATION)

    segments.forEach(segment => {
      if (isCJKTerm(segment)) {
        // Conversion between Tradictional Chinese and Simplified Chinese can happen here.
        // A simple character table can be found at:
        // https://github.com/tongwentang/tongwen-dict/blob/main/src/charater/t2s-char.json

        // Each single character is added. '樣例詞組' => ['樣', '例', '詞', '組']
        Array.from(segment).forEach(char => tokens.push(char))
        // Each bigram is added. '樣例詞組' => ['樣例', '例詞', '詞組']
        bigram(segment).forEach(token => tokens.push(token))
      } else {
        // For non-CJK terms, directly add it to tokens.
        tokens.push(segment)
      }
    })
    return tokens
  },
  searchOptions: {
    combineWith: 'AND',
    fuzzy(term) {
      // For CJK terms, disable fuzzy search. Otherwise, use a fuzzy option.
      if (isCJKTerm(term)) {
        return false
      } else {
        return 0.35
      }
    },
    maxFuzzy: 4
  }
})

squidfunk mentioned this issue Dec 21, 2023

Towards better documentation search squidfunk/mkdocs-material#6307

Open

24 tasks

brc-dd mentioned this issue Jul 16, 2024

Better default minisearch tokenizer for Chinese documents vuejs/vitepress#4049

Open

4 tasks

leverglowh added a commit to leverglowh/tweets-archive that referenced this issue Nov 21, 2024

Fix minisearch Chinese search

3a18ed1

per lucaong/minisearch#201 (comment)

leverglowh mentioned this issue Nov 21, 2024

Fix chinese search leverglowh/tweets-archive#1

Merged

leverglowh added a commit to leverglowh/tweets-archive that referenced this issue Nov 21, 2024

Fix chinese search (#1)

eb347f8

* Fix minisearch Chinese search per lucaong/minisearch#201 (comment) * Mention feature in readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excuse me, how to support other language search, such as Chinese search, thank you #201

Excuse me, how to support other language search, such as Chinese search, thank you #201

sinianzhiren commented Feb 1, 2023

lucaong commented Feb 1, 2023

SSShooter commented Feb 7, 2023

acnebs commented Jul 4, 2023

mantou132 commented Jan 14, 2024 •

edited

Loading

ThomasChan commented Jul 15, 2024

onyxblade commented Nov 30, 2024 •

edited

Loading

Excuse me, how to support other language search, such as Chinese search, thank you #201

Excuse me, how to support other language search, such as Chinese search, thank you #201

Comments

sinianzhiren commented Feb 1, 2023

lucaong commented Feb 1, 2023

SSShooter commented Feb 7, 2023

acnebs commented Jul 4, 2023

mantou132 commented Jan 14, 2024 • edited Loading

ThomasChan commented Jul 15, 2024

onyxblade commented Nov 30, 2024 • edited Loading

mantou132 commented Jan 14, 2024 •

edited

Loading

onyxblade commented Nov 30, 2024 •

edited

Loading