[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

lvdongyi · 2024-09-28T12:19:35Z

PR types

New features

PR changes

APIs

Description

Support reading Tiktoken tokenizer.model.
Split PretrainedTokenizerBase.from_pretrained into two separate methods: from_pretrained and _from_pretrained.
Prefer not to use FastTokenizer even it is available. (When you want to load TokenizerFast through AutoTokenizer, you should explicitly set use_fast=True )
Use LazyMapping to load keys and values when it is accessed.
Do not allow multiple Tokenizer Classes in a tokenizer.py

e.g.
- Previouslypaddlenlp.transformers.albert.tokenizerwas split into paddlenlp.transformers.albert.tokenizer,paddlenlp.transformers.albert_chinese.tokenizer,paddlenlp.transformers.albert_english.tokenizer
- Previouslypaddlenlp.transformers.mbart.tokenizerwas split into paddlenlp.transformers.mbart.tokenizer,paddlenlp.transformers.mbart50.tokenizer
Modify tests/transformers/test_modeling_common.py to support LlamaTokenizerFast

TOKENIZER_MAPPING_NAMES, MODEL_NAMES_MAPPING, CONFIG_MAPPING_NAMES should be reviewed carefully

…d from pretrained, update method to get attr from a module

paddle-bot · 2024-09-28T12:19:39Z

Thanks for your contribution!

codecov · 2024-09-28T12:52:54Z

Codecov Report

Attention: Patch coverage is 77.03281% with 161 lines in your changes missing coverage. Please review.

Project coverage is 53.04%. Comparing base (5ad7a9c) to head (f0f4113).
Report is 8 commits behind head on develop.

❗ Current head f0f4113 differs from pull request most recent head e84a062

Please upload reports for the commit e84a062 to get more accurate results.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/auto/factory.py	46.06%	48 Missing ⚠️
paddlenlp/utils/import_utils.py	54.92%	32 Missing ⚠️
paddlenlp/transformers/auto/tokenizer.py	77.10%	19 Missing ⚠️
paddlenlp/transformers/mbart50/tokenizer.py	86.29%	17 Missing ⚠️
paddlenlp/transformers/albert_english/tokenizer.py	85.98%	15 Missing ⚠️
paddlenlp/transformers/auto/configuration.py	80.00%	13 Missing ⚠️
paddlenlp/transformers/convert_slow_tokenizer.py	80.59%	13 Missing ⚠️
paddlenlp/transformers/tokenizer_utils_base.py	90.62%	3 Missing ⚠️
paddlenlp/transformers/llama/tokenizer.py	94.11%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9215      +/-   ##
===========================================
+ Coverage    52.99%   53.04%   +0.04%     
===========================================
  Files          671      667       -4     
  Lines       109835   106886    -2949     
===========================================
- Hits         58212    56699    -1513     
+ Misses       51623    50187    -1436

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paddlenlp/transformers/configuration_utils.py

paddlenlp/transformers/auto/tokenizer.py

paddlenlp/transformers/fnet/tokenizer.py

paddlenlp/transformers/mbart50/__init__.py

paddlenlp/utils/download/__init__.py

paddlenlp/transformers/auto/configuration.py

paddlenlp/transformers/auto/tokenizer.py

DrownFish19 · 2024-10-11T12:50:25Z

paddlenlp/transformers/auto/tokenizer.py

@@ -176,7 +324,7 @@ def _get_tokenizer_class_from_config(cls, pretrained_model_name_or_path, config_
            return tokenizer_class

    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):


这个参数修改名称需要注意，需判断其他from_pretrained方法参数是否使用相同名称

其他Tokenizer都没有override from_pretrained方法，所以应该不会有影响

这里的问题是使用auto.from_pretrained()和Qwen2XXX.form_pretrained()的代码写法可能会发生变化，建议统一

这块我先改回model_args了

paddlenlp/transformers/auto/tokenizer.py

tests/transformers/llama/test_tokenizer.py

DrownFish19 · 2024-10-12T02:36:42Z

paddlenlp/utils/download/__init__.py

@@ -272,7 +272,7 @@ def resolve_file_path(
            f"'{log_endpoint}' for available revisions."
        )
    except EntryNotFoundError:
-        return None
+        raise EnvironmentError(f"Does not appear one of the {filenames} in {repo_id}.")


这个Error类型是不是应该是EntryNotFoundError？

这块在我修改之前就是这样的（

估计是当时就写错了，这个错误可以改

如果要raise EntryNotFoundError，那前面就不需要用except捕获EntryNotFoundError了，之前这么做应该有这么做的道理（吧）。

tests/transformers/auto/test_confiugration.py

DrownFish19

LGTM

paddlenlp/transformers/auto/configuration.py

paddlenlp/transformers/auto/tokenizer.py

paddlenlp/transformers/configuration_utils.py

ZHUI · 2024-10-17T11:32:42Z

paddlenlp/transformers/albert_chinese/tokenizer.py

+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software


这个层级是模型层级吧，为什么要拆开到两个文件夹？

目前AutoTokenizer加载Tokenizer的方式是根据模型目录的名称进行匹配，举个例子，之前albert，albert_chinese，albert_english都在albert目录下，但是根据名称进行匹配(TOKENIZER_MAPPING_NAMES表)只允许有一个Tokenizer和一个TokenizerFast，如果不分开会导致albert_chinese和albert_english无法通过AutoTokenizer加载，因为他们三个加载时需要用不同的Tokenizer类。

lvdongyi added 8 commits September 24, 2024 14:10

add support of tiktoken tokenizer, refactor some code

0fd7240

Merge branch 'PaddlePaddle:develop' into dev-refactor-pretrained

d1ee434

add support of tiktoken tokenizer, refactor some code

9004ac9

clean code & add blobfile to requirements.txt

d004c33

Don't allow multiple Class in a

0b61d11

update docstring, add a RuntimeError when AutoTokenizer failed to loa…

aad6750

…d from pretrained, update method to get attr from a module

update albert_english/__init__.py and mbart/__init__.py

04dff4d

fix typo, rm redundent notations

6475a83

ZHUI requested a review from DrownFish19 September 30, 2024 03:41

DrownFish19 added the contributor label Oct 11, 2024

paddle-bot bot assigned wawltor Oct 11, 2024

DrownFish19 assigned DrownFish19 and unassigned wawltor Oct 11, 2024

DrownFish19 reviewed Oct 11, 2024

View reviewed changes

lvdongyi added 5 commits October 11, 2024 14:55

some changes...

dea3ad4

AutoTokenizer will not load TokenzierFast by default

f5ae794

Add test for external config

ce684a1

revert unnecrssary changes

75368d5

Update test_modeling_common.py

469ffbf

DrownFish19 reviewed Oct 12, 2024

View reviewed changes

lvdongyi added 3 commits October 12, 2024 02:59

fix

ee33fba

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

92e4e0e

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

f0f4113

DrownFish19 previously approved these changes Oct 15, 2024

View reviewed changes

paddlenlp/transformers/auto/configuration.py Show resolved Hide resolved

paddlenlp/transformers/auto/tokenizer.py Show resolved Hide resolved

paddlenlp/transformers/configuration_utils.py Show resolved Hide resolved

ZHUI reviewed Oct 17, 2024

View reviewed changes

rm redundent print

353fb41

lvdongyi dismissed DrownFish19’s stale review via 353fb41 October 17, 2024 12:40

lvdongyi added 3 commits October 17, 2024 15:44

revert some changes

d279d8d

fix problem in TOKENIZER_MAPPING_NAMES

e367332

try fix a strange bug

e84a062

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

lvdongyi commented Sep 28, 2024 •

edited

Loading

paddle-bot bot commented Sep 28, 2024

codecov bot commented Sep 28, 2024 •

edited

Loading

DrownFish19 Oct 11, 2024

lvdongyi Oct 11, 2024

DrownFish19 Oct 12, 2024

lvdongyi Oct 12, 2024

DrownFish19 Oct 12, 2024

lvdongyi Oct 12, 2024

DrownFish19 Oct 12, 2024

lvdongyi Oct 12, 2024

DrownFish19 left a comment

ZHUI Oct 17, 2024

lvdongyi Oct 17, 2024

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

Are you sure you want to change the base?

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

Conversation

lvdongyi commented Sep 28, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Sep 28, 2024

codecov bot commented Sep 28, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DrownFish19 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvdongyi commented Sep 28, 2024 •

edited

Loading

codecov bot commented Sep 28, 2024 •

edited

Loading