Can long text be splitted into short texts? #655

CoinCheung · 2024-07-12T08:26:26Z

❓ The question

I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my max_seq_len is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?

The text was updated successfully, but these errors were encountered:

CoinCheung added the type/question An issue that's a question label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can long text be splitted into short texts? #655

Can long text be splitted into short texts? #655

CoinCheung commented Jul 12, 2024

Can long text be splitted into short texts? #655

Can long text be splitted into short texts? #655

Comments

CoinCheung commented Jul 12, 2024

❓ The question