You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my max_seq_len is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?
The text was updated successfully, but these errors were encountered:
❓ The question
I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my
max_seq_len
is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?The text was updated successfully, but these errors were encountered: