Datamixer (for SFT) #187

natolambert · 2024-07-10T21:58:54Z

Makes it so you can mix HF and local datasets by proportion of dataset or count of samples, configs like what we had:

dataset_name: allenai/tulu-v2-sft-mixture

Or fractional mixing:

dataset_mixer:
 allenai/tulu-v2-sft-mixture: 0.5
 HuggingFaceH4/no_robots: 0.8

Or count mixing:

dataset_mixer:
  allenai/tulu-v2-sft-mixture: 100000
  HuggingFaceH4/no_robots: 5000

Including local files:

dataset_mixer:
 allenai/tulu-v2-sft-mixture: 0.5
 HuggingFaceH4/no_robots: 0.8
 data/processed/tulu_v2/tulu_v2_filtered_data.jsonl: 0.1

* add yaml files for safety data * update max_seq_length * update max_seq_length

jacob-morrison

Overall LGTM, three questions/requests in addition to my handful of comments --

Are we outputting a single jsonl file with the mix created by the data mixer yet? Imo this is important for consistency
Is there an easy way to use this to create a mix, output a file, and not use it for training? Would be nice to be able to do that to create mixes for EasyLM/TPUs
Have we tested to make sure we get identical mixes if a seed is set?

configs/train_configs/sft/default.yaml

configs/train_configs/sft/olmo_7b_17_remix_sft.yaml

…to datamix

* add yaml files for safety data * update max_seq_length * update max_seq_length * update data paths --------- Co-authored-by: Nathan Lambert <[email protected]>

vwxyzjn

Overall LGTM

vwxyzjn · 2024-07-24T18:17:13Z

configs/train_configs/sft/olmo_7b_17_remix_sft.yaml

+  /net/nfs.cirrascale/mosaic/oe-safety-datasets/wildchat_lmsys_sexual/gpt4_lmsys_wildchat_dedup_50ksampled.jsonl: 16888
+  allenai/tulu-v2-sft-mixture: 326154


Maybe it's helpful to specify the percentage instead of the absolute value?

Percentage also works!

yizhongw

Lgtm in general. I would love to test it for training a real model, but maybe later.

yizhongw · 2024-07-24T18:32:19Z

open_instruct/utils.py

+                and "response" in dataset.column_names
+                and "messages" not in dataset.column_names
+            ):
+                dataset = dataset.map(query_response_to_messages, num_proc=10)


Probably raise an unsupported error if none is matched?

Most of the time it'll error if it doesn't match.

open_instruct/test_utils.py

natolambert and others added 15 commits July 10, 2024 21:23

up

e3bce19

init mixer

bb0ade5

init

d822d0e

add early and clearer check

4747a36

up

5b4be82

tests

8b01160

style

6a185fb

up

125391a

install from requirements

d9fbaa1

up

05cbdf5

force packaging install

f24eefa

up

d43adf1

fiddle

3630fb2

fiddle

3aef6b5

Add safety data to data mixer yaml files (#197)

78f3d47

* add yaml files for safety data * update max_seq_length * update max_seq_length

jacob-morrison requested changes Jul 22, 2024

View reviewed changes

configs/train_configs/sft/default.yaml Show resolved Hide resolved

configs/train_configs/sft/default.yaml Show resolved Hide resolved

configs/train_configs/sft/olmo_7b_17_remix_sft.yaml Show resolved Hide resolved

natolambert and others added 6 commits July 22, 2024 20:54

fixes

0cfeea4

Merge branch 'datamix' of https://github.com/allenai/open-instruct in…

7bff817

…to datamix

style

95f3b30

fixes

3d8c916

runs

c164735

Datamix safety: Tulu2Mix data added (#203)

ab30c68

* add yaml files for safety data * update max_seq_length * update max_seq_length * update data paths --------- Co-authored-by: Nathan Lambert <[email protected]>

vwxyzjn reviewed Jul 24, 2024

View reviewed changes

natolambert merged commit e9ff44b into main Jul 24, 2024
3 checks passed

yizhongw reviewed Jul 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datamixer (for SFT) #187

Datamixer (for SFT) #187

natolambert commented Jul 10, 2024 •

edited

Loading

jacob-morrison left a comment

vwxyzjn left a comment

vwxyzjn Jul 24, 2024

natolambert Jul 24, 2024

yizhongw left a comment

yizhongw Jul 24, 2024

natolambert Jul 24, 2024

		/net/nfs.cirrascale/mosaic/oe-safety-datasets/wildchat_lmsys_sexual/gpt4_lmsys_wildchat_dedup_50ksampled.jsonl: 16888
		allenai/tulu-v2-sft-mixture: 326154

Datamixer (for SFT) #187

Datamixer (for SFT) #187

Conversation

natolambert commented Jul 10, 2024 • edited Loading

jacob-morrison left a comment

Choose a reason for hiding this comment

vwxyzjn left a comment

Choose a reason for hiding this comment

vwxyzjn Jul 24, 2024

Choose a reason for hiding this comment

natolambert Jul 24, 2024

Choose a reason for hiding this comment

yizhongw left a comment

Choose a reason for hiding this comment

yizhongw Jul 24, 2024

Choose a reason for hiding this comment

natolambert Jul 24, 2024

Choose a reason for hiding this comment

natolambert commented Jul 10, 2024 •

edited

Loading