Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datamixer (for SFT) #187

Merged
merged 21 commits into from
Jul 24, 2024
Merged

Datamixer (for SFT) #187

merged 21 commits into from
Jul 24, 2024

Conversation

natolambert
Copy link
Collaborator

@natolambert natolambert commented Jul 10, 2024

Makes it so you can mix HF and local datasets by proportion of dataset or count of samples, configs like what we had:

dataset_name: allenai/tulu-v2-sft-mixture

Or fractional mixing:

dataset_mixer:
 allenai/tulu-v2-sft-mixture: 0.5
 HuggingFaceH4/no_robots: 0.8

Or count mixing:

dataset_mixer:
  allenai/tulu-v2-sft-mixture: 100000
  HuggingFaceH4/no_robots: 5000

Including local files:

dataset_mixer:
 allenai/tulu-v2-sft-mixture: 0.5
 HuggingFaceH4/no_robots: 0.8
 data/processed/tulu_v2/tulu_v2_filtered_data.jsonl: 0.1

Copy link
Contributor

@jacob-morrison jacob-morrison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, three questions/requests in addition to my handful of comments --

  1. Are we outputting a single jsonl file with the mix created by the data mixer yet? Imo this is important for consistency
  2. Is there an easy way to use this to create a mix, output a file, and not use it for training? Would be nice to be able to do that to create mixes for EasyLM/TPUs
  3. Have we tested to make sure we get identical mixes if a seed is set?

configs/train_configs/sft/default.yaml Show resolved Hide resolved
configs/train_configs/sft/default.yaml Show resolved Hide resolved
natolambert and others added 6 commits July 22, 2024 20:54
* add yaml files for safety data

* update max_seq_length

* update max_seq_length

* update data paths

---------

Co-authored-by: Nathan Lambert <[email protected]>
Copy link
Collaborator

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

Comment on lines +11 to +12
/net/nfs.cirrascale/mosaic/oe-safety-datasets/wildchat_lmsys_sexual/gpt4_lmsys_wildchat_dedup_50ksampled.jsonl: 16888
allenai/tulu-v2-sft-mixture: 326154
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's helpful to specify the percentage instead of the absolute value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percentage also works!

@natolambert natolambert merged commit e9ff44b into main Jul 24, 2024
3 checks passed
Copy link
Contributor

@yizhongw yizhongw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm in general. I would love to test it for training a real model, but maybe later.

and "response" in dataset.column_names
and "messages" not in dataset.column_names
):
dataset = dataset.map(query_response_to_messages, num_proc=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably raise an unsupported error if none is matched?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time it'll error if it doesn't match.

open_instruct/test_utils.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants