Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data custom collator #260

Merged
merged 20 commits into from
Jul 31, 2024
Merged

Data custom collator #260

merged 20 commits into from
Jul 31, 2024

Conversation

Ssukriti
Copy link
Collaborator

@Ssukriti Ssukriti commented Jul 23, 2024

Description of the change

Addition of support for JSON datasets containing input/output fields . Masking on input to leave output unmasked. Create single sequence by concatenation.
Avoids datacollatorforcompletionLM that requires response template. This is because response template is very error prone and subject to it being present in text.

Refactored https://github.com/foundation-model-stack/fms-hf-tuning/pull/166/files a bit to allow extension to support pretokenized datasets which are masked.

Related issue number

How to verify the PR

  1. Train using JSON input/output fields and leave dataset_text_field and response_template blank

Was the PR tested

  1. Quality test below
  2. Unit tests
  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Ssukriti and others added 8 commits July 22, 2024 21:25
Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@Ssukriti Ssukriti marked this pull request as ready for review July 25, 2024 22:22
@Ssukriti
Copy link
Collaborator Author

find some quality results here https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/621#issuecomment-86629029 using both collators

@Ssukriti
Copy link
Collaborator Author

@alex-jw-brooks review requested is addressed, also added EOS token

Attached some quality results above.

Remaining TODOs:

  1. Update README
  2. I think JSON file format is broken (JSONL works ) - to check
  3. Run more quality tests if have time to have further record

Ssukriti added 2 commits July 31, 2024 16:40
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@Ssukriti
Copy link
Collaborator Author

The PR is ready to be reviewed @alex-jw-brooks @anhuong . I have documented only JSONL format for new input/output and will extend it to JSON in following PR .

Also current functionality is retained and we are just adding support for a new data format . I think it is safe to merge as one quality test looked good. we can continue to do quality testing before announcing release

@Ssukriti Ssukriti merged commit 3439a68 into main Jul 31, 2024
8 of 9 checks passed
@fabianlim
Copy link
Collaborator

fabianlim commented Aug 1, 2024

@Ssukriti i think this PR is not accurately named, the changes should be to allow for custom processing by means of passing in the Jinja template? Collation is usually referred to the step where the examples come out of the sampler, and they are formed into a single tensor to be passed out of the dataloader.

anhuong pushed a commit to anhuong/fms-hf-tuning that referenced this pull request Aug 5, 2024
* refactor code to preprocess datasets

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* fix formatting

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* allow input/output in validate args

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* format input/output JSON and mask

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* function to return suitable collator

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* add tests for SFT Trainer input/output format

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* remove unused functions

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* add eos token to input/output format

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* fix tests

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* improve docstrings

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* keeping JSON keys constant

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* support for input/output format

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* formatting fixes

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* update rEADME formats

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* formatting README

Signed-off-by: Sukriti-Sharma4 <[email protected]>

---------

Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
anhuong added a commit that referenced this pull request Aug 14, 2024
* Set default value of target_modules to be None in LoraConfig

Signed-off-by: Will Johnson <[email protected]>

* Removal of transformers logger and addition of python logger

Signed-off-by: Abhishek <[email protected]>

* FMT and lint check: Removal of transformers logger and addition of python logger

Signed-off-by: Abhishek <[email protected]>

* fix: remove lm_head for granite with llama arch models (#258)

* initial code for deleting lm_head

Signed-off-by: Anh-Uong <[email protected]>

* fix logic for copying checkpoint

Signed-off-by: Anh-Uong <[email protected]>

* fix check that embed_tokens and lm_head weights are the same

Signed-off-by: Anh-Uong <[email protected]>

* fix warning assertion

Signed-off-by: Anh-Uong <[email protected]>

* fix lm_head check, remove test

Signed-off-by: Anh-Uong <[email protected]>

* small fixes from code review

Signed-off-by: Anh-Uong <[email protected]>

* fmt

Signed-off-by: Anh-Uong <[email protected]>

---------

Signed-off-by: Anh-Uong <[email protected]>
Co-authored-by: Anh-Uong <[email protected]>
Signed-off-by: Abhishek <[email protected]>

* Add config_utils tests

Signed-off-by: Angel Luu <[email protected]>

* Fix fmt

Signed-off-by: Angel Luu <[email protected]>

* Separate tests out and use docstrings

Signed-off-by: Angel Luu <[email protected]>

* Update more field/value checks from HF defaults

Signed-off-by: Angel Luu <[email protected]>

* Fix: Addition of env var TRANSFORMERS_VERBOSITY check

Signed-off-by: Abhishek <[email protected]>

* FMT Fix: Addition of env var TRANSFORMERS_VERBOSITY check

Signed-off-by: Abhishek <[email protected]>

* Add test for tokenizer in lora config (should be ignored)

Signed-off-by: Angel Luu <[email protected]>

* Adding logging support to accelerate launch

Signed-off-by: Abhishek <[email protected]>

* FMT_FIX: Adding logging support to accelerate launch

Signed-off-by: Abhishek <[email protected]>

* bug: On save event added to callback (#256)

* feat: On save event added to callback

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Removed additional bracket

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Removed additional bracket

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Format issues resolved

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: rebase with upstream and add new line

Signed-off-by: Mehant Kammakomati <[email protected]>

---------

Signed-off-by: Padmanabha V Seshadri <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Mehant Kammakomati <[email protected]>

* feat: All metric handling changes (#263)

* feat: All metric handling changes

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Format issues

Signed-off-by: Padmanabha V Seshadri <[email protected]>

---------

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* feat: Configuration to set logging level for trigger log (#241)

* feat: Added the triggered login in the operation

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Formatting issues

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Added default config

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Moved the variable to right scope

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Checked added to validate config log level

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* fix: Removed some unwanted log file

Signed-off-by: Padmanabha V Seshadri <[email protected]>

---------

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* limit peft deps until investigate (#274)

Signed-off-by: Anh-Uong <[email protected]>

* Data custom collator (#260)

* refactor code to preprocess datasets

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* fix formatting

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* allow input/output in validate args

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* format input/output JSON and mask

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* function to return suitable collator

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* add tests for SFT Trainer input/output format

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* remove unused functions

Co-authored-by: Alex-Brooks <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>

* add eos token to input/output format

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* fix tests

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* improve docstrings

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* keeping JSON keys constant

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* support for input/output format

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* formatting fixes

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* update rEADME formats

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* formatting README

Signed-off-by: Sukriti-Sharma4 <[email protected]>

---------

Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>

* Revert "limit peft deps until investigate (#274)" (#275)

This reverts commit f57ff63.

Signed-off-by: Anh-Uong <[email protected]>

* feat: per process state metric (#239)

Signed-off-by: Harikrishnan Balagopal <[email protected]>

* Modify test to pass with target_modules: None

Signed-off-by: Will Johnson <[email protected]>

* Logging changes and unit tests added

Signed-off-by: Abhishek <[email protected]>

* feat: Add a dockerfile argument to enable aimstack (#261)

* Add a dockerfile argument at the end of final layer to enable aimstack.
Currenlty guarded by a dockerfile argument.

Signed-off-by: Dushyant Behl <[email protected]>

* Set the default value of ENABLE_AIM to false

Signed-off-by: Dushyant Behl <[email protected]>

---------

Signed-off-by: Dushyant Behl <[email protected]>

* Solved conflict with main

Signed-off-by: Abhishek <[email protected]>

* FMT:Fix Solved conflict with main

Signed-off-by: Abhishek <[email protected]>

* enabling tests for prompt tuning

Signed-off-by: Abhishek <[email protected]>

* feat: Support pretokenized (#272)

* feat: support pretokenized datasets

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: rebase with upstream and review commits

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: rebase with upstream and review commits

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: rebase with upstream and review commits

Signed-off-by: Mehant Kammakomati <[email protected]>

* consolidate collator code

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* add valuerrors for incorrect args

Signed-off-by: Sukriti-Sharma4 <[email protected]>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <[email protected]>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <[email protected]>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <[email protected]>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <[email protected]>

---------

Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>

* Update packaging requirement from <24,>=23.2 to >=23.2,<25 (#212)

Updates the requirements on [packaging](https://github.com/pypa/packaging) to permit the latest version.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](pypa/packaging@23.2...24.1)

---
updated-dependencies:
- dependency-name: packaging
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Anh Uong <[email protected]>

* enabling tests for prompt tuning (#278)

Signed-off-by: Abhishek <[email protected]>
Co-authored-by: Anh Uong <[email protected]>

* fix: do not add special tokens for custom tokenizer (#279)

Signed-off-by: Mehant Kammakomati <[email protected]>

* PR changes for changing logger

Signed-off-by: Abhishek <[email protected]>

* fix: bug where the logger was not being used properly (#286)

Signed-off-by: Hari <[email protected]>

* Unit Tests changes

Signed-off-by: Abhishek <[email protected]>

* Add functionality to free disk space from Github Actions (#287)

* Add functionality to free disk space from Github Actions

Signed-off-by: Will Johnson <[email protected]>

* Add functionality to free disk space from Github Actions, relocate from build-and-publish.yaml to image.yaml

Signed-off-by: Will Johnson <[email protected]>

* Move freeing space step to before building image

Signed-off-by: Will Johnson <[email protected]>

---------

Signed-off-by: Will Johnson <[email protected]>

* commented os.environ[LOG_LEVEL] in accelerate.py for testing

Signed-off-by: Abhishek <[email protected]>

* PR changes

Signed-off-by: Abhishek <[email protected]>

* FIX:FMT

Signed-off-by: Abhishek <[email protected]>

* PR Changes

Signed-off-by: Abhishek <[email protected]>

* PR Changes

Signed-off-by: Abhishek <[email protected]>

* Add unit test to verify target_modules defaults correctly (#281)

* Add unit test to verify target_modules defaults correctly

Signed-off-by: Will Johnson <[email protected]>

* Add sft_trainer.main test to ensure target modules properly default for LoRA when set to None from CLI

Signed-off-by: Will Johnson <[email protected]>

* fmt

Signed-off-by: Will Johnson <[email protected]>

* Use model_args instead of importing, fix nits

Signed-off-by: Will Johnson <[email protected]>

* Add test to ensure target_modules defaults to None in job config

Signed-off-by: Will Johnson <[email protected]>

* Add additional check, fix nits

Signed-off-by: Will Johnson <[email protected]>

---------

Signed-off-by: Will Johnson <[email protected]>

* docs: Add documentation on experiment tracking. (#257)

Signed-off-by: Dushyant Behl <[email protected]>

* Ensure additional metadata to trackers don't throw error in happy case. (#290)

Signed-off-by: Dushyant Behl <[email protected]>

* PR Changes

Signed-off-by: Abhishek <[email protected]>

* fix multiple runid creation bug with accelerate. (#268)

Signed-off-by: Dushyant Behl <[email protected]>

* feat: logging control operation (#264)

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* Metrics file epoch indexing from 0

Signed-off-by: Abhishek <[email protected]>

* Revert last commit

Signed-off-by: Abhishek <[email protected]>

* fix run evaluation to get base model path (#273)

Signed-off-by: Anh-Uong <[email protected]>

* PR Changes

Signed-off-by: Abhishek <[email protected]>

* PR Changes

Signed-off-by: Abhishek <[email protected]>

* feat: Added additional events such as on_step_begin, on_optimizer_step, on_substep_end (#293)

Signed-off-by: Padmanabha V Seshadri <[email protected]>

* Always update setuptools to latest (#288)

Signed-off-by: James Busche <[email protected]>
Co-authored-by: Anh Uong <[email protected]>

* Rename all fixtures with correct .jsonl extension (#295)

Signed-off-by: Will Johnson <[email protected]>
Co-authored-by: Anh Uong <[email protected]>

* feat: add save_model_dir flag where final checkpoint saved (#291)

* add save_model_dir flag for final checkpoint

Signed-off-by: Anh-Uong <[email protected]>

* remove output_dir logic, add save method

Signed-off-by: Anh-Uong <[email protected]>

* update accelerate_launch, remove save tokenizer

Signed-off-by: Anh-Uong <[email protected]>

* fix: put back creation of .complete file

Signed-off-by: Anh-Uong <[email protected]>

* fix failing tests and add new ones

Signed-off-by: Anh-Uong <[email protected]>

* tests: add sft_trainer test to train and save

- small refactor of tests

Signed-off-by: Anh-Uong <[email protected]>

* add docs on saving checkpoints and fix help msg

Signed-off-by: Anh-Uong <[email protected]>

* update example and note best checkpoint

Signed-off-by: Anh-Uong <[email protected]>

* changes based on PR review

Signed-off-by: Anh-Uong <[email protected]>

* add logging to save, fix error out properly

Signed-off-by: Anh-Uong <[email protected]>

---------

Signed-off-by: Anh-Uong <[email protected]>

---------

Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Anh-Uong <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Padmanabha V Seshadri <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Harikrishnan Balagopal <[email protected]>
Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Hari <[email protected]>
Signed-off-by: James Busche <[email protected]>
Co-authored-by: Abhishek <[email protected]>
Co-authored-by: Sukriti Sharma <[email protected]>
Co-authored-by: Anh-Uong <[email protected]>
Co-authored-by: Abhishek Maurya <[email protected]>
Co-authored-by: Angel Luu <[email protected]>
Co-authored-by: Angel Luu <[email protected]>
Co-authored-by: Padmanabha V Seshadri <[email protected]>
Co-authored-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Alex-Brooks <[email protected]>
Co-authored-by: Hari <[email protected]>
Co-authored-by: Dushyant Behl <[email protected]>
Co-authored-by: Sukriti-Sharma4 <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: James Busche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants