Models: Add Phi-3.1-mini-128k-instruct #2790

ThiloteE · 2024-08-03T15:28:38Z

Resolves #2668

Describe your changes

Adds model support for Phi-3.1-mini-128k-instruct

Description of Model

At the date of writing, the model has strong results in benchmarks (for its parameter size). It claims to support a context of up to 128K.

The model was trained/finetuned on English
License: MIT

Note: The name "Phi-3.1" is not mentioned anywhere in the original repository at Huggingface, but since Microsoft uploaded a new version in July without changing its model name, publishers of model quants have largely followed the ggml standard and changed the name of their quants from 3.0 to 3.1, although that standard also would have denoted to add the parameter count in the model name, we refrain from doing so, in order to keep as much consistency with the original repo and other publishers of quants as possible.

Personal Impression:

For 3.8 billion parameters, the model has reasonable output. It is possible to converse and follow tasks. I have held a conversation that held 24k characters and even at that long of a context, it still was able to answer "what is 2x2?" correctly, albeit the responses understandably slightly degrade at that context size. In general, the model tends to be verbose. I have seen refusals when it was tasked with certain things and it seems to be finetuned with a particular alignment. Its long context and quality of responses makes it a good model, if you can bear its alignment or your use case happens to fall within the originally intended use cases of the model. It mainly will appeal to English speaking users.

Critique:

This model architecture is not supported by the Vulkan backend (but since it is so small, using the CPU is kinda okish for inference)
This model does not support Grouped Query Attention, that means other models that support GQA may need less RAM/VRAM for the same amount of tokens in the context window. It has been claimed that llama-3-8b (which supports GQA) needs less RAM after a certain point (~ 8k context).

Motivation for this pull-request

Other quants uploaded to huggingface and that are accessible via the search feature of GPT4All have tokenizer eos issues.
The model is small and fits into 3GB of VRAM or 4GB of RAM respectively (I set 8GB of RAM as minimum, as the Operating System and other Apps also need some)
The model claims long context and it delivers (although with high RAM usage in longer conversations).
AFAIK, apart from the Qwen1.5 and Qwen2 model family, this is the only generic purpose model family below 4B parameters that delivers that large of a context window and that is also compatible with GPT4All
For it's size it is high on the huggingface open leaderboard benchmark
Made by Microsoft, the model has a reputation
Users were asking for this model

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
I have added thorough documentation for my code.
I have tagged PR with relevant project labels. I acknowledge that a PR without labels may be dismissed.
If this PR addresses a bug, I have provided both a screenshot/video of the original bug and the working solution.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
I have added thorough documentation for my code.
I have tagged PR with relevant project labels. I acknowledge that a PR without labels may be dismissed.
If this PR addresses a bug, I have provided both a screenshot/video of the original bug and the working solution.

Demo

Resolves #2668 Adds model support for [Phi-3-mini-128k-instruct](https://huggingface.co/GPT4All-Community/Phi-3-mini-128k-instruct) ### Description of Model At the date of writing, the model has strong results in benchmarks (for its parameter size). It claims to support a context of up to 128K. - The model was trained/finetuned on English - License: MIT ### Personal Impression: For 3.8 billion parameters, the model has reasonable output. It is possible to converse and follow tasks. I have held a conversation that held 24k characters and even at that long of a context, it still was able to answer "what is 2x2?" correctly, albeit the responses understandably slightly degrade at that context size. I have seen refusals when it was tasked with certain things and it seems to be finetuned with a particular alignment. Its long context and quality of responses makes it a good model, if you can bear its alignment or your use case happens to fall within the originally intended use cases of the model. It mainly will appeal to English speaking users. ### Critique: This model does not support Grouped Query Attention, that means other models that support GQA may need less RAM/VRAM for the same amount of tokens in the context window. It has been claimed that llama-3-8b (which supports GQA) needs less RAM after a certain point (\~ 8k context). ### Motivation for this pull-request - The model is small and fits into 3GB of VRAM or 4GB of RAM respectively (I set 8GB of RAM as minimum, as the Operating System and other Apps also need some) - The model claims long context and it delivers (although with high RAM usage in longer conversations). - AFAIK, apart from the Qwen1.5 and Qwen2 model family, this is the only generic purpose model family below 4B parameters that delivers that large of a context window and that is also compatible with GPT4All - For it's size it is high on the huggingface open leaderboard benchmark - Made by Microsoft, the model has a reputation - Users were asking for this model ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] I have added thorough documentation for my code. - [x] I have tagged PR with relevant project labels. I acknowledge that a PR without labels may be dismissed. - [ ] If this PR addresses a bug, I have provided both a screenshot/video of the original bug and the working solution. Signed-off-by: ThiloteE <[email protected]>

ThiloteE · 2024-08-03T15:43:35Z

Relevant Upstream issues and PRs (But those should not prevent merging this PR):

Link to initial issue calling for support of model architecture in llama.cpp

ggerganov/llama.cpp#6849

Link to PRs in llama.cpp

initial model support: ggerganov/llama.cpp#7225
Fix phi3 conversion: ggerganov/llama.cpp#8262

Discussion about Longrope:

Lack of Longrope support in llama.cpp
- People that say longrope support is needed:
  - See comment in Support for Phi-3 models #6849 for call of LongRope support in llama.cpp.
  - Request to microsoft to support long-rope in llama.cpp: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/88
- People that say longrope is not needed
  - Comment in Fix phi 3 conversion #8262: "perplexity sets n_seq_max to 4, so it takes 16k context before it starts using the long rope factors. Before that there should be a fallback to su"
  - Comment in Fix phi 3 conversion #8262: "A small note is that on huggingface model, they simply rename Phi3SuScaledRotaryEmbedding to Phi3LongRoPEScaledRotaryEmbedding without changing any code inside, so "longrope" should be just "su" in another name."

Tokenizer issues

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache)

ThiloteE · 2024-08-03T16:08:18Z

Benchmarks:

Signed-off-by: ThiloteE <[email protected]>

ThiloteE · 2024-08-04T11:13:10Z

To support this model in the Compute/Vulkan backend, at the very least it would need to be whitelisted here: https://github.com/nomic-ai/llama.cpp/blob/add387854ea73d83770a62282089dea666fa266f/src/llama.cpp#L7771

I have not tested what happens, if that is done.

Had forgotten to have "GGUF" in the model repository name at huggingface. Signed-off-by: ThiloteE <[email protected]>

Change name of model to GPT4All standard. Signed-off-by: ThiloteE <[email protected]>

cosmic-snow · 2024-08-04T20:27:05Z

I've checked the following fields in the entry: url, md5sum, filesize, quant and type -- they're all fine.

Without seeing it in the GUI, the description field's HTML seems fine, too.

I did some small tests with the listed system prompt and prompt template and it doesn't look like there are any obvious problems with those. 👍

3Simplex · 2024-08-05T01:47:34Z

ThiloteE do you still have the instructions for the changes needed to directly test the new models.json? Would you post that here for us. I remember we had to comment something. I would have to look into the code to figure out what.

ThiloteE · 2024-08-05T08:33:04Z

gpt4all/gpt4all-chat/modellist.cpp

Line 44 in f3734e5

//#define USE_LOCAL_MODELSJSON

Replace

//#define USE_LOCAL_MODELSJSON

with

#define USE_LOCAL_MODELSJSON

ThiloteE added models models.json This requires a change to the official model list. and removed models.json This requires a change to the official model list. labels Aug 3, 2024

ThiloteE mentioned this pull request Aug 3, 2024

Unable to set the max context limit #2416

Closed

ThiloteE changed the title ~~Models: Add Phi-3-mini-128k-instruct~~ Models: Add Phi-3.1-mini-128k-instruct Aug 4, 2024

Rename paths to 3.1

a2b6111

Signed-off-by: ThiloteE <[email protected]>

ThiloteE requested review from manyoso, 3Simplex and cebtenzzre August 4, 2024 11:21

ThiloteE marked this pull request as ready for review August 4, 2024 11:21

ThiloteE added 2 commits August 4, 2024 17:22

Update URL

8e7bab8

Had forgotten to have "GGUF" in the model repository name at huggingface. Signed-off-by: ThiloteE <[email protected]>

Address review by Cosmic-Snow

a9d3e18

Change name of model to GPT4All standard. Signed-off-by: ThiloteE <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models: Add Phi-3.1-mini-128k-instruct #2790

Models: Add Phi-3.1-mini-128k-instruct #2790

ThiloteE commented Aug 3, 2024 •

edited

Loading

ThiloteE commented Aug 3, 2024

ThiloteE commented Aug 3, 2024

ThiloteE commented Aug 4, 2024 •

edited

Loading

cosmic-snow commented Aug 4, 2024

3Simplex commented Aug 5, 2024

ThiloteE commented Aug 5, 2024 •

edited

Loading

Models: Add Phi-3.1-mini-128k-instruct #2790

Are you sure you want to change the base?

Models: Add Phi-3.1-mini-128k-instruct #2790

Conversation

ThiloteE commented Aug 3, 2024 • edited Loading

Describe your changes

Description of Model

Personal Impression:

Critique:

Motivation for this pull-request

Checklist before requesting a review

Checklist before requesting a review

Demo

ThiloteE commented Aug 3, 2024

Relevant Upstream issues and PRs (But those should not prevent merging this PR):

Link to initial issue calling for support of model architecture in llama.cpp

Link to PRs in llama.cpp

Discussion about Longrope:

Tokenizer issues

ThiloteE commented Aug 3, 2024

Benchmarks:

ThiloteE commented Aug 4, 2024 • edited Loading

cosmic-snow commented Aug 4, 2024

3Simplex commented Aug 5, 2024

ThiloteE commented Aug 5, 2024 • edited Loading

ThiloteE commented Aug 3, 2024 •

edited

Loading

ThiloteE commented Aug 4, 2024 •

edited

Loading

ThiloteE commented Aug 5, 2024 •

edited

Loading