Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gguf dump start data offset via --data-offset and some extra refactor #8054

Merged
merged 5 commits into from
Jun 25, 2024

Conversation

mofosyne
Copy link
Collaborator

This PR is an investigation of gg's desired approach to use piped sha256sum #8048 (comment)

in the C gguf-dump program PR i got:

$ ~/gitextern/llama.cpp/build/bin/llama-gguf-hash --sha1 phi-2.Q6_K.gguf
sha1    32ea6e22a0c63beef6ce2ba15471689b8144b39c  phi-2.Q6_K.gguf

Now with this PR that adds --data-offset and --data-alignment I got:

$ ~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf
1806176

$ ~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-alignment phi-2.Q6_K.gguf
32

# GG's initial suggestion (Very very very slow)
$dd bs=1 skip=1806176 if=phi-2.Q6_K.gguf | pv | sha1sum
... takes forever...

# Manually adjusted skip and bs parameter to speed up dd by taking advantage of alignment
$ dd bs=32 skip=56443 if=phi-2.Q6_K.gguf | pv | sha1sum
71351680+0 records in
71351680+0 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 91.1693 s, 25.0 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c  -

# Faster by skipping the entire block sized to data offset (works because of alignment?)
$ dd bs=1806176 skip=1 if=phi-2.Q6_K.gguf | sha1sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 4.32419 s, 528 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c  -

So it appears that this approach while not very fast is still somewhat workable?

This approach in my opinion assumes that all tensors are aligned and that there is no padding between tensors data area and that the gguf file format in the future would not try add non tensor data at the end of the gguf file format. My original C approach is much much faster it appears and robust for future gguf file format development.

Anyway we may still want to merge this in to include these two useful dump file switches.

@mofosyne mofosyne marked this pull request as draft June 21, 2024 11:26
@github-actions github-actions bot added the python python script changes label Jun 21, 2024
@mofosyne
Copy link
Collaborator Author

Timing

$:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha1sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 4.32916 s, 527 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c  -

real	0m7.200s
user	0m6.797s
sys	0m1.326s

$:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha256sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 9.95004 s, 229 MB/s
8b5eea25e2946b05e345dc0e1dea191968bd2ebc6a15cb321085391dc89d9692  -

real	0m13.016s
user	0m12.744s
sys	0m1.509s

@mofosyne mofosyne marked this pull request as ready for review June 21, 2024 12:40
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 21, 2024
Copy link
Collaborator

@Galunid Galunid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks mostly alright, except for a lot of Tensor Info comments, that don't really serve any purpose, for example:

# Tensor Info Fields
offs, tensors_fields = self._build_tensors_info_fields(offs, tensor_count)

@mofosyne
Copy link
Collaborator Author

@Galunid added because that section was just hard for me to mentally process. But if it's not really an issue then I can remove it anyway. Another alternative is to may be to add spacing in place of the comments, because perhaps it's the lack of seperation between semantic blocks?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor naming suggestions - feel free to ignore

gguf-py/gguf/gguf_reader.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_reader.py Outdated Show resolved Hide resolved
@mofosyne mofosyne force-pushed the gguf-dump-start-data-offset branch from 86040f8 to beb8023 Compare June 24, 2024 08:37
@mofosyne
Copy link
Collaborator Author

Applied GG's suggestion and also adjusted comments to tensor get info to be a little less redundant.

Then manually checked that both new commands works.

$ ~/llama.cpp/gguf-py/scripts/gguf-dump.py test.gguf --data-offset
1216
$ ~/llama.cpp/gguf-py/scripts/gguf-dump.py test.gguf --data-alignment
32

Will merge as soon as CI clears

@mofosyne mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 24, 2024
@mofosyne mofosyne force-pushed the gguf-dump-start-data-offset branch from beb8023 to eb1c225 Compare June 24, 2024 10:45
@compilade
Copy link
Collaborator

compilade commented Jun 24, 2024

It seems tail -c +$((offset + 1)) some-model.gguf can also be used instead of dd bs=${offset} skip=1 if=some-model.gguf

From the tail man page:

       -c, --bytes=[+]NUM
              output the last NUM bytes; or use -c +NUM to output starting with byte NUM of each file

Adding one to the offset is necessary though, in this case, because tail starts counting bytes from 1.

Copy link
Collaborator

@Galunid Galunid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Galunid added because that section was just hard for me to mentally process. But if it's not really an issue then I can remove it anyway. Another alternative is to may be to add spacing in place of the comments, because perhaps it's the lack of seperation between semantic blocks?

I totally agree about new lines, to mark different "sections" of the code. I think comments in general should provide more insight into the code. I feel some of the comments in this PR don't do that, especially ones in _get_tensor_info_field. I think better approach would be to change variable names ;)

Maybe I'm a bit too pedantic though, so feel free to disagree and merge.

@mofosyne
Copy link
Collaborator Author

mofosyne commented Jun 25, 2024

I feel your point. I think the philosophy I have with comments is that while you shouldn't be describing every actions, you should at least describe the intent as succinctly as possible. Hence the commenting style.

So in this case it's a bit of a clash of philosophy here, but I think I've been as reasonable as possible to minimize adding too much description, instead using these as 'headlines' so people scanning their eyes downwards can at least grab and find the intent of the following line as quick as possible (at cost to a bit more lines).

Appreciate your feedback regardless, as it's good to have pushback against verbosity. @Galunid

@mofosyne mofosyne merged commit c8ad359 into ggerganov:master Jun 25, 2024
17 checks passed
@mofosyne mofosyne deleted the gguf-dump-start-data-offset branch June 25, 2024 12:03
@mofosyne
Copy link
Collaborator Author

@compilade sounds nifty. Not too sure where I'll put that tip thought, but hopefully it be obvious enough with people discovering these new options and seeing other dev use the head/tail command in the codebase along the way.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
…ggerganov#8054)

* gguf-dump: add --data-offset

* gguf-dump: add tensor data offset table

* gguf-dump: refactor GGUFReader for clarity

* gguf-dump: add --data-alignment

* gguf-dump.py: Rename variables and adjust comments

start_data_offset --> data_offset

_build_tensors_info_fields --> _build_tensor_info
MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024
…ggerganov#8054)

* gguf-dump: add --data-offset

* gguf-dump: add tensor data offset table

* gguf-dump: refactor GGUFReader for clarity

* gguf-dump: add --data-alignment

* gguf-dump.py: Rename variables and adjust comments

start_data_offset --> data_offset

_build_tensors_info_fields --> _build_tensor_info
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants