Gguf dump start data offset via --data-offset and some extra refactor #8054

mofosyne · 2024-06-21T11:26:11Z

This PR is an investigation of gg's desired approach to use piped sha256sum #8048 (comment)

$ ~/gitextern/llama.cpp/build/bin/llama-gguf-hash --sha1 phi-2.Q6_K.gguf
sha1    32ea6e22a0c63beef6ce2ba15471689b8144b39c  phi-2.Q6_K.gguf

Now with this PR that adds --data-offset and --data-alignment I got:

$ ~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf
1806176

$ ~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-alignment phi-2.Q6_K.gguf
32

# GG's initial suggestion (Very very very slow)
$dd bs=1 skip=1806176 if=phi-2.Q6_K.gguf | pv | sha1sum
... takes forever...

# Manually adjusted skip and bs parameter to speed up dd by taking advantage of alignment
$ dd bs=32 skip=56443 if=phi-2.Q6_K.gguf | pv | sha1sum
71351680+0 records in
71351680+0 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 91.1693 s, 25.0 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c  -

# Faster by skipping the entire block sized to data offset (works because of alignment?)
$ dd bs=1806176 skip=1 if=phi-2.Q6_K.gguf | sha1sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 4.32419 s, 528 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c  -

So it appears that this approach while not very fast is still somewhat workable?

This approach in my opinion assumes that all tensors are aligned and that there is no padding between tensors data area and that the gguf file format in the future would not try add non tensor data at the end of the gguf file format. My original C approach is much much faster it appears and robust for future gguf file format development.

Anyway we may still want to merge this in to include these two useful dump file switches.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

mofosyne · 2024-06-21T11:40:30Z

Timing

$:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha1sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 4.32916 s, 527 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c  -

real	0m7.200s
user	0m6.797s
sys	0m1.326s

$:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha256sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 9.95004 s, 229 MB/s
8b5eea25e2946b05e345dc0e1dea191968bd2ebc6a15cb321085391dc89d9692  -

real	0m13.016s
user	0m12.744s
sys	0m1.509s

Galunid

I think it looks mostly alright, except for a lot of Tensor Info comments, that don't really serve any purpose, for example:

# Tensor Info Fields
offs, tensors_fields = self._build_tensors_info_fields(offs, tensor_count)

mofosyne · 2024-06-23T20:01:07Z

@Galunid added because that section was just hard for me to mentally process. But if it's not really an issue then I can remove it anyway. Another alternative is to may be to add spacing in place of the comments, because perhaps it's the lack of seperation between semantic blocks?

ggerganov

Some minor naming suggestions - feel free to ignore

gguf-py/gguf/gguf_reader.py

mofosyne · 2024-06-24T08:41:46Z

Applied GG's suggestion and also adjusted comments to tensor get info to be a little less redundant.

Then manually checked that both new commands works.

$ ~/llama.cpp/gguf-py/scripts/gguf-dump.py test.gguf --data-offset
1216
$ ~/llama.cpp/gguf-py/scripts/gguf-dump.py test.gguf --data-alignment
32

Will merge as soon as CI clears

start_data_offset --> data_offset _build_tensors_info_fields --> _build_tensor_info

compilade · 2024-06-24T21:24:27Z

It seems tail -c +$((offset + 1)) some-model.gguf can also be used instead of dd bs=${offset} skip=1 if=some-model.gguf

From the tail man page:

       -c, --bytes=[+]NUM
              output the last NUM bytes; or use -c +NUM to output starting with byte NUM of each file

Adding one to the offset is necessary though, in this case, because tail starts counting bytes from 1.

Galunid

@Galunid added because that section was just hard for me to mentally process. But if it's not really an issue then I can remove it anyway. Another alternative is to may be to add spacing in place of the comments, because perhaps it's the lack of seperation between semantic blocks?

I totally agree about new lines, to mark different "sections" of the code. I think comments in general should provide more insight into the code. I feel some of the comments in this PR don't do that, especially ones in _get_tensor_info_field. I think better approach would be to change variable names ;)

Maybe I'm a bit too pedantic though, so feel free to disagree and merge.

mofosyne · 2024-06-25T12:03:12Z

I feel your point. I think the philosophy I have with comments is that while you shouldn't be describing every actions, you should at least describe the intent as succinctly as possible. Hence the commenting style.

So in this case it's a bit of a clash of philosophy here, but I think I've been as reasonable as possible to minimize adding too much description, instead using these as 'headlines' so people scanning their eyes downwards can at least grab and find the intent of the following line as quick as possible (at cost to a bit more lines).

Appreciate your feedback regardless, as it's good to have pushback against verbosity. @Galunid

mofosyne · 2024-06-25T12:05:24Z

@compilade sounds nifty. Not too sure where I'll put that tip thought, but hopefully it be obvious enough with people discovering these new options and seeing other dev use the head/tail command in the codebase along the way.

…ggerganov#8054) * gguf-dump: add --data-offset * gguf-dump: add tensor data offset table * gguf-dump: refactor GGUFReader for clarity * gguf-dump: add --data-alignment * gguf-dump.py: Rename variables and adjust comments start_data_offset --> data_offset _build_tensors_info_fields --> _build_tensor_info

mofosyne marked this pull request as draft June 21, 2024 11:26

github-actions bot added the python python script changes label Jun 21, 2024

mofosyne mentioned this pull request Jun 21, 2024

gguf-hash: model wide and per tensor hashing using xxhash and sha1 #8048

Merged

4 tasks

mofosyne marked this pull request as ready for review June 21, 2024 12:40

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 21, 2024

Galunid reviewed Jun 23, 2024

View reviewed changes

ggerganov approved these changes Jun 24, 2024

View reviewed changes

gguf-py/gguf/gguf_reader.py Outdated Show resolved Hide resolved

gguf-py/gguf/gguf_reader.py Outdated Show resolved Hide resolved

mofosyne force-pushed the gguf-dump-start-data-offset branch from 86040f8 to beb8023 Compare June 24, 2024 08:37

mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 24, 2024

mofosyne added 5 commits June 24, 2024 20:45

gguf-dump: add --data-offset

b664c3c

gguf-dump: add tensor data offset table

0fee5b6

gguf-dump: refactor GGUFReader for clarity

c0e6537

gguf-dump: add --data-alignment

de61181

gguf-dump.py: Rename variables and adjust comments

eb1c225

start_data_offset --> data_offset _build_tensors_info_fields --> _build_tensor_info

mofosyne force-pushed the gguf-dump-start-data-offset branch from beb8023 to eb1c225 Compare June 24, 2024 10:45

Galunid reviewed Jun 24, 2024

View reviewed changes

mofosyne merged commit c8ad359 into ggerganov:master Jun 25, 2024
17 checks passed

mofosyne deleted the gguf-dump-start-data-offset branch June 25, 2024 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gguf dump start data offset via --data-offset and some extra refactor #8054

Gguf dump start data offset via --data-offset and some extra refactor #8054

mofosyne commented Jun 21, 2024

mofosyne commented Jun 21, 2024

Galunid left a comment

mofosyne commented Jun 23, 2024

ggerganov left a comment

mofosyne commented Jun 24, 2024

compilade commented Jun 24, 2024 •

edited

Loading

Galunid left a comment

mofosyne commented Jun 25, 2024 •

edited

Loading

mofosyne commented Jun 25, 2024

Gguf dump start data offset via --data-offset and some extra refactor #8054

Gguf dump start data offset via --data-offset and some extra refactor #8054

Conversation

mofosyne commented Jun 21, 2024

mofosyne commented Jun 21, 2024

Galunid left a comment

Choose a reason for hiding this comment

mofosyne commented Jun 23, 2024

ggerganov left a comment

Choose a reason for hiding this comment

mofosyne commented Jun 24, 2024

compilade commented Jun 24, 2024 • edited Loading

Galunid left a comment

Choose a reason for hiding this comment

mofosyne commented Jun 25, 2024 • edited Loading

mofosyne commented Jun 25, 2024

compilade commented Jun 24, 2024 •

edited

Loading

mofosyne commented Jun 25, 2024 •

edited

Loading