-
Notifications
You must be signed in to change notification settings - Fork 20.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/geth: implement data import and export #22931
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few other thoughts:
- Would be good to have a progress report every 8 seconds or so,
- What happens if user presses
ctrl-c
? Does it exit gracefully or just croaks? - Since these are long-running operations, it might be neat if user can press ctrl-c, exit orderly, and then restart it again with where it left off. Perhaps using
startKey
or something.
cmd/utils/cmd.go
Outdated
} | ||
defer fh.Close() | ||
|
||
var reader io.Reader = fh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping it in a buffered reader might be a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ still a good idea :)
Triage discussion: this could be 'generalized' so we have a generic importer which imports key/value pairs. And then we could have specific exporters which export preimages, or snap data, or whatever it may be. |
@holiman I will not add the |
cmd/geth/dbcmd.go
Outdated
// The prefix used to identify the snapshot data type, | ||
// 0: account snapshot | ||
// 1: storage snapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do it this way, we're limiting imports to be only the ones that are predefined. If we instead encode it as [key, value], [key, value], [key, valiue]
, then the importer doesn't need to be datatype-aware, and we can use the same import function regardless if we're importing snapshot data or e.g. preimage data or trie data.
@rjl493456442 any thoughts on making it so that we only ever need one single generic importer -- even if we decide to have customer exporters for different data types? I think that would be pretty good, because then an older node could import data generated by a more recent one which is able to export the things it needs. |
@holiman yes yes sure. I can check how to implement this general exporter/importer. |
Note: I'm talking about general importer, but specialized exporter(s)
|
a5ab283
to
b656582
Compare
cmd/utils/cmd.go
Outdated
} | ||
defer fh.Close() | ||
|
||
var reader io.Reader = fh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ still a good idea :)
My suggestions: holiman@cf1260d |
@holiman Can you move your code into this PR? It looks good to me. |
Ok, done. Now it's nearly fully generic. So for example, we could write an exporter that exports a particular trie root. Or we could write one that spits out all metadata, to more easily analyze what has gone wrong, e.g. if someone has inconsistencies between pivot block / latest header / latest block / latest snapshot etc. |
Another thing to consider: adding a metadata field as the first element. It could contain version info, the exported 'type' and date fo export |
Added a header now + testcases. The header contains timestamp, version and "kind"-string. |
Now also with a magic fingerprint, so arbitrary rlp lists doesn't import just because it happens to look like the header, rlp wise. |
Now that we have versioning, maybe we don't have to add support for deletions right now. We can add that in the future, if we decide we need it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (but then I wrote some of the code, so someone else maybe should thumb it up too)
We need to extend the format to handle deletions aswell. If we import snapshot data, we need to delete the metadata, so it's regenerated based on the newest block and not interpreted as some old already present but overwritten snapshot. |
Currently, the schema is an rlp stream:
Possible schemas for deletions, if element X is to be deleted Direct approachInstead of encoding key/value pairs, encode triplets:
Pro: simple, no special cases Magic keyInstead of doing key/value pair, we'd have a special key meaning "delete", and the
Pro:
Cons:
Any other ideas? |
What do you think about something like this:
|
Yes, that's also something I've considered. That scheme makes it so that all deletions are in a particular order, at the end. Which I guess is fine, as long as the full import is performed. One might want to produce a dump which first of all deletes some marker, then starts filling data. That way, if the import is aborted, and geth started, there is no corruption. Whereas if we force deletions to go last, we lose that ability. Also, it means that if there for some reason are a lot of deletions (though I can't think of why that would be), then the exporter would have to hang on to those and not flush until after it's done with the additions). |
What about adding a new area called metadata? I think exporting a batch of deletion markers and then import these markers into another db for deletion sounds not realistic. We can add one more area which contains the customized key-value pairs(deletion can be included here). The exporting file will contain these data: Just like the header, the metadata can also be a struct type entry struct {
Key, Val []byte
}
type Metadata []entry We can put all deletion markers there with value as nil. Also the metadata area is always handled before the data area. |
@rjl493456442 so essentially, your scheme would be
Where the deletion markers, from an rlp perspective, is not just appended (like the kv pairs), but an actual rlp-list? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was a good choice to switch over to iterators
cmd/utils/cmd.go
Outdated
start = time.Now() | ||
logged = time.Now() | ||
) | ||
for key, val, next := iter.Next(); next; key, val, next = iter.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both for this iterator and the next: the final bool
value being returned isn't really safe to rely on. What you have implemented is not "are there more elements", but rather "is there a chance that there are more elements?". Or, "go to next, was it ok?"
So it makes more sense to name it
for key, val, ok := iter.Next(); ok; key, val, ok = iter.Next() {
Rebased, and fixed so the format is |
@holiman LGTM! |
This PR offers two more database sub commands for exporting and importing data. Two exporters are implemented: preimage and snapshot data respectively. The import command is generic, it can take any data export and import into leveldb. The data format has a 'magic' for disambiguation, and a version field for future compatibility.
This PR offers two more database sub commands for exporting and importing data. Two exporters are implemented: preimage and snapshot data respectively. The import command is generic, it can take any data export and import into leveldb. The data format has a 'magic' for disambiguation, and a version field for future compatibility.
what is "key" in this struct? Is it linked to the account's address? (a hash of the account's address?) |
This PR offers two more database sub commands for exporting and importing snapshot data.
These two commands can be useful in this scenario that: archive node just upgrade and start to use snapshot. It's a very expensive to regenerate the snapshot, especially for the archive node which has a huge database. So the node operators can just snap sync a fresh new geth node(a few hours) and import all the snap shot data into the archive node. The archive node can pick it up and do the repair work(a few hours). Compare with the endless snapshot generation, this manual work is much faster.
This two commands can be used in more general way for importing/exporting arbitrary chain data.