Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta log files on HDFS does not have checksum #865

Closed
zijie0 opened this issue Dec 14, 2021 · 2 comments
Closed

Delta log files on HDFS does not have checksum #865

zijie0 opened this issue Dec 14, 2021 · 2 comments
Labels
question Questions on how to use Delta Lake

Comments

@zijie0
Copy link

zijie0 commented Dec 14, 2021

We are running Spark + delta on CDH platform, and delta tables are stored on HDFS. We also use delta-rs project to read delta tables in some Python projects. Currently delta-rs does not support HDFS storage(delta-io/delta-rs#300). So we have to use Minio over HDFS as a workaround for this issue.

When we try to read delta table files through minio, it would return some error like following:

API: GetObject(bucket=some_bucket, object=some_delta_table/_delta_log/00000000000000000000.json)
Time: 12:20:03 UTC 12/13/2021
DeploymentID: 9af18baf-b9bd-45c7-b570-2d1a6e0d6f60
RequestID: 16C04FCC5A225C8D
RemoteHost: xx.xx.xx.xx
Host: yy.yy.yy.yy:9011
UserAgent: Botocore/1.21.21 Python/3.8.10 Windows/10
Error: unsupported checksum type: 0 (*errors.errorString)
       3: cmd/gateway/hdfs/gateway-hdfs.go:295:hdfs.hdfsToObjectErr()
       2: cmd/gateway/hdfs/gateway-hdfs.go:636:hdfs.(*hdfsObjects).getObject()
       1: cmd/gateway/hdfs/gateway-hdfs.go:597:hdfs.(*hdfsObjects).GetObjectNInfo.func1()

We also verified with hadoop fs -checksum command on delta log files, it would return NONE instead of some valid checksum code.

After investigating the code in delta project, we found that checksum is explicitly disabled when creating log files:

tempPath, EnumSet.of(CREATE), CreateOpts.checksumParam(ChecksumOpt.createDisabled()))

We are wondering what is the reason for this? And is there any recommended workaround for this issue? Thanks in advance.

@zsxwing
Copy link
Member

zsxwing commented Dec 15, 2021

Is it possible to disable the checksum check for Minio over HDFS? Anyway, the checksum file is not guaranteed to be created (e.g, a json file is created and then the system crashes) and you may still hit the same issue.

AFAIK, the HDFS checksum doesn't work very well with file overwriting. E.g, let's say we have a file (A, A.crc). If we want to overwrite the file A, we may end up with overwriting file A, but A.crc is not updated due to system crash. Then the table will be broken.

Unless HDFS can provide an API to update the file content and its crc automatically, we won't be able to enable checksum for Delta.

@dennyglee dennyglee added the question Questions on how to use Delta Lake label Dec 15, 2021
@zijie0
Copy link
Author

zijie0 commented Dec 16, 2021

@zsxwing Thanks! I will post another issue in minio to see if they have any solutions on this.

@zijie0 zijie0 closed this as completed Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions on how to use Delta Lake
Projects
None yet
Development

No branches or pull requests

3 participants