Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parchive v2 #69

Closed
wants to merge 1 commit into from
Closed

Added parchive v2 #69

wants to merge 1 commit into from

Conversation

KOLANICH
Copy link
Contributor

No description provided.

archive/parchive_v2.ksy Show resolved Hide resolved
archive/parchive_v2.ksy Show resolved Hide resolved
'"PAR 2.0\0CommASCI"': ascii_comment
'"PAR 2.0\0CommUni"': unicode_comment
'"PAR 2.0\0FileSlic"': input_file_slice
'"PAR 2.0\0RFSC\0\0\0\0"': recovery_file_slice_checksum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some internal structure, wouldn't it be better to split the header.type into multiple fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

Comment on lines +15 to +34
doc: |
This document describes a file format for storing redundant data for a set of files.
In operation, a user will select a set of files from which the redundant data is to be made. These are known as input files and the set of them is known as the recovery set. The user will provide these to a program which generates file(s) that match the specification in this document. The program is known as a PAR 2.0 Client or client for short, and the generated files are known as PAR 2.0 files or PAR files. If the files in the recovery set ever get damaged (e.g. when they are transmitted or stored on a faulty disk) the client can read the damaged input files, read the (possibly damaged) PAR files, and regenerate the original input files. Of course, not all damages can be repaired, but many can.
A user can also name some input files that are not to be recovered if damaged. These input files are known as the non-recovery set. This feature is in the spec to keep the same functionality as PAR 1.0.
The redundant data in the PAR files is computed using Reed-Solomon codes. These codes can take a set of equal-sized blocks of data and produce a number of same-sized recovery blocks. Then, given a subset of original data blocks and some recovery block, it is possible to reproduce the original data blocks. Reed-Solomon codes can do this recovery as long as the number of missing data blocks does not out number the recovery blocks. The design of the Reed-Solomon codes in this spec is based on James S. Plank's tech report at U. of Tennessee entitled A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. The tech report contains an error, so the design is changed slightly to fix the problem. PAR 2.0 uses a 16-bit Reed-Solomon code and can support 32768 blocks.
The equal-sized blocks for the Reed-Solomon codes come from slices of the input files in the recovery set. The slices are consecutive equal-sized chunks of each file. If a file does not fill out the chunk, i.e. it ends mid-slice, the rest of the slice is treated as if it is padded with zero bytes.
The PAR 2.0 file itself is made of packets - self-contained parts with their own checksum. This design prevents damage to one part of the file from making the whole file unusable.
Packets have a type and each type of packet serves a different purpose. One describes a file. Another contains the checksums of the slices in a file. Another states which files are in the recovery set and which files are in the non-recovery set. And yet another contains a recovery slice - the recovery data produced by the Reed-Solomon code.
A PAR 2.0 file is only *required* to contain 1 specific packet - the packet that identifies the type of client that created the file. This way, if clients are creating files that don't match the specification in some way, they can be tracked down.
The packets for a recovery set and non-recovery set can be broken into multiple files. Files can contain duplicate packets - in fact, this is recommended for vital packets, such as the ones that describe the input files and the one that states which files are in the recovery set. Packets can appear in any order in a file.

That is the official spec. To make sure clients work similarly, the following client conventions should be followed.
PAR 2.0 files should always end in ".par2". For example, "file.par2". If a file contains recovery slices, the ".par2" should be preceded by ".volXX-YY" where XX to YY is the range of exponents for the recovery slices. For example, "file.vol20-29.par2". More than 2 digits should be used if necessary. Any exponents that contain fewer digits than the largest exponent should be preceded by zeros so that all filenames have the same length. For example, "file.vol075-149.par2". Exponents should start at 0 and go upwards.
If multiple PAR files are generated, they may either have a constant number of slices per file (e.g. 20, 20, 20, ...) or exponentially increasing number of slices (e.g., 1, 2, 4, 8, ...). Note that to store 1023 slices takes 52 files if each has 20 slices, but takes only 10 files with the exponential pattern.
When generating multiple PAR files, it is expected that one file be generated without any slices and containing all main, file description, and input file checksum packets. The other files should also include the main, file description and input file checksum packets. This repeats data that cannot be recovered.

*NOTE: If the files are to be transmitted over usenet, it might be best to place the main, file description and input file checksum packets at the end, so that the equal-sized recovery slice packets are at the beginning. That way it may be possible to put a single recovery slice in each usenet message.*
If just a single PAR file is generated, it is expected that the main, file description, and input file checksum packets are repeated multiple times and scattered through out the file. (Once again, repeating data that cannot be recovered.)
It is recommended that users are warned when they create PAR files with names that are incompatible with Windows, Mac, or Linux systems. That is, file or directory names that are more than 255 characters long, start with a period (.) or a dash (-), or contain one of these characters: < > : " ' ` ? * & | [ ] \ ; or newline (\n).
It is *strongly* recommended that clients query a user before writing to a file whose File Description packet contains an absolute pathname. For Windows, that means one starting with "C:\" or "//" for example. For UNIX, that means one starting with "/" or "//". For Mac, that means one starting with ":". This is to prevent PAR files of unknown origin from cracking a system by overwriting system files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was afraid, this would render terribly in the format gallery, because the text is not sensibly divided into paragraphs. Here's the screenshot of the generated page:

ks-formats-parchive-v2

The format gallery generator uses CommonMark formatting for the top-level doc key, so you should separate the paragraphs with a blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated. Thanks.

The recovery slice packet contains one slice of recovery data. The recovery data is generated using a 16-bit Galois Field (GF) with generator 0x0001100B.
The algorithm for computing recovery slices is based on James S. Plank's tech report at U. of Tennessee entitled A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. The input slices are ordered and assigned 16-bit constants. Recovery slices are assigned 16-bit exponents. Each 2-byte word of the recovery slice is the sum of the contributions from each input slice. The contribution of each input slice is the 2-byte word of the input slice multiplied by the input slice's constant raised to the recovery slice's exponent. All these computations (adds, multiplys, powers) are done using the 16-bit Galois Field operations.
To generate the recovery data, the slices of the input files are assigned constants. This is based on the order the File IDs appear in the main packet and then by the order the slices appear in the file. So the first slice of the first file in the main packet is assigned the first constant. The second slice of the first file is assigned the second constant. And so on. If the last slice of the first file has the Nth constant, the first slice of the second file is assigned the (N+1)th. And so on.
Here, the PAR 2.0 Spec diverges from Plank's paper. In Plank, the first constant is 1, the second 2, the third 3, etc. This is a bad approach because some constants have an order less than 65535. (That is, there exists constants N where N raised to a power less than 65535 is equal to 1 in the Galois Field.) These constants can prevent recovery matrices from being invertible and can, therefore, stop recovery. This spec does not use those constants. So, the first constant is the first power of two that has order 65535. The second constant is the next power of two that has order 65535. And so on. A power of two has order 65535 if the exponent is not equal to 0 modulus 3, 5, 17, or 257. In C code, that would be (n%3 != 0 && n%5 != 0 && n%17 != 0 && n%257 != 0). Note - this is the exponent being tested, and not the constant itself. There are 32768 valid constants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, for the love of you-know-who, though I know you're not a fan of it, I ask you to wrap the lines at 80 characters, because this line with 871 characters really makes the reading of the spec a nightmare:

ks-formats-parchive-v2-long-lines

It's recommended in the KSY Style Guide and it's there for a reason:

Lines should be wrapped to be 80 columns long. If it doesn’t fit into single line after wrapping, (...).

Not wrapping the lines might work in your workflow and environment, but it's highly impractical for most people. (I think I'd need a mouse with horizontal scrolling in order not to go crazy over it.)

And again, please separate all the paragraphs with a blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npp

archive/parchive_v2.ksy Show resolved Hide resolved
@KOLANICH KOLANICH mentioned this pull request Aug 21, 2020
@KOLANICH KOLANICH closed this Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants