-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Git Protocol filter (v5) #9
Conversation
set_packet_header() converts an integer to a 4 byte hex string. Make this function locally available so that other pkt-line functions can use it. Signed-off-by: Lars Schneider <[email protected]>
The packet_trace() call is not ideal in format_packet() as we would print a trace when a packet is formatted and (potentially) when the packet is actually send. This was no problem up until now because format_packet() was only used by one function. Fix it by moving the trace call into the function that actually sends the packet. Signed-off-by: Lars Schneider <[email protected]>
f9465d1
to
8fc20c1
Compare
buf[2] = hex(size >> 4); | ||
buf[3] = hex(size); | ||
#undef hex | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peff Junio told me twice that he thinks this function is a bad idea:
http://public-inbox.org/git/xmqqd1lp8v2o.fsf%40gitster.mtv.corp.google.com/
However, I need it as a helper function for packet_write_gently()
below. Do you think it would be OK to send it in v5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he's mostly concerned that you are adding a new and complicated interface to the pktline code. But I think the last round of discussion showed that we need something besides format_packet
, because it is not binary-clean.
So that something is going to want to share code with format_packet
, and this is one way of doing it (and probably the simplest).
Another would be something like:
8fc20c1
to
ae81ad5
Compare
------------------------ | ||
packet: git< 0000 | ||
packet: git< result=reject-all\n | ||
------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnareb would that work for your "refuse to filter further" idea?
Another response, which I think should be standarized, or at
least described in the documentation, is filter driver refusing
to filter further (e.g. git-LFS and network is down), to be not
restarted by Git.http://public-inbox.org/git/607c07fe-5b6f-fd67-13e1-705020c267ee%40gmail.com/
ae81ad5
to
49d97a1
Compare
8e2261a
to
fc71489
Compare
>test4-empty.r && | ||
|
||
check_filter \ | ||
git add . \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dscho !
This patch is a part of a series that I work on and publish on the mailing list.Build and test runs clean on OS X and Linux. I just got around to test it on Windows and I get this error
Unlink of file 'C:/git-sdk-64/usr/src/git/t/trash directory.t0021-conversion/repo/.git/index.lock' failed. Should I try again? (y/n) warning: unable to unlink C:/git-sdk-64/usr/src/git/t/trash directory.t0021-conversion/repo/.git/index.lock: Permission denied
fatal: Unable to write new index file
This seems to happen with every Git operation that uses my new filter protocol. I know it is a big patch... but do you spot by any chance something that would cause problems on Windows?
Thanks,
Lars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit more context: My patch stats a filter process that runs until the Git parent process dies... this seems to be the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2c: I've seen similar issues before because Windows is particularly picky about code releasing file locks by explicitly closing file handles before exiting, while OS X / Linux tends to free them quickly when the process exits. I'm a a bit new to this code though so I can't suggest where exactly that might be atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found something! Looks like @bwijen and @dscho fixed that issue already! 🎉
git-for-windows#755
This fix is not yet in Git upstream... that's why I missed it.
Only one remaining error on Windows... 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record, here the remaining error:
- I have a branch with files that require filtering (amongst them
test2.r
) - I switch to a branch that contains no files at all.
This switch is not successful and I get this error:
warning: unable to unlink test2.r: Invalid argument
Switched to branch 'empty'
error: The following untracked working tree files would be overwritten by checkout:
test2.r
Please move or remove them before you can switch branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long silence... busy with CI...
Is it possible that your filter somehow still has a handle open on test2.r
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! I think I fixed it! Could you review the fix on the mailing list?
http://public-inbox.org/git/[email protected]/
Here is the commit on GitHub:
5e5c0aa
9b04200
to
511a40f
Compare
format_packet() dies if the caller wants to format a packet larger than LARGE_PACKET_MAX. Certain callers might prefer an error response instead. Add a parameter `gentle` to define if the function should signal an error with the return value (gentle=1) or die (gentle=0). Signed-off-by: Lars Schneider <[email protected]>
packet_write() has two shortcomings. First, it uses format_packet() which lets the caller only send string data via "%s". That means it cannot be used for arbitrary data that may contain NULs. Second, it will always die on error. Add packet_write_gently() which writes arbitrary data and returns `0` for success and `-1` for an error. Signed-off-by: Lars Schneider <[email protected]>
packet_write() would die in case of a write error even though for some callers an error would be acceptable. Add packet_write_gently_fmt() which writes a formatted pkt-line and returns `0` for success and `-1` for an error. Signed-off-by: Lars Schneider <[email protected]>
packet_flush() would die in case of a write error even though for some callers an error would be acceptable. Add packet_flush_gently() which writes a pkt-line flush packet and returns `0` for success and `-1` for failure. Signed-off-by: Lars Schneider <[email protected]>
packet_write_stream_with_flush_from_fd() and packet_write_stream_with_flush_from_buf() write a stream of packets. All content packets use the maximal packet size except for the last one. After the last content packet a `flush` control packet is written. packet_read_till_flush() reads arbitary sized packets until it detects a `flush` packet. Signed-off-by: Lars Schneider <[email protected]>
packet_write() should be called packet_write_fmt() as the string parameter can be formatted. Suggested-by: Junio C Hamano <[email protected]> Signed-off-by: Lars Schneider <[email protected]>
According to LARGE_PACKET_MAX in pkt-line.h the maximal length of a pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and therefore the pkt-line data component must not exceed 65516 bytes. Signed-off-by: Lars Schneider <[email protected]>
Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard to read in error messages. Quote them to improve the readability. Signed-off-by: Lars Schneider <[email protected]>
Use `test_config` to set the config, check that files are empty with `test_must_be_empty`, compare files with `test_cmp`, and remove spaces after ">" and "<". Signed-off-by: Lars Schneider <[email protected]>
Generate more interesting large test files with pseudo random characters in between and reuse these test files in multiple tests. Run tests formerly marked as EXPENSIVE every time but with a smaller data set. Signed-off-by: Lars Schneider <[email protected]>
apply_filter() returns a boolean that tells the caller if it "did convert or did not convert". The variable `ret` was used throughout the function to track errors wheras `1` denoted success and `0` failure. This is unusual for the Git source where `0` denotes success. Rename the variable and flip its value to make the function easier readable for Git developers. Signed-off-by: Lars Schneider <[email protected]>
Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time. In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382 This patch adds the `filter.<driver>.process` string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output described below. Git starts the filter when it encounters the first file that needs to be cleaned or smudged. After the filter started Git sends a welcome message, a list of supported protocol version numbers, and a flush packet. Git expects to read the welcome message and one protocol version number from the previously sent list. Afterwards Git sends a list of supported capabilities and a flush packet. Git expects to read a list of desired capabilities, which must be a subset of the supported capabilities list, and a flush packet as response: ------------------------ packet: git> git-filter-client packet: git> version=2 packet: git> version=42 packet: git> 0000 packet: git< git-filter-server packet: git< version=2 packet: git> clean=true packet: git> smudge=true packet: git> not-yet-invented=true packet: git> 0000 packet: git< clean=true packet: git< smudge=true packet: git< 0000 ------------------------ Supported filter capabilities in version 2 are "clean" and "smudge". Afterwards Git sends a list of "key=value" pairs terminated with a flush packet. The list will contain at least the filter command (based on the supported capabilities) and the pathname of the file to filter relative to the repository root. Right after these packets Git sends the content split in zero or more pkt-line packets and a flush packet to terminate content. ------------------------ packet: git> command=smudge\n packet: git> pathname=path/testfile.dat\n packet: git> 0000 packet: git> CONTENT packet: git> 0000 ------------------------ The filter is expected to respond with a list of "key=value" pairs terminated with a flush packet. If the filter does not experience problems then the list must contain a "success" status. Right after these packets the filter is expected to send the content in zero or more pkt-line packets and a flush packet at the end. Finally, a second list of "key=value" pairs terminated with a flush packet is expected. The filter can change the status in the second list. ------------------------ packet: git< status=success\n packet: git< 0000 packet: git< SMUDGED_CONTENT packet: git< 0000 packet: git< 0000 # empty list! ------------------------ If the result content is empty then the filter is expected to respond with a success status and an empty list. ------------------------ packet: git< status=success\n packet: git< 0000 packet: git< 0000 # empty content! packet: git< 0000 # empty list! ------------------------ In case the filter cannot or does not want to process the content, it is expected to respond with an "error" status. Depending on the `filter.<driver>.required` flag Git will interpret that as error but it will not stop or restart the filter process. ------------------------ packet: git< status=error\n packet: git< 0000 ------------------------ In case the filter cannot or does not want to process the content as well as any future content for the lifetime of the Git process, it is expected to respond with an "error-all" status. Depending on the `filter.<driver>.required` flag Git will interpret that as error but it will not stop or restart the filter process. ------------------------ packet: git< status=error-all\n packet: git< 0000 ------------------------ If the filter experiences an error during processing, then it can send the status "error". Depending on the `filter.<driver>.required` flag Git will interpret that as error but it will not stop or restart the filter process. ------------------------ packet: git< status=success\n packet: git< 0000 packet: git< HALF_WRITTEN_ERRONEOUS_CONTENT packet: git< 0000 packet: git< status=error\n packet: git< 0000 ------------------------ If the filter dies during the communication or does not adhere to the protocol then Git will stop the filter process and restart it with the next file that needs to be processed. After the filter has processed a blob it is expected to wait for the next "key=value" list containing a command. When the Git process terminates, it will send a kill signal to the filter in that stage. If a `filter.<driver>.clean` or `filter.<driver>.smudge` command is configured then these commands always take precedence over a configured `filter.<driver>.process` command. Helped-by: Martin-Louis Bright <[email protected]> Reviewed-by: Jakub Narebski <[email protected]> Signed-off-by: Lars Schneider <[email protected]>
Consider the case of a file that requires filtering and is present in branch A but not in branch B. If A is the current HEAD and we checkout B then the following happens: 1. ce_compare_data() opens the file 2. index_fd() detects that the file requires to run a clean filter and calls index_stream_convert_blob() 4. index_stream_convert_blob() calls convert_to_git_filter_fd() 5. convert_to_git_filter_fd() calls apply_filter() which creates a new long running filter process (in case it is the first file of this kind to be filtered) 6. The new filter process inherits all file handles. This is the default on Linux/OSX and is explicitly defined in the `CreateProcessW` call in `mingw.c` on Windows. 7. ce_compare_data() closes the file 8. Git unlinks the file as it is not present in B The unlink operation does not work on Windows because the filter process has still an open handle to the file. Apparently that is no problem on Linux/OSX. Probably because "[...] the two file descriptors share open file status flags" (see fork(2)). Fix this problem by opening files in read-cache with the `O_CLOEXEC` flag to ensure that the file descriptor does not remain open in a newly spawned process. `O_CLOEXEX` is defined as `O_NOINHERIT` on Windows. A similar fix for temporary file handles was applied on Git for Windows already: git-for-windows@667b8b5 Signed-off-by: Lars Schneider <[email protected]>
511a40f
to
5e5c0aa
Compare
Notable changes since v4
(plus detect wrongly configured version 1 filters)
Patches
please look closely: the patch does not add a new interface to
pktline. It is a private functions used 03.
packet_write_fmt() as suggested by Junio.
(This patch is not required for the the series and could be dropped)
the discussion here:
git.exe parent process locks index file for child git.exe process git-for-windows/git#770 (comment)
(This patch is not required in Git core and could be dropped)
Jakub
mis configured as process-filter
Junio
Peff
Eric