Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git Protocol filter (v5) #9

Closed
wants to merge 15 commits into from
Closed

Conversation

larsxschneider
Copy link
Owner

@larsxschneider larsxschneider commented Aug 5, 2016

Notable changes since v4

  • drop the shutdown capability
  • add error-all response to signal that the filter does not want to filter anymore
  • extend init sequence to negotiate version number and capabilities
    (plus detect wrongly configured version 1 filters)
  • improve status response format according to Peff's suggestions
  • fix Git for Window support

Patches

  • 01-09: Make pktline ready to be used for filters.
  • 01 - @junio: I know you told me twice that you don't like this patch but
    please look closely: the patch does not add a new interface to
    pktline. It is a private functions used 03.
  • 08 - This patch changes the pktline interface as it renames packet_write() to
    packet_write_fmt() as suggested by Junio.
    (This patch is not required for the the series and could be dropped)
  • 10-13: Prepare convert code.
  • 14: Main contribution.
  • 15: Proposed fix to make the long running filters work on Windows. See
    the discussion here:
    git.exe parent process locks index file for child git.exe process git-for-windows/git#770 (comment)
    (This patch is not required in Git core and could be dropped)

Jakub

Junio

Peff

Eric

set_packet_header() converts an integer to a 4 byte hex string. Make
this function locally available so that other pkt-line functions can
use it.

Signed-off-by: Lars Schneider <[email protected]>
The packet_trace() call is not ideal in format_packet() as we would print
a trace when a packet is formatted and (potentially) when the packet is
actually send. This was no problem up until now because format_packet()
was only used by one function. Fix it by moving the trace call into the
function that actually sends the packet.

Signed-off-by: Lars Schneider <[email protected]>
buf[2] = hex(size >> 4);
buf[3] = hex(size);
#undef hex
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peff Junio told me twice that he thinks this function is a bad idea:
http://public-inbox.org/git/xmqqd1lp8v2o.fsf%40gitster.mtv.corp.google.com/

However, I need it as a helper function for packet_write_gently() below. Do you think it would be OK to send it in v5?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he's mostly concerned that you are adding a new and complicated interface to the pktline code. But I think the last round of discussion showed that we need something besides format_packet, because it is not binary-clean.

So that something is going to want to share code with format_packet, and this is one way of doing it (and probably the simplest).

Another would be something like:

------------------------
packet: git< 0000
packet: git< result=reject-all\n
------------------------
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnareb would that work for your "refuse to filter further" idea?

Another response, which I think should be standarized, or at
least described in the documentation, is filter driver refusing
to filter further (e.g. git-LFS and network is down), to be not
restarted by Git.

http://public-inbox.org/git/607c07fe-5b6f-fd67-13e1-705020c267ee%40gmail.com/

@larsxschneider larsxschneider changed the title Protocol filter v5 Git Protocol filter (v5) Aug 6, 2016
@larsxschneider larsxschneider force-pushed the protocol-filter/v5 branch 6 times, most recently from 8e2261a to fc71489 Compare August 8, 2016 12:49
>test4-empty.r &&

check_filter \
git add . \
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dscho !

This patch is a part of a series that I work on and publish on the mailing list.Build and test runs clean on OS X and Linux. I just got around to test it on Windows and I get this error

Unlink of file 'C:/git-sdk-64/usr/src/git/t/trash directory.t0021-conversion/repo/.git/index.lock' failed. Should I try again? (y/n) warning: unable to unlink C:/git-sdk-64/usr/src/git/t/trash directory.t0021-conversion/repo/.git/index.lock: Permission denied
fatal: Unable to write new index file

This seems to happen with every Git operation that uses my new filter protocol. I know it is a big patch... but do you spot by any chance something that would cause problems on Windows?

Thanks,
Lars

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit more context: My patch stats a filter process that runs until the Git parent process dies... this seems to be the problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c: I've seen similar issues before because Windows is particularly picky about code releasing file locks by explicitly closing file handles before exiting, while OS X / Linux tends to free them quickly when the process exits. I'm a a bit new to this code though so I can't suggest where exactly that might be atm.

Copy link
Owner Author

@larsxschneider larsxschneider Aug 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found something! Looks like @bwijen and @dscho fixed that issue already! 🎉
git-for-windows#755
This fix is not yet in Git upstream... that's why I missed it.
Only one remaining error on Windows... 😄

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record, here the remaining error:

  1. I have a branch with files that require filtering (amongst them test2.r)
  2. I switch to a branch that contains no files at all.

This switch is not successful and I get this error:

warning: unable to unlink test2.r: Invalid argument
Switched to branch 'empty'
error: The following untracked working tree files would be overwritten by checkout:
        test2.r
Please move or remove them before you can switch branches.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long silence... busy with CI...

Is it possible that your filter somehow still has a handle open on test2.r?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! I think I fixed it! Could you review the fix on the mailing list?
http://public-inbox.org/git/[email protected]/

Here is the commit on GitHub:
5e5c0aa

@larsxschneider larsxschneider force-pushed the protocol-filter/v5 branch 5 times, most recently from 9b04200 to 511a40f Compare August 10, 2016 12:26
format_packet() dies if the caller wants to format a packet larger than
LARGE_PACKET_MAX. Certain callers might prefer an error response instead.

Add a parameter `gentle` to define if the function should signal an error
with the return value (gentle=1) or die (gentle=0).

Signed-off-by: Lars Schneider <[email protected]>
packet_write() has two shortcomings. First, it uses format_packet() which
lets the caller only send string data via "%s". That means it cannot be
used for arbitrary data that may contain NULs. Second, it will always
die on error.

Add packet_write_gently() which writes arbitrary data and returns `0` for
success and `-1` for an error.

Signed-off-by: Lars Schneider <[email protected]>
packet_write() would die in case of a write error even though for some callers
an error would be acceptable. Add packet_write_gently_fmt() which writes a
formatted pkt-line and returns `0` for success and `-1` for an error.

Signed-off-by: Lars Schneider <[email protected]>
packet_flush() would die in case of a write error even though for some callers
an error would be acceptable. Add packet_flush_gently() which writes a pkt-line
flush packet and returns `0` for success and `-1` for failure.

Signed-off-by: Lars Schneider <[email protected]>
packet_write_stream_with_flush_from_fd() and
packet_write_stream_with_flush_from_buf() write a stream of packets. All
content packets use the maximal packet size except for the last one.
After the last content packet a `flush` control packet is written.

packet_read_till_flush() reads arbitary sized packets until it detects
a `flush` packet.

Signed-off-by: Lars Schneider <[email protected]>
packet_write() should be called packet_write_fmt() as the string
parameter can be formatted.

Suggested-by: Junio C Hamano <[email protected]>
Signed-off-by: Lars Schneider <[email protected]>
According to LARGE_PACKET_MAX in pkt-line.h the maximal length of a
pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
therefore the pkt-line data component must not exceed 65516 bytes.

Signed-off-by: Lars Schneider <[email protected]>
Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard to
read in error messages. Quote them to improve the readability.

Signed-off-by: Lars Schneider <[email protected]>
Use `test_config` to set the config, check that files are empty with
`test_must_be_empty`, compare files with `test_cmp`, and remove spaces
after ">" and "<".

Signed-off-by: Lars Schneider <[email protected]>
Generate more interesting large test files with pseudo random characters
in between and reuse these test files in multiple tests. Run tests formerly
marked as EXPENSIVE every time but with a smaller data set.

Signed-off-by: Lars Schneider <[email protected]>
apply_filter() returns a boolean that tells the caller if it
"did convert or did not convert". The variable `ret` was used throughout
the function to track errors wheras `1` denoted success and `0` failure.
This is unusual for the Git source where `0` denotes success.

Rename the variable and flip its value to make the function easier
readable for Git developers.

Signed-off-by: Lars Schneider <[email protected]>
Git's clean/smudge mechanism invokes an external filter process for every
single blob that is affected by a filter. If Git filters a lot of blobs
then the startup time of the external filter processes can become a
significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge filter
written in golang to filter 12,000 files. This process took 364s with the
existing filter mechanism and 5s with the new mechanism. See details here:
git-lfs/git-lfs#1382

This patch adds the `filter.<driver>.process` string option which, if used,
keeps the external filter process running and processes all blobs with
the packet format (pkt-line) based protocol over standard input and standard
output described below.

Git starts the filter when it encounters the first file
that needs to be cleaned or smudged. After the filter started
Git sends a welcome message, a list of supported protocol
version numbers, and a flush packet. Git expects to read the
welcome message and one protocol version number from the
previously sent list. Afterwards Git sends a list of supported
capabilities and a flush packet. Git expects to read a list of
desired capabilities, which must be a subset of the supported
capabilities list, and a flush packet as response:
------------------------
packet:          git> git-filter-client
packet:          git> version=2
packet:          git> version=42
packet:          git> 0000
packet:          git< git-filter-server
packet:          git< version=2
packet:          git> clean=true
packet:          git> smudge=true
packet:          git> not-yet-invented=true
packet:          git> 0000
packet:          git< clean=true
packet:          git< smudge=true
packet:          git< 0000
------------------------
Supported filter capabilities in version 2 are "clean" and
"smudge".

Afterwards Git sends a list of "key=value" pairs terminated with
a flush packet. The list will contain at least the filter command
(based on the supported capabilities) and the pathname of the file
to filter relative to the repository root. Right after these packets
Git sends the content split in zero or more pkt-line packets and a
flush packet to terminate content.
------------------------
packet:          git> command=smudge\n
packet:          git> pathname=path/testfile.dat\n
packet:          git> 0000
packet:          git> CONTENT
packet:          git> 0000
------------------------

The filter is expected to respond with a list of "key=value" pairs
terminated with a flush packet. If the filter does not experience
problems then the list must contain a "success" status. Right after
these packets the filter is expected to send the content in zero
or more pkt-line packets and a flush packet at the end. Finally, a
second list of "key=value" pairs terminated with a flush packet
is expected. The filter can change the status in the second list.
------------------------
packet:          git< status=success\n
packet:          git< 0000
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< 0000  # empty list!
------------------------

If the result content is empty then the filter is expected to respond
with a success status and an empty list.
------------------------
packet:          git< status=success\n
packet:          git< 0000
packet:          git< 0000  # empty content!
packet:          git< 0000  # empty list!
------------------------

In case the filter cannot or does not want to process the content,
it is expected to respond with an "error" status. Depending on the
`filter.<driver>.required` flag Git will interpret that as error
but it will not stop or restart the filter process.
------------------------
packet:          git< status=error\n
packet:          git< 0000
------------------------

In case the filter cannot or does not want to process the content
as well as any future content for the lifetime of the Git process,
it is expected to respond with an "error-all" status. Depending on
the `filter.<driver>.required` flag Git will interpret that as error
but it will not stop or restart the filter process.
------------------------
packet:          git< status=error-all\n
packet:          git< 0000
------------------------

If the filter experiences an error during processing, then it can
send the status "error". Depending on the `filter.<driver>.required`
flag Git will interpret that as error but it will not stop or restart
the filter process.
------------------------
packet:          git< status=success\n
packet:          git< 0000
packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
packet:          git< 0000
packet:          git< status=error\n
packet:          git< 0000
------------------------

If the filter dies during the communication or does not adhere to
the protocol then Git will stop the filter process and restart it
with the next file that needs to be processed.

After the filter has processed a blob it is expected to wait for
the next "key=value" list containing a command. When the Git process
terminates, it will send a kill signal to the filter in that stage.

If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
is configured then these commands always take precedence over
a configured `filter.<driver>.process` command.

Helped-by: Martin-Louis Bright <[email protected]>
Reviewed-by: Jakub Narebski <[email protected]>
Signed-off-by: Lars Schneider <[email protected]>
Consider the case of a file that requires filtering and is present in branch A
but not in branch B. If A is the current HEAD and we checkout B then the
following happens:

1. ce_compare_data() opens the file
2.   index_fd() detects that the file requires to run a clean filter and
     calls index_stream_convert_blob()
4.     index_stream_convert_blob() calls convert_to_git_filter_fd()
5.       convert_to_git_filter_fd() calls apply_filter() which creates a new
         long running filter process (in case it is the first file of this kind
         to be filtered)
6.       The new filter process inherits all file handles. This is the default
         on Linux/OSX and is explicitly defined in the `CreateProcessW` call
         in `mingw.c` on Windows.
7. ce_compare_data() closes the file
8. Git unlinks the file as it is not present in B

The unlink operation does not work on Windows because the filter process has
still an open handle to the file. Apparently that is no problem on Linux/OSX.
Probably because "[...] the two file descriptors share open file status flags"
(see fork(2)).

Fix this problem by opening files in read-cache with the `O_CLOEXEC` flag to
ensure that the file descriptor does not remain open in a newly spawned process.
`O_CLOEXEX` is defined as `O_NOINHERIT` on Windows. A similar fix for temporary
file handles was applied on Git for Windows already:
git-for-windows@667b8b5

Signed-off-by: Lars Schneider <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants