Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff for *.doc files did not work as expected #355

Closed
asdqwezx opened this issue Sep 4, 2015 · 9 comments
Closed

Diff for *.doc files did not work as expected #355

asdqwezx opened this issue Sep 4, 2015 · 9 comments

Comments

@asdqwezx
Copy link

asdqwezx commented Sep 4, 2015

git for windows 1.9.5 has nice feature — comparing text content of *.doc files when diff calculated
git 2.5.x treats *.doc files like ordinary binary files

@rimrul
Copy link
Member

rimrul commented Sep 4, 2015

This behaviour (and simmilar ones for pdf,rtf docx) seems to be caused by the commits that edited /etc/gitattributes. This file and its supporting scripts apparently did not make it into the 2.X versions. Could you test this after the following steps:
add these lines to your gitattributes file.

*.doc   diff=astextplain
*.DOC   diff=astextplain
*.pdf   diff=astextplain
*.PDF   diff=astextplain

download the old script 'astextplain' to your git bin directory.

add these lines to your gitconfig

[diff "astextplain"]
    textconv = astextplain

I can't test this currently since I have no git for windows 2.5 installed.

@Akkuzin
Copy link

Akkuzin commented Sep 4, 2015

Thanks for clarifying!
Your method works
But simply adding new scripts will not be enough — some underlying utilities needed: antiword, pdftotext, docx2txt. And installing tool manually is very unconvinient!
Plain text comparison feature should be out of the box — it is kind of a dealbreaker for working with document archives.

@rimrul
Copy link
Member

rimrul commented Sep 4, 2015

It was not meant as a final sollution, but as a check if it still works. Documents are basically binary files so conversion to text is not exactly an out of the box feature. gits main purpose is sourcecode versioning, that means it's optimized for plain text. Since both scripts (astexplain and docx2txt) plus antiword add up to roughly 250 kB here, I don't think @dscho would have much off a problem if we added them back in. I'll take a look at making an installer and comparing the sizes when I find time for that. I'll notify you about any pull requests.
EDIT: I can't find pdftotext in github.com/msys/* so maybe this feature will stay missing.

@dscho
Copy link
Member

dscho commented Sep 5, 2015

I can't find pdftotext in github.com/msys/* so maybe this feature will stay missing.

You will find it in the poppler package. You can install it via pacman -S mingw-w64-x86_64-poppler in a 64-bit Git for Windows SDK, but it will take a while and download ~5MB.

Just for fun, I contributed a package definition for Xpdf which also provides a pdftotext.exe, but this still needs the libstdc++ DLL, so I am not sure just how much of a size penalty we would incur.

@rimrul
Copy link
Member

rimrul commented Sep 10, 2015

My git 1.9.4 and 2.5.1 currently display the same message, when I'm not inside a repository:
mingw32__ 2015-09-10 22 30 06
mingw64__ 2015-09-10 22 31 24
I will have another look at this inside a repository on monday. The gitattributes and gitconfig file install fine though. git config -l returns diff.astextplain.textconv=astextplain as intended. The doc/docx converters add roughly 94 kiB to the installer, but I haven't included pdftotext yet since I don't want to add the whole package.
sizes

EDIT: Just noticed that my git 2.5.1 searches the ~/.gitattributes instead of /etc/gitattributes. I should propably look at regular Linux behaviour for this.

@rimrul
Copy link
Member

rimrul commented Sep 11, 2015

Next Update: Git 2.x searches /$(prefix)/etc/gitattributes instead of /etc/gitattributes. I guess Git 1.x was just built without the prefix. I apparently also forgot to include the antiword mapping files. I'm getting closer.

@rimrul
Copy link
Member

rimrul commented Sep 12, 2015

I'm still having slight differences in the doc conversion, but I got it running. I've also gotten docx2txt running as intended. I should probably use the unzip package instead of adding unzip to the git-extra package. I guess I should also introduce a separate package for antiword and its 30-ish mapping files.

This is my current Git 2.5.1 result for doc to text conversion:

$ git diff --cached
diff --git a/a.doc b/a.doc
new file mode 100644
index 0000000..f345d74
--- /dev/null
+++ b/a.doc
@@ -0,0 +1,3 @@
+^M
+a.doc^M
+^M

This is the intended result that Git 1.9.4 produces:

$ git diff --cached
WARNING: terminal is not fully functional
diff --git a/a.doc b/a.doc
new file mode 100644
index 0000000..f345d74
--- /dev/null
+++ b/a.doc
@@ -0,0 +1,3 @@
+
+a.doc
+

I would assume it's a conversion issue between CRLF and LF, but I don't know why it would be converted in Git 1.9.4 and not in Git 2.5.1. Any ideas @dscho? maybe we could fix this by running the doc file through dos2unix before feeding it to antiword? No results with pdf conversion yet. The current installer for this issue is 171 kiB bigger than the 2.5.1 installer I've built. The size of it might be wrong since I think I didn't rebuild git-extra and added the required files manually to the fitting locations.

TODO:

  • figure out if antiword needs POSIX emulation
  • resolve the CRLF issue
  • introduce antiword package
  • use unzip and antiword packages
  • add pdf support
  • open pull request

@dscho
Copy link
Member

dscho commented Sep 12, 2015

I should probably use the unzip package instead of adding unzip to the git-extra package

I would add it here: https://github.com/git-for-windows/build-extra/blob/b1ada75d730cd6b5e74ac202841b46e28a5399f0/make-file-list.sh#L89 (and here: https://github.com/git-for-windows/build-extra/blob/b1ada75d730cd6b5e74ac202841b46e28a5399f0/make-file-list.sh#L73).

In any case, would you have some code to show? If you do, please open a Pull Request (with the prefix "DO NOT MERGE YET:").

rimrul added a commit to rimrul/build-extra that referenced this issue Sep 13, 2015
Converting Word files to text before diffing them allows an easier comparison between changed files. This
reintroduces some functionality of Git for Windows 1.7.x+.  It was requested by a user to reintroduce this feature in
git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
@rimrul
Copy link
Member

rimrul commented Sep 14, 2015

For those who follow this thread, but have not had a look at git-for-windows/build-extra#75: I've opened a pull request for the first version of these changes, but they aren't ready to be merged yet. @dscho and I have created packages for antiword and docx2txt and created the pull requests msys2/MSYS2-packages#345 and Alexpux/MINGW-packages#781. Both pull requests have been merged and I'm currently working on the package integration, the CRLF issue and PDF support.

rimrul added a commit to rimrul/build-extra that referenced this issue Sep 15, 2015
Converting Word files to text before diffing them allows an easier comparison between changed files. This
reintroduces some functionality of Git for Windows 1.7.x+.  It was requested by a user to reintroduce this feature in
git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
rimrul added a commit to rimrul/build-extra that referenced this issue Sep 15, 2015
Converting Word files to text before diffing them allows an easier comparison between changed files. This
reintroduces some functionality of Git for Windows 1.7.x+.  It was requested by a user to reintroduce this feature in
git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
rimrul added a commit to rimrul/build-extra that referenced this issue Sep 15, 2015
Converting Word files to text before diffing them allows an easier comparison between changed files. This
reintroduces some functionality of Git for Windows 1.7.x+.  It was requested by a user to reintroduce this feature in
git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
rimrul added a commit to rimrul/build-extra that referenced this issue Sep 16, 2015
Converting Word files to text before diffing them allows an easier comparison between changed files. This
reintroduces some functionality of Git for Windows 1.x.

This fixes git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
rimrul added a commit to rimrul/build-extra that referenced this issue Sep 16, 2015
Converting Word files to text before diffing them allows an easier comparison between changed files. This
reintroduces some functionality of Git for Windows 1.x.

This fixes git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
rimrul added a commit to rimrul/build-extra that referenced this issue Sep 16, 2015
Converting PDF and  Word files to text before diffing them allows an easier comparison between changed
files. This reintroduces some functionality of Git for Windows 1.x.

This fixes git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
rimrul added a commit to rimrul/build-extra that referenced this issue Sep 17, 2015
Converting PDF and  Word files to text before diffing them allows an easier comparison between changed
files. This reintroduces some functionality of Git for Windows 1.x.
Including only unzip.exe instead of the entire unzip package makes the installer increase only by 61 kiB
instead of 84 kiB, hence the we opted for the former. pdftotext exists in the xpdf package (adds 2860 kiB) and
the poppler package (adds 13250 kiB), we opted to include the xpdf pdftotext.exe and its dependency
libstdc++-6.dll that add 550 kiB to the installer instead of the poppler pdftotext.exe and its 23 additional dlls,
that would increase the size of the installer by 2032 kiB.
In total this commit increases the size of the installer by 2220 kiB.

This fixes git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants