Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert pdf, doc and docx files to text by default #75

Closed
wants to merge 1 commit into from

Conversation

rimrul
Copy link
Member

@rimrul rimrul commented Sep 13, 2015

Converting PDF and Word files to text before diffing them allows an easier comparison between changed files. This reintroduces some functionality of Git for Windows 1.x. It was requested by user @asdqwezx to reintroduce this feature in #355. This pull request is not yet ready to be merged, there are still slight Issues with the doc conversion and the pdf conversion is yet to be reintroduced. All "new" files are copied out of the old msysgit repositories, and some have been slightly altered to work with Git 2.x.

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

Should we upgrade to the current docx2txt v1.4 while we are changing this?

@dscho
Copy link
Member

dscho commented Sep 13, 2015

It is a very good start!

The antiword program should be packaged as a package, though. Adding it as a binary without source code is inappropriate.

You will find excellent documentation on Git for Windows' package management on the wiki, and there are pretty good examples in both MSYS2-packages' and MINGW-packages' commit history how a new package can be added, e.g. mingw-w64-connect.

I have no idea whether antiword can be compiled in default mode (i.e. whether it is a MinGW program), or whether it has to be compiled in an MSys shell (i.e. whether it is an MSys2 program); This would be the indicator whether the package needs to go into MSYS2-packages or MINGW-packages.

After you have the package, you do not need to list every single of antiword's files in make-file-list.sh but instead simply list the package.

Listing single files (and adding the information about the corresponding package's version) is something I would do only for packages that ship way too many things, where we want to cherry-pick only a few files.

I have no idea where the docx2txt comes from, but I suspect that we would want to add a new package for that as well, of course using the current version.

@dscho dscho changed the title [DO NOT MERGE] Convert doc and docx files to text by default [DO NOT MERGE YET] Convert doc and docx files to text by default Sep 13, 2015
@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

Yes, I'll test how I can get antiword compiled from source, the author mentions djgpp or cygwin as ways to build it, but hasn't tested it himself. docx2txt seems to have its source (it's a perl script) uploaded to sourceforge by the original author. Our versions of antiword and docx2txt are both from the old msysgit repository where antiword was just the same inappropriate binary without source.

@dscho
Copy link
Member

dscho commented Sep 13, 2015

Yes, I'll test how I can get antiword compiled from source, the author mentions djgpp or cygwin as ways to build it, but hasn't tested it himself.

It is typically fairly easy to get it to build in MSys2. Just try it, download the source, extract it, run ./configure (or make or whatever is the recommended way to build it).

With that information, the PKGBUILD should be really easy to write (or even better: just take the PKGBUILD from the mingw-w64-connect example I linked earlier, go through it line by line and adjust what looks like it needs to be adjusted).

Our versions of antiword and docx2txt are both from the old msysgit repository where antiword was just the same inappropriate binary without source.

Yeah, my mistake. I should not have accepted that... 😇

@dscho
Copy link
Member

dscho commented Sep 13, 2015

docx2txt seems to have its source (it's a perl script) uploaded to sourceforge by the original author.

Okay, let's package that just the same like Markdown (which is also a Perl script): https://github.com/Alexpux/MSYS2-packages/blob/master/markdown/PKGBUILD

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

I created a kind of working docx2txt package. When built withmakepkg-mingw -s, it installs files to
/usr/src/MINGW-packages/mingw-w64-docx2txt/pkg/mingw-w64-x86_64-docx2txt/mingw64/bin and
/usr/src/MINGW-packages/mingw-w64-docx2txt/pkg/mingw-w64-x86_64-docx2txt/mingw64/etc that I intended to install to /mingw64/bin and /mingw64/etc. Is this normal makepkg-mingw -s-behaviour or should I remove the $pkgdir from the call to make? Can I use this package to make an installer already or do I need to create a pull request and wait for it to be merged before testing the new package in the installer? Where should I open the pull request? On the GfW repo or on the base repo?

@dscho
Copy link
Member

dscho commented Sep 13, 2015

I created a kind of working docx2txt package.

Could you fork MINGW-packages and push your topic branch?

@dscho
Copy link
Member

dscho commented Sep 13, 2015

When built withmakepkg-mingw -s, it installs files to
/usr/src/MINGW-packages/mingw-w64-docx2txt/pkg/mingw-w64-x86_64-docx2txt/mingw64/bin and
/usr/src/MINGW-packages/mingw-w64-docx2txt/pkg/mingw-w64-x86_64-docx2txt/mingw64/etc that I intended to install to /mingw64/bin and /mingw64/etc. Is this normal makepkg-mingw -s-behaviour or should I remove the $pkgdir from the call to make?

The idea is that makepkg-mingw "installs" it into the pkg/ subdirectory, indeed, and then builds the mingw-w64-<arch>-doc2txt-<version>.pkg.tar.xz package from that subdirectory. That .pkg.tar.xz file should then be installed for real, via pacman -U <file>.

The only suspicious thing there is the extra /mingw-w64-x86_64-docx2txt/ level before /mingw64/bin; that should not be the case.

Can I use this package to make an installer already or do I need to create a pull request and wait for it to be merged before testing the new package in the installer? Where should I open the pull request? On the GfW repo or on the base repo?

You could already make the installer, but I think that I would try to focus on building the package first.

BTW since it is a Perl script, i.e. it requires perl (which is an MSys2 package), I would put it into the MSYS2-packages repository and call it docx2txt rather than mingw-w64-docx2txt, but those are really only details that are easily remedied once you have a working PKGBUILD.

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

Could you fork MINGW-packages and push your topic branch?

Done.

BTW since it is a Perl script, i.e. it requires perl (which is an MSys2 package), I would put it into the MSYS2-packages repository and call it docx2txt rather than mingw-w64-docx2txt, but those are really only details that are easily remedied once you have a working PKGBUILD.

Yes we should do that, sounds like that shouldn't be too hard.

@dscho
Copy link
Member

dscho commented Sep 13, 2015

Could you fork MINGW-packages and push your topic branch?

Done.

Perfect, thanks. I adjusted it so it is a proper MSys2 package (since it depends on an MSys2 package, it is appropriate):

msys2/MSYS2-packages@master...dscho:docx2txt

Please note that you need to open an MSys2 shell (i.e. launch C:\git-sdk-64\msys2_shell.bat) before building, and you need to change directory to /usr/src/MSYS2-packages/docx2txt/ (after initializing /usr/src/MSYS2-packages/ and fetching my docx2txt branch), and the build is started using makepkg -s (i.e. not makepkg-mingw -s).

The resulting package looks pretty good to me, but I did not test anything.

The only thing I noticed is that there are now two scripts: docx2txt.pl and docx2txt.sh... Maybe rename the latter to docx2txt?

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

The only thing I noticed is that there are now two scripts: docx2txt.pl and docx2txt.sh... Maybe rename the latter to docx2txt?

The .sh file is just a wrapper around the .pl file that always outputs a textfile. Since diff needs stdout output for its textconv we should keep the perl script. If you want to open a pull request to the base repo I would keep the files with their endings, since they are the output of the makefile, which is the recommended way to install docx2txt. I would just attend .pl to the call in astextplain.

The resulting package looks pretty good to me, but I did not test anything.

Mine (still a MINGW-Package, but that shouldn't be a big diffference here) doesn't contain a mingw-w64-x86_64-docx2txt/ folder inside the tar.xz file so that shouldn't cause any problems. I'll test the installation with pacman.

docx2txt seems to have its source (it's a perl script) uploaded to sourceforge by the original author.

Okay, let's package that just the same like Markdown

the only significant difference I see in the Markdown PKGBUILD is a pod2man call. Since docx2txt contains no POD we won't find anything to convert to manpages.

@dscho
Copy link
Member

dscho commented Sep 13, 2015

Oh, and I just noticed a suspicious line:

Setting systemConfigDir to [...<wrong path>...]

where the points to inside the pkg/ directory...

The only thing I noticed is that there are now two scripts: docx2txt.pl and docx2txt.sh... Maybe rename the latter to docx2txt?

The .sh file is just a wrapper around the .pl file that always outputs a textfile. Since diff needs stdout output for its textconv we should keep the perl script.

Makes sense, in particular since the .sh file seems to require the Perl script anyway.

If you want to open a pull request to the base repo I would keep the files with their endings, since they are the output of the makefile, which is the recommended way to install docx2txt. I would just attend .pl to the call in astextplain.

Hmm. I still would like to rename it because it would make it easier to use (if a random user installs the docx2txt package and sees that there are both a .pl and a .sh file, they might be confused -- as I was -- which one to use).

The resulting package looks pretty good to me, but I did not test anything.

Mine (still a MINGW-Package, but that shouldn't be a big diffference here) doesn't contain a mingw-w64-x86_64-docx2txt/ folder inside the tar.xz file so that shouldn't cause any problems. I'll test the installation with pacman.

Thank you.

However, I guess that the CONFIGDIR needs to be set to /etc/ in order for the test to be truly successful: it will probably work for you because you have that pkg/ directory, but if you install the Pacman package on some other machine, it will probably fail.

docx2txt seems to have its source (it's a perl script) uploaded to sourceforge by the original author.

Okay, let's package that just the same like Markdown

the only significant difference I see in the Markdown PKGBUILD is a pod2man call. Since docx2txt contains no POD we won't find anything to convert to manpages.

Yeah, I agree. The only documentation is the usage, and there is no need to convert that into a man page.

@dscho
Copy link
Member

dscho commented Sep 13, 2015

I just figured out that leaving off the CONFIGDIR setting should fix it ;-) Force-pushed.

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

However, I guess that the CONFIGDIR needs to be set to /etc/ in order for the test to be truly successful: it will probably work for you because you have that pkg/ directory, but if you install the Pacman package on some other machine, it will probably fail.

Your prediction was right, it configured the CONFIGDIR wrong.

If you want to open a pull request to the base repo I would keep the files with their endings, since they are the output of the makefile, which is the recommended way to install docx2txt. I would just attend .pl to the call in astextplain.

Hmm. I still would like to rename it because it would make it easier to use (if a random user installs the docx2txt package and sees that there are both a .pl and a .sh file, they might be confused -- as I was -- which one to use).

Since we need neither the shell script nor the config file (just tested the installed v1.4 with $systemConfigDir = "/";) how about we drop the make call and just copy the perl script without the extension?

Edit:

I just figured out that leaving off the CONFIGDIR setting should fix it ;-) Force-pushed

you where a little faster...

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

Investigated the antiword build. The following commands where needed:

rm Makefile
mv Makefile.Linux Makefile
make install

antiword built into $HOME/bin and intalled its mapping files into $HOME/.antiword.
$home/bin contains antiword.exe and kantiword. kantiword is some script to make drag and drop work in KDE. I hope there are no plans to support running KDE (or gnome or LXDE etc.) from Git bash. $HOME/.antiword contains the mappingfiles I pulled from the msysgit repo plus a Default file that seems to map MS Word fonts to RiscOS fonts, an Example file that contains a slightly different mapping to RiscOS fonts, a fontnames.russian file that maps cyrillic Word fonts to corresponding PostScript fonts and two files UnicodeXX that are variations of the corresponding 8859-XX.txt files for RiscOS.

Edit: Ok, we should probably run

rm Makefile
mv Makefile.Linux Makefile
make all
install Resources/* $pkgdir/$mingwdir/share/antiword
install antiword.exe $pkgdir/$mingwdir/bin

@dscho
Copy link
Member

dscho commented Sep 13, 2015

Since we need neither the shell script nor the config file (just tested the installed v1.4 with $systemConfigDir = "/";) how about we drop the make call and just copy the perl script without the extension?

In the interest of making a package that is as useful as possible, I would rather install the whole thing; That would make MSys2 benefit most from your work, not just Git for Windows.

Investigated the antiword build. The following commands where needed:

rm Makefile
mv Makefile.Linux Makefile
make install

Good catch!

You could also call

make -f Makefile.Linux install

instead.

I hope there are no plans to support running KDE (or gnome or LXDE etc.) from Git bash.

LOL... no, there isn't ;-)

Good work!

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

In the interest of making a package that is as useful as possible, I would rather install the whole thing; That would make MSys2 benefit most from your work, not just Git for Windows.

That's what I originally intended to achieve. But removing the extension of the perl script would break the shell script. We could regex search and replace inside the shell script using perl or sed, though.
/etc should usually work as the $systemConfigDir.

You could also call
make -f Makefile.Linux install
instead.

So we should end up with something like

build()
{
    cd $srcdir
    make -f Makefile.Linux antiword
}
package()
{
    install $srcdir/Resources/* $pkgdir/$mingwdir/share/antiword
    install $srcdir/antiword.exe $pkgdir/$mingwdir/bin
}

@dscho
Copy link
Member

dscho commented Sep 13, 2015

In the interest of making a package that is as useful as possible, I would rather install the whole thing; That would make MSys2 benefit most from your work, not just Git for Windows.

That's what I originally intended to achieve. But removing the extension of the perl script would break the shell script.

Oh sorry! I was unclear. Since the .sh script is clearly the one that is intended to be the end-user script (and the Perl script being the work-horse that you can also call from elsewhere), I was only talking about the .sh extension to be stripped.

I did that and it looks pretty good, a simple test did exactly what I hoped it would do: OneDrive's "Getting started" document (which is a .docx) got translated as I wished.

Are you okay with these changes? msys2/MSYS2-packages@4fd937c If so, I would squash those changes in and submit the build recipe upstream.

So we should end up with something like

build()
{
cd $srcdir
make -f Makefile.Linux antiword
}
prepare()
{
install $srcdir/Resources/* $pkgdir/$mingwdir/share/antiword
install $srcdir/antiword.exe $pkgdir/$mingwdir/bin
}

Yes. You probably also need to pass -D to the install executable so that the directories are created.

Oh, and I just saw that in other PKGBUILD scripts, the common invocation is:

install -D -m755 <something>.exe "${pkgdir}"${MINGW_PREFIX}/bin/<something>.exe

i.e. the MINGW_PREFIX environment variable is already available, and I guess upstream likes curly brackets 😀

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

Are you okay with these changes? msys2/MSYS2-packages@4fd937c If so, I would squash those changes in and submit the build recipe upstream.

Seems fine to me. I'll write a PKGBUILD for antiword and generate an upstream pull request at Alexpux/MINGW-packages

@dscho
Copy link
Member

dscho commented Sep 13, 2015

Cool!

@dscho dscho closed this Sep 13, 2015
@dscho dscho reopened this Sep 13, 2015
@dscho
Copy link
Member

dscho commented Sep 13, 2015

Whoopsie! Sorry, hit the wrong button 😊

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

Ok, antiword is a mess. On http://www.winfield.demon.nl/linux/ it lists antiword-0.37.tar.gz as 310K. The version my Firefox downloads is reproducably 1.7M big. the version makepkg-mingw downloads is 310K. 7Zip opens both as valid files. It opens the big version in one step and the small one in two (a gz archive containing a tar archive)... This slight difference causes a checksum mismatch -.-

@dscho
Copy link
Member

dscho commented Sep 13, 2015

The version my Firefox downloads is reproducably 1.7M big.

Firefox probably "helpfully" uncompresses it.

@dscho
Copy link
Member

dscho commented Sep 13, 2015

The docx2txt Pull Request was merged: msys2/MSYS2-packages#345

@rimrul
Copy link
Member Author

rimrul commented Sep 13, 2015

I've opened Alexpux/MINGW-packages#781 for Antiword. We'll wait for that to be merged, and then I'll introduce those packages here. Enough done for today.

@dscho
Copy link
Member

dscho commented Sep 13, 2015

Enough done for today.

You did great! Thank you! (Alexpux/MINGW-packages#781 was merged, too!)

@dscho
Copy link
Member

dscho commented Sep 14, 2015

And the mingw-w64-antiword package now hit MSys2's Pacman repository (but not the docx2txt package...)!

@rimrul
Copy link
Member Author

rimrul commented Sep 14, 2015

I've removed the unzip, docx2txt and antiword files from git-extra/PKGBUILD and make-file-list.sh. Should I add the unzip, docx2txt and antiword packages to line 30 (pacman -S --needed...), line 33(pacman_list mingw-w64-$ARCH-git...) or line 73 (pacman -Q filesystem) of make-file-list.sh? Would you prefer a commit on top of e5a8144, an ammended commit based on e5a8144 or all changes rebased into a single commit on top of d930be0? I'd say rebasing onto d930be0 would produce more precise results when meassuring the size difference.

@dscho
Copy link
Member

dscho commented Sep 15, 2015

I've removed the unzip, docx2txt and antiword files from git-extra/PKGBUILD and make-file-list.sh.

Good!

Should I add the unzip, docx2txt and antiword packages to line 30 (pacman -S --needed...), line 33(pacman_list mingw-w64-$ARCH-git...) or line 73 (pacman -Q filesystem) of make-file-list.sh?

They need to be on line 30 for sure because the Git for Windows SDK 1.0.0 did not install them, and contributors who were early adopters should not suffer for their early help.

Whether you add them to line 33 or 73 depends on the size it adds. For example, if the unzip package contains much more than just the unzip.exe it makes sense to add just the file to line 91 and list the unzip package on line 73. If it does not really matter, installer size-wise, it is better to add the unzip package to line 33 and be done with it.

Would you prefer a commit on top of e5a8144, an ammended commit based on e5a8144 or all changes rebased into a single commit on top of d930be0?

Yeah, a single commit (i.e. an amended e5a8144) rebased on top of current master would make most sense.

Thank you!

@dscho
Copy link
Member

dscho commented Sep 15, 2015

BTW I made a couple of mistakes in the docx2txt package definition which @Alexpux has fixed gracefully, and he said that he would upload the packages today...

@dscho
Copy link
Member

dscho commented Sep 15, 2015

And the docx2txt packages were uploaded! Yeah!

@rimrul rimrul force-pushed the issue-355 branch 3 times, most recently from 0b70fcd to c32af21 Compare September 15, 2015 21:34
@rimrul
Copy link
Member Author

rimrul commented Sep 15, 2015

Just pushed two three updates: one update and two bugfixes for said update, the third one (c32af21) should be allright. Maybe I'll take a look at pdftotext tomorrow morning.

@rimrul rimrul force-pushed the issue-355 branch 3 times, most recently from 5a89ee6 to c6508e0 Compare September 16, 2015 19:13
@rimrul rimrul changed the title [DO NOT MERGE YET] Convert doc and docx files to text by default [DO NOT MERGE YET] Convert pdf, doc and docx files to text by default Sep 16, 2015
@rimrul
Copy link
Member Author

rimrul commented Sep 16, 2015

I've done the git rebase --whitespace=fix HEAD^ and added the pdf function, but I have yet to check for form feed issues, rebase onto master and check the size differences. We are definitly getting close to a good result.

@dscho
Copy link
Member

dscho commented Sep 16, 2015

Very good! I am looking forward to merging this!

@rimrul
Copy link
Member Author

rimrul commented Sep 17, 2015

It seems, that removing unzip.exe increases the installer size by 215 bytes. I have no idea, how that works.

@rimrul
Copy link
Member Author

rimrul commented Sep 17, 2015

It seems, that removing unzip.exe increases the installer size by 215 bytes. I have no idea, how that works.

Logical error here. Removing unzip from the rebased c6508e0 installer produces different results than adding it to the master installer...

@dscho
Copy link
Member

dscho commented Sep 17, 2015

:-) Compression algorithms are strange, anyway.

Converting PDF and  Word files to text before diffing them allows an easier comparison between changed
files. This reintroduces some functionality of Git for Windows 1.x.
Including only unzip.exe instead of the entire unzip package makes the installer increase only by 61 kiB
instead of 84 kiB, hence the we opted for the former. pdftotext exists in the xpdf package (adds 2860 kiB) and
the poppler package (adds 13250 kiB), we opted to include the xpdf pdftotext.exe and its dependency
libstdc++-6.dll that add 550 kiB to the installer instead of the poppler pdftotext.exe and its 23 additional dlls,
that would increase the size of the installer by 2032 kiB.
In total this commit increases the size of the installer by 2220 kiB.

This fixes git-for-windows/git#355 .

Signed-off-by: Matthias Aßhauer <[email protected]>
@rimrul
Copy link
Member Author

rimrul commented Sep 17, 2015

Ok, I think I'm ready. I've solved all remaining problems, switched to xpdf and included the size differences into the commit message.

@rimrul rimrul changed the title [DO NOT MERGE YET] Convert pdf, doc and docx files to text by default Convert pdf, doc and docx files to text by default Sep 17, 2015
dscho added a commit that referenced this pull request Sep 17, 2015
Convert pdf, doc and docx files to text by default

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho
Copy link
Member

dscho commented Sep 17, 2015

Thank you! In my tests, the size difference was much more favorable, though: just 701kB. I adjusted the commit message (and I also adjusted the make-file-list.sh file because the portable Git cannot be built when there are duplicate entries: 7-Zip does not like that at all).

Thank you for your contribution! The bulk of the work was yours.

@dscho dscho closed this Sep 17, 2015
@dscho
Copy link
Member

dscho commented Sep 18, 2015

@rimrul rimrul deleted the issue-355 branch December 5, 2018 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants