-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert pdf, doc and docx files to text by default #75
Conversation
Should we upgrade to the current docx2txt v1.4 while we are changing this? |
It is a very good start! The You will find excellent documentation on Git for Windows' package management on the wiki, and there are pretty good examples in both I have no idea whether After you have the package, you do not need to list every single of Listing single files (and adding the information about the corresponding package's version) is something I would do only for packages that ship way too many things, where we want to cherry-pick only a few files. I have no idea where the |
Yes, I'll test how I can get |
It is typically fairly easy to get it to build in MSys2. Just try it, download the source, extract it, run With that information, the
Yeah, my mistake. I should not have accepted that... 😇 |
Okay, let's package that just the same like Markdown (which is also a Perl script): https://github.com/Alexpux/MSYS2-packages/blob/master/markdown/PKGBUILD |
I created a kind of working docx2txt package. When built with |
Could you fork MINGW-packages and push your topic branch? |
The idea is that The only suspicious thing there is the extra
You could already make the installer, but I think that I would try to focus on building the package first. BTW since it is a Perl script, i.e. it requires |
Yes we should do that, sounds like that shouldn't be too hard. |
Perfect, thanks. I adjusted it so it is a proper MSys2 package (since it depends on an MSys2 package, it is appropriate): msys2/MSYS2-packages@master...dscho:docx2txt Please note that you need to open an MSys2 shell (i.e. launch The resulting package looks pretty good to me, but I did not test anything. The only thing I noticed is that there are now two scripts: |
The .sh file is just a wrapper around the .pl file that always outputs a textfile. Since diff needs stdout output for its textconv we should keep the perl script. If you want to open a pull request to the base repo I would keep the files with their endings, since they are the output of the makefile, which is the recommended way to install
Mine (still a MINGW-Package, but that shouldn't be a big diffference here) doesn't contain a
the only significant difference I see in the Markdown |
Oh, and I just noticed a suspicious line:
where the points to inside the
Makes sense, in particular since the
Hmm. I still would like to rename it because it would make it easier to use (if a random user installs the
Thank you. However, I guess that the
Yeah, I agree. The only documentation is the usage, and there is no need to convert that into a man page. |
I just figured out that leaving off the |
Your prediction was right, it configured the CONFIGDIR wrong.
Since we need neither the shell script nor the config file (just tested the installed v1.4 with Edit:
you where a little faster... |
Investigated the antiword build. The following commands where needed:
antiword built into Edit: Ok, we should probably run
|
In the interest of making a package that is as useful as possible, I would rather install the whole thing; That would make MSys2 benefit most from your work, not just Git for Windows.
Good catch! You could also call make -f Makefile.Linux install instead.
LOL... no, there isn't ;-) Good work! |
That's what I originally intended to achieve. But removing the extension of the perl script would break the shell script. We could regex search and replace inside the shell script using perl or sed, though.
So we should end up with something like
|
Oh sorry! I was unclear. Since the I did that and it looks pretty good, a simple test did exactly what I hoped it would do: OneDrive's "Getting started" document (which is a Are you okay with these changes? msys2/MSYS2-packages@4fd937c If so, I would squash those changes in and submit the build recipe upstream.
Yes. You probably also need to pass Oh, and I just saw that in other install -D -m755 <something>.exe "${pkgdir}"${MINGW_PREFIX}/bin/<something>.exe i.e. the |
Seems fine to me. I'll write a |
Cool! |
Whoopsie! Sorry, hit the wrong button 😊 |
Ok, antiword is a mess. On http://www.winfield.demon.nl/linux/ it lists |
Firefox probably "helpfully" uncompresses it. |
The |
I've opened Alexpux/MINGW-packages#781 for Antiword. We'll wait for that to be merged, and then I'll introduce those packages here. Enough done for today. |
You did great! Thank you! (Alexpux/MINGW-packages#781 was merged, too!) |
And the |
I've removed the |
Good!
They need to be on line 30 for sure because the Git for Windows SDK 1.0.0 did not install them, and contributors who were early adopters should not suffer for their early help. Whether you add them to line 33 or 73 depends on the size it adds. For example, if the
Yeah, a single commit (i.e. an amended e5a8144) rebased on top of current Thank you! |
BTW I made a couple of mistakes in the |
And the |
0b70fcd
to
c32af21
Compare
Just pushed |
5a89ee6
to
c6508e0
Compare
I've done the |
Very good! I am looking forward to merging this! |
It seems, that removing |
Logical error here. Removing unzip from the rebased c6508e0 installer produces different results than adding it to the master installer... |
:-) Compression algorithms are strange, anyway. |
Converting PDF and Word files to text before diffing them allows an easier comparison between changed files. This reintroduces some functionality of Git for Windows 1.x. Including only unzip.exe instead of the entire unzip package makes the installer increase only by 61 kiB instead of 84 kiB, hence the we opted for the former. pdftotext exists in the xpdf package (adds 2860 kiB) and the poppler package (adds 13250 kiB), we opted to include the xpdf pdftotext.exe and its dependency libstdc++-6.dll that add 550 kiB to the installer instead of the poppler pdftotext.exe and its 23 additional dlls, that would increase the size of the installer by 2032 kiB. In total this commit increases the size of the installer by 2220 kiB. This fixes git-for-windows/git#355 . Signed-off-by: Matthias Aßhauer <[email protected]>
Ok, I think I'm ready. I've solved all remaining problems, switched to xpdf and included the size differences into the commit message. |
Convert pdf, doc and docx files to text by default Signed-off-by: Johannes Schindelin <[email protected]>
Thank you! In my tests, the size difference was much more favorable, though: just 701kB. I adjusted the commit message (and I also adjusted the Thank you for your contribution! The bulk of the work was yours. |
Converting PDF and Word files to text before diffing them allows an easier comparison between changed files. This reintroduces some functionality of Git for Windows 1.x. It was requested by user @asdqwezx to reintroduce this feature in #355. This pull request is not yet ready to be merged, there are still slight Issues with the doc conversion and the pdf conversion is yet to be reintroduced. All "new" files are copied out of the old msysgit repositories, and some have been slightly altered to work with Git 2.x.