Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy text selection from PDF adds extra line breaks where they don't belong #7833

Closed
lightman76 opened this issue Nov 21, 2016 · 8 comments · Fixed by #13424
Closed

Copy text selection from PDF adds extra line breaks where they don't belong #7833

lightman76 opened this issue Nov 21, 2016 · 8 comments · Fixed by #13424
Assignees

Comments

@lightman76
Copy link

Link to PDF file (or attach file here):
http://research.microsoft.com/en-us/um/people/hiballan/pubs/ccs08-staledns.pdf

Configuration:

  • Web browser and its version: Chrome 54.0.2840.98 (64-bit) and Firefox 49.0.2
  • Operating system and its version: Mac OSX Sierra
  • PDF.js version: 1.6.210
  • Is an extension: no

Steps to reproduce the problem:

  1. Open the referenced PDF
    2.Try copying text from the article (I've just tried on the first page)
  2. Paste somewhere else, and spaces are missing between most words.

What is the expected behavior? (add screenshot)
Copy the text from the pdf with spaces between words: EG first line in paper should copy as
This paper considers DoS attacks on DNS wherein attackers flood

What went wrong? (add screenshot)
This is what is copied
This
paper
consider
s DoS
attac
ks
on
DNS
wher
ein
attac
kers flood

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension): Reproduced this using the github hosted version.

I have a pull request to fix this problem in the works.

@aaronshaf
Copy link

Changing the absolutely positioned div's to span's would help text selection.

@goodwingibbins
Copy link

Just chiming in that this is not an issue with Chrome PDF viewer but is within Firefox. Proving difficult for at least a few users over here!

@naturalspringwater
Copy link

naturalspringwater commented Jan 26, 2019

Hello,

I read scientific literature mainly in the form of PDFs and copy-pasting quotes from PDFs opened in Firefox and find my work greatly and surprisingly negatively affected by this issue. It's impossible to copy-paste anything without both line breaks and dashes everywhere. It's not the exception, it's virtually every scientific PDF, all the time, everywhere.

Example, a simple CTRL+C then CTRL+V into a plain text editor:

therefore, under these basal physiological conditions, 
glutamate primarily functions as an excitatory transmitter, mediat
-
ing just synaptic excitation. However, under other conditions, such 
as high-frequency stimulation of presynaptic inputs and/or uncon
-
trolled  seizure  activity  and  spastic  hypertonia,  repetitive  firing  of 
glutamatergic neurons leads to excessive accumulation of presynap
-
tically released glutamate to surpass the uptake ability of glutamate 
transporters, thereby allowing it to transiently escaping uptake and 
spill-over  to  the  adjacent  glycinergic  synapses
22–24
.  When  such  a 
heterosynaptic glutamate spill-over increases glutamate concentra
-
tions at nearby glycinergic synapses to the level of a few 
micromolar, 
only  then  will  glutamate  allosterically  potentiate  GlyR  function, 
thereby strengthening glycine-mediated synaptic inhibition at these 
synapses. The enhanced inhibition may, in turn, rapidly and efficiently 
counteract glutamate-mediated excitation to ensure tight control of 
neuronal excitability under most physiological conditions. Vice versa, 
a similar heterosynaptic spill-over feedback mechanism can also take 
place via glycine potentiation of NMDARs when glycinergic neurons 
are  firing  extensively.  Compared  with  other  levels  of  homeostastic 
regulation, such as neuronal networks
25
 or synaptic homeostasis
26,27
, 
this glutamate-glycine reciprocal receptor cross-talk mode is clearly 
much  faster.  Such  a  rapid  mode  of  homeostatic  regulation  may  be 
particularly important for some neuronal functions that require very 
tight ti

from: https://www.researchgate.net/profile/Yu_Tian_Wang/publication/46220764_Allosteric_potentiation_of_glycine_receptor_chloride_currents_by_glutamate/links/02e7e5238ea94b8cb0000000.pdf

You can see the following 3 issues, which intertwine but you can separate them:

  1. Often random line breaks in middle of words (the absolute worst). This especially occurs in PDFs that were scanned, and does not seem to happen much or at all in PDFs that were saved electronically.
  2. Dashes inserted everywhere ("-"), sometimes even when when there were already line breaks (whether or not 1 is also happening). This appears to be a a FF-PDF-only "feature", but it's extremely counter-productive, and instead of doing that, it should simply wrap on word breaks instead of breaking words. This affects all PDFs I read.
  3. There is no visible way to prevent inserting line breaks in paragraphs altogether, meaning I have to manually un-wrap all the text so that it can once again be read in a word-wrapped plain text editor - this is in the very best case! This affects all PDFs I read.
  4. (I forgot, remembered after scrolling past OP again): Just like OP described, I often see line breaks inappropriately between words, splitting the sentence across 5-10 lines. This happens mainly in scanned PDFs, rarely or not in electronically saved ones.

The only arguably sane behavior for most purposes is 3, however even 3 desperately requires at least an option to not insert line breaks into paragraphs, so that you can copy it without artificial formatting.

What it should do, at least have a configurable option to allow, is: If you copy-paste 2 subsequent paragraphs from a PDF, it should produce a total of 3 line breaks maximum: 1 or 2 line breaks after the first paragraph, and possibly another after the second although I'm not sure about that one, maybe not.

Note that Chromium only suffers from issue 3 for the most part, but due to setup and plugins find myself relying on Firefox so this is quite a major setback and I physically waste time due to this.

All versions of Firefox I've tried recently are affected by this (60-64), and I use exclusively under Linux.

Thank you

@timvandermeij
Copy link
Contributor

Pull request #7834 is a first attempt to resolve this, however it was never completely finished. It can be used as a baseline for a follow-up patch.

@burtonator
Copy link

Would love to see this fixed. It's really impacting our users: https://getpolarized.io/ whom edit a lot of scientific PDFs

@shikhaidsil
Copy link

Facing same issue. Mozilla shows extra newlines while chrome removes even the one which are visible. Any solution please?

@themoonisacheese
Copy link

Can confirm that this is an issue with the generation of the pdf and not pdf.js. We have the same problem regardless of the viewer used.

@calixteman
Copy link
Contributor

Patch #13257 mostly fixes the issue but there is always a problem on the bottom of page 2 (note 1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants