Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhpOffice\PhpSpreadsheet\Reader\Html::load not working for html files smaller than 2048 bytes. #194

Closed
1 of 3 tasks
victortodoran opened this issue Aug 2, 2017 · 5 comments

Comments

@victortodoran
Copy link

This is:

What is the expected behavior?

The bug refers to the PhpOffice\PhpSpreadsheet\Reader\Html::canRead($filename).
To return true if a valid html file(path) is served.

What is the current behavior?

If the filesize is smaller than 2048 the value of the self::TEST_SAMPLE_SIZE the function
will always return false since inside the Html::readEnding() nothing is read by the
fread($this->fileHandle, $blockSize).

I suspect this is caused by the fact that Html::readBegining() is reading after EOF and
fseek($this->fileHandle, $size - $blockSize); does not have the expected behaviour.

Because of this issue I'm unable to load any html file that has a size smaller than 2048.

What are the steps to reproduce?

Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:

<?php

require __DIR__ . '/vendor/autoload.php';

// add code that show the issue here...

$reader = new Html();
$spreadsheet = $reader->load("test.html"); // where test.html filesize < 2048


### Which versions of PhpSpreadsheet and PHP are affected?
Happend in:
PhpSpreadsheet latest (installed with composer require)
PHP 7.0.20
@PowerKiKi
Copy link
Member

Do you have HTML files with actual valid data that are smaller than 2048 bytes ? If so what sizes are they ?

Would you have any suggestions on how to solve this ?

@victortodoran
Copy link
Author

It's been a while so bare with me.

"Do you have HTML files with actual valid data that are smaller than 2048 bytes ?"

The error occurred in the following context:
I was parsing large html files table by table.
I was feeding the service one table at a time.
So we are not talking about files but input.

" If so what sizes are they ?"
The size is something dynamic. In my opinion it should not be a factor.

"Would you have any suggestions on how to solve this ?"

99% sure that all validation did was check if the file starts with "<" and ends with ">".
We can both agree that it does not really qualify for a validation of html format.
I can not figure out what value or problem that validation solves.
In my opinion you can drop the whole validation function entirely.

How I remember fixing the problem:

I ended up extending the provided class.
Droped the provided validation entirely.
Did a validation based on try catch.
After playing around with the input I've noticed some cases in which exceptions were thrown by the provided service for certain inputs.
I can not guarantee the provided fix suggestion as I no longer have access to the code it self, unfortunately I can not afford the hours to investigate this now.
Next time I'll raise an issue I'll leave a possible fix as well.

Thank you.

@kifni41
Copy link
Contributor

kifni41 commented Oct 19, 2017

Hi,
I also experiencing similar problem.

I'm using it for 'convert' html files containing table to Xlsx file.

I got "Invalid HTML file." exception. I check the html file using some validator, all says valid.

At first i didn't suspect it's related to the size, but after debugging it i notice the are static variable TEST_SAMPLE_SIZE on Reader/Html.php

`
private function readEnding()
{
$meta = stream_get_meta_data($this->fileHandle);
$filename = $meta['uri'];

    $size = filesize($filename);
    $blockSize = self::TEST_SAMPLE_SIZE;

    fseek($this->fileHandle, $size - $blockSize);

    return fread($this->fileHandle, $blockSize);

}
`

not sure why we need to set TEST_SAMPLE_SIZE to 2048, maybe there is certain reason for that.

But in my case, since i use read html that is generated by the data, for my latest case it can be as small as 1KB.
So i'm thingking maybe 512 would be a better value.

But maybe this will affect reader performance, i haven't think about performance effect.

Regards,

@alpha-and-omega
Copy link

I have fixed this long time ago in my local fork.
Patch:
https://gist.github.com/alpha-and-omega/c5e92fcc9f551bd6d312ce7dfbeae4a2#file-0001-fixes-fseek-bug-patch

@PowerKiKi
Copy link
Member

Thank you @alpha-and-omega for your patch. I was able to complete it with unit tests and merge it.

Dfred pushed a commit to Dfred/PhpSpreadsheet that referenced this issue Nov 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants