Skip to content

Commit

Permalink
ContentExtractor: Fix encoding in JSON-LD
Browse files Browse the repository at this point in the history
When page has an article body specified in JSON-LD, the HTML fragment is passed to `DOMDocument::loadHTML`.
`DOMDocument` will parse a HTML document as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set.

https://stackoverflow.com/a/39148511/160386

JSON-LD typically does not contain `meta[charset]` tag so we need to add it ourselves.

At this point the input should already be in UTF-8, as we convert all inputs to that encoding everywhere except in `Graby::cleanupHtml`. There we will clarify the requirement in doc comment.
  • Loading branch information
jtojnar committed Feb 23, 2025
1 parent 38976e3 commit c07119e
Show file tree
Hide file tree
Showing 4 changed files with 2,002 additions and 3 deletions.
4 changes: 2 additions & 2 deletions src/Extractor/ContentExtractor.php
Original file line number Diff line number Diff line change
Expand Up @@ -1208,7 +1208,7 @@ private function extractMultipleEntityFromPattern(string $entity, string $patter
* - OpenGraph
* - JSON-LD.
*
* @param string $html Html from the page
* @param string $html UTF-8-encoded HTML fragment of the page
*/
private function extractDefinedInformation(string $html): void
{
Expand All @@ -1219,7 +1219,7 @@ private function extractDefinedInformation(string $html): void
libxml_use_internal_errors(true);

$doc = new \DOMDocument();
$doc->loadHTML($html);
$doc->loadHTML('<meta charset="utf-8">' . $html);

libxml_use_internal_errors(false);

Expand Down
2 changes: 1 addition & 1 deletion src/Graby.php
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ public function toggleImgNoReferrer(bool $toggle = false): void
/**
* Cleanup HTML from a DOMElement or a string.
*
* @param string|\DOMElement $contentBlock
* @param string|\DOMElement $contentBlock a DOM element or UTF-8-encoded HTML fragment
*/
public function cleanupHtml($contentBlock, UriInterface $url): string
{
Expand Down
23 changes: 23 additions & 0 deletions tests/GrabyFunctionalTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -267,4 +267,27 @@ public function testWithEmptyReplaceString(): void
$this->assertNotNull($res->getSummary());
$this->assertSame(200, $res->getEffectiveResponse()->getResponse()->getStatusCode());
}

// https://github.com/j0k3r/graby/issues/359
public function testExtractDefinedInformation(): void
{
$httpMockClient = new HttpMockClient();
$httpMockClient->addResponse(new Response(200, ['Content-Type' => ['text/html; charset=UTF-8'], 'Transfer-Encoding' => ['chunked'], 'Connection' => ['keep-alive'], 'Date' => ['Sun, 23 Feb 2025 23:19:48 GMT'], 'countrycode' => ['CZ'], 'Accept-Ranges' => ['bytes'], 'X-Frame-Options' => ['SAMEORIGIN'], 'Cache-Control' => ['no-cache, private'], 'Surrogate-Control' => ['content="ESI/1.0"'], 'Vary' => ['Accept-Encoding'], 'Strict-Transport-Security' => ['max-age=31536000; includeSubDomains; preload'], 'x-clientip' => ['89.177.205.85'], 'X-Cache' => ['Miss from cloudfront'], 'Via' => ['1.1 4614c36172b2854b1e1e94af37435c8e.cloudfront.net (CloudFront)'], 'X-Amz-Cf-Pop' => ['PRG50-C1'], 'X-Amz-Cf-Id' => ['SX6OzC1Nese_say0csdmfmPKp4ez2utzI3ePYATt_MZuJri_HJ1mEA==']], (string) file_get_contents(__DIR__ . '/fixtures/content/https___www.xataka.com_movilidad_coches-vendidos-2023-2024-espana.html')));
$graby = new Graby([
'debug' => true,
'extractor' => [
'config_builder' => [
'site_config' => [__DIR__ . '/fixtures/site_config'],
],
],
], $httpMockClient);
$res = $graby->fetchContent('https://www.xataka.com/movilidad/coches-vendidos-2023-2024-espana');

$this->assertNotNull($res->getSummary());
$this->assertStringContainsString(
'automóvil',
$res->getHtml(),
'JSON-LD processing must use UTF-8'
);
}
}
Loading

0 comments on commit c07119e

Please sign in to comment.