Chars decoding error (gb2312, UTF-8) in Readability. #2435

PrinOrange · 2025-01-03T16:02:05Z

Describe the bug

一些网站的原文文本不是 UTF-8 编码的字符，而是 gb-2312，用 readability 阅读这些文章时，会出现解码错误。

The original text of some websites is not UTF-8 encoded characters, but gb-2312.
When reading these articles by readability, decoding errors will occur.

比如上面的文章，它的原网页地址 http://www.pacilution.com/ShowArticle.asp?ArticleID=14866

可以看到，它的网页文本编码为 gb2312.
We can notice that the charset of this page is gb-2312.

可能的解决思路：Readability 在读取原文文本的 html 时，先使用 iconv 统一转化编码，再进行解析。
Possible solution: When Readability reads the original text HTML, first use iconv to convert the encoding uniformly, and then parse it.

Feed Info

https://rsshub.terminels.com/pacilution/latest?code=fde34cc7707b88c3938c652bb2c018db

(这是自部署的 RSSHub 上自己写的路由，目前这个路由的 Pull Request 还在受审阶段，未收录到官方 RSSHub 中)

Reproduction Video

No response

Environment

No response

Validations

Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
Check that this is a concrete bug. For Q&A, please open a GitHub Discussion instead.
This issue is valid

Contributions

I am willing to submit a PR to fix this issue
I am willing to submit a PR with failing tests (actually just go ahead and do it, thanks!)

The text was updated successfully, but these errors were encountered:

linear · 2025-01-03T16:02:09Z

FOL-1393 Chars decoding error in Readability.

PrinOrange · 2025-01-03T16:22:46Z

In Node.js, you can use the chardet library to detect the encoding format of text.

Install the chardet library:

npm install chardet

Use chardet to detect encoding:
Then, you can use the following code to detect the encoding of your garbled text:

const chardet = require('chardet');

// Assume your garbled text is like this
const text = Buffer.from([0xE6, 0x96, 0x87, 0xE6, 0x9C, 0xAC]); // Sample data

const detectedEncoding = chardet.detect(text);
console.log(`Detected encoding: ${detectedEncoding}`);

In this example, text should be replaced with garbled text. You can use Buffer.from or read the file content directly to get the byte array of text.
Note that chardet may not always be 100% accurate, as some encodings may look similar in certain situations, but it usually gives a reasonable guess.
Result:
If chardet detects an encoding, you can use this information to decode your text. For example, if UTF-8 is detected, you can use Buffer.toString('utf-8') to decode it correctly.

const correctEncoding = 'utf-8'; // Use the detected encoding here
const decodedText = text.toString(correctEncoding);
console.log(`Decoded text: ${decodedText}`);

However, chardet uses heuristics to detect encodings, so in some edge cases, the detection result may not be accurate and may need to be combined with other encodings to further confirm.

在 Node.js 中，可以使用 chardet 库来检测文本的编码格式。

安装 chardet 库：

npm install chardet

使用 chardet 检测编码：
然后，你可以使用以下代码来检测你的乱码文本的编码：

const chardet = require('chardet');

// 假设你的乱码文本是这样的
const text = Buffer.from([0xE6, 0x96, 0x87, 0xE6, 0x9C, 0xAC]); // 示例数据

const detectedEncoding = chardet.detect(text);
console.log(`Detected encoding: ${detectedEncoding}`);

在这个例子中，text 应该替换为乱码文本。可以用 Buffer.from 或直接读取文件内容来获取文本的字节数组。
请注意，chardet 可能不会总是100%准确，因为有些编码在某些情况下可能看起来相似，但它通常能给出合理的猜测。
处理结果：
如果 chardet 检测到了编码，你可以用这个信息去解码你的文本。例如，如果检测到的是 UTF-8，可以使用 Buffer.toString('utf-8') 来正确解码。

const correctEncoding = 'utf-8'; // 这里用检测到的编码
const decodedText = text.toString(correctEncoding);
console.log(`Decoded text: ${decodedText}`);

但是，chardet 使用了启发式方法来检测编码，所以在某些边缘情况下，检测结果可能不准确，可能需要结合其他来进一步确认编码

PrinOrange · 2025-01-04T14:50:59Z

Related PR: #2449
Related another issue: #290 (comment)

…dability (issue #2435) (#2449)

PrinOrange changed the title ~~Chars decoding error in Readability.~~ Chars decoding error (gb2312, UTF-8) in Readability. Jan 3, 2025

PrinOrange mentioned this issue Jan 4, 2025

fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449

Merged

5 tasks

hyoban pushed a commit that referenced this issue Jan 6, 2025

fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in rea…

1de1cd9

…dability (issue #2435) (#2449)

hyoban closed this as completed in #2449 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chars decoding error (gb2312, UTF-8) in Readability. #2435

Chars decoding error (gb2312, UTF-8) in Readability. #2435

PrinOrange commented Jan 3, 2025 •

edited

Loading

linear bot commented Jan 3, 2025

PrinOrange commented Jan 3, 2025

PrinOrange commented Jan 4, 2025 •

edited

Loading

Chars decoding error (gb2312, UTF-8) in Readability. #2435

Chars decoding error (gb2312, UTF-8) in Readability. #2435

Comments

PrinOrange commented Jan 3, 2025 • edited Loading

Describe the bug

Feed Info

Reproduction Video

Environment

Validations

Contributions

linear bot commented Jan 3, 2025

PrinOrange commented Jan 3, 2025

PrinOrange commented Jan 4, 2025 • edited Loading

PrinOrange commented Jan 3, 2025 •

edited

Loading

PrinOrange commented Jan 4, 2025 •

edited

Loading