Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chars decoding error (gb2312, UTF-8) in Readability. #2435

Closed
4 of 5 tasks
PrinOrange opened this issue Jan 3, 2025 · 3 comments · Fixed by #2449
Closed
4 of 5 tasks

Chars decoding error (gb2312, UTF-8) in Readability. #2435

PrinOrange opened this issue Jan 3, 2025 · 3 comments · Fixed by #2449

Comments

@PrinOrange
Copy link
Contributor

PrinOrange commented Jan 3, 2025

Describe the bug

一些网站的原文文本不是 UTF-8 编码的字符,而是 gb-2312用 readability 阅读这些文章时,会出现解码错误。

The original text of some websites is not UTF-8 encoded characters, but gb-2312.
When reading these articles by readability, decoding errors will occur.

Image

比如上面的文章,它的原网页地址 http://www.pacilution.com/ShowArticle.asp?ArticleID=14866

Image

可以看到,它的网页文本编码为 gb2312.
We can notice that the charset of this page is gb-2312.

可能的解决思路:Readability 在读取原文文本的 html 时,先使用 iconv 统一转化编码,再进行解析。
Possible solution: When Readability reads the original text HTML, first use iconv to convert the encoding uniformly, and then parse it.

Feed Info

https://rsshub.terminels.com/pacilution/latest?code=fde34cc7707b88c3938c652bb2c018db

(这是自部署的 RSSHub 上自己写的路由,目前这个路由的 Pull Request 还在受审阶段,未收录到官方 RSSHub 中)

Reproduction Video

No response

Environment

No response

Validations

  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Check that this is a concrete bug. For Q&A, please open a GitHub Discussion instead.
  • This issue is valid

Contributions

  • I am willing to submit a PR to fix this issue
  • I am willing to submit a PR with failing tests (actually just go ahead and do it, thanks!)
Copy link

linear bot commented Jan 3, 2025

@PrinOrange PrinOrange changed the title Chars decoding error in Readability. Chars decoding error (gb2312, UTF-8) in Readability. Jan 3, 2025
@PrinOrange
Copy link
Contributor Author

In Node.js, you can use the chardet library to detect the encoding format of text.

Install the chardet library:

npm install chardet

Use chardet to detect encoding:
Then, you can use the following code to detect the encoding of your garbled text:

const chardet = require('chardet');

// Assume your garbled text is like this
const text = Buffer.from([0xE6, 0x96, 0x87, 0xE6, 0x9C, 0xAC]); // Sample data

const detectedEncoding = chardet.detect(text);
console.log(`Detected encoding: ${detectedEncoding}`);

In this example, text should be replaced with garbled text. You can use Buffer.from or read the file content directly to get the byte array of text.
Note that chardet may not always be 100% accurate, as some encodings may look similar in certain situations, but it usually gives a reasonable guess.
Result:
If chardet detects an encoding, you can use this information to decode your text. For example, if UTF-8 is detected, you can use Buffer.toString('utf-8') to decode it correctly.

const correctEncoding = 'utf-8'; // Use the detected encoding here
const decodedText = text.toString(correctEncoding);
console.log(`Decoded text: ${decodedText}`);

However, chardet uses heuristics to detect encodings, so in some edge cases, the detection result may not be accurate and may need to be combined with other encodings to further confirm.


在 Node.js 中,可以使用 chardet 库来检测文本的编码格式。

安装 chardet 库:

npm install chardet

使用 chardet 检测编码:
然后,你可以使用以下代码来检测你的乱码文本的编码:

const chardet = require('chardet');

// 假设你的乱码文本是这样的
const text = Buffer.from([0xE6, 0x96, 0x87, 0xE6, 0x9C, 0xAC]); // 示例数据

const detectedEncoding = chardet.detect(text);
console.log(`Detected encoding: ${detectedEncoding}`);

在这个例子中,text 应该替换为乱码文本。可以用 Buffer.from 或直接读取文件内容来获取文本的字节数组。
请注意,chardet 可能不会总是100%准确,因为有些编码在某些情况下可能看起来相似,但它通常能给出合理的猜测。
处理结果:
如果 chardet 检测到了编码,你可以用这个信息去解码你的文本。例如,如果检测到的是 UTF-8,可以使用 Buffer.toString('utf-8') 来正确解码。

const correctEncoding = 'utf-8'; // 这里用检测到的编码
const decodedText = text.toString(correctEncoding);
console.log(`Decoded text: ${decodedText}`);

但是,chardet 使用了启发式方法来检测编码,所以在某些边缘情况下,检测结果可能不准确,可能需要结合其他来进一步确认编码

@PrinOrange
Copy link
Contributor Author

PrinOrange commented Jan 4, 2025

Related PR: #2449
Related another issue: #290 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant