-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chars decoding error (gb2312, UTF-8) in Readability. #2435
Comments
In Node.js, you can use the chardet library to detect the encoding format of text. Install the chardet library: npm install chardet Use chardet to detect encoding: const chardet = require('chardet');
// Assume your garbled text is like this
const text = Buffer.from([0xE6, 0x96, 0x87, 0xE6, 0x9C, 0xAC]); // Sample data
const detectedEncoding = chardet.detect(text);
console.log(`Detected encoding: ${detectedEncoding}`); In this example, text should be replaced with garbled text. You can use Buffer.from or read the file content directly to get the byte array of text. const correctEncoding = 'utf-8'; // Use the detected encoding here
const decodedText = text.toString(correctEncoding);
console.log(`Decoded text: ${decodedText}`); However, chardet uses heuristics to detect encodings, so in some edge cases, the detection result may not be accurate and may need to be combined with other encodings to further confirm. 在 Node.js 中,可以使用 chardet 库来检测文本的编码格式。 安装 chardet 库: npm install chardet 使用 chardet 检测编码: const chardet = require('chardet');
// 假设你的乱码文本是这样的
const text = Buffer.from([0xE6, 0x96, 0x87, 0xE6, 0x9C, 0xAC]); // 示例数据
const detectedEncoding = chardet.detect(text);
console.log(`Detected encoding: ${detectedEncoding}`); 在这个例子中,text 应该替换为乱码文本。可以用 Buffer.from 或直接读取文件内容来获取文本的字节数组。 const correctEncoding = 'utf-8'; // 这里用检测到的编码
const decodedText = text.toString(correctEncoding);
console.log(`Decoded text: ${decodedText}`); 但是,chardet 使用了启发式方法来检测编码,所以在某些边缘情况下,检测结果可能不准确,可能需要结合其他来进一步确认编码 |
Related PR: #2449 |
Describe the bug
一些网站的原文文本不是 UTF-8 编码的字符,而是
gb-2312
,用 readability 阅读这些文章时,会出现解码错误。The original text of some websites is not UTF-8 encoded characters, but gb-2312.
When reading these articles by readability, decoding errors will occur.
比如上面的文章,它的原网页地址 http://www.pacilution.com/ShowArticle.asp?ArticleID=14866
可以看到,它的网页文本编码为
gb2312
.We can notice that the charset of this page is
gb-2312
.可能的解决思路:Readability 在读取原文文本的 html 时,先使用 iconv 统一转化编码,再进行解析。
Possible solution: When Readability reads the original text HTML, first use iconv to convert the encoding uniformly, and then parse it.
Feed Info
https://rsshub.terminels.com/pacilution/latest?code=fde34cc7707b88c3938c652bb2c018db
(这是自部署的 RSSHub 上自己写的路由,目前这个路由的 Pull Request 还在受审阶段,未收录到官方 RSSHub 中)
Reproduction Video
No response
Environment
No response
Validations
Contributions
The text was updated successfully, but these errors were encountered: