-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression Bug: Cheerio 0.10.4 -> 0.10.5 Worse in 0.10.7 #167
Comments
Any thoughts on the issue? Thanks in advance. |
Um, it would be nice if you could strip your example down to the essentials, it's a bit overwhelming. Also, it would be nice if you could test this on 0.10.6 as that helps to pinpoint this down. |
I have what I think may be the same problem. Here is a small demo. var html = '<div><a>A</a></div><div><a>B</a></div>';
var $ = require('cheerio').load(html);
var firstDiv = $('div').first();
console.log($('a', firstDiv).text()); With cheerio 0.10.4, this prints "A". With 0.10.5 and later, it prints "AB". |
@ssmout has a very basic example. But, mine trimmed down even further: google-cleaned.html: <!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage">
<body marginheight="0" topmargin="0" bgcolor="#FFFFFF" marginwidth="0">
<table border="0" cellpadding="0" cellspacing="0" id="mn" style="position:relative">
<tr>
<td valign="top">
<div id="center_col">
<div id="res">
<div id="search">
<div id="ires">
<ol>
<li class="g">
<h3 class="r"><a href=
"/url?q=http://www.microsoft.com/&sa=U&ei=hlMyUdvfEuyDyAG77oBY&ved=0CCQQFjAA&usg=AFQjCNFEx3qGWnPgXHzsueeYnZMZah21aA">
<b>Microsoft</b> Home Page | Devices and
Services</a></h3>
</li>
<li class="g">
<h3 class="r"><a href=
"/url?q=http://en.wikipedia.org/wiki/Microsoft&sa=U&ei=hlMyUdvfEuyDyAG77oBY&ved=0CDkQFjAJ&usg=AFQjCNFBuLjqmEIvZT7UGV1GyjoBYhjxAA">
<b>Microsoft</b> - Wikipedia, the free
encyclopedia</a></h3>
</li>
<li class="g">
<h3 class="r"><a href=
"/url?q=http://www.microsoftstore.com/&sa=U&ei=hlMyUdvfEuyDyAG77oBY&ved=0CD0QFjAK&usg=AFQjCNFNkCILwb6dEQsHKr00KhrD5g4-mA">
<b>Microsoft</b> Store Online -
Welcome</a></h3>
</li>
</ol>
</div>
</div>
</div>
</div>
</td>
</tr>
</table>
</body>
</html> test.js: var fs = require('fs')
, cheerio = require('cheerio')
, qs = require('querystring')
var linkSel = 'h3.r a'
, itemSel = 'li.g'
var html = fs.readFileSync('./google-cleaned.html', 'utf8');
var $ = cheerio.load(html);
var links = [];
$(itemSel).each(function(i, elem) {
var linkElem = $(elem).children(linkSel).first()
, item = {title: $(linkElem).text(), link: null}
, qsObj = qs.parse($(linkElem).attr('href'));
if (qsObj['/url?q']) {
item.link = qsObj['/url?q']
}
links.push(item);
})
console.dir(links)
produces: Cheerio v0.10.4: (Expected Result) [ { title: '\n Microsoft Home Page | Devices and\n Services',
link: 'http://www.microsoft.com/' },
{ title: '\n Microsoft - Wikipedia, the free\n encyclopedia',
link: 'http://en.wikipedia.org/wiki/Microsoft' },
{ title: '\n Microsoft Store Online -\n Welcome',
link: 'http://www.microsoftstore.com/' } ] Cheerio v0.10.5: [ { title: '\n Microsoft Home Page | Devices and\n Services',
link: null },
{ title: '\n Microsoft - Wikipedia, the free\n encyclopedia',
link: null },
{ title: '\n Microsoft Store Online -\n Welcome',
link: null } ] Cheerio v0.10.6: [ { title: '\n Microsoft Home Page | Devices and\n Services',
link: null },
{ title: '\n Microsoft - Wikipedia, the free\n encyclopedia',
link: null },
{ title: '\n Microsoft Store Online -\n Welcome',
link: null } ] Cheerio v0.10.7: [ { title: '', link: null },
{ title: '', link: null },
{ title: '', link: null } ] Hopefully this helps some? |
Any thoughts? Anything that I can help with? |
Thanks @jprichardson for your patience. I think the issue is in your code, specifically: var linkElem = $(elem).children(linkSel).first()
I've tested it and it appears to be working with: var linkElem = $(elem).find(linkSel) |
Thanks @matthewmueller! I can confirm that with your suggestion it works on Cheerio 0.10.4-0.10.7. Thanks again for taking the time to review this and to help me correct my mistake. |
This in reference to: jprichardson/node-google#3
I'm running Mac OS X 10.8.2 and Node.js 0.8.16.
I have distilled down a sample of code that reproduces the bug with distilled input of a Google search.
If you install [email protected], it works as expected. That is, you see data in most of the
title
,link
, anddescription
fields.cheerio-bug.js: (distilled snippet from
node-google
)Input...
google-cleaned.html:
If you run this with [email protected] it partially works, you'll see
title
andhref
andlink
are null.With [email protected] all fields are null.
The text was updated successfully, but these errors were encountered: