Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression Bug: Cheerio 0.10.4 -> 0.10.5 Worse in 0.10.7 #167

Closed
jprichardson opened this issue Mar 2, 2013 · 7 comments
Closed

Regression Bug: Cheerio 0.10.4 -> 0.10.5 Worse in 0.10.7 #167

jprichardson opened this issue Mar 2, 2013 · 7 comments

Comments

@jprichardson
Copy link

This in reference to: jprichardson/node-google#3

I'm running Mac OS X 10.8.2 and Node.js 0.8.16.

I have distilled down a sample of code that reproduces the bug with distilled input of a Google search.

If you install [email protected], it works as expected. That is, you see data in most of the title, link, and description fields.

cheerio-bug.js: (distilled snippet from node-google)

var fs = require('fs')
  , cheerio = require('cheerio')
  , qs = require('querystring')

var linkSel = 'h3.r a'
  , itemSel = 'li.g'

var html = fs.readFileSync('./google-cleaned.html', 'utf8');

var $ = cheerio.load(html);
var links = [];

$(itemSel).each(function(i, elem) {
  var linkElem = $(elem).children(linkSel).first()
    , item = {title: $(linkElem).text(), link: null, href: null}
    , qsObj = qs.parse($(linkElem).attr('href'));

  if (qsObj['/url?q']) {
    item.link = qsObj['/url?q']
    item.href = item.link
  }

  links.push(item);
})

console.dir(links)

Input...

google-cleaned.html:

<!DOCTYPE html>

<html itemscope="itemscope" itemtype="http://schema.org/WebPage">

<body marginheight="0" topmargin="0" bgcolor="#FFFFFF" marginwidth=
"0">


  <table border="0" cellpadding="0" cellspacing="0" id="mn" style="position:relative">
    <tr>
      <td valign="top">
        <div id="center_col">
          <div id="res">
            <div id="topstuff"></div>

            <div id="search">
              <div id="ires">
                <ol>
                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://www.microsoft.com/&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CCQQFjAA&amp;usg=AFQjCNFEx3qGWnPgXHzsueeYnZMZah21aA">
                    <b>Microsoft</b> Home Page | Devices and
                    Services</a></h3>

                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "http://maps.google.com/maps?hl=en&amp;num=10&amp;um=1&amp;ie=UTF-8&amp;q=Microsoft&amp;fb=1&amp;gl=us&amp;hq=Microsoft&amp;sa=X&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CDAQtQM">
                    <div>
                      Local business results for <b>Microsoft</b>
                      near
                    </div></a></h3>


                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://en.wikipedia.org/wiki/Microsoft&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CDkQFjAJ&amp;usg=AFQjCNFBuLjqmEIvZT7UGV1GyjoBYhjxAA">
                    <b>Microsoft</b> - Wikipedia, the free
                    encyclopedia</a></h3>

                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://www.microsoftstore.com/&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CD0QFjAK&amp;usg=AFQjCNFNkCILwb6dEQsHKr00KhrD5g4-mA">
                    <b>Microsoft</b> Store Online -
                    Welcome</a></h3>

                  </li>

                  <li class="g">
                    <div>
                      <h3 class="r"><a href=
                      "/search?q=Microsoft&amp;hl=en&amp;sa=N&amp;ie=UTF-8&amp;prmd=ivnsuzm&amp;source=univ&amp;tbm=nws&amp;tbo=u&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CEAQqAI">
                      News for <b>Microsoft</b></a></h3>


                    </div>
                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://www.google.com/finance%3Fcid%3D358464&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CEsQFjAO&amp;usg=AFQjCNGjv6csOaDQzXvudcLnlLBiJnwafg">
                    <b>Microsoft</b> Corporation: NASDAQ:MSFT
                    quotes &amp; news - Google <b>...</b></a></h3>

                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://www.xbox.com/&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CE4QFjAP&amp;usg=AFQjCNGROBu1L8dX57GonwNESw5bzqDtRg">
                    Xbox 360 - Official Site - Xbox.com</a></h3>


                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=https://twitter.com/Microsoft&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CFIQFjAQ&amp;usg=AFQjCNH2Y55AzQVQYIwWznXqiloJjr-VXA">
                    <b>Microsoft</b> (<b>Microsoft</b>) on
                    Twitter</a></h3>

                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://www.forbes.com/companies/microsoft/&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CFUQFjAR&amp;usg=AFQjCNHcladReVyJXtWiY01LnTybPQ8puQ">
                    <b>Microsoft</b> on the Forbes List</a></h3>


                  </li>
                </ol>
              </div>
            </div>
          </div>

        </div>
      </td>

      <td valign="top"></td>
    </tr>
  </table>
</body>
</html>

If you run this with [email protected] it partially works, you'll see title and href and link are null.

With [email protected] all fields are null.

@jprichardson
Copy link
Author

Any thoughts on the issue? Thanks in advance.

@fb55
Copy link
Member

fb55 commented Mar 6, 2013

Um, it would be nice if you could strip your example down to the essentials, it's a bit overwhelming. Also, it would be nice if you could test this on 0.10.6 as that helps to pinpoint this down.

@ssmout
Copy link

ssmout commented Mar 12, 2013

I have what I think may be the same problem. Here is a small demo.

var html = '<div><a>A</a></div><div><a>B</a></div>';
var $ = require('cheerio').load(html);
var firstDiv = $('div').first();
console.log($('a', firstDiv).text());

With cheerio 0.10.4, this prints "A". With 0.10.5 and later, it prints "AB".

@jprichardson
Copy link
Author

@ssmout has a very basic example.

But, mine trimmed down even further:

google-cleaned.html:

<!DOCTYPE html>

<html itemscope="itemscope" itemtype="http://schema.org/WebPage">
<body marginheight="0" topmargin="0" bgcolor="#FFFFFF" marginwidth="0">
  <table border="0" cellpadding="0" cellspacing="0" id="mn" style="position:relative">
    <tr>
      <td valign="top">
        <div id="center_col">
          <div id="res">
            <div id="search">
              <div id="ires">
                <ol>
                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://www.microsoft.com/&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CCQQFjAA&amp;usg=AFQjCNFEx3qGWnPgXHzsueeYnZMZah21aA">
                    <b>Microsoft</b> Home Page | Devices and
                    Services</a></h3>
                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://en.wikipedia.org/wiki/Microsoft&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CDkQFjAJ&amp;usg=AFQjCNFBuLjqmEIvZT7UGV1GyjoBYhjxAA">
                    <b>Microsoft</b> - Wikipedia, the free
                    encyclopedia</a></h3>
                  </li>

                  <li class="g">
                    <h3 class="r"><a href=
                    "/url?q=http://www.microsoftstore.com/&amp;sa=U&amp;ei=hlMyUdvfEuyDyAG77oBY&amp;ved=0CD0QFjAK&amp;usg=AFQjCNFNkCILwb6dEQsHKr00KhrD5g4-mA">
                    <b>Microsoft</b> Store Online -
                    Welcome</a></h3>
                  </li>
                </ol>
              </div>
            </div>
          </div>
        </div>
      </td>
    </tr>
  </table>
</body>
</html>

test.js:

var fs = require('fs')
  , cheerio = require('cheerio')
  , qs = require('querystring')

var linkSel = 'h3.r a'
  , itemSel = 'li.g'

var html = fs.readFileSync('./google-cleaned.html', 'utf8');

var $ = cheerio.load(html);
var links = [];

$(itemSel).each(function(i, elem) {
  var linkElem = $(elem).children(linkSel).first()
    , item = {title: $(linkElem).text(), link: null}
    , qsObj = qs.parse($(linkElem).attr('href'));

  if (qsObj['/url?q']) {
    item.link = qsObj['/url?q']
  }

  links.push(item);
})

console.dir(links)
node test.js

produces:

Cheerio v0.10.4: (Expected Result)

[ { title: '\n                    Microsoft Home Page | Devices and\n                    Services',
    link: 'http://www.microsoft.com/' },
  { title: '\n                    Microsoft - Wikipedia, the free\n                    encyclopedia',
    link: 'http://en.wikipedia.org/wiki/Microsoft' },
  { title: '\n                    Microsoft Store Online -\n                    Welcome',
    link: 'http://www.microsoftstore.com/' } ]

Cheerio v0.10.5:

[ { title: '\n                    Microsoft Home Page | Devices and\n                    Services',
    link: null },
  { title: '\n                    Microsoft - Wikipedia, the free\n                    encyclopedia',
    link: null },
  { title: '\n                    Microsoft Store Online -\n                    Welcome',
    link: null } ]

Cheerio v0.10.6:

[ { title: '\n                    Microsoft Home Page | Devices and\n                    Services',
    link: null },
  { title: '\n                    Microsoft - Wikipedia, the free\n                    encyclopedia',
    link: null },
  { title: '\n                    Microsoft Store Online -\n                    Welcome',
    link: null } ]

Cheerio v0.10.7:

[ { title: '', link: null },
  { title: '', link: null },
  { title: '', link: null } ]

Hopefully this helps some?

@jprichardson
Copy link
Author

Any thoughts? Anything that I can help with?

@matthewmueller
Copy link
Member

Thanks @jprichardson for your patience.

I think the issue is in your code, specifically:

var linkElem = $(elem).children(linkSel).first()

children([sel]) only filters on one level of DOM. jQuery docs say it best:

The .children() method differs from .find() in that .children() only travels a single level down the DOM tree while .find() can traverse down multiple levels to select descendant elements (grandchildren, etc.) as well

I've tested it and it appears to be working with:

var linkElem = $(elem).find(linkSel)

@jprichardson
Copy link
Author

Thanks @matthewmueller! I can confirm that with your suggestion it works on Cheerio 0.10.4-0.10.7. Thanks again for taking the time to review this and to help me correct my mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants