Monday, May 20, 2024
 Popular · Latest · Hot · Upcoming
74
rated 0 times [  81] [ 7]  / answers: 1 / hits: 35762  / 11 Years ago, wed, may 15, 2013, 12:00:00

I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-




  • Setting viewportSize to a large height right after var page = require('webpage').create();




page.viewportSize = { width: 1600, height: 10000,
};





  • Using page.scrollPosition = { top: 10000, left: 0 } inside page.open but have no effect like-




page.open('http://example.com/?q=houston', function(status) {
if (status == success) {
page.scrollPosition = { top: 10000, left: 0 };
}
});




  • Also tried putting it inside page.evaluate function but that gives




Reference error: Can't find variable page





  • Tried using jQuery and JS code inside page.evaluate and page.open but to no avail-




$(html, body).animate({ scrollTop: $(document).height() }, 10,
function() {
//console.log('check for execution');
});




as it is and also inside document.ready. Similarly for JS code-



window.scrollBy(0,10000)


as it is and also inside window.onload



I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.



Update



I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0



var hitRockBottom = false; while (!hitRockBottom) {
// Scroll the page (not sure if this is the best way to do so...)
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };

// Check if we've hit the bottom
hitRockBottom = page.evaluate(function() {
return document.querySelector(.has-more-items) === null;
}); }


Where .has-more-items is the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.



However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 }; with codes from below as well (one at a time)



window.document.body.scrollTop = '1000';
location.href = .has-more-items;
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
document.location.href=.has-more-items;


But nothing seems to work.


More From » dom

 Answers
2

Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check the solution below. The thing here is that you have to wait a little for the page to load and javascript works asynchronously so you have to use setInterval or setTimeout (see) to achieve this.


page.open('http://example.com/?q=houston', function () {

// Check for the bottom div and scroll down from time to time
window.setInterval(function() {
// Check if there is a div with class=".has-more-items"
// (not sure if there's a better way of doing this)
var count = page.content.match(/class=".has-more-items"/g);

if(count === null) { // Didn't find
page.evaluate(function() {
// Scroll to the bottom of page
window.document.body.scrollTop = document.body.scrollHeight;
});
}
else { // Found
// Do what you want
...
phantom.exit();
}
}, 500); // Number of milliseconds to wait between scrolls

});

[#78224] Tuesday, May 14, 2013, 11 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
dylondaytond

Total Points: 92
Total Questions: 88
Total Answers: 96

Location: China
Member since Fri, Jan 15, 2021
3 Years ago
dylondaytond questions
Tue, Jun 22, 21, 00:00, 3 Years ago
Thu, May 7, 20, 00:00, 4 Years ago
;