Monday, May 13, 2024
185
rated 0 times [  188] [ 3]  / answers: 1 / hits: 5278  / 10 Years ago, tue, october 28, 2014, 12:00:00

I'm trying to scrape a page with recaptcha widget using phantomjs but when I get the page it has no captcha image.



If I add an iframe element to the page, the image shows. The weirdest thing is that the image only appears if you make an iframe with specific content.



Here is the html code that I used to test (it's the normal code from recaptcha docs with the iframe element)



<form action= method=post>
<script type=text/javascript src=http://www.google.com/recaptcha/api/challenge?k=6LfUUtMSAAAAAOBuPTWtMAnAu3l9AS-iHZb6iFpp&amp;error=>
</script>
<noscript>
<iframe src=http://www.google.com/recaptcha/api/noscript?k=6LfUUtMSAAAAAOBuPTWtMAnAu3l9AS-iHZb6iFpp&amp;error= height=300 width=500 frameborder=0></iframe>
<br>
<textarea name=recaptcha_challenge_field rows=3 cols=40>
</textarea>
<input type=hidden name=recaptcha_response_field value=manual_challenge>
</noscript>
</form>
<iframe src=frame.html></iframe>


The iframe refers to the page frame.html and here is the specific code of it



<a><img src='http://c'></a>


If you tried to change the content of the frame.html a little bit you'll probably not get the captcha image.



The PhantomJS script that I used is this:



var url = 'http://127.0.0.1/php_api/recaptcha.html';
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var p = page.evaluate(function () {
return document.getElementById(recaptcha_challenge_image).src;
});
console.log(p);
}
phantom.exit();
});


This is the first time I use PhantomJS so is there something I'm missing?


More From » web-scraping

 Answers
8

This has nothing to do with the additional iframe that you have on your page. The recaptcha script isn't loaded when the page.open callback is called. It hasn't created the reCaptcha table and hasn't loaded the captcha image. This is a timing issue.



You can wait a static amount of time with setTimeout or use waitFor to wait until the image is present.



page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
setTimeout(function(){
var p = page.evaluate(function () {
return document.getElementById(recaptcha_challenge_image).src;
});
console.log(p);
phantom.exit();
}, 5000);
}
});


Don't forget that phantom.exit should also be called after the timeout otherwise you just exit prematurely.


[#41635] Monday, October 27, 2014, 10 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
lucianom

Total Points: 601
Total Questions: 98
Total Answers: 109

Location: Kenya
Member since Fri, Dec 23, 2022
1 Year ago
lucianom questions
Tue, Feb 22, 22, 00:00, 2 Years ago
Wed, May 5, 21, 00:00, 3 Years ago
Sun, Jan 24, 21, 00:00, 3 Years ago
Sat, Aug 15, 20, 00:00, 4 Years ago
Mon, Jun 22, 20, 00:00, 4 Years ago
;