Sunday, May 12, 2024
 Popular · Latest · Hot · Upcoming
172
rated 0 times [  178] [ 6]  / answers: 1 / hits: 12631  / 5 Years ago, tue, april 2, 2019, 12:00:00

I develop parsing app with Puppeteer and it works well. But the problem is this app has error sometimes and I don't know why this error occurs.



I have to capture about 90,000 data.



The error seems to be caused by not reading the class list, but even if I give the Headless option to False and check, the class list exists.



At first, it works well and randomly generates errors.



In my opinion, sometimes the page does not load on the website itself and continues to stop at the loading bar, which is why.



Even if I give a networkidle0 or 2 in the waitUntil value, if this prediction is correct, I don't know how to detect it.



[Full Code]



'use strict';

const puppeteer = require('puppeteer'); // Puppeteer 모듈 Load
(async () => {
const browser = await puppeteer.launch({ // Puppeteer 모듈을 사용하지 않고 기존의 크롬 사용자 정보를 사용 (Auth 인증을 패스하기 위하여)
executablePath: 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe',
userDataDir: 'C:\User\AppData\Local\Google\Chrome\User Data', // 설치시 개인 크롬 Directory로 수정하여야함
headless: true
});
const page = await browser.newPage(); // Broswer Open
await page.setViewport({ // Viewport 설정 가로의 경우 일반적으로 최대 1920, 새로의 경우 예상되는 최대 px를 지정해주면됨
width: 800,
height: 6000
});
page.on('dialog', async dialog => { // 삭제된 게시글의 경우 Band에서 Dialog를 띄우는데 이를 제거하기 위하여 필요
console.log(dialog.message());
await dialog.dismiss(); // Dialog 창 닫음
await postNumber++; // 삭제된 게시글의 경우 Dialog 창이 닫힌후에 이전 URL로 돌아가므로 postNumber 1증가 시켜줌
await page.goto(`https://band.us/band/58075840/post/${postNumber}`, {
waitUntil: 'networkidle0'
});
})
let postNumber = 14565; // 시작되는 PostNumber * 이 부분 부터 시작 *
while (postNumber <= 90000) { // PostNumber 끝값 * 이 부분은 마지막 값 *
await page.goto(`https://band.us/band/58075840/post/${postNumber}`, {
waitUntil: 'networkidle0' // 페이지가 완전히 Load된후 작동
});

let by = await page.evaluate(() => document.getElementsByClassName('text')[0].innerText); // 게시글 작성자 Text 파싱
let date = await page.evaluate(() => document.getElementsByClassName('time')[0].innerText); // 게시글 작성일 Text 파싱
let element = await page.$('.boardList'); // 게시글, 댓글 전체 Class
await element.screenshot({ // ScreenShot Function
path: `./image/${postNumber}-${by}-${date.replace(:,_)}.png` // 파일 저장 위치 & 파일이름 지정, replace 메소드의 경우 Windows 탐색기에서 :를 파일명으로 지원하지 않기때문
});
console.log(`${postNumber}-${by}-${date.replace(:,_)}.png`) // Console.log에서 파일 확인
await postNumber++; // 최종 성공시 postnumber 증가
}
await browser.close(); // 종료
})();


[ERROR Message]




(node:16880) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerText' of undefined
at __puppeteer_evaluation_script__:1:50
at ExecutionContext.evaluateHandle (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibExecutionContext.js:121:13)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at ExecutionContext.<anonymous> (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibhelper.js:108:27)
at ExecutionContext.evaluate (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibExecutionContext.js:48:31)
at ExecutionContext.<anonymous> (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibhelper.js:109:23)
at DOMWorld.evaluate (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibDOMWorld.js:105:20)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at Frame.<anonymous> (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibhelper.js:108:27)
at Page.evaluate (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibPage.js:809:43)
at Page.<anonymous> (C:Users\DownloadsProjectsBander-Statisticsnode_modulespuppeteerlibhelper.js:109:23)
at C:Users\DownloadsProjectsBander-Statisticsband.js:29:29
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:16880) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:16880) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

More From » node.js

 Answers
9

Error



The error happens in this line according to your error message:



let by = await page.evaluate(() => document.getElementsByClassName('text')[0].innerText); // 게시글 작성자 Text 파싱


As you are making roughly 75k requests to a website, I could imagine the website is having protective measures to ban your bot from crawling. Alternatively, the post might just not exist that you are trying to crawl.



Fix



To fix your problem, you could change your evaluate function like this. This will return undefined (instead of throwing an error) if the elements do not exist. It also improves your code by only using one page.evaluate call.



let [by, date] = await page.evaluate(() => {
const textNode = document.getElementsByClassName('text')[0];
const timeNode = document.getElementsByClassName('time')[0];
return [
textNode && textNode.innerText,
timeNode && timeNode.innerText,
];
});
if (!by || !date) {
// by or date is undefined
console.log(`Not working for ID: ${postNumber}`);
await element.screenshot({ path: `error-${postNumber}.png` });
}


This will make a screenshot of the pages, where an error happens. Maybe you will see that the website changed (maybe they are showing you captchas?) or the post you are trying to crawl simply does not exist.



If the screenshot is not helping, you could also use page.content() to save the HTML in an error case and have a look at it.


[#8189] Saturday, March 30, 2019, 5 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
theodore

Total Points: 318
Total Questions: 97
Total Answers: 119

Location: Turks and Caicos Islands
Member since Sun, Mar 7, 2021
3 Years ago
;