Thursday, June 6, 2024
 Popular · Latest · Hot · Upcoming
103
rated 0 times [  105] [ 2]  / answers: 1 / hits: 6096  / 4 Years ago, thu, september 24, 2020, 12:00:00

I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:


<table>
<tr>
<th>Product name</th>
<td>Shakeweight</td>
</tr>
<tr>
<th>Product category</th>
<td>Exercise equipment</td>
</tr>
<tr>
<th>Manufacturer name</th>
<td>The Shakeweight Company</td>
</tr>
<tr>
<th>Manufacturer address</th>
<td>
<table>
<tr><td>123 Fake Street</td></tr>
<tr><td>Springfield, MO</td></tr>
</table>
</td>
</tr>


In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.


I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:


var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)

This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.


More From » node.js

 Answers
4

This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td> that comes after the <th> you've found: XPath:: Get following Sibling


But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:


$x('//th[text()="Manufacturer name"]')

NOTE: $x() is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x function.


That expression should return an array with one element, the <th> with that text in the query. To get the <td> next to it:


$x('//th[text()="Manufacturer name"]/following-sibling::td')

And to get its inner text:


$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText

Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:


const puppeteer = require('puppeteer');

const main = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://127.0.0.1:8080/'); // <-- EDIT THIS

const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');
const prop = await mfg[0].getProperty('innerText');
const text = await prop.jsonValue();
console.log(text);

await browser.close();
}

main();

[#2609] Sunday, September 20, 2020, 4 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
ira

Total Points: 298
Total Questions: 112
Total Answers: 103

Location: Guadeloupe
Member since Sat, Aug 22, 2020
4 Years ago
;