Question

103

Puppeteer - how to select an element based on its inner text?

rated 0 times [ 105] [ 2] / answers: 1 / hits: 6096 / 4 Years ago, thu, september 24, 2020, 12:00:00

I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:

<table>

<tr>

    <th>Product name</th>

    <td>Shakeweight</td>

</tr>

<tr>

    <th>Product category</th>

    <td>Exercise equipment</td>

</tr>

<tr>

    <th>Manufacturer name</th>

    <td>The Shakeweight Company</td>

</tr>

<tr>

    <th>Manufacturer address</th>

    <td>

        <table>

            <tr><td>123 Fake Street</td></tr>

            <tr><td>Springfield, MO</td></tr>

        </table>

    </td>

</tr>

In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.

I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:

var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)

This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

ira

Add To Favorites

Follow

Total Points: 298

Total Questions: 112

Total Answers: 103

Location: Guadeloupe

Member since Sat, Aug 22, 2020

4 Years ago

ira questions

1 How to create a string array in the useState hook and update that same array in TypeScript?

Tue, Jul 20, 21, 00:00, 3 Years ago

1 determine type of parameter that could have multiple types in typescript

Thu, Jul 9, 20, 00:00, 4 Years ago

1 Change font size of chart in apexchart

Thu, Jun 18, 20, 00:00, 4 Years ago

1 Jest function toHaveBeenCalledWith to ignore object order

Tue, Mar 24, 20, 00:00, 4 Years ago

1 Best way to integrate Vue.js into existing ASP.NET MVC5 Project

Fri, Mar 13, 20, 00:00, 4 Years ago

View All

answered 4 Years ago braidenv · Accepted Answer

This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td> that comes after the <th> you've found: XPath:: Get following Sibling

But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:

$x('//th[text()="Manufacturer name"]')

NOTE: $x() is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x function.

That expression should return an array with one element, the <th> with that text in the query. To get the <td> next to it:

$x('//th[text()="Manufacturer name"]/following-sibling::td')

And to get its inner text:

$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText

Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:

const puppeteer = require('puppeteer');



const main = async () => {

  const browser = await puppeteer.launch();

  const page = await browser.newPage();

  await page.goto('http://127.0.0.1:8080/');  // <-- EDIT THIS



  const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');

  const prop = await mfg[0].getProperty('innerText');

  const text = await prop.jsonValue();

  console.log(text);



  await browser.close();

}



main();