Question

59

Selecting an html node's text content with htmlparser2 in Node.js

rated 0 times [ 65] [ 6] / answers: 1 / hits: 6169 / 5 Years ago, mon, may 27, 2019, 12:00:00

I want to parse some html with htmlparser2 module for Node.js. My task is to find a precise element by its ID and extract its text content.

I have read the documentation (quite limited) and I know how to setup my parser with the onopentag function but it only gives access to the tag name and its attributes (I cannot see the text). The ontext function extracts all text nodes from the given html string, but ignores all markup.

So here's my code.

const htmlparser = require("htmlparser2");

const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';



const parser = new htmlparser.Parser({

  onopentag: function(name, attribs){

    if (attribs.id === "heading1"){

      console.log(/*how to extract text so I can get "Some heading" here*/);

    }

  },

   

  ontext: function(text){

    console.log(text); // Some heading n Foobar

  }

});



parser.parseComplete(file);

I expect the output of the function call to be 'Some heading'. I believe that there is some obvious solution but somehow it misses my mind.

Thank you.

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

jazminkyrap

Add To Favorites

Follow

Total Points: 631

Total Questions: 89

Total Answers: 109

Location: Finland

Member since Fri, Oct 21, 2022

2 Years ago

jazminkyrap questions

1 How to fix error: Fontconfig error: Cannot load default config file

Sat, Apr 17, 21, 00:00, 3 Years ago

1 How to return an attribute of clicked element in react?

Sun, Jul 5, 20, 00:00, 4 Years ago

1 why do I got Uncaught TypeError: response.text is not a function error when try to assignt body received as text to a variable in javascript

Sat, May 16, 20, 00:00, 4 Years ago

1 Error This line has a length of 182. Maximum allowed is 100 max-len

Thu, Apr 16, 20, 00:00, 4 Years ago

1 Webpack does not see the absolute path

Fri, Mar 15, 19, 00:00, 5 Years ago

View All

answered 5 Years ago marisela · Accepted Answer

You can do it like this using the library you asked about:

const htmlparser = require('htmlparser2');

const domUtils = require('domutils');



const file = '<h1 id=heading1>Some heading</h1><p>Foobar</p>';



var handler = new htmlparser.DomHandler(function(error, dom) {

  if (error) {

    console.log('Parsing had an error');

    return;

  } else {

    const item = domUtils.findOne(element => {

      const matches = element.attribs.id === 'heading1';

      return matches;

    }, dom);



    if (item) {

      console.log(item.children[0].data);

    }

  }

});



var parser = new htmlparser.Parser(handler);

parser.write(file);

parser.end();

The output you will get is Some Heading. However, you will, in my opinion, find it easier to just use a querying library that is meant for it. You of course, don't need to do this, but you can note how much simpler the following code is: How do I get an element name in cheerio with node.js

Cheerio OR a querySelector API such as https://www.npmjs.com/package/node-html-parser if you prefer the native query selectors is much more lean.

You can compare that code to something more lean, such as the node-html-parser which supports simply querying:

const { parse } = require('node-html-parser');



const file = '<h1 id=heading1>Some heading</h1><p>Foobar</p>';

const root = parse(file);

const text = root.querySelector('#heading1').text;

console.log(text);