parse5

[![Build Status](https://api.travis-ci.org/inikulin/parse5.svg)](https://travis-ci.org/inikulin/parse5) [![npm](https://img.shields.io/npm/v/parse5.svg)](https://www.npmjs.com/package/parse5) *WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node and io.js.* I needed fast and ready for production HTML parser, which will parse HTML as a modern browser's parser. Existing solutions were either too slow or their output was too inaccurate. So, this is how parse5 was born. **Included tools:** * [Parser](#class-parser) - HTML to DOM-tree parser. * [SimpleApiParser](#class-simpleapiparser) - [SAX](http://en.wikipedia.org/wiki/Simple_API_for_XML)-style parser for HTML. * [Serializer](#class-serializer) - DOM-tree to HTML code serializer. ## Install ``` $ npm install parse5 ``` ## Usage ```js var Parser = require('parse5').Parser; //Instantiate parser var parser = new Parser(); //Then feed it with an HTML document var document = parser.parse('Hi there!') //Now let's parse HTML-snippet var fragment = parser.parseFragment('Parse5 is fucking awesome!

42

'); ``` ## Is it fast? Check out [this benchmark](https://github.com/inikulin/node-html-parser-bench). ``` Starting benchmark. Fasten your seatbelts... html5 (https://github.com/aredridel/html5) x 0.18 ops/sec ±5.92% (5 runs sampled) htmlparser (https://github.com/tautologistics/node-htmlparser/) x 3.83 ops/sec ±42.43% (14 runs sampled) htmlparser2 (https://github.com/fb55/htmlparser2) x 4.05 ops/sec ±39.27% (15 runs sampled) parse5 (https://github.com/inikulin/parse5) x 3.04 ops/sec ±51.81% (13 runs sampled) Fastest is htmlparser2 (https://github.com/fb55/htmlparser2),parse5 (https://github.com/inikulin/parse5) ``` So, parse5 is as fast as simple specification incompatible parsers and ~15-times(!) faster than the current specification compatible parser available for the node. ## API reference ### Enum: TreeAdapters Provides built-in tree adapters which can be passed as an optional argument to the `Parser` and `Serializer` constructors. #### • TreeAdapters.default Default tree format for parse5. #### • TreeAdapters.htmlparser2 Quite popular [htmlparser2](https://github.com/fb55/htmlparser2) tree format (e.g. used in [cheerio](https://github.com/MatthewMueller/cheerio) and [jsdom](https://github.com/tmpvar/jsdom)). --------------------------------------- ### Class: Parser Provides HTML parsing functionality. #### • Parser.ctor([treeAdapter, options]) Creates new reusable instance of the `Parser`. Optional `treeAdapter` argument specifies resulting tree format. If `treeAdapter` argument is not specified, `default` tree adapter will be used. `options` object provides the parsing algorithm modifications: ##### options.decodeHtmlEntities Decode HTML-entities like `&`, ` `, etc. Default: `true`. **Warning:** disabling this option may cause output which is not conform HTML5 specification. ##### options.locationInfo Enables source code location information for the nodes. Default: `false`. When enabled, each node (except root node) has `__location` property, which contains `start` and `end` indices of the node in the source code. If element was implicitly created by the parser it's `__location` property will be `null`. In case the node is not an empty element, `__location` has two addition properties `startTag` and `endTag` which contain location information for individual tags in a fashion similar to `__location` property. *Example:* ```js var parse5 = require('parse5'); //Instantiate new parser with default tree adapter var parser1 = new parse5.Parser(); //Instantiate new parser with htmlparser2 tree adapter var parser2 = new parse5.Parser(parse5.TreeAdapters.htmlparser2); ``` #### • Parser.parse(html) Parses specified `html` string. Returns `document` node. *Example:* ```js var document = parser.parse('Hi there!'); ``` #### • Parser.parseFragment(htmlFragment, [contextElement]) Parses given `htmlFragment`. Returns `documentFragment` node. Optional `contextElement` argument specifies context in which given `htmlFragment` will be parsed (consider it as setting `contextElement.innerHTML` property). If `contextElement` argument is not specified then `