Skip to content

SGML Python parsers benchmark

SGML is an ISO standard to define markup standards. Two of its derivatives are HTML and XML (XHTML is an application of XML itself).

The standard library provides a lot of modules to parse HTML and XML. In my benchmark today I decided to consider only the ones oriented to HTML (and SGML since it’s a kind of superset) cheating a bit and letting them handle XML also. I know it’s not really fair but I’m a free man so let’s go on.

To do the benchmark I obviously needed some data so I wrote a script to extract the HTML URLs (homepages) and XML URLs (feeds) from the OPML exported by my feed reader.

The script is very simple and straightforward: loads the OPML file in cElementTree, scans the outlines with XPath-like syntax and for each outline downloads the related URLs storing the content in two separated directories.

Let me say that cElementTree is fast, really really fast!

Now I have 454 files full of content: 227 HTML pages, 227 XML feeds. Roughly 22 mega bytes of content to feed to the parses. Let’s presume a lot of that content isn’t even valid according to the specific standards (HTML, XML, Atom, RSS). Anyway, I don’t have to validate the content so I just don’t care.

The modules considered for this benchmark are: HTMLParser, sgmllib, BeautifulSoup and sgmlop. I ignored all the plethora of XML only libraries because there are plenty of benchmarks out there and I ignored htmllib because it sits upon sgmllib so there’s no point in benchmarking it too.

This bench does not do anything with the content, just feeds it to the parser. Any kind of error is collected in a list to let me know how many errors the parsers encounter.

The data is preloaded in memory so no time to open the file descriptor and read the content is measured. Every parser has been fed with HTML and XML content separately.

The machine used is a Core Duo 2GHz MacBook with 2 GB of RAM. Python’s version is 2.5.1.

These are my results:

rhymes@groove ~% python sgmlbench.py
HTMLParser, only HTML - time: 5.1607260704, errors: 37 HTMLParser, only XML - time: 3.56549191475, errors: 98 Total: 8.726218

sgmllib.SGMLParser, only HTML - time: 7.36616611481, errors: 1 sgmllib.SGMLParser, only XML - time: 4.22875499725, errors: 0 Total: 11.594921

BeautifulSoup, only HTML - time: 23.7593009472, errors: 5 BeautifulSoup, only XML - time: 10.1111578941, errors: 0 Total: 33.870459

sgmlop.SGMLParser, only HTML - time: 0.473984956741, errors: 0 sgmlop.SGMLParser, only XML - time: 0.443637132645, errors: 0 Total: 0.917622

Draw any conclusion you might like.

You can find the archive containing the whole data including the scripts online.

Update [2007/08/30]:

Karl Dubost pointed html5lib out in the comments so I decided to add it to the benchmark. These are the numbers obtained with the SVN version:

html5lib.HTMLParser, only HTML - time: 209.537359953, errors: 6019
html5lib.HTMLParser, only XML - time: 247.377570152, errors: 15566
Total: 456.914930

I also tried with the different tree builders but the numbers don’t change significantly because they’re just used to build the resulting DOM, they don’t affect the parsing process.

Since it seems tremendously slow I profiled the code with cProfile and realized that the tokenization process seems to be the slower part:

         482665 function calls (479420 primitive calls) in 2.282 CPU seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 2.282 2.282 {execfile} 1 0.004 0.004 2.282 2.282 html5lib_profile.py:1() 1 0.000 0.000 2.188 2.188 html5lib_profile.py:3(html5lib_parse) 1 0.000 0.000 2.187 2.187 html5parser.py:126(parse) 1 0.041 0.041 2.187 2.187 html5parser.py:72(_parse) 6832 0.062 0.000 1.877 0.000 tokenizer.py:88(iter) 8216 1.158 0.000 1.230 0.000 inputstream.py:244(charsUntil) 6763 0.060 0.000 0.689 0.000 tokenizer.py:298(dataState) 6064 0.026 0.000 0.377 0.000 tokenizer.py:454(tagNameState) 2059 0.013 0.000 0.332 0.000 tokenizer.py:585(attributeValueDoubleQuotedState) 26613 0.083 0.000 0.239 0.000 inputstream.py:205(char) 35513 0.155 0.000 0.155 0.000 {method 'pop' of 'list' objects} 2076 0.015 0.000 0.152 0.000 tokenizer.py:494(attributeNameState) 163 0.048 0.000 0.120 0.001 tokenizer.py:191(consumeEntity) [...]

I think it’s just a great idea to have a reference implementation in Python of an HTML5 parser, there’s plenty of time to make it fast.

4 Comments

  1. You have missed one in your candidates. HTML 5 Editor’s draft defines a parsing for HTML Tag soup with a well define error recovery mechanism to create a DOM. Anne Van Kesteren and a few other persons built an HTML 5 parser in python.

    http://www.w3.org/html/wg/html5/ http://code.google.com/p/html5lib/

    Thursday, August 30, 2007 at 12:27 pm | Permalink
  2. jgraham wrote:

    Which version of html5lib did you use? The version in SVN should be somewhat faster than the quite outdated 0.9 release, but not enough to make up the difference compared to BeautifulSoup. I have a few ideas for making html5lib faster but it’s basically limited by the need to process the input character so I don’t think it will ever be truly fast without significant rearchitecture or a port of the tokenizer stage to C.

    Thursday, August 30, 2007 at 11:56 pm | Permalink
  3. Lawrence wrote:

    I used the SVN version, sorry I forgot to mention that.

    I guess you’re right, maybe you can optimize the Python code a bit but I guess the real improvement is to port the tokenizer to C or maybe pyrex.

    Friday, August 31, 2007 at 1:09 am | Permalink
  4. ludo wrote:

    I told you sgmlop kick ass. :)

    Sunday, September 2, 2007 at 11:21 pm | Permalink

2 Trackbacks/Pingbacks

  1. Just another WordPress weblog on Sunday, August 26, 2007 at 9:46 pm

    [...] Lawrence Oluyede’s Blog (Lawrence Oluyede): SGML Python parsers benchmark [...]

  2. A song for the lovers » html5lib is getting faster on Saturday, September 22, 2007 at 5:08 pm

    [...] ran the benchmark again with the 1014 revision of html5lib and I noticed a major speedup (altough miles far from the [...]

Additional comments powered by BackType