inicio mail me! sindicaci;ón

SGML Python parsers benchmark

SGML is an ISO standard to define markup standards. Two of its derivatives are HTML and XML (XHTML is an application of XML itself).

The standard library provides a lot of modules to parse HTML and XML. In my benchmark today I decided to consider only the ones oriented to HTML (and SGML since it’s a kind of superset) cheating a bit and letting them handle XML also. I know it’s not really fair but I’m a free man so let’s go on.

To do the benchmark I obviously needed some data so I wrote a script to extract the HTML URLs (homepages) and XML URLs (feeds) from the OPML exported by my feed reader.

The script is very simple and straightforward: loads the OPML file in cElementTree, scans the outlines with XPath-like syntax and for each outline downloads the related URLs storing the content in two separated directories.

Let me say that cElementTree is fast, really really fast!

Now I have 454 files full of content: 227 HTML pages, 227 XML feeds. Roughly 22 mega bytes of content to feed to the parses. Let’s presume a lot of that content isn’t even valid according to the specific standards (HTML, XML, Atom, RSS). Anyway, I don’t have to validate the content so I just don’t care.

The modules considered for this benchmark are: HTMLParser, sgmllib, BeautifulSoup and sgmlop. I ignored all the plethora of XML only libraries because there are plenty of benchmarks out there and I ignored htmllib because it sits upon sgmllib so there’s no point in benchmarking it too.

This bench does not do anything with the content, just feeds it to the parser. Any kind of error is collected in a list to let me know how many errors the parsers encounter.

The data is preloaded in memory so no time to open the file descriptor and read the content is measured. Every parser has been fed with HTML and XML content separately.

The machine used is a Core Duo 2GHz MacBook with 2 GB of RAM. Python’s version is 2.5.1.

These are my results:

rhymes@groove ~% python sgmlbench.py                                                                                           
HTMLParser, only HTML - time: 5.1607260704, errors: 37
HTMLParser, only XML - time: 3.56549191475, errors: 98
Total: 8.726218

sgmllib.SGMLParser, only HTML - time: 7.36616611481, errors: 1
sgmllib.SGMLParser, only XML - time: 4.22875499725, errors: 0
Total: 11.594921

BeautifulSoup, only HTML - time: 23.7593009472, errors: 5
BeautifulSoup, only XML - time: 10.1111578941, errors: 0
Total: 33.870459

sgmlop.SGMLParser, only HTML - time: 0.473984956741, errors: 0
sgmlop.SGMLParser, only XML - time: 0.443637132645, errors: 0
Total: 0.917622

Draw any conclusion you might like.

You can find the archive containing the whole data including the scripts online.

Update [2007/08/30]:

Karl Dubost pointed html5lib out in the comments so I decided to add it to the benchmark. These are the numbers obtained with the SVN version:

html5lib.HTMLParser, only HTML - time: 209.537359953, errors: 6019
html5lib.HTMLParser, only XML - time: 247.377570152, errors: 15566
Total: 456.914930

I also tried with the different tree builders but the numbers don’t change significantly because they’re just used to build the resulting DOM, they don’t affect the parsing process.

Since it seems tremendously slow I profiled the code with cProfile and realized that the tokenization process seems to be the slower part:

         482665 function calls (479420 primitive calls) in 2.282 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.282    2.282 {execfile}
        1    0.004    0.004    2.282    2.282 html5lib_profile.py:1()
        1    0.000    0.000    2.188    2.188 html5lib_profile.py:3(html5lib_parse)
        1    0.000    0.000    2.187    2.187 html5parser.py:126(parse)
        1    0.041    0.041    2.187    2.187 html5parser.py:72(_parse)
     6832    0.062    0.000    1.877    0.000 tokenizer.py:88(__iter__)
     8216    1.158    0.000    1.230    0.000 inputstream.py:244(charsUntil)
     6763    0.060    0.000    0.689    0.000 tokenizer.py:298(dataState)
     6064    0.026    0.000    0.377    0.000 tokenizer.py:454(tagNameState)
     2059    0.013    0.000    0.332    0.000 tokenizer.py:585(attributeValueDoubleQuotedState)
    26613    0.083    0.000    0.239    0.000 inputstream.py:205(char)
    35513    0.155    0.000    0.155    0.000 {method 'pop' of 'list' objects}
     2076    0.015    0.000    0.152    0.000 tokenizer.py:494(attributeNameState)
      163    0.048    0.000    0.120    0.001 tokenizer.py:191(consumeEntity)
[...]

I think it’s just a great idea to have a reference implementation in Python of an HTML5 parser, there’s plenty of time to make it fast.

Related posts

  • html5lib is getting faster
  • Rails, ActiveRecord, benchmarking and why I feel evil
  • Hype, the Python Indexer
  • Unicode and Python
  • Reddit has got me
  • Gravatar

    Just another WordPress weblog said,

    August 26, 2007 @ 9:46 pm

    [...] Lawrence Oluyede’s Blog (Lawrence Oluyede): SGML Python parsers benchmark [...]

    Gravatar

    karl dubost, W3C said,

    August 30, 2007 @ 12:27 pm

    You have missed one in your candidates. HTML 5 Editor’s draft defines a parsing for HTML Tag soup with a well define error recovery mechanism to create a DOM. Anne Van Kesteren and a few other persons built an HTML 5 parser in python.

    http://www.w3.org/html/wg/html5/ http://code.google.com/p/html5lib/

    Gravatar

    jgraham said,

    August 30, 2007 @ 11:56 pm

    Which version of html5lib did you use? The version in SVN should be somewhat faster than the quite outdated 0.9 release, but not enough to make up the difference compared to BeautifulSoup. I have a few ideas for making html5lib faster but it’s basically limited by the need to process the input character so I don’t think it will ever be truly fast without significant rearchitecture or a port of the tokenizer stage to C.

    Gravatar

    Lawrence said,

    August 31, 2007 @ 1:09 am

    I used the SVN version, sorry I forgot to mention that.

    I guess you’re right, maybe you can optimize the Python code a bit but I guess the real improvement is to port the tokenizer to C or maybe pyrex.

    Gravatar

    ludo said,

    September 2, 2007 @ 11:21 pm

    I told you sgmlop kick ass. :)

    Gravatar

    A song for the lovers » html5lib is getting faster said,

    September 22, 2007 @ 5:08 pm

    [...] ran the benchmark again with the 1014 revision of html5lib and I noticed a major speedup (altough miles far from the [...]

    RSS feed for comments on this post · TrackBack URI

    Leave a Comment