SGML is an ISO standard to define markup standards. Two of its derivatives are HTML and XML (XHTML is an application of XML itself).
The standard library provides a lot of modules to parse HTML and XML. In my benchmark today I decided to consider only the ones oriented to HTML (and SGML since it’s a kind of superset) cheating a bit and letting them handle XML also. I know it’s not really fair but I’m a free man so let’s go on.
To do the benchmark I obviously needed some data so I wrote a script to extract the HTML URLs (homepages) and XML URLs (feeds) from the OPML exported by my feed reader.
The script is very simple and straightforward: loads the OPML file in cElementTree, scans the outlines with XPath-like syntax and for each outline downloads the related URLs storing the content in two separated directories.
Let me say that cElementTree is fast, really really fast!
Now I have 454 files full of content: 227 HTML pages, 227 XML feeds. Roughly 22 mega bytes of content to feed to the parses. Let’s presume a lot of that content isn’t even valid according to the specific standards (HTML, XML, Atom, RSS). Anyway, I don’t have to validate the content so I just don’t care.
The modules considered for this benchmark are: HTMLParser, sgmllib, BeautifulSoup and sgmlop. I ignored all the plethora of XML only libraries because there are plenty of benchmarks out there and I ignored htmllib because it sits upon sgmllib so there’s no point in benchmarking it too.
This bench does not do anything with the content, just feeds it to the parser. Any kind of error is collected in a list to let me know how many errors the parsers encounter.
The data is preloaded in memory so no time to open the file descriptor and read the content is measured. Every parser has been fed with HTML and XML content separately.
The machine used is a Core Duo 2GHz MacBook with 2 GB of RAM. Python’s version is 2.5.1.
These are my results:
rhymes@groove ~% python sgmlbench.py HTMLParser, only HTML - time: 5.1607260704, errors: 37 HTMLParser, only XML - time: 3.56549191475, errors: 98 Total: 8.726218 sgmllib.SGMLParser, only HTML - time: 7.36616611481, errors: 1 sgmllib.SGMLParser, only XML - time: 4.22875499725, errors: 0 Total: 11.594921 BeautifulSoup, only HTML - time: 23.7593009472, errors: 5 BeautifulSoup, only XML - time: 10.1111578941, errors: 0 Total: 33.870459 sgmlop.SGMLParser, only HTML - time: 0.473984956741, errors: 0 sgmlop.SGMLParser, only XML - time: 0.443637132645, errors: 0 Total: 0.917622
Draw any conclusion you might like.
You can find the archive containing the whole data including the scripts online.
Update [2007/08/30]:
Karl Dubost pointed html5lib out in the comments so I decided to add it to the benchmark. These are the numbers obtained with the SVN version:
html5lib.HTMLParser, only HTML - time: 209.537359953, errors: 6019 html5lib.HTMLParser, only XML - time: 247.377570152, errors: 15566 Total: 456.914930
I also tried with the different tree builders but the numbers don’t change significantly because they’re just used to build the resulting DOM, they don’t affect the parsing process.
Since it seems tremendously slow I profiled the code with cProfile and realized that the tokenization process seems to be the slower part:
482665 function calls (479420 primitive calls) in 2.282 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.282 2.282 {execfile}
1 0.004 0.004 2.282 2.282 html5lib_profile.py:1()
1 0.000 0.000 2.188 2.188 html5lib_profile.py:3(html5lib_parse)
1 0.000 0.000 2.187 2.187 html5parser.py:126(parse)
1 0.041 0.041 2.187 2.187 html5parser.py:72(_parse)
6832 0.062 0.000 1.877 0.000 tokenizer.py:88(__iter__)
8216 1.158 0.000 1.230 0.000 inputstream.py:244(charsUntil)
6763 0.060 0.000 0.689 0.000 tokenizer.py:298(dataState)
6064 0.026 0.000 0.377 0.000 tokenizer.py:454(tagNameState)
2059 0.013 0.000 0.332 0.000 tokenizer.py:585(attributeValueDoubleQuotedState)
26613 0.083 0.000 0.239 0.000 inputstream.py:205(char)
35513 0.155 0.000 0.155 0.000 {method 'pop' of 'list' objects}
2076 0.015 0.000 0.152 0.000 tokenizer.py:494(attributeNameState)
163 0.048 0.000 0.120 0.001 tokenizer.py:191(consumeEntity)
[...]
I think it’s just a great idea to have a reference implementation in Python of an HTML5 parser, there’s plenty of time to make it fast.

