I ran the benchmark again with the 1014 revision of html5lib and I noticed a major speedup (altough miles far from the other libraries) in parsing.
The benchmark itself ran in roughly 262 seconds instead of the previous roughly 457. 57% faster.
These are the numbers on the 30 of August:
html5lib.HTMLParser, only HTML - time: 209.537359953, errors: 6019 html5lib.HTMLParser, only XML - time: 247.377570152, errors: 15566 Total: 456.914930
These come from today:
html5lib.HTMLParser, only HTML - time: 97.3409409523, errors: 6019 html5lib.HTMLParser, only XML - time: 164.554941893, errors: 15566 Total: 261.895883
This is the profiling:
582972 function calls (579727 primitive calls) in 2.196 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.196 2.196 {execfile}
1 0.049 0.049 2.196 2.196 html5lib_profile.py:1(module)
1 0.002 0.002 1.584 1.584 html5lib_profile.py:3(html5lib_parse)
1 0.000 0.000 1.580 1.580 html5parser.py:130(parse)
1 0.053 0.053 1.580 1.580 html5parser.py:76(_parse)
6832 0.071 0.000 1.173 0.000 tokenizer.py:88(__iter__)
8216 0.572 0.000 0.686 0.000 inputstream.py:249(charsUntil)
1 0.038 0.038 0.512 0.512 __init__.py:13(module)
6763 0.061 0.000 0.477 0.000 tokenizer.py:298(dataState)
1 0.029 0.029 0.439 0.439 html5parser.py:7(module)
1 0.088 0.088 0.312 0.312 treebuilders/simpletree.py:1(module)
2059 0.013 0.000 0.294 0.000 tokenizer.py:585(attributeValueDoubleQuotedState)
1 0.039 0.039 0.196 0.196 saxutils.py:4(module)
1 0.090 0.090 0.156 0.156 urllib.py:23(module)

