inicio mail me! sindicaci;ón

Archive for August, 2007

SGML Python parsers benchmark

SGML is an ISO standard to define markup standards. Two of its derivatives are HTML and XML (XHTML is an application of XML itself).

The standard library provides a lot of modules to parse HTML and XML. In my benchmark today I decided to consider only the ones oriented to HTML (and SGML since it’s a kind of superset) cheating a bit and letting them handle XML also. I know it’s not really fair but I’m a free man so let’s go on.

To do the benchmark I obviously needed some data so I wrote a script to extract the HTML URLs (homepages) and XML URLs (feeds) from the OPML exported by my feed reader.

The script is very simple and straightforward: loads the OPML file in cElementTree, scans the outlines with XPath-like syntax and for each outline downloads the related URLs storing the content in two separated directories.

Let me say that cElementTree is fast, really really fast!

Now I have 454 files full of content: 227 HTML pages, 227 XML feeds. Roughly 22 mega bytes of content to feed to the parses. Let’s presume a lot of that content isn’t even valid according to the specific standards (HTML, XML, Atom, RSS). Anyway, I don’t have to validate the content so I just don’t care.

The modules considered for this benchmark are: HTMLParser, sgmllib, BeautifulSoup and sgmlop. I ignored all the plethora of XML only libraries because there are plenty of benchmarks out there and I ignored htmllib because it sits upon sgmllib so there’s no point in benchmarking it too.

This bench does not do anything with the content, just feeds it to the parser. Any kind of error is collected in a list to let me know how many errors the parsers encounter.

The data is preloaded in memory so no time to open the file descriptor and read the content is measured. Every parser has been fed with HTML and XML content separately.

The machine used is a Core Duo 2GHz MacBook with 2 GB of RAM. Python’s version is 2.5.1.

These are my results:

rhymes@groove ~% python sgmlbench.py                                                                                           
HTMLParser, only HTML - time: 5.1607260704, errors: 37
HTMLParser, only XML - time: 3.56549191475, errors: 98
Total: 8.726218

sgmllib.SGMLParser, only HTML - time: 7.36616611481, errors: 1
sgmllib.SGMLParser, only XML - time: 4.22875499725, errors: 0
Total: 11.594921

BeautifulSoup, only HTML - time: 23.7593009472, errors: 5
BeautifulSoup, only XML - time: 10.1111578941, errors: 0
Total: 33.870459

sgmlop.SGMLParser, only HTML - time: 0.473984956741, errors: 0
sgmlop.SGMLParser, only XML - time: 0.443637132645, errors: 0
Total: 0.917622

Draw any conclusion you might like.

You can find the archive containing the whole data including the scripts online.

Update [2007/08/30]:

Karl Dubost pointed html5lib out in the comments so I decided to add it to the benchmark. These are the numbers obtained with the SVN version:

html5lib.HTMLParser, only HTML - time: 209.537359953, errors: 6019
html5lib.HTMLParser, only XML - time: 247.377570152, errors: 15566
Total: 456.914930

I also tried with the different tree builders but the numbers don’t change significantly because they’re just used to build the resulting DOM, they don’t affect the parsing process.

Since it seems tremendously slow I profiled the code with cProfile and realized that the tokenization process seems to be the slower part:

         482665 function calls (479420 primitive calls) in 2.282 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.282    2.282 {execfile}
        1    0.004    0.004    2.282    2.282 html5lib_profile.py:1()
        1    0.000    0.000    2.188    2.188 html5lib_profile.py:3(html5lib_parse)
        1    0.000    0.000    2.187    2.187 html5parser.py:126(parse)
        1    0.041    0.041    2.187    2.187 html5parser.py:72(_parse)
     6832    0.062    0.000    1.877    0.000 tokenizer.py:88(__iter__)
     8216    1.158    0.000    1.230    0.000 inputstream.py:244(charsUntil)
     6763    0.060    0.000    0.689    0.000 tokenizer.py:298(dataState)
     6064    0.026    0.000    0.377    0.000 tokenizer.py:454(tagNameState)
     2059    0.013    0.000    0.332    0.000 tokenizer.py:585(attributeValueDoubleQuotedState)
    26613    0.083    0.000    0.239    0.000 inputstream.py:205(char)
    35513    0.155    0.000    0.155    0.000 {method ‘pop’ of ‘list’ objects}
     2076    0.015    0.000    0.152    0.000 tokenizer.py:494(attributeNameState)
      163    0.048    0.000    0.120    0.001 tokenizer.py:191(consumeEntity)
[…]

I think it’s just a great idea to have a reference implementation in Python of an HTML5 parser, there’s plenty of time to make it fast.

Updates from Python SVN, Part 13

  • New codecs for UTF-32, UTF-32-LE and UTF-32-BE are in place.

  • EUC-KR codec now handles the cheot-ga-keut composed make-up hangul syllables.

  • BeOS is no longer supported (remember that AtheOS, Win9x and WinME as well will be no longer supported in Python 2.6, see PEP 11 for the details).

  • uuid creation is now threadsafe.

Updates from Python SVN, Part 12

Python 2.6 documentation will be in the reST format. I, personally, am really happy that LaTeX doc is gone.

See for example the documentation of the sys module

flickyou

I began writing this library months ago because I needed it in a project. At that time there was a bunch of libraries all quite useless for us. One didn’t really work, another didn’t support authentication, another was incomplete so I started this library that, in the end, we didn’t use because Flickr ceased to be a requirement for the project.

The name is flickyou because I think Flickr has one of the worst API out there and their maniacal obsession with control is a little insane, so you can easily guess what lies behind the name :-)

After “throwing” a bunch of code in the wild I soon realized that some kind of design was emerging and although I still think it’s not perfect I tried to separate as much as possible the various aspects of the library. Let’s see them:

Main class: FlickrClient

This is the “entry point” of the whole library. Instantiate it and start calling the Flickr API and use the response.

This class requires only the two keys Flickr issues to its users.

By default the library uses the JSON response format so you’ll need simplejson as its only requirement.

If you take a look at its implementation you’ll notice that the actual machinery is done in a separate “backend” class.

The trick to activate that machinery is in the __ getattr __ method: it first looks for a real implementation of the method (so you can easily extend the backend), then if nothing comes out it issues a request to the Flickr server directly.

This way I didn’t have to code an implementation of each and every method the API supports.

Abstract backend: BaseBackend

This class is the specification for the machinery. Its documentation explains how it has to be extended and provides a partial implementation of the overall process (like the photo upload support which works outside the Flickr API).

If you don’t want to extend the library with other request/response formats or something I can’t know you don’t need to care about this class.

Default backend: JSONBackend

The Flickr API allows to specify some response formats: JSON, XML (REST), XML (XML-RPC), XML (SOAP) and a custom PHP serialized format.

You can also call the API through HTTP (REST), XML-RPC and SOAP.

Not every combination makes sense: you can’t issue a request with XML-RPC and request to get JSON in response. Ok in theory you can, but in practice xmlrpclib will obviously complain so as I said, not every combination makes sense.

I decided to go with HTTP for the requests and JSON for the response format as the default.

JSONBackend is a very simple class: creates the API signature, issues the HTTP POST request, reads the response, parses it with simplejson and checks for errors.

It provides an actual implementation for the three fundamental methods: checkToken, getFrob, getToken.

JSONBackend also stores the token in a cache on the file system.

That’s basically all about the rationale of the library.

Where to find flickyou

Updates from Python SVN, Part 11