inicio mail me! sindicaci;ón

HTML Stripper

I’m writing a scraper for an ill-formed HTML page (it uses the beautiful BeautifulSoup tool). At the end of the process I need to strip all the HTML away from a string. So I remembered the gorgeous HTML Sanitizer written by Mark Pilgrim for his feedparser.

Two minutes of customization after and I have htmlstripper.py.

It strips away all the HTML (you can customize it to keep some tags or attribute if you want…)

Related posts

  • Italian tvguide (or why Python is wonderful)
  • Updates from Python SVN, Part 17
  • Updates from Python SVN, part 4
  • Mono 1.0 is out
  • Apple has donated an Xserve to Twisted
  • Gravatar

    A song for the lovers » Blog Archive » Italian tvguide (or why Python is wonderful) said,

    February 13, 2006 @ 1:57 pm

    [...] A song for the lovers Everything considered harmful « HTML Stripper [...]

    Gravatar

    Alex said,

    February 15, 2006 @ 5:41 am

    Il tutto veramente carinissimo.

    Ho appena cominciato con P e mi ci sto divertendo un mondo (sono solo le 4 e tre quarti del mattino :) )!

    Ahh, avere più tempo per far quello che piace di più!

    A.

    Gravatar

    Lawrence said,

    February 15, 2006 @ 11:28 am

    vieni a farti un giro su it.comp.lang.python quando vuoi :)

    Gravatar

    Laura said,

    June 1, 2006 @ 10:23 pm

    The htmlstripper script that you posted is wonderful! In fact, it’s precisely what I’ve been looking for…I wonder if I may use it on my own website (cybermenology.com)? Its primary use would be to strip formatting HTML from newsposts, preparatory to making RSS feed entries out of them.

    Thanks! -Laura

    Gravatar

    Lawrence said,

    June 1, 2006 @ 10:26 pm

    Laura: yes for sure. It’s open source eventually :-)

    Gravatar

    Laura said,

    June 2, 2006 @ 2:30 pm

    Thank you! :D

    RSS feed for comments on this post · TrackBack URI

    Leave a Comment