I’m writing a scraper for an ill-formed HTML page (it uses the beautiful BeautifulSoup tool). At the end of the process I need to strip all the HTML away from a string. So I remembered the gorgeous HTML Sanitizer written by Mark Pilgrim for his feedparser.
Two minutes of customization after and I have htmlstripper.py.
It strips away all the HTML (you can customize it to keep some tags or attribute if you want…)

