I wrote a simple script (for the italian guys) to scrape the content of an Italian tvguide website, all in less than 40 lines.
40 lines to me means Python and its tools are wonderful!
The script is very simple:
- Builds the URI based on the today date with the datetime module.
- Retrieves the page with the page with the urllib module.
- Feeds the page to the BeautifulSoup HTML parser.
- Fetches the HTML of the channels programs with parser’s fetch() method.
- Fetches the names of the channels with fetch() and some one-line iteration combined with strings capabilities.
- Loops through the channels and:
- fetches the channels content stripping away remaining HTML with htmlstripper.py.
- Couples the on-air times with the names of the programs with zip() function.
- Add the name of the program.
- Append the channel to the guide.
That’s it. Stunning
Here’s the code:
[code lang="python"]
!/usr/bin/env python
from datetime import date import urllib from BeautifulSoup import BeautifulSoup import htmlstripper
today = date.today().strftime(”%d/%m/%Y”)
def parse(): # construct the URL properly BASE_URL = “http://tv.lospettacolo.it/seratatv.asp?data=” url = “%s%s” % (BASE_URL, today)
# retrieve the page
request = urllib.urlopen(url)
body = request.read()
request.close()
# feed the html to the parser
soup = BeautifulSoup(body)
# get the HTML for the 9 channels of the page
channels = soup.fetch("td", {"bgcolor": "#F2F2F2"})
# get the name of each channel
names = soup.fetch("img", {"border": "0"})
names = names[2:] # throw away the first two unneeed images
names = [name["src"].replace("img/", "").replace(".jpg", "").title() for name in names]
# loop through the channels and build the data of the tvguide
tvguide = []
for index, channel in enumerate(channels):
channel = channel.fetch("b")
channel = [htmlstripper.stripHTML(str(prg), "utf-8") for prg in channel]
hours = [hr for i, hr in enumerate(channel) if i % 2 == 0]
prgs = [pg for i, pg in enumerate(channel) if i % 2 != 0]
channel = zip(hours, prgs)
channel.insert(0, names[index])
tvguide.append(channel)
return tvguide
if name == “main“: print parse() [/code]
I write Python code everyday but I’m still impressed about it’s power!

