Beautiful Soup

The Python HTTPParser both in HTTPParser and htmllib are very flexible to provide your own implementations for handling start tag, end tags and data elements, but it has limitations. For example if i wanted to preserve input formatting of HTML, but change just a few tags it would be hard to do.

I found a better solution.

Beautiful Soup is a good HTML parser from the initial impressions. I tried many public sites like cnn.com, news.com etc and able to parse it into a tree and access the elements very easily.

It provides very easy functions to search the entire tree and returns references to those.

Lets say, you want to get all hyperlink (a) tags, the code would be as simple as below


from BeautifulSoup import BeautifulSoup
import urllib2;

data=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(data.read())

resultset=soup.findAll("a")
for i in range(len(resultset)):
 print resultset[i]

Now say, you want to make all the links absolute instead of relative, a simple function that takes the resultset would do the trick


from BeautifulSoup import BeautifulSoup
import urllib2

def relativetoabsolute(resultset,tag,url):
  for i in range(len(resultset)):
      try:
          link=str(resultset[i][tag])
          if not link.lower().startswith("http"):
              s[i][tag]=urljoin(url,link)
      except:
          pass

data=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(data.read())

resultset=soup.findAll("a")
relativetoabsolute(resultset,'href','http://www.cnn.com')
print soup

The output HTML would have all relative URLs while preserving input formatting. The cool thing is you could prettify the output by just


print soup.prettify()