Beautiful Soup
The Python HTTPParser both in HTTPParser and htmllib are very flexible to provide your own implementations for handling start tag, end tags and data elements, but it has limitations. For example if i wanted to preserve input formatting of HTML, but change just a few tags it would be hard to do.
I found a better solution.
Beautiful Soup is a good HTML parser from the initial impressions. I tried many public sites like cnn.com, news.com etc and able to parse it into a tree and access the elements very easily.
It provides very easy functions to search the entire tree and returns references to those.
Lets say, you want to get all hyperlink (a) tags, the code would be as simple as below
from BeautifulSoup import BeautifulSoup
import urllib2;
data=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(data.read())
resultset=soup.findAll("a")
for i in range(len(resultset)):
print resultset[i]
Now say, you want to make all the links absolute instead of relative, a simple function that takes the resultset would do the trick
from BeautifulSoup import BeautifulSoup
import urllib2
def relativetoabsolute(resultset,tag,url):
for i in range(len(resultset)):
try:
link=str(resultset[i][tag])
if not link.lower().startswith("http"):
s[i][tag]=urljoin(url,link)
except:
pass
data=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(data.read())
resultset=soup.findAll("a")
relativetoabsolute(resultset,'href','http://www.cnn.com')
print soup
The output HTML would have all relative URLs while preserving input formatting. The cool thing is you could prettify the output by just
print soup.prettify()