In an earlier post, I have demonstrated how we can extract links or anchor tags from html documents in python. The intent here is similar, but this time, we are going to pull out email addresses (an definitely not send them spam emails, please don’t). Python is awesome with regular expressions, and it is one of the prime reasons why python is so popular for crawling or scrapping jobs.
Regular expression for emails
This is definitely not the definitive version for emails, and there could be some cases that this does not cover. However, this has served me humbly quite a few times to my needs. I am yet to discover an email address that makes this invalid. In case you find one, I would be glad to learn it.
Here’s the full working code for extracting email addresses from an html document.
htmlFile = urllib.urlopen("http://www.somewhereontheinternet.com")
html = htmlFile.read()
regexp_email = r'''([w-.+]+@w[w-]+.+[w-]+)'''
pattern = re.compile(regexp_email)
emailAddresses = re.findall(pattern, html)
#print all matches