HTML web scraping using Python

The Social Science Research Institute (SSRI) at Duke offers a series of great workshops on research methodology and data analysis. Today I went to a session on web scraping using Python.

Python borrows a lot of syntax from Matlab and R, but it attempts to emulate both of them. It is good for large data sets (>1GB). Python is also faster for web scraping, good at parsing strings and searching text. An easier method of web scraping is API, but it cannot be applied when the underlying website does not have structured XML paths.
There are three main steps to scraping data from a website:
1. Setup: import relevant packages and set directory.
2. Scraping:
1) Download the website you want to scrap
def fetch_backend(url, to_print=’no’):
website_raw.close() #make sure to run this otherwise your browser will not work
if to_print==’yes’:
print(website[1:200]) #examine the first few lines but by all means, not all of it
return website
url_trulia=”” # using Trulia as an example
page1=fetch_backend(url_trulia, ‘yes’)
page1=fetch_backend(url_trulia, ‘yes’)
2) Extract relevant information
def clean_up(website):
website=re.sub(“[\”\’]”, “”, website) # Delete the backslashes and quotation marks
website=re.sub(‘\\n’, ”, website)
return website
page1=clean_up(str(page1)) # only scrape information on the 1st page
3) Retrieval of fields
3. Extract the data
fields=[“aaa”, “bbb”, “cccc”] # insert the names of the variables
def fill_dict(list_of_regs, dictionary, fields,source): # Create a dictionary
for index in range(0, len(fields)):
dictionary[fields[index]]=re.findall(list_of_regs[index], source)
return dictionarydata=fill_dict([price_reg, size_reg, address_reg, latitude_reg, longitude_reg], data_dict, fields, page1) #run the function

data.keys() #print the keys of dictionary
data[“price”] #print the prices

#check to see that each entry in the dictionary is of the same length:
[len(data[x]) for x in data.keys()]

Programming is fun, and Python is powerful. I can see myself applying these techniques to collect from a wider range of sources more effectively. Many thanks to GB for teaching the workshop.