HTML web scraping using Python

The Social Science Research Institute (SSRI) at Duke offers a series of great workshops on research methodology and data analysis. Today I went to a session on web scraping using Python.

Python borrows a lot of syntax from Matlab and R, but it attempts to emulate both of them. It is good for large data sets (>1GB). Python is also faster for web scraping, good at parsing strings and searching text. An easier method of web scraping is API, but it cannot be applied when the underlying website does not have structured XML paths.
There are three main steps to scraping data from a website:
1. Setup: import relevant packages and set directory.
2. Scraping:
1) Download the website you want to scrap
def fetch_backend(url, to_print=’no’):
website_raw=urllib2.request.urlopen(url_trulia)
website=website_raw.read()
website_raw.close() #make sure to run this otherwise your browser will not work
if to_print==’yes’:
print(website[1:200]) #examine the first few lines but by all means, not all of it
return website
url_trulia=”http://www.trulia.com/for_sale/27705_zip” # using Trulia as an example
page1=fetch_backend(url_trulia, ‘yes’)
url_trulia=”http://www.trulia.com/for_sale/27705_zip”
page1=fetch_backend(url_trulia, ‘yes’)
2) Extract relevant information
def clean_up(website):
website=re.sub(“[\”\’]”, “”, website) # Delete the backslashes and quotation marks
website=str.lower(website)
website=re.sub(‘\\n’, ”, website)
return website
page1=clean_up(str(page1)) # only scrape information on the 1st page
3) Retrieval of fields
data_dict={}
3. Extract the data
fields=[“aaa”, “bbb”, “cccc”] # insert the names of the variables
def fill_dict(list_of_regs, dictionary, fields,source): # Create a dictionary
for index in range(0, len(fields)):
dictionary[fields[index]]=re.findall(list_of_regs[index], source)
return dictionarydata=fill_dict([price_reg, size_reg, address_reg, latitude_reg, longitude_reg], data_dict, fields, page1) #run the function

data.keys() #print the keys of dictionary
data[“price”] #print the prices
data[“latitude”]

#check to see that each entry in the dictionary is of the same length:
[len(data[x]) for x in data.keys()]

Programming is fun, and Python is powerful. I can see myself applying these techniques to collect from a wider range of sources more effectively. Many thanks to GB for teaching the workshop.

Matlab learning log (1)

Programming is fun and rewarding. It is similar with writing in that you reach clarity and elegance through constant revision. As a beginner in Matlab, I’m starting a learning log to record my thoughts along the way. This time, my thoughts come from homework problems for my Demand Estimation class and a dynamic programming problem in my RA work.

1. Be clear about the steps you need to take before you write down any code. If you have a model to guide your analysis, make your code as consistent with the model as possible. Sometimes a tree structure or flow chart can help you think more clearly. Once you start writing the code, it is very easy to get lost in the details (e.g. vector dimensions).

2. In a loop, have a clear idea of the relationship between variables and when and where a variable needs to be defined. For example, empty vectors/matrices to store estimates should be defined before the estimates are produced.

3. Use vectors where possible to make calculations more efficient. For someone like me who is spoiled by straight-forward “programming” in Stata, this is something I need to learn.

Learning a new programming language is like getting to know a new friend. Over time you learn about her strengths and weaknesses, and how she can complement you to make your work more productive. Starting next week I will be taking a course on entry games taught by Professor Allan Collard-Wexler. Looking forward to learning more about programming and IO theory in the next seven weeks!