Python borrows a lot of syntax from Matlab and R, but it attempts to emulate both of them. It is good for large data sets (>1GB). Python is also faster for web scraping, good at parsing strings and searching text. An easier method of web scraping is API, but it cannot be applied when the underlying website does not have structured XML paths.
There are three main steps to scraping data from a website:
1. Setup: import relevant packages and set directory.
1) Download the website you want to scrap
def fetch_backend(url, to_print=’no’):
website_raw.close() #make sure to run this otherwise your browser will not work
print(website[1:200]) #examine the first few lines but by all means, not all of it
return websiteurl_trulia=”http://www.trulia.com/for_sale/27705_zip” # using Trulia as an example
2) Extract relevant information
website=re.sub(“[\”\’]”, “”, website) # Delete the backslashes and quotation marks
website=re.sub(‘\\n’, ”, website)
return websitepage1=clean_up(str(page1)) # only scrape information on the 1st page
3) Retrieval of fields
3. Extract the data
fields=[“aaa”, “bbb”, “cccc”] # insert the names of the variables
def fill_dict(list_of_regs, dictionary, fields,source): # Create a dictionary
for index in range(0, len(fields)):
return dictionarydata=fill_dict([price_reg, size_reg, address_reg, latitude_reg, longitude_reg], data_dict, fields, page1) #run the function
data.keys() #print the keys of dictionary
data[“price”] #print the prices
#check to see that each entry in the dictionary is of the same length:
[len(data[x]) for x in data.keys()]
Programming is fun, and Python is powerful. I can see myself applying these techniques to collect from a wider range of sources more effectively. Many thanks to GB for teaching the workshop.