What I learned from teaching PhD-level “tool set” classes

This summer I had the pleasure to teach two PhD level modules: Introduction to R and Programming and Project Management. Both of these classes are designed to equip rising second-year PhD students with the necessary programming skills needed for independent research.

I also participated in the Teaching in Triangles program as part of my effort towards earning a Certificate of College Teaching. This program consists of pair-wise peer observations (with other PhD students who are teaching summer classes), and we get extensive feedback on the content and style of our teaching. Here are the two main lessons for me:

  • It is challenging to achieve a balance between pure lecturing and student discussions, especially in software classes. I designed the classes to contain short, in-class assignments that allow students to check their understanding right after a new concept/procedure is introduced. But when I explain the recommended solutions, I tend to lecture on and on without leaving much time for students to ask questions.
  • Students are more engaged when they feel they can contribute to the class. In the last class when I talked about programming and project management in collaborative projects, I asked students to brainstorm the good practices in this setting and emphasized that they would need to share their insights with the rest of the class. My students were super engaged and raised good points that complemented my lecture.

My experience as an instructor also makes me realize how much effort goes into course preparation. Bravos to the good teachers I encountered throughout the years! Hopefully I will become a better teacher over time.

Advertisements

HTML web scraping using Python

The Social Science Research Institute (SSRI) at Duke offers a series of great workshops on research methodology and data analysis. Today I went to a session on web scraping using Python.

Python borrows a lot of syntax from Matlab and R, but it attempts to emulate both of them. It is good for large data sets (>1GB). Python is also faster for web scraping, good at parsing strings and searching text. An easier method of web scraping is API, but it cannot be applied when the underlying website does not have structured XML paths.
There are three main steps to scraping data from a website:
1. Setup: import relevant packages and set directory.
2. Scraping:
1) Download the website you want to scrap
def fetch_backend(url, to_print=’no’):
website_raw=urllib2.request.urlopen(url_trulia)
website=website_raw.read()
website_raw.close() #make sure to run this otherwise your browser will not work
if to_print==’yes’:
print(website[1:200]) #examine the first few lines but by all means, not all of it
return website
url_trulia=”http://www.trulia.com/for_sale/27705_zip” # using Trulia as an example
page1=fetch_backend(url_trulia, ‘yes’)
url_trulia=”http://www.trulia.com/for_sale/27705_zip”
page1=fetch_backend(url_trulia, ‘yes’)
2) Extract relevant information
def clean_up(website):
website=re.sub(“[\”\’]”, “”, website) # Delete the backslashes and quotation marks
website=str.lower(website)
website=re.sub(‘\\n’, ”, website)
return website
page1=clean_up(str(page1)) # only scrape information on the 1st page
3) Retrieval of fields
data_dict={}
3. Extract the data
fields=[“aaa”, “bbb”, “cccc”] # insert the names of the variables
def fill_dict(list_of_regs, dictionary, fields,source): # Create a dictionary
for index in range(0, len(fields)):
dictionary[fields[index]]=re.findall(list_of_regs[index], source)
return dictionarydata=fill_dict([price_reg, size_reg, address_reg, latitude_reg, longitude_reg], data_dict, fields, page1) #run the function

data.keys() #print the keys of dictionary
data[“price”] #print the prices
data[“latitude”]

#check to see that each entry in the dictionary is of the same length:
[len(data[x]) for x in data.keys()]

Programming is fun, and Python is powerful. I can see myself applying these techniques to collect from a wider range of sources more effectively. Many thanks to GB for teaching the workshop.

Matlab learning log (1)

Programming is fun and rewarding. It is similar with writing in that you reach clarity and elegance through constant revision. As a beginner in Matlab, I’m starting a learning log to record my thoughts along the way. This time, my thoughts come from homework problems for my Demand Estimation class and a dynamic programming problem in my RA work.

1. Be clear about the steps you need to take before you write down any code. If you have a model to guide your analysis, make your code as consistent with the model as possible. Sometimes a tree structure or flow chart can help you think more clearly. Once you start writing the code, it is very easy to get lost in the details (e.g. vector dimensions).

2. In a loop, have a clear idea of the relationship between variables and when and where a variable needs to be defined. For example, empty vectors/matrices to store estimates should be defined before the estimates are produced.

3. Use vectors where possible to make calculations more efficient. For someone like me who is spoiled by straight-forward “programming” in Stata, this is something I need to learn.

Learning a new programming language is like getting to know a new friend. Over time you learn about her strengths and weaknesses, and how she can complement you to make your work more productive. Starting next week I will be taking a course on entry games taught by Professor Allan Collard-Wexler. Looking forward to learning more about programming and IO theory in the next seven weeks!