HTML web scraping using Python

The Social Science Research Institute (SSRI) at Duke offers a series of great workshops on research methodology and data analysis. Today I went to a session on web scraping using Python.

Python borrows a lot of syntax from Matlab and R, but it attempts to emulate both of them. It is good for large data sets (>1GB). Python is also faster for web scraping, good at parsing strings and searching text. An easier method of web scraping is API, but it cannot be applied when the underlying website does not have structured XML paths.
There are three main steps to scraping data from a website:
1. Setup: import relevant packages and set directory.
2. Scraping:
1) Download the website you want to scrap
def fetch_backend(url, to_print=’no’):
website_raw=urllib2.request.urlopen(url_trulia)
website=website_raw.read()
website_raw.close() #make sure to run this otherwise your browser will not work
if to_print==’yes':
print(website[1:200]) #examine the first few lines but by all means, not all of it
return website
url_trulia=”http://www.trulia.com/for_sale/27705_zip” # using Trulia as an example
page1=fetch_backend(url_trulia, ‘yes’)
url_trulia=”http://www.trulia.com/for_sale/27705_zip”
page1=fetch_backend(url_trulia, ‘yes’)
2) Extract relevant information
def clean_up(website):
website=re.sub(“[\”\’]”, “”, website) # Delete the backslashes and quotation marks
website=str.lower(website)
website=re.sub(‘\\n’, ”, website)
return website
page1=clean_up(str(page1)) # only scrape information on the 1st page
3) Retrieval of fields
data_dict={}
3. Extract the data
fields=[“aaa”, “bbb”, “cccc”] # insert the names of the variables
def fill_dict(list_of_regs, dictionary, fields,source): # Create a dictionary
for index in range(0, len(fields)):
dictionary[fields[index]]=re.findall(list_of_regs[index], source)
return dictionarydata=fill_dict([price_reg, size_reg, address_reg, latitude_reg, longitude_reg], data_dict, fields, page1) #run the function

data.keys() #print the keys of dictionary
data[“price”] #print the prices
data[“latitude”]

#check to see that each entry in the dictionary is of the same length:
[len(data[x]) for x in data.keys()]

Programming is fun, and Python is powerful. I can see myself applying these techniques to collect from a wider range of sources more effectively. Many thanks to GB for teaching the workshop.

Philip Oreopoulos on behavioral economics for education

Professor Philip Oreopoulos from the University of Toronto is our ERID visitor this week. Today he gave a talk on applications of behavioral economics in economics of education.

Individual rationality is one of the fundamental assumptions widely held in traditional economic theory. This assumption allows economists to place model decision-making process as an optimization process given a well-defined objective under a set of constraints.

But when individuals don’t think rationally, many modeling frameworks need to be revised. Professor Oreopoulos specifically mentioned the dual process theory in psychology and how that justifies the present bias. Interested readers can refer to this book by Daniel Kahneman.

A few signs of system I show up in the behavior of students while they learn. They tend to focus too much on the present, sometimes sacrificing long-term benefits in favor of temporary pleasure. They rely too much on routine. They focus too much on negative identities. It is worth noting that social identity can exacerbate present bias by perpetuating irresponsible lifestyles through social networks.

The speaker provided some evidence that education could change preferences. Studying requires efforts, but the payoffs are only realized gradually over the life course. Therefore, to get an education students need to overcome their tendency to “enjoy the present” without considering long-term consequences of bad time management.

Behavioral economics is increasingly used to analyze education policies. One important topic is how much does the availability of college admissions information help disadvantaged students depending on whether they actively seek the information. Individual aspiration and family resources play vital roles in determining the overall impact of increasing availability and transparency of the information on college choices.

One of my favorite pieces of research on this topic is done by Caroline Hoxby and Sarah Turner. They mail packages containing personalized information to disadvantaged students on college admissions statistics and scholarship availability, thereby increasing their awareness of the affordability of selective colleges. As a result, the number of applications to more selective colleges increased and so does matriculation in those colleges. This suggests that simple, inexpensive information interventions can make a big difference in the decision-making process of disadvantaged individuals.

Two iPhone apps for meditation

If you are looking for peace in a busy life, or simply want to set some free space in your brain, the following apps might be useful.

– Headspace: guided 10-minute meditations. Pre-recorded by a British young man. It has a ten-day trial period, but charges a fee afterwards.

– Insight Timer: provides free, pre-recorded guided meditation by famous meditation practitioners (Thich Nhat Hahn, for example). You can also share your life experiences and thoughts on meditation on a discussion forum.

2015: Learn, Explore, Create

2014 has been a great year for me. I experienced a lot of uncertainty and anxiety but have also become much more mature. For 2015, here are a few of my keywords:

* Learn *

Learn about economics, in terms of both theory and empirical methods. As a first-year PhD student, learning is my priority. I have come to realize that without a deep understanding about the current literature, ideas are often either too shallow or too outdated. I should also develop my own perspective to see the questions and modeling techniques in different fields under a unified framework, which will allow me to have a bigger repository of ideas and research tools.

Learn about how to communicate ideas, in writing and in person. Writing is like carving a statue out of a bare stone. There are general rules to be followed, but good writing requires a tremendous amount of practice and through this practice, a solid grasp of the reader’s mind.

* Explore *

A fair amount of exploration is needed before one settles on a particular (set of) ideas for dissertation (in the short run) and future research (in the long run). I need to not only be more aware of the resources around me, but also develop the ability to abstract and synthesize relevant information for my own use.

Apart from academics, I hope to explore more about the area where I’m living and to make more friends. This, I believe, will come naturally.

* Create *

The ultimate goal of research is to advance the boundary of human knowledge. Creativity is a key ingredient to good research. I definitely need to improve in this aspect. Hope to come back with more insights in a year.

One important difference between a PhD student and an undergrad/masters student is that work is no longer distinguishable from leisure. Everything seems to be related to economics in some way. While I know economic is very important to me, I am also trying to not be buried in the ivory tower and to communicate with people from other backgrounds. We have a lot to learn from each other.

Finally, a few thoughts on love and distance. If I can give one piece of advice to people in long distance relationships, I would say: don’t see distance as your enemy. View it as an opportunity for you to become more independent individuals. You can achieve personal growth while maintaining the emotional bond with your significant other. Strive to become a better person so that the next time you meet, you are able to deliver more positive energy to each other.

HAPPY NEW YEAR!

Using Bibtex to manage references in LaTex

BibTeX is a useful reference management tool for academic writers. The Wikibooks provides a detailed description on how to use BibTeX. For most references you only need to search the article on GoogleScholar and select “import into BibTeX” to get the codes.

If you are using WinEdt like me, follow this procedure (I imagine the procedure should be similar for other editors) to link BibTeX with Latex and generate the bibliography.

After compiling your .bib file and adding the relevant Latex commands in your .tex file, hit the following key combinations while you’re in .tex file:

  1. CTL-SHIFT-L: runs LaTeX2e. Since the program is designed to work with BibTeX, and you have used the code in your TeX file, it generates an .aux file called which contains all of the citations which you used in the document.
  2. CTL-SHIFT-B: runs BibTeX. This command searches for the “.aux” file, searches your BibTeX file for the relevant citations, and creates a .bbl file containing all information for the works cited in your .tex file. This is a crucial step; without it the citations will appear as question marks and the bibliography won’t be generated.
  3. CTL-SHIFT-L: runs LaTeX2e to let LaTeX create the bibliography inside the document with the bibliography (.bbl) file.
  4. CTL-SHIFT-L: runs LaTeX2e again to make sure all of the references match up.

After every change in your .bib file, you have to do 2-4 again.

Winter Reading Notes on Industrial Development (1)

This winter break I am reading some classic articles about trade and development, fields that I am likely to specialize in. If you are interested, the materials come from this industrial development course taught by Professor Eric Verhoogen at Columbia. This is the first in a series of posts documenting my thoughts on these readings.

In The Fall and Rise of Development Economics, Paul Krugman writes about how the “high development theory” perished because of its inability to express its key ideas in structured economic models. There is a lot to be learned just from Krugman’s writing.

Mookherjee (1999) argues that contractual constraints are the major impediments to firm performance in developing countries. In developing countries, moral hazard problems can hardly be solved through formal channels because of financial constraints and weaker contract enforcement institutions. He proposes three alternative solutions:

1. Designing contracts that reward abstinence from dysfunctional behavior (when performance is contractible);

2. Joint ownership by agents with conflicting interests (when performance is non-contractible);

3. Relying on reputational constraints, when neither 1 nor 2 works.

Clearly, weaker institutions and greater information barriers will shift people from external monitoring (formal contracts) to internal monitoring (reputation and networks).

Another interesting and insightful point made in the article is that entrepreneurs with higher wealth have a bigger chance to secure investment in risky projects, because they have high stakes and bigger collaterals.

Tybout (2000) outlines several key observations about manufacturing firms in developing countries: the proliferation of very small plants (under 5 workers), the large market shares of big plants, and high turnover rates of plants and jobs. However, linking these phenomena to productivity level and growth requires clearer measures of efficiency and productivity, better data tracking firms across long enough time periods, and better models of firm behavior in macroeconomic uncertainty.

References:

Krugman, P. 1994. “The fall and rise of development economics.” Rethinking the development experience: Essays provoked by the work of Albert O. Hirschman: 39-58.

Mookherjee, D. 1999. “Contractual constraints on firm performance in developing countries”. Working Paper. Boston University, Institute for Economic Development.

Tybout, J. R. 2000. “Manufacturing firms in developing countries: How well do they do, and why?.” Journal of Economic literature 38(1): 11-44.