This is my project for Bayesian Statistics class.

**1. Background**

China has experienced remarkable growth and widening regional gap in recent decades. The inequality of wealth distribution can translate into inequality of overall well being among citizens. In this paper, I use a Bayesian probit model to test if the elderly from underdeveloped and rural areas are systematically less healthy than their peers. I incorporate a latent variable to capture the underlying difference in elderly health status by province and hukou status (the Chinese household registration system which suggests where, rural or urban, you are from). My results suggest that the elderly residents in Gansu province are significantly less healthy than those in Zhejiang. But the difference between rural and urban elderly is not significant.

**2. Data and Method**

**2.1. Data**

I use data from China Health and Retirement Longitudinal Study (CHARLS) pilot sample downloaded from China Center for Economic Research at Peking Universityhttp://charls.ccer.edu.cn/charls/. The study interviews people older than 45 living in two provinces — Gansu and Zhejiang. Gansu is a landlocked province in northwest China with GDP per capita lower than 2000 US dollars in 2011; while the coastal province of Zhejiang has GDP per capita approaching 6500 US dollars in 2011.

There are a total of 1620 observations. Among the respondents, 45.3% are from Gansu province and 54.7% are from Zhejiang. The majority of the interviewees, 82.3%, hold rural hukou. The age of the respondents range from 45 to 87, with a median age of 57. For each individual, demographic data including age, gender, smoking habit, hukou status (rural or urban), and province, are collected. Individuals are also asked about their health in the childhood, and they can answer “poor”, “fair”, “good”, “very good”, or “excellent”. Information about disability and diseases is recorded, but the missing rates are too high. So I use 12 indicators of health conditions measured by the difficulty in doing 12 daily activities. A complete list is attached in appendix. Each of these variables is recorded to be 1 if the individual feels difficult in doing the corresponding daily activity, and 0 otherwise.

**2.2. Method**

Initially I attempted to follow Chib and Greenberg (1998) and fit the data with a multivariate probit model in order to make full use of the multiple responses. But choosing an appropriate correlation matrix and sampling from multi-dimensional truncated normal were non-trivial given the limited time of this project, so I constructed a single indicator for each individual, y , which suggests that whether an individual feels difficult in doing any of the 12 activities.

y{i}= 1, if difficulty

0, otherwise

I use a probit model.

Pr(y_{i}=1)=\Phi(x_{i}^{T}\beta)

where x_{i} is a vector of individual characteristics.

x_{i}^{T}=(1,age, male,HealthYoungPoor,HealthYoungFair, HealthYoungGood, HealthYoungVeryGood, smoke,RuralHukou,Gansu ,RuralGansu) .

The variable Gansu is coded to be 1 if the respondent lives in Gansu and 0 if the person lives in Zhejiang. Similarly, RuralHukou is an indicator of whether the respondent holds a rural hukou or not. RuralGansu is an interactive term coded to be 1 if a person lives in Gansu with rural hukou, and 0 otherwise. I break down the levels of childhood health into separate dummies, which are equal to 1 if the corresponding statement is true (e.g. HelathYoungPoor=1 if health in the childhood is poor) and 0 otherwise. I drop the dummy for “excellent health in the childhood” to avoid perfect linearity.

For computational convenience, I use a data augmentation scheme. Let y_{i}=I(z_{i}>0)

where z_{i}~ Normal(x_{i}^{T}\beta,1) is a latent variable. z_{i} can be interpreted as individual i’s disability level, with higher score indicating poorer health and thus bigger probability that he or she would feel difficult in doing those daily activities.

I follow Hoff (2009) and choose a multivariate normal g prior for beta . I set prior mean to be 0 because I assume it is equally likely for the co variates to have positive or negative impact on health. I set g=n to represent vague information about beta.

Therefore the prior distribution of \beta is: beta~MultivariateNormal(0,n(X^{T}X)^{-1})

Full conditionals for beta and z are both in closed forms.

beta|- ~Normal(\beta^{*},\sum^{*})

where \sum^{*}=\frac{n}{n+1}(X^{T}X)^{-1} and \beta^{*}=\sum^{*}X^{T}z

z_{i}|y_{i}=1,- ~Normal_{(0,+\infty)}(x_{i}^{T}\beta,1)

z_{i}|y_{i}=0,- ~Normal_{(-\infty,0)}(x_{i}^{T}\beta,1)

I use Gibbs sampling. Total number of simulations is 11000, and burnin is 1000.

**2.3. Inference Strategy**

I get the posterior distribution of \beta’s from 10000 post burn-in Gibbs samples. The estimated values for each \beta, denoted as \hat{\beta} , is the mean of post burn-in Gibbs samples. I also calculate 95% credible intervals using the 0.025 and 0.975 quantiles of each \hat{\beta} . If the 95% credible interval of \hat{\beta_{i}} is does not contain values smaller than or equal to zero, \hat{\beta_{i}} is significantly positive. Otherwise it is significantly negative.

I then use the Bayesian estimates to do in-sample fit and out-sample prediction. I compare the performance of Bayesian and frequentist logistic and probit regressions using Mean Squared Error (MSE) and Mean Squared Prediction Error (MSPE) measures.

**3. Results **

Traceplots for \beta_{RuralGansu} is shown below. Although there is some auto correlation, mixing is good in general.

Among all the co variates, only the coefficient for Gansu is significantly positive, with a posterior mean around 0.8. This suggests that in my sample the residents in Gansu are significantly less healthy than those in Zhejiang. Neither\hat{\beta}_{RuralHukou} nor \hat{\beta}_{RuralGansu} is significantly different from 0, suggesting lack of evidence for differential health status between urban and rural residents. Health in the childhood is not significantly correlated with general health status as measured by the 12 indicators, but this may be because the childhood health variables are self-reported and are likely to be inaccurate.

I also plotted the posterior distribution of the z_{i}’s (below) to investigate the landscape of “disability index” among different regions.

The z_{i}’s are calculated by using updated \beta’s from each iteration. Each individual’s posterior \hat{z_{i}} is calculated by taking a mean of all his or her post burn-in z_{i}’s . The distributions of rural and urban residents are not very different. This echos the lack of significance in \beta_{RuralHukou} . But due to the under representation of urban population in my sample, this result may not reflect the underlying patterns in reality. The residents in Zhejiang have much lower disability scores a posteriori, which shows a pronounced inter-provincial difference in health.

When I compare the Bayesian approach with frequentist logistic and probit regression, Bayesian method yields a higher Mean Squared Error (MSE) and lower Mean Squared Predictive Error (MSPE).

**4. Conclusion**

Using a Bayesian approach to analyze data from CHARLS, I found that people living in the province of Gansu are in worse health conditions than residents in Zhejiang. The health difference between rural and urban residents is not significant. Policy makers should be aware of the unequal consequences of development in these two regions in particular and the whole nation in general. Future research can incorporate community or region fixed effects and use more direct measures of health outcomes to get more robust results.

**References:**

Chib, S. and Greenberg, E. 1998. “Analysis of multivariate probit models,” Biometrika 85(2): 347-361.

Hoff, P. 2009. A First Course in Bayesian Statistical Methods, 2nd edition. Springer.