13 B6. Censoring, Truncation & Sample Selection
13.1 About
Topics covered:
- Censored data and the Tobit model (MLE, McDonald-Moffitt decomposition)
- Truncated sample regression and truncated normal MLE
- Sample selection bias and the Heckman two-step estimator
- Full Information MLE for sample selection (FIML)
- Identification via exclusion restrictions
13.2 Lecture Notes
13.3 Overview
This section covers three related data problems:
- Censoring — all observations are retained, but the outcome is only observed above/below a threshold.
- Truncation — observations outside a range are entirely excluded from the sample.
- Sample selection — whether an observation appears in the sample depends on endogenous choices.
All three situations cause standard OLS to be biased and inconsistent if not corrected.
13.4 Censoring and the Tobit Model
Setup
Let \(y_i^*\) be the latent (unobserved) outcome. We observe: \[y_i = \max(0,\, y_i^*) = \begin{cases} y_i^* & \text{if } y_i^* > 0 \\ 0 & \text{if } y_i^* \leq 0 \end{cases}\]
This is censoring from below at zero (standard Tobit). The latent variable follows: \[y_i^* = x_i'\beta + \varepsilon_i, \quad \varepsilon_i \mid x_i \sim \mathcal{N}(0,\sigma^2)\]
Applications: hours worked (many zeros for non-employed), expenditure on durable goods, loan amounts.
Why OLS fails
OLS on the censored outcome \(y_i\) is biased because \(E[y_i \mid x_i] \neq x_i'\beta\): \[E[y_i \mid x_i] = \Phi\!\left(\frac{x_i'\beta}{\sigma}\right)\,x_i'\beta + \sigma\,\phi\!\left(\frac{x_i'\beta}{\sigma}\right)\]
where \(\phi\) and \(\Phi\) are the standard normal PDF and CDF. OLS ignores the pile-up of mass at zero and underestimates \(|\beta|\).
Tobit MLE
The likelihood has two parts: a mass at \(y_i = 0\) and a density for \(y_i > 0\):
\[\ell(\beta,\sigma) = \sum_{y_i=0}\ln\left[1 - \Phi\!\left(\frac{x_i'\beta}{\sigma}\right)\right] + \sum_{y_i>0}\ln\left[\frac{1}{\sigma}\phi\!\left(\frac{y_i - x_i'\beta}{\sigma}\right)\right]\]
Tobit MLE is consistent and asymptotically normal under the normality and homoskedasticity assumptions.
Marginal effects
The overall marginal effect of \(x_j\) on \(E[y_i \mid x_i]\) (unconditional) is: \[\frac{\partial E[y_i\mid x_i]}{\partial x_j} = \Phi\!\left(\frac{x_i'\beta}{\sigma}\right)\beta_j\]
(McDonald & Moffitt decomposition also gives the conditional effect for \(y_i > 0\).)
13.5 Truncation
Truncation occurs when observations with \(y_i^* \leq 0\) are entirely absent from the sample (not just coded as zero).
Truncated Normal Regression
Conditioning on \(y_i^* > 0\): \[E[y_i^*\mid x_i, y_i^*>0] = x_i'\beta + \sigma\,\frac{\phi(x_i'\beta/\sigma)}{\Phi(x_i'\beta/\sigma)} = x_i'\beta + \sigma\lambda\!\left(\frac{x_i'\beta}{\sigma}\right)\]
where \(\lambda(\cdot) = \phi(\cdot)/\Phi(\cdot)\) is the inverse Mills ratio (IMR).
OLS on the truncated sample omits the IMR term and is therefore biased. MLE for the truncated normal model directly maximizes the truncated likelihood.
13.6 Sample Selection: The Heckman Model
The selection problem
A sample selection bias arises when whether we observe \(y_i\) depends on a potentially endogenous decision. The two-equation system is:
Selection equation: \[d_i^* = z_i'\gamma + v_i, \quad d_i = \mathbf{1}[d_i^*>0]\]
Outcome equation (observed only when \(d_i=1\)): \[y_i = x_i'\beta + u_i\]
If \((v_i, u_i) \sim \mathcal{N}\!\left(\mathbf{0},\begin{pmatrix}\sigma_v^2 & \rho\sigma_v\sigma_u \\ \rho\sigma_v\sigma_u & \sigma_u^2\end{pmatrix}\right)\), then:
\[E[y_i \mid x_i, d_i=1] = x_i'\beta + \rho\sigma_u\,\lambda(z_i'\gamma/\sigma_v)\]
OLS on the selected sample is biased whenever \(\rho \neq 0\).
Heckman two-step estimator
Heckman (1979) proposes a two-step procedure:
Step 1: Estimate a Probit for \(d_i\) on \(z_i\) → get \(\hat\gamma\). Compute the estimated IMR: \[\hat\lambda_i = \frac{\phi(z_i'\hat\gamma)}{\Phi(z_i'\hat\gamma)}\]
Step 2: Run OLS of \(y_i\) on \((x_i, \hat\lambda_i)\) using only the selected subsample (\(d_i=1\)): \[y_i = x_i'\beta + \delta\,\hat\lambda_i + \text{error}_i\]
The coefficient \(\hat\delta = \rho\sigma_u\) tests for selection (\(H_0: \rho=0\)). Standard errors in Step 2 must account for the generated regressor \(\hat\lambda_i\) (use robust/corrected SEs).
Identification
For point identification, the selection and outcome equations may share all \(x_i\) regressors (functional form identification via the nonlinearity of \(\lambda\)), but exclusion restrictions — variables \(z_i\) that enter selection but not the outcome — greatly improve precision and robustness.
MLE alternative
Full Information MLE (FIML) jointly estimates \((\beta, \gamma, \sigma_u, \rho)\) from the bivariate normal likelihood. More efficient than two-step but computationally heavier and sensitive to distributional misspecification.
13.7 Summary
| Data problem | Estimator | Key assumption |
|---|---|---|
| Censored data | Tobit MLE | Normality, homoskedasticity |
| Truncated sample | Truncated Normal MLE | Normality |
| Sample selection (\(\rho\neq 0\)) | Heckman two-step or FIML | Bivariate normality |
| Sample selection (\(\rho=0\)) | OLS on selected sample | No selection bias |
13.8 References
Cameron y Trivedi (2005), chapters 16, 24. Davidson y MacKinnon (2004), chapter 15.