13  B6. Censoring, Truncation & Sample Selection

13.1 About

Topics covered:

  • Censored data and the Tobit model (MLE, McDonald-Moffitt decomposition)
  • Truncated sample regression and truncated normal MLE
  • Sample selection bias and the Heckman two-step estimator
  • Full Information MLE for sample selection (FIML)
  • Identification via exclusion restrictions

13.2 Lecture Notes

 


13.3 Overview

This section covers three related data problems:

  1. Censoring — all observations are retained, but the outcome is only observed above/below a threshold.
  2. Truncation — observations outside a range are entirely excluded from the sample.
  3. Sample selection — whether an observation appears in the sample depends on endogenous choices.

All three situations cause standard OLS to be biased and inconsistent if not corrected.


13.4 Censoring and the Tobit Model

Setup

Let \(y_i^*\) be the latent (unobserved) outcome. We observe: \[y_i = \max(0,\, y_i^*) = \begin{cases} y_i^* & \text{if } y_i^* > 0 \\ 0 & \text{if } y_i^* \leq 0 \end{cases}\]

This is censoring from below at zero (standard Tobit). The latent variable follows: \[y_i^* = x_i'\beta + \varepsilon_i, \quad \varepsilon_i \mid x_i \sim \mathcal{N}(0,\sigma^2)\]

Applications: hours worked (many zeros for non-employed), expenditure on durable goods, loan amounts.

Why OLS fails

OLS on the censored outcome \(y_i\) is biased because \(E[y_i \mid x_i] \neq x_i'\beta\): \[E[y_i \mid x_i] = \Phi\!\left(\frac{x_i'\beta}{\sigma}\right)\,x_i'\beta + \sigma\,\phi\!\left(\frac{x_i'\beta}{\sigma}\right)\]

where \(\phi\) and \(\Phi\) are the standard normal PDF and CDF. OLS ignores the pile-up of mass at zero and underestimates \(|\beta|\).

Tobit MLE

The likelihood has two parts: a mass at \(y_i = 0\) and a density for \(y_i > 0\):

\[\ell(\beta,\sigma) = \sum_{y_i=0}\ln\left[1 - \Phi\!\left(\frac{x_i'\beta}{\sigma}\right)\right] + \sum_{y_i>0}\ln\left[\frac{1}{\sigma}\phi\!\left(\frac{y_i - x_i'\beta}{\sigma}\right)\right]\]

Tobit MLE is consistent and asymptotically normal under the normality and homoskedasticity assumptions.

Marginal effects

The overall marginal effect of \(x_j\) on \(E[y_i \mid x_i]\) (unconditional) is: \[\frac{\partial E[y_i\mid x_i]}{\partial x_j} = \Phi\!\left(\frac{x_i'\beta}{\sigma}\right)\beta_j\]

(McDonald & Moffitt decomposition also gives the conditional effect for \(y_i > 0\).)


13.5 Truncation

Truncation occurs when observations with \(y_i^* \leq 0\) are entirely absent from the sample (not just coded as zero).

Truncated Normal Regression

Conditioning on \(y_i^* > 0\): \[E[y_i^*\mid x_i, y_i^*>0] = x_i'\beta + \sigma\,\frac{\phi(x_i'\beta/\sigma)}{\Phi(x_i'\beta/\sigma)} = x_i'\beta + \sigma\lambda\!\left(\frac{x_i'\beta}{\sigma}\right)\]

where \(\lambda(\cdot) = \phi(\cdot)/\Phi(\cdot)\) is the inverse Mills ratio (IMR).

OLS on the truncated sample omits the IMR term and is therefore biased. MLE for the truncated normal model directly maximizes the truncated likelihood.


13.6 Sample Selection: The Heckman Model

The selection problem

A sample selection bias arises when whether we observe \(y_i\) depends on a potentially endogenous decision. The two-equation system is:

Selection equation: \[d_i^* = z_i'\gamma + v_i, \quad d_i = \mathbf{1}[d_i^*>0]\]

Outcome equation (observed only when \(d_i=1\)): \[y_i = x_i'\beta + u_i\]

If \((v_i, u_i) \sim \mathcal{N}\!\left(\mathbf{0},\begin{pmatrix}\sigma_v^2 & \rho\sigma_v\sigma_u \\ \rho\sigma_v\sigma_u & \sigma_u^2\end{pmatrix}\right)\), then:

\[E[y_i \mid x_i, d_i=1] = x_i'\beta + \rho\sigma_u\,\lambda(z_i'\gamma/\sigma_v)\]

OLS on the selected sample is biased whenever \(\rho \neq 0\).

Heckman two-step estimator

Heckman (1979) proposes a two-step procedure:

Step 1: Estimate a Probit for \(d_i\) on \(z_i\) → get \(\hat\gamma\). Compute the estimated IMR: \[\hat\lambda_i = \frac{\phi(z_i'\hat\gamma)}{\Phi(z_i'\hat\gamma)}\]

Step 2: Run OLS of \(y_i\) on \((x_i, \hat\lambda_i)\) using only the selected subsample (\(d_i=1\)): \[y_i = x_i'\beta + \delta\,\hat\lambda_i + \text{error}_i\]

The coefficient \(\hat\delta = \rho\sigma_u\) tests for selection (\(H_0: \rho=0\)). Standard errors in Step 2 must account for the generated regressor \(\hat\lambda_i\) (use robust/corrected SEs).

Identification

For point identification, the selection and outcome equations may share all \(x_i\) regressors (functional form identification via the nonlinearity of \(\lambda\)), but exclusion restrictions — variables \(z_i\) that enter selection but not the outcome — greatly improve precision and robustness.

MLE alternative

Full Information MLE (FIML) jointly estimates \((\beta, \gamma, \sigma_u, \rho)\) from the bivariate normal likelihood. More efficient than two-step but computationally heavier and sensitive to distributional misspecification.


13.7 Summary

Data problem Estimator Key assumption
Censored data Tobit MLE Normality, homoskedasticity
Truncated sample Truncated Normal MLE Normality
Sample selection (\(\rho\neq 0\)) Heckman two-step or FIML Bivariate normality
Sample selection (\(\rho=0\)) OLS on selected sample No selection bias

13.8 References

Cameron y Trivedi (2005), chapters 16, 24. Davidson y MacKinnon (2004), chapter 15.