10 B3. Binary Choice Models

10.1 About

Topics covered:

Linear Probability Model (LPM): OLS on binary outcomes and its limitations
Probit and Logit: specification and MLE estimation
Random utility interpretation (McFadden)
Marginal effects: MEM and AME
Goodness of fit and hypothesis tests

10.2 Lecture Notes

Download lecture notes (PDF)

10.3 Overview

Binary choice models explain a binary outcome \(y_i \in \{0,1\}\) — e.g., whether an individual is employed, whether a firm exports, whether a household owns a house — as a function of covariates \(x_i\).

We model the conditional probability: \[P(y_i = 1 \mid x_i) = F(x_i'\beta)\] where \(F(\cdot)\) is a cumulative distribution function (CDF).

10.4 The Linear Probability Model (LPM)

The simplest approach is OLS applied to the binary outcome: \[y_i = x_i'\beta + \varepsilon_i, \quad E[y_i \mid x_i] = x_i'\beta = P(y_i=1\mid x_i)\]

Advantages: Simple to estimate; coefficients are directly interpretable as marginal effects.

Problems: - Predicted probabilities \(\hat{p}_i = x_i'\hat\beta\) can fall outside \([0,1]\). - Errors \(\varepsilon_i = y_i - x_i'\beta\) are necessarily heteroskedastic (Bernoulli variance: \(p_i(1-p_i)\)). - OLS is still consistent (assuming correct specification), but inefficient.

Despite these limitations, the LPM is widely used as a simple approximation when the focus is on average marginal effects near the center of the distribution.

10.5 Probit and Logit Models

To guarantee probabilities in \((0,1)\), we replace the identity link with a proper CDF:

\[P(y_i = 1 \mid x_i) = F(x_i'\beta)\]

Model	\(F(\cdot)\)	Distribution
Probit	\(\Phi(z) = \int_{-\infty}^z \phi(t)\,dt\)	Standard Normal
Logit	\(\Lambda(z) = \frac{e^z}{1+e^z}\)	Logistic

Both models are estimated by MLE. The log-likelihood for \(n\) independent observations is:

\[\ell(\beta) = \sum_{i=1}^n \left[ y_i \ln F(x_i'\beta) + (1-y_i)\ln(1-F(x_i'\beta)) \right]\]

The score (first-order conditions) must be solved numerically (Newton-Raphson or BFGS). The MLE \(\hat\beta\) is consistent, asymptotically normal, and asymptotically efficient under correct specification.

10.6 Random Utility Interpretation

Binary choice models have a structural foundation in random utility theory (McFadden, 1974). Consider an agent choosing between two alternatives \(j=0,1\). The utility for alternative \(j\) is: \[U_{ij} = x_i'\beta_j + \varepsilon_{ij}\]

The agent chooses \(j=1\) if \(U_{i1} > U_{i0}\), i.e., if: \[y_i = 1 \iff x_i'(\beta_1-\beta_0) > \varepsilon_{i0} - \varepsilon_{i1}\]

Letting \(\beta = \beta_1-\beta_0\) and \(\eta_i = \varepsilon_{i0}-\varepsilon_{i1}\):

If \(\eta_i \sim \mathcal{N}(0,1)\) → Probit
If \(\eta_i \sim \text{Logistic}(0,1)\) → Logit

10.7 Marginal Effects

Because \(F(\cdot)\) is nonlinear, \(\beta\) is not directly interpretable as a marginal effect. Define:

\[\frac{\partial P(y=1\mid x)}{\partial x_j} = f(x'\beta)\,\beta_j\]

where \(f = F'\) is the PDF of the chosen distribution. Since \(f(x'\beta)\) varies with \(x\), there are several conventions:

Concept	Formula
ME at the mean (MEM)	\(f(\bar{x}'\hat\beta)\,\hat\beta_j\)
Average marginal effect (AME)	\(\frac{1}{n}\sum_i f(x_i'\hat\beta)\,\hat\beta_j\)
ME at representative values	evaluated at chosen \(x^*\)

The AME is typically preferred in practice because it averages out heterogeneity in \(x\).

Sign vs. magnitude

In Probit/Logit, \(\hat\beta_j\) only determines the sign of the marginal effect (since \(f>0\) always). The magnitude requires computing \(f(\cdot)\) at specific values of \(x\).

10.8 Goodness of Fit

Standard \(R^2\) does not apply to binary models. Common alternatives:

Pseudo-\(R^2\) (McFadden): \(1 - \ell(\hat\beta)/\ell(\hat\beta_0)\), where \(\ell(\hat\beta_0)\) is the log-likelihood of the intercept-only model.
Percent correctly classified: fraction of observations where \(\hat{y}_i = y_i\) (using 0.5 threshold).
AUC (area under the ROC curve): overall discrimination ability.

10.9 Hypothesis Testing

All standard MLE tests apply (see B1. MLE Theory):

Likelihood Ratio (LR) test: \(LR = 2[\ell(\hat\beta) - \ell(\hat\beta_{\text{restricted}})] \xrightarrow{d} \chi^2_q\)
Wald test: based on asymptotic normality of \(\hat\beta\)
Score (LM) test: based on the score evaluated at the restricted estimate

10.10 References

Cameron y Trivedi (2005), chapters 14–15. Davidson y MacKinnon (2004), chapter 11.