Generate Data with Covariates and Binary Endogenous Regressor

Simulates data from a grouped instrument design model that includes discrete covariates and a binary endogenous regressor generated via a threshold crossing model. The function incorporates essential heterogeneity, where the random treatment coefficient is correlated with the first-stage selection error.

GenData_cov(
  S = 5,
  p1 = 7/8,
  Het = 3,
  sigeps = 0.5,
  sigev = 0.1,
  beta = 0,
  beta0 = 0,
  omega = 0.1,
  K = 20,
  c = 5
)

Arguments

S: Numeric. Concentration parameter $\mu^2$, scaling the instrument strength.
p1: Numeric (0 to 1). Probability parameter controlling the correlation between the first-stage error $v$ and the treatment heterogeneity $\xi$.
Het: Numeric. Heterogeneity parameter, scaling the magnitude of $\xi$.
sigeps: Numeric. Standard deviation of the structural error $\varepsilon$.
sigev: Numeric. Coefficient governing the correlation between $\varepsilon$ and $v$.
beta: Numeric. Average Treatment Effect (ATE).
beta0: Numeric. Null hypothesis value for $\beta$.
omega: Numeric. Scaling parameter for the covariate effects $\gamma$.
K: Integer. Number of instrument groups minus one.
c: Integer. Number of observations per group.

Value

A data frame containing:

group: Overall observation index.
groupZ: Instrument group identifier (the effective Z).
groupW: Covariate stratum identifier.
pi: Instrument effect $\pi$.
gammad: Covariate effect $\gamma$.
v: Latent first-stage variable (Uniform).
xi: Random treatment effect heterogeneity.
X: Binary endogenous regressor (-1 or 1).
Y: Outcome variable.
e: Residual under the null.
MX, Me, MY: Variables projected onto the annihilator $M = I - P$.

Details

The Data Generating Process (DGP) is structured as follows:

Design Matrix:

Covariates (W): Defined by pairs of instrument groups (e.g., strata).
Instruments (Z): Nested within covariates (e.g., judges within years).
Weights: The UJIVE weighting matrix $G$ is computed as $U(P_{[Z,W]}) - U(P_W)$.

Structural Equations: $$X_i = 2 \cdot \mathbb{I}(v_i < \pi_{g(i)} + \gamma_{d(i)}) - 1$$ $$Y_i = X_i (\beta + \xi_i) + \gamma_{d(i)} + \varepsilon_i$$

Heterogeneity: The variable $X$ is binary (taking values -1, 1). The random slope $\xi_i$ is drawn conditionally on the first-stage latent variable $v_i$, inducing correlation between selection into treatment and treatment gains (Essential Heterogeneity).