# Gaussian Processes with Structured Latent Confounders

`GPSLC.jl`

is a Julia package for semi-parametric causal effect estimation with structured latent confounding. It provides interfaces for performing causal inference over the latent variables and Gaussian process parameters to produce accurate causal effect estimates.

The original GP-SLC paper can be found here: http://proceedings.mlr.press/v119/witty20a/witty20a.pdf.

# Inference

`GPSLC.gpslc`

— Function```
gpslc(filename * ".csv")
gpslc(filename * ".csv"; hyperparams=hyperparams, priorparams=priorparams))
gpslc(DataFrame(X1=...,X2=...,T=...,Y=...,obj=...))
gpslc(DataFrame(X1=...,X2=...,T=...,Y=...,obj=...); hyperparams=hyperparams, priorparams=priorparams)
```

Run posterior inference on the input data.

Datatypes of DataFrame or CSV must follow these standards:

`T`

(Boolean/Float64)`Y`

(Float64)`X1...XN`

(Float64...Float64)`obj`

(Any)

Optional parameters

`hyperparams::`

`HyperParameters`

=`getHyperParameters`

`()`

: Hyper parameters primarily define the high level amount of inference to perform.`priorparams::`

`PriorParameters`

=`getPriorParameters`

`()`

: Prior parameters define the high level priors to draw from when constructing kernel functions and latent confounder structure.

Returns a `GPSLCObject`

which stores the hyperparameters, prior parameters, data, and posterior samples.

The primary struct that we provide interfaces for is the `GPSLCObject`

, which most of the high-level functions like the treatment effect functions take as input in addition to their other arguments.

# Treatment Effects

## Individual Treatment Effect (ITE)

A contribution of the original GP-SLC paper is to produce accurate individual treatment effect estimates, conditioned on the observed data and using the inferred values of the latent confounders as determined by the provided structure.

`GPSLC.sampleITE`

— Function```
sampleITE(g, doT)
sampleITE(g, doT; samplesPerPosterior=10)
```

Estimate Individual Treatment Effect with GPSLC model

Params:

`g::`

`GPSLCObject`

: Contains data and hyperparameters`doT`

: The requested intervention (e.g. set all treatments to 1.0)`samplesPerPosterior`

: How many ITE samples to draw per posterior sample in`g`

.

Returns:

`ITEsamples`

: `n x m`

matrix where `n`

is the number of individuals, and `m`

is the number of samples.

## Sample Average Treatment Effect (SATE)

Another popular and useful treatment effect estimate is SATE, which averages individual treatment effects over the individuals in the sample.

`GPSLC.sampleSATE`

— Function```
sampleSATE(g, doT)
sampleSATE(g, doT; samplesPerPosterior=10)
```

Estimate Sample Average Treatment Effect with GPSLC model

Using `sampleITE`

, samples can be drawn for the sample average treatment effect

Params:

`g::`

`GPSLCObject`

: Contains data and hyperparameters`doT`

: The requested intervention (e.g. set all treatments to 1.0)`samplesPerPosterior`

: How many samples to draw per posterior sample in`g`

.

Returns:

`SATEsamples`

: `n x m`

matrix where `n`

is the number of individuals, and `m`

is the number of samples.

## Counterfactual Effects

It can be helpful to calculate treatment effect estimates for the whole domain of treatment values in the data, or some subset, as in the example below. For this we can use `predictCounterfactualEffects`

, which also tracks the values of the `doT`

intervention values for comparison.

`GPSLC.predictCounterfactualEffects`

— Function```
predictCounterfactualOutcomes(g, nSamplesPerMixture)
predictCounterfactualOutcomes(g, nSamplesPerMixture; fidelity=100)
predictCounterfactualOutcomes(g, nSamplesPerMixture; fidelity=100, minDoT=0, maxDoT=5)
```

Params

`g::`

`GPSLCObject`

: The`GPSLCObject`

that inference has already been computed for.`nSamplesPerMixture::Int64`

: The number of outcome samples to

draw from each set of inferred posterior parameters.

`fidelity::Int64`

: How many intervention values to use to cover the domain of treatment values. Higher means more samples.`minDoT::Float64=min(g.T...)`

: The lowest interventional treatment to use.Defaults to the data`g.T`

's lowest treatment value.`maxDoT::Float64=max(g.T...)`

: The highest interventional treatment to use. Defaults to the data`g.T`

's highest treatment value.

`julia> ite, doT = predictCounterfactualEffects(g, 30; fidelity=100)`

Returns

`ite::Matrix{Float64}`

: An array of size`[d, n, numPosteriorSamples * nSamplesPerMixture]`

where d is the number of interventional values defined by`fidelity`

and the range of treatments in`g.T`

-`doTrange::Vector{Float64}`

: The list values of doT used, in order that matches the rows of`ite`

.

## Summarizing

It can be helpful to summarize the inferred individual treatment effects and sample average treatment effects into mean and credible intervals. A use case for this is demonstrated in the examples section.

`GPSLC.summarizeEstimates`

— Function```
summarizeEstimates(samples)
summarizeEstimates(samples; savetofile="ite_samples.csv")
```

Summarize Predicted Estimates (Counterfactual Outcomes or Individual Treatment Effects)

Create dataframe of mean, lower and upper quantiles of the samples from `sampleITE`

or `predictCounterfactualEffects`

.

Params:

`samples`

: The`n x m`

array of samples from sampleSATE or sampleITE`savetofile::String`

: Optionally save the resultant DataFrame as CSV to the filename passed`credible_interval::Float64`

: A real in [0,1] where 0.90 is the default for a 90% credible interval

Returns:

`df`

: Dataframe of Individual, Mean, LowerBound, and UpperBound values for the credible intervals around the sample.

# Examples

The examples below illustrate use cases for

- setting hyperparameters,
- performing inference,
- saving inference results,
- predicting counterfactual effects,
- calculating sample average treatment effect (SATE),
- computing credible intervals for SATE,
- and plotting those intervals relative to the original data

## New England Energy Consumption

This example creates an example plot of the NEEC treatment vs outcome data. Plots the original and the intervened data together. The example below is similar to Figure 3 in the original GP-SLC paper.

```
import Random # hide
Random.seed!(1234) # hide
using GPSLC # hide
using Plots # hide
using Statistics # hide
# set hyperparameters
hyperparams = getHyperParameters()
hyperparams.nOuter = 25
hyperparams.nU = 2
hyperparams.nMHInner = 3
hyperparams.nESInner = 5
# run inference
dataFile = "../example_data/NEEC_sampled.csv"
g = gpslc(dataFile; hyperparams=hyperparams)
saveGPSLCObject(g, "exampleGPSLCObject")
# collect counterfactual outcomes
maIdx = vec(g.obj .== "MA")
nSamples = 100
ite, doT = predictCounterfactualEffects(g, nSamples)
maITE = ite[:, maIdx, :]
# get credible interval on counterfactual outcomes
sate = mean(maITE, dims=2)[:, 1, :]
interval = summarizeEstimates(sate)
meanOutcome = mean(g.Y[maIdx])
lowerSATE = interval[!, "LowerBound"]
meanSATE = interval[!, "Mean"]
upperSATE = interval[!, "UpperBound"]
# plot outcomes and credible interval
treatmentScale = 100
outcomeScale = 10
# observed data
plot(legend=:outertopright, size=(750, 400), margin=0.5Plots.cm, dpi=600)
obsT = g.T[maIdx] .* treatmentScale
obsY = g.Y[maIdx] .* outcomeScale
scatter!(obsT, obsY, label="MA obs", markershape=:circle)
# counterfactual
Tcf = doT .* treatmentScale
Ycf = (meanOutcome .+ meanSATE) .* outcomeScale
upper = (upperSATE .- meanSATE) .* outcomeScale
lower = (meanSATE .- lowerSATE) .* outcomeScale
plot!(Tcf, Ycf, label="MA cf", color=:green,
ribbon=(lower, upper))
xlabel!("Temperature °F")
ylabel!("Energy Consumption (GWh)")
title!("Energy Consumption for Massachusetts")
```

Above we can see the Gaussian Process using individual treatment effect estimates to predict the energy consumption (outcome) from the temperature (treatment) for Massachusetts. The shaded region is a 90% credible interval from the samples taken by `predictCounterfactualEffects`

and processed by `summarizeEstimates`

which calculates the credible intervals by computing the 5*th* and 95*th* percentiles of the samples.

The data used in this example can be found here.

# Types

## External Types

External types are those relevant for using `GPSLC.jl`

in a high-level way, where inference and prediction are automatically performed.

### HyperParameters

`GPSLC.HyperParameters`

— Type`HyperParameters`

Define the high-level attributes of the inference procedure. More information on each of the attributes can be found in `getHyperParameters`

.

The default values for `HyperParameters`

are provided by

`GPSLC.getHyperParameters`

— Function`getHyperParameters()`

Returns default values for hyperparameters

`nU = 1`

: Number of latent confounding variables assumed to be influencing all the instances that belong to one object. Inference will be performed over these values.`nOuter = 20`

: Number of posterior samples to draw.`nMHInner = 5`

: Number of internal Metropolis-Hastings updates to make per posterior sample.`nESInner = 5`

: Number of elliptical-slice sampling updates to make per posterior for latent confounders and binary treatment.`nBurnIn = 5`

: Number of posterior samples to discard when making predictions and estimates.`stepSize = 1`

: How frequently to use posterior samples (1 being every one after burnIn, higher being every`stepSize`

*th*).`predictionCovarianceNoise=1e-10`

: Predicting with Gaussian processes requires use of covariance matrices that are Symmetric Positive Definite, and this covariance noise on the diagonal ensures these operations can be performed in a stable and consistent way.

### PriorParameters

`GPSLC.PriorParameters`

— Type`PriorParameters`

A dictionary of shapes and scales for various Inverse Gamma distributions used as priors for kernel parameters and other parameters. More information on each of the attributes can be found in `getPriorParameters`

.

The default values for `PriorParameters`

are provided by

`GPSLC.getPriorParameters`

— Function`getPriorParameters()`

These are standard values for scale and shape of Inverse Gamma priors over kernel parameters, confounder structure covariance noise, and confounder Gaussian prior covariance.

`uNoiseShape::Float64=4.0`

: shape of the InvGamma prior over the noise of U`uNoiseScale::Float64=4.0`

: scale of the InvGamma prior over the noise of U`xNoiseShape::Float64=4.0`

: shape of the InvGamma prior over the noise of X`xNoiseScale::Float64=4.0`

: scale of the InvGamma prior over the noise of X`tNoiseShape::Float64=4.0`

: shape of the InvGamma prior over the noise of T`tNoiseScale::Float64=4.0`

: scale of the InvGamma prior over the noise of T`yNoiseShape::Float64=4.0`

: shape of the InvGamma prior over the noise of Y`yNoiseScale::Float64=4.0`

: scale of the InvGamma prior over the noise of Y`xScaleShape::Float64=4.0`

: shape of the InvGamma prior over kernel scale of X`xScaleScale::Float64=4.0`

: scale of the InvGamma prior over kernel scale of X`tScaleShape::Float64=4.0`

: shape of the InvGamma prior over kernel scale of T`tScaleScale::Float64=4.0`

: scale of the InvGamma prior over kernel scale of T`yScaleShape::Float64=4.0`

: shape of the InvGamma prior over kernel scale of Y`yScaleScale::Float64=4.0`

: scale of the InvGamma prior over kernel scale of Y`uxLSShape::Float64=4.0`

: shape of the InvGamma prior over kernel lengthscale of U and X`uxLSScale::Float64=4.0`

: scale of the InvGamma prior over kernel lengthscale of U and X`utLSShape::Float64=4.0`

: shape of the InvGamma prior over kernel lengthscale of U and T`utLSScale::Float64=4.0`

: scale of the InvGamma prior over kernel lengthscale of U and T`xtLSShape::Float64=4.0`

: shape of the InvGamma prior over kernel lengthscale of X and T`xtLSScale::Float64=4.0`

: scale of the InvGamma prior over kernel lengthscale of X and T`uyLSShape::Float64=4.0`

: shape of the InvGamma prior over kernel lengthscale of U and Y`uyLSScale::Float64=4.0`

: scale of the InvGamma prior over kernel lengthscale of U and Y`xyLSShape::Float64=4.0`

: shape of the InvGamma prior over kernel lengthscale of X and Y`xyLSScale::Float64=4.0`

: scale of the InvGamma prior over kernel lengthscale of X and Y`tyLSShape::Float64=4.0`

: shape of the InvGamma prior over kernel lengthscale of T and Y`tyLSScale::Float64=4.0`

: scale of the InvGamma prior over kernel lengthscale of T and Y`sigmaUNoise::Float64=1.0e-13`

: noise added to matrix to make covariance stable and invertible`sigmaUCov::Float64=1.0`

: assumed covariance over structured confounders`drift::Float64=0.5`

: as in the paper, Metropolis Hastings Gaussian Drift

### GPSLCObject

The `GPSLCObject`

is the high-level Julia struct that most of the externally facing interfaces rely on to perform their operations. Since it contains inference samples, the hyperparameters, and the observed data, it is at the center of all post-inference interfaces that manipulate the posterior samples according to the observed data.

This also means that the `GPSLCObject`

contains the result of a large portion of compute time, as well as contains all the relevant data for a given workflow. For this reason, all the fields and functions that utilize it are externally available, and described below, to provide users with a simple way to extend the functionality of `GPSLC.jl`

and estimate other quantities of interest.

`GPSLC.GPSLCObject`

— Type`GPSLCObject`

This is the struct in GPSLC.jl that contains the data, hyperparamters, prior parameters, and posterior samples. It provides the primary interfaces to abstract the internals of GPSLC away from the higher-order functions like `sampleITE`

, `sampleSATE`

, and `predictCounterfactualEffects`

.

Returned by `gpslc`

#### Retrieving meta-values from a `GPSLCObject`

`GPSLC.getN`

— Function`getN(g)`

Number of individuals in dataset.

`GPSLC.getNX`

— Function`getNX(g)`

Number of covariates (and observed confounders) in dataset.

`GPSLC.getNU`

— Function`getNU(g)`

Number of latent confounders to perform inference over (hyperparameter).

`GPSLC.getNumPosteriorSamples`

— Function`getNumPosteriorSamples(g)`

Number of posterior samples that will be used based on hyperparameters.

`(total posterior samples - burn in) / step size = nBurnIn:stepSize:nOuter`

#### Saving and loading `GPSLCObject`

s

`GPSLCObject`

s contain all the posterior samples, which can be intensive to calculate and can be reused for various estimations, we provide a pair of interfaces to save and load the `GPSLCObjects`

from the filesystem.

`GPSLC.saveGPSLCObject`

— Function```
saveGPSLCObject(g, filename)
saveGPSLCObject(g, "path/to/filename")
saveGPSLCObject(g, "path/to/filename.gpslc")
```

This function will save the `GPSLCObject`

`g`

to the file `<filename>.gpslc`

. This `GPSLCObject`

, including the posterior samples contained within it can be retrieved with the `loadGPSLCObject`

function.

Note: The extension `.gpslc`

is optional and will be added if it is not included.

`GPSLC.loadGPSLCObject`

— Function```
loadGPSLCObject(filename)
loadGPSLCObject("path/to/filename")
loadGPSLCObject("path/to/filename.gpslc")
```

This function will load and return the `GPSLCObject`

contained in `<filename>.gpslc`

.

Note: the extension `.gpslc`

is optional and will be added if it is not included.

## Internal Types

These types are are used in internal inference procedures, so if users need to fine tune or modify the inference procedure, or access the model directly, these will be relevant.

### Confounders

`GPSLC.Confounders`

— Type`Confounders (U)`

Latent confounders that GPSLC performs inference over.

Either 1D or 2D matrices of `Float64`

values.

### Covariates

`GPSLC.Covariates`

— TypeCovariates (X) Observed confounders and covariates.

`Matrix{Float64}`

is the only valid structure for covariates

### Treatment

`GPSLC.Treatment`

— Type`Treatment (T)`

Is made up of `BinaryTreatment`

which is an alias for `Vector{Bool}`

and `ContinuousTreatment`

which is an alias for `Vector{Float64}`

. These types support other vector types to afford compatibility with internal libraries.

### Outcome

`GPSLC.Outcome`

— Type`Outcome (Y)`

The outcome for the series of Gaussian Process predictions is a `Vector{Float64}`

. Currently only continuous values are supported as outcomes for input data.