Calvin and Hobbes, Bill Watterson, April 19, 1988, (source).1
Table of Contents
Tetrad includes a huge variety of tools for causal inference. It has been under development since the early 90s. The algorithms in Tetrad were designed by many people, but the vast majority of the implementation was done by Joe Ramsey.
When people say 'causal inference', they mean lots of different things. Here are some things you might want to do with Tetrad:
All of these tasks can be called 'causal inference'.
Despite their differences, these tasks share many components. For example, if you're learning a graph or evaluating a search algorithm, you need a search function. Tetrad is modular: it lets you mix and match components to do many different kinds of causal inference. This modularity makes Tetrad powerful, but difficult to understand without first understanding the basic components.
To understand what is possible with Tetrad, let's talk about what it contains.
Tetrad is written in Java, an object-oriented programming language. Tetrad uses the following kinds of objects:2
Causal inference is a scientific discovery problem, so random variables are the basic objects. Variables are identified with "nodes" or "vertices" in causal graphs.3
In other graph software, you first create a graph, then populate it with nodes; if the graph disappears the nodes do too. By contrast, in Tetrad the nodes are basic objects. You can build multiple graphs over the same set of nodes. This represents the scientific problem: we start out knowing what the variables are, and we learn the causal relationships among them.
Variables can be discrete- or continuous-valued in Tetrad. This distinction matters for search algorithms.
How they're made: You create new variables when you load your data into Tetrad, create a random graph, or create a new graph by hand (with no input).
Schematic Example: our set of variables might be {Sunscreen, Temperature, Ice-cream}.
Datasets in Tetrad include two parts: a set of variables V, and either a set of observations of all of those variables in the form of a data table X, or else a covariance matrix over the variables Σ.
How they're made: You create a dataset when you load your data into Tetrad, or generate data from an instantiated model.
Schematic Example: our dataset might look something like this table of observations:
Variables: {Sunscreen, Temperature, Ice-cream}
Data:
Person/Date Sunscreen Temperature Ice-cream Hemank, June 12 0ml 32°C 150g Mahdi, June 12 15ml 32°C 120g Benedict, June 14 30ml 36°C 200g ... ... ... ... Or this correlation matrix:
Variables: {Sunscreen, Temperature, Ice-cream}
Data:
Sunscreen Temperature Ice-cream Sunscreen 1 0.3 0.12 Temperature 0.3 1 0.4 Ice-cream 0.12 0.4 1
.
GUI example:
In this example we'll load a 'mixed' data set, a data set that contains both discrete and continuous values.
To create a dataset object in Tetrad, do the following:
Step Screenshots 1. Place a data box on the work space; double click to open it. 2. Click "file" and then "load" in the drop down menu and the data loader window will appear. 3. Choose a file to load. 4. Make sure the loading options are set according to your file properties and click "Validate". 5. Click "Load" if there are no errors. 6. The loaded data will appear in the data loader window.
A graph G is a set of nodes, V, and a set of edges, E. Each edge has four pieces of
information: a pair of nodes and a pair of endpoints, in order. For example, the edge (A, B, -, >)
represents the edge A → B, whereas the edge (C, B, >, >)
represents the
edge C ↔ B. This makes Tetrad's graph representation very flexible: it can represent
undirected edges, bidirected edges, unusual endpoint types, etc. The edge A → B can be
interpreted as "A has a direct causal effect on B"; the other kinds of edges are explained elsewhere in
the manual.
How they're made: There are three ways to create graphs in Tetrad: by hand, using a random graph generator, or using a search algorithm.
Schematic Example: If our causal graph looks like this: Sunscreen ← Temperature → Ice-cream, it would be represented in Tetrad like so:
Variables: {Sunscreen, Temperature, Ice-cream}
Edges: {(Sunscreen, Temperature,
>
,-
), (Temperature, Ice-cream,-
,>
)}
Why the word "search"? You can think about the discovery problem like this. We start with a set of variables; out of all the graphs you can make with those variables, we are searching for the one graph that describes the true causal relationships between those variables.
How many graphs are we looking through?
Number of variables | Number of Directed Acyclic Graphs |
---|---|
1 | 1 |
2 | 3 |
3 | 25 |
4 | 543 |
5 | 29281 |
6 | 3781503 |
... | ... |
20 | more than the number of atoms in the observable universe |
This is why we need an algorithm to search, rather than inspecting all the graphs by hand. Search algorithms use various tricks to find the answer quickly, without inspecting every single graph.
How they're made: A search algorithm is a function: it takes input and produces output. The inputs are:
The output is a graph, or a set of graphs that are equally compatible with the data (a.k.a. an "equivalence class" of graphs). The type of graph you get depends on the type of algorithm you use.
GUI example:
Step Screenshots 1. Put a Search box in the workspace, and add an arrow from the Data box to the Search box. 2. (a) Choose output type (here a PAG). (b) Choose an algorithm (here GFCI). (c) Choose parameters (here alpha = 0.05, and "one-edge faithfulness" = "no"). (d) Click "search". 3. Your results will pop up. If you wish, you can drag the variables into a nicer layout. Then click "Done".
As mentioned in the Search Algorithms section, we can use background knowledge as an input to search. Tetrad represents knowledge as a set of variables, a list of forbidden edges5 and a list of required edges.
How they're made: You might think of knowledge as being independent of everything else – that's what makes it "background" knowledge! However, Tetrad won't let you create a knowledge object without giving it input: a dataset or search algorithm that tells it the names of your variables. Only then can you list the forbidden and required edges. It is as if Tetrad is asking, "knowledge about what?"
Schematic Example: Say we know that neither ice-cream nor sunscreen can influence the temperature. We would represent this as a pair of forbidden edges. In Tetrad the knowledge would be represented like so:
Variables: {Sunscreen, Temperature, Ice-cream}
Forbidden Edges: {(Sunscreen, Temperature,
-
,>
), (Ice-cream, Temperature,-
,>
)}Required Edges: {}
Causal graphs only give us qualitative information: which variables causally influence which others. But they don't tell us quantitatively how big the causal effects are. They put constraints on the probability distribution over variables in the graph, but they don't fully specify the probability distribution. For that, we need models.
Causal models add information to the graph: they specify a probability distribution, and the distributions you'd get if you intervened on some of the variables.
We need models for several distinct tasks:
Tetrad has two confusing distinctions between types of model object. Here they are in one table:
Bayes model | Structural Equation Model (SEM) | |
---|---|---|
Parametric Model | Graph (DAG) where the nodes are discrete variables, each with a set of possible values | Graph (DAG) where the nodes are continuous variables (means and variances initialized but not assigned values), plus a set of linear parameters (coefficients initialized but not assigned values) |
Instantiated model | Probabilities assigned to the possible values of each variable, conditional on its parents in the graph | Values assigned to all parameters of linear structural equation model (means, variances, and edge coefficients) |
Tetrad distinguishes between parametric models and instantiated models. The parametric model just initializes the object: it's where you decide what kind of model you're going to use (Bayes or SEM parameterization). The instantiated model then assigns values to the model parameters.
"Bayes model" just means the model fits discrete-valued data. It has no special relationship to Bayesian inference6. Tetrad uses the term "Bayes model" only because DAGs for discrete data have been called "Bayes nets" (again, not because they have a special relationship to Bayesian inference).
A Bayes Parametric Model (Bayes PM) object includes a graph, and a set of possible values for every variable in that graph. The graph must be a DAG.
How Bayes PMs are made: You can start with a DAG and a dataset. Tetrad will automatically pull the lists of possible values from the actual values in your data. If you want to generate synthetic data, you can start with just a DAG. Tetrad will ask for a set of possible values for each variable (the default is {0,1}).
A Bayes Instantiated Model (Bayes IM) object includes everything that's in a Bayes PM, plus a set of conditional probability tables – one for each node, conditional on its parents in the DAG.
How Bayes IMs are made: You can start with a Bayes PM and a dataset, in which case Tetrad will estimate the conditional probabilities from your data. If you want to generate synthetic data, you can start with a Bayes PM and specify the conditional probabilities (either by choosing them randomly, or inputting specific values by hand).
"Structural Equation Models" or SEMs are used to fit continuous-valued data, under some assumption about the relationships between the variables. In Tetrad you may fit either "standard" (i.e. linear Gaussian) or "generalized" SEMs.
A linear model means the relationships between the variables can be described with linear equations. For example, if we have the graph X → Y ← Z, we could describe this as the standard SEM parametric model:
\(X = \varepsilon_1 \\ Z = \varepsilon_2 \\ Y = \alpha X + \beta Z + \varepsilon_3\)
Where the errors \(\varepsilon_1, \varepsilon_2, \varepsilon_3\) are independent random variables with Gaussian distributions.
A SEM Parametric Model (SEM PM) includes the graph, plus a list of all parameters needed to specify the probability distribution. In this example, the parameters are \(\alpha, \beta, \mu_{\varepsilon_1}, \mu_{\varepsilon_2}, \mu_{\varepsilon_3}, \sigma_{\varepsilon_1}, \sigma_{\varepsilon_2}\), and \(\sigma_{\varepsilon_3}\). However, in the SEM PM object the values of those parameters are undetermined; the values are specified in the instantiated model (see below).
If you choose a generalized SEM PM, you have the freedon to specify non-linear relationships between parent and child variables, and a non-Gaussian distribution for each node. For example, you might say one variable is related to its parent by a quadratic equation: \(Y = \alpha X + \beta X^2 + \varepsilon\). You could specify that the error term \(\varepsilon\) had, say, a Uniform(0,1) distribution.
Note: Although generalized SEM PMs give you more freedom than standard SEM PMs, they require you to make more decisions. You must specify the parametric form of the distribution. If you don't, Tetrad cannot learn the model from data. Right now there are no nonparametric model fitting methods in Tetrad.
How SEM PMs are made: All you need is a DAG. If you choose a standard SEM PM, Tetrad can generate the list of parameters from the DAG structure alone. If you choose a generalized SEM PM, you must also specify your parametric model.
The SEM Instantiated Model (SEM IM) assigns values to all those parameters – in this example, to \(\alpha\) and \(\beta\), and to the means and variances of the errors \(\boldsymbol{\varepsilon}\).
How SEM IMs are made: You can start with a SEM PM and a dataset, in which case Tetrad will estimate the model parameters from your data. If you want to generate synthetic data, you can start with a SEM PM and specify the parameter values (either by choosing them randomly, or inputting specific values by hand).
There are five other modules that I won't talk about here. See these other sections of the manual for more information:
Say you start with data, and you want to learn a causal model and estimate the size of the causal effects. Your workflow or "pipeline" would look like the following schema.
But take note: This schema describes what's happening inside the Tetrad library. In the graphical interface, some steps may be combined. For example, in the current7 version of the Tetrad GUI, steps 4, 5 and 6 are grouped into a single box.
In text form:
You should also perform some sanity checks along the way:
After running the search algorithm: does the output graph look plausible, based on your background knowledge about the causal system? What changes if you use different search settings?
After estimating the model parameters: do the parameters look plausible? What changes if you choose a different graph from the equivalence class?
Tetrad is a modular, object-oriented program for causal inference. "Causal inference" includes a variety of tasks; Tetrad objects can be combined in various ways to accomplish many of those tasks. This tutorial describes some of the most important objects in Tetrad. It is meant to be schematic yet independent of Tetrad's graphical user interface (which may change in the future). I have included an example of one pipeline – one way of combining Tetrad objects to achieve a particular aim – but that is only the beginning of what is possible with Tetrad.
This tutorial is an introduction to the Tetrad software. For an introduction to causal inference in general, and guidance on interpreting your results, see the companion tutorial.
This comic is under copyright, held by Universal Uclick. We believe our use of the material is covered under Fair Use for three reasons: (1) The purpose of the use is education, not profit. (2) The portion of the work used is tiny relative to the whole corpus of Calvin and Hobbes comics (one panel of one strip). (3) The use of this panel will have no effect on the market value of Calvin and Hobbes. However, should Universal Uclick disagree with our judgment and ask us to remove the comic from this documentation, we will comply. ↩
For brevity, this is a simplified version of Tetrad's ontology, emphasizing the objects that you see in the GUI, and their dependences. If you want to learn what's really under the hood you can look at the Tetrad library source code in the Git repository. ↩
In the guts of Tetrad there are differences between node objects and variables, and what you're using depends on whether you load data first or define a graph and generate data from it. These details should not matter to the user. ↩
You may instead use some kind of 'oracle', which gives the algorithm the information that it would normally estimate from the dataset (e.g. conditional independence facts). This is useful if you're trying to figure out how the algorithms perform when given perfect information. ↩
We can also use tiers to forbid many edges at once. This is often useful, for example, if you have time-ordered measurements, and you want to prevent any edges going back in time. For more information look at the module on Knowledge. ↩
Of course you can learn a Bayes model using Bayesian updating. However, you can also learn a Structural Equation Model using Bayesian updating. ↩
Current as of 10/21/2016. ↩