Dissecting Dependence

In statistics we often want to examine the dependence between some variables of interest, this could be for inferential or modelling purposes. In this blog post I will be discussing some examples of data which exhibit dependence. I also aim to given a intuitive introduction to copulas; a popular tool which is used to capture and model the dependence structure between a collection of random variables.

Examples of Dependence

Throughout this post I stick to the so-called bivariate case, considering the dependence between just two variables \(X\) and \(Y\), often denoted in vector form \((X, Y)\). We typically have some sample of observations, \((x_i, y_i)\) \(i = 1, \dots, n\), for this pair of variables and want to examine this sample to see if we can make any solid conclusions concerning the relationship between \(X\) and \(Y\).

To provide a motivating example I had a look through some of the data which comes pre-loaded into \(\texttt{R}\). In doing so I stumbled upon the \(\texttt{swiss}\) data set, which contains data on some economical indicators for 47 districts in Switzerland, recorded in 1888. Two of these indicators are

% of males working in agriculture
% in education beyond primary school

See below a scatter plot of the data with respect to these two variables.

There seems to be some form of negative dependence, with districts having a higher % in agriculture often having a low % in education. Similarly, when the % in agriculture is lower there is a tendency for the districts to have a higher % in education. Though in the latter case this seems less clear cut, with some districts still having a low % in education.

As an alternative example I have also simulated some data to illustrate the dependence between energy consumption and temperature discussed in my previous post. Each observation represents a daily measurement of energy consumption and temperature. Here we see that energy consumption is higher both when temperatures are low and high, with it decreasing between these two extremes. The reasoning here is that when temperatures are low many households will be using energy to power heating, whilst when they are high they will likewise be using it to power air conditioning. What we have here is non-linear dependence between the two variables.

Dissecting The Dependence

In the two examples above the dependence observed is pretty clear. Often, however, there may be dependence present between variables which is not completely obvious at a first glance. Take for example the data plotted below (left). This is another data set which is found in \(\texttt{R}\), though this time it comes within the package \(\texttt{fitdistplus}\) and has the name \(\texttt{danishmulti}\). The dataset was collected at Copenhagen Reinsurance and comprise 2167 fire losses over the period 1980 to 1990. Here I have again produced a scatter plot of the observations, comparing the amount of building and contents coverage for each claim.

We can see that most of the values are below 5, with a few extreme values in one or both the variables. However, it is quite difficult to see any dependence structure here, at least not as clearly as in previous examples. We can, however, transform the data to try and uncover the underlying dependence as in the above plot (right). This is done by replacing the observed values \(x_i\) and \(y_i\) with their ranks, defined as

\(\text{Rank}(x_i) = \text{# observations} \leq x_i\)

I further divided these values by the number of observations, which scales everything to be in the unit square.

If these two variables were independent then the transformed plot should appear to be a sample from a uniform distribution on the unit square, which in this case it appears not to be! There is dense collection of observations in the upper right corner, implying that when the claim coverage for contents is high so is the coverage for the building. There is also the presence of a boundary in the lower left, which can be interpreted as saying when we get down to the smaller claims we rarely see both types of coverage being small, usually at least one is of a decent size.

The Important Theorem

If we view the above transformed dataset as a sample from some distribution then it is this distribution which is known as the copula. This is not strictly true but the formalities aren't worth worrying about here. Essentially the (bivariate) copula is a probability distribution on the unit square which fully describes the dependence structure of the two variables \(X\) and \(Y\).

There is a key theoretical result which justifies this whole perspective, known as Sklar's Theorem. This says that for any pair of variables \(X\) and \(Y\) the joint distribution can be decomposed into two ingredients: the marginal distributions and the copula. If both variables are continuous, as in the examples seen here, then this copula is unique. This copula then contains all the information about the dependence structure.

George's Blog

Search This Blog