Correlation
Correlation and Dependence [6]
In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price.
Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (i.e., correlation does not imply causation).
Correlation Coefficient [2]
A correlation coefficient is a number that quantifies a type of correlation and dependence, meaning statistical relationships between two or more values in fundamental statistics.
Pearson Correlation Coefficient [5]
In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/), also referred to as the Pearson's r or Pearson product-moment correlation coefficient (PPMCC), is a measure of the linear correlation between two variables and . It has a value between and , where is total positive linear correlation, is no linear correlation, and is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.
Time-Series Correlation [4]
The first step is to find a way of measuring how similar two time series are. There are countless way of doing this, depending on the underlying assumptions of your data. The most used one for those applications is called correlation. The correlation between two functions (or time series) is a measure of how similarly they behave. It can be expressed as:
with and being the standard deviation and the mean of , respectively:
The mean is simply the average of the whole time series. The standard deviation, instead, indicates how much the points of the series tends distance themselves from the mean. This quantity is often associated with variance, defined as:
When the variance is zero, all the points in the series are equal to the mean. A high variance indicates that the points are scattered around.
The term represents the covariance between and , which generalises the concept of variance over two time series instead of one. The covariance provides a measure of how much two time series change together. It does not necessarily account for how similar they are, but for how similarly they behave. More in detail, it captures whether the two time series increase and decrease at the same time.
The covariance is calculated as follow:
and is easy to see that indeed .
Looking back to the definition of correlation, it is now easy to understand what is trying to capture. It’s a measure of how similarly and behave, normalised by their variance to obtain a value between and . When both time series tend to increase (or decrease) over time with a similar fashion, they will be positively correlated. Conversely, if one goes up and the other goes down, they will be negatively correlated.
Autocorrelation Function [4]
The idea behind the concept of autocorrelation is to calculate the correlation coefficient of a time series with itself, shifted in time. If the data has a periodicity, the correlation coefficient will be higher when those two periods resonate with each other.
The first step is to define an operator to shift a time series in time, causing a delay of . This is known as the lag operator:
The autocorrelation of a time series with lag is defined as:
The Correlogram [4]
Autocorrelation is a relatively robust technique, which doesn’t come with strong assumptions on how the data has been created. If in the previous post we have used a synthetic sales data, this time we can confidently use real analytics:
This is the plot for the autocorrelation function, also known as correlogram:
Example
data |
---|
1 |
2 |
3 |
4 |
5 |
= =
Calculate the Variance - Steps
- Calculate the mean of the sample.
- Subtract the mean from each data point.
- Square each result.
- Find the sum of the squared values.
- Divide by , where is the number of data points.
data | |||
---|---|---|---|
1 | 3 | -2 | 4 |
2 | 3 | -1 | 1 |
3 | 3 | 0 | 0 |
4 | 3 | 1 | 1 |
5 | 3 | 2 | 4 |
10 |
Calculate the Autocorrelation
Numerator
Denominator
10 |
Verify in R
Code
data <- c(1,2,3,4,5)
summary(data)
acfTs <- acf(data, na.action = na.pass)
acfTs
Output
> data <- c(1,2,3,4,5)
> summary(data)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 3 3 4 5
> acfTs <- acf(data, na.action = na.pass)
> acfTs
Autocorrelations of series ‘data’, by lag
0 1 2 3 4
1.0 0.4 -0.1 -0.4 -0.4
>