Online Covariance

Given the following set of two-dimensional inputs:

{(x1,y1),(x2,y2),,(xn1,yn1),(xn,yn)}

Let n be the number of two-dimensional inputs, X represent the x dimension, Y represent the y dimension, Covn(X,Y) be the biased sample covariance of the x and y dimensions for the first n two-dimensional inputs, Covn1(X,Y) be the biased sample covariance of the x and y dimensions for the first n1 two-dimensional inputs, xn be the x value of the n-th two-dimensional input, x¯n be the sample mean of the x values for the first n two-dimensional inputs, yn be the y value of the n-th two-dimensional input, and y¯n1 be the sample mean of the y values for the first n1 two-dimensional inputs. Then, the recurrence equation for the biased sample covariance (a.k.a. online covariance) is:

Covn(X,Y)=Covn1(X,Y)Covn1(X,Y)(xnx¯n)(yny¯n1)n

Note: The recurrence equation above also applies when computing the online covariance matrix:

Σn[j,k]=Σn1[j,k]Σn1[j,k](xn[j]x[j]¯n)(xn[k]x[k]¯n1)n.

However, we will restrict ourselves to the online covariance computation of two-dimensional input in this post and explore the online covariance matrix computation of m-dimensional input in a later post.

 

Proof:

The definition of the biased sample covariance of the x and y dimensions for the first n two-dimensional inputs is defined as:

Covn(X,Y)=i=1n(xix¯n)(yiy¯n)n.

If we expand this definition, we have:

Covn(X,Y)=i=1n1(xix¯n)(yiy¯n)+(xnx¯n)(yny¯n)n.

Since the recurrence equations for the sample mean of the x and y values are:

x¯n=x¯n1x¯n1xnn and y¯n=y¯n1y¯n1ynn,

then we have:

Covn(X,Y)=i=1n1(xi(x¯n1x¯n1xnn))(yi(y¯n1y¯n1ynn))+(xnx¯n)(yny¯n)nCovn(X,Y)=i=1n1(xix¯n1+x¯n1xnn)(yiy¯n1+y¯n1ynn)+(xnx¯n)(yny¯n)nCovn(X,Y)=i=1n1(xiyixiy¯n1+xi(y¯n1ynn)x¯n1yi+x¯n1y¯n1x¯n1(y¯n1ynn)+(x¯n1xnn)yi(x¯n1xnn)y¯n1+(x¯n1xnn)(y¯n1ynn))+(xnx¯n)(yny¯n)nCovn(X,Y)=(i=1n1(xiyixiy¯n1x¯n1yi+x¯n1y¯n1)+i=1n1(xi(y¯n1ynn)x¯n1(y¯n1ynn))+i=1n1((x¯n1xnn)yi(x¯n1xnn)y¯n1+(x¯n1xnn)(y¯n1ynn))+(xnx¯n)(yny¯n))nCovn(X,Y)=(i=1n1(xix¯n1)(yiy¯n1)+(y¯n1ynn)i=1n1(xix¯n1)+(x¯n1xnn)i=1n1(yiy¯n1+(y¯n1ynn))+(xnx¯n)(yny¯n))n

Since the biased sample covariance of the x and y dimensions for the first n1 two-dimensional inputs is defined as:

Covn1(X,Y)=i=1n1(xix¯n1)(yiy¯n1)n1,

then we also have:

i=1n1(xix¯n1)(yiy¯n1)=(n1)Covn1(X,Y).

With this, we have:

Covn(X,Y)=((n1)Covn1(X,Y)+(y¯n1ynn)i=1n1(xix¯n1)+(x¯n1xnn)i=1n1(yiy¯n1+(y¯n1ynn))+(xnx¯n)(yny¯n))nCovn(X,Y)=((n1)Covn1(X,Y)+(y¯n1ynn)(i=1n1(xi)+i=1n1(x¯n1))+(x¯n1xnn)(i=1n1(yi)+i=1n1(y¯n1)+i=1n1(y¯n1ynn))+(xnx¯n)(yny¯n))nCovn(X,Y)=((n1)Covn1(X,Y)+(y¯n1ynn)(i=1n1(xi)x¯n1i=1n1(1))+(x¯n1xnn)(i=1n1(yi)y¯n1i=1n1(1)+(y¯n1ynn)i=1n1(1))+(xnx¯n)(yny¯n))nCovn(X,Y)=((n1)Covn1(X,Y)+(y¯n1ynn)(i=1n1(xi)x¯n1(n1))+(x¯n1xnn)(i=1n1(yi)y¯n1(n1)+(y¯n1ynn)(n1))+(xnx¯n)(yny¯n))n

Since the sample mean for the first n1 x and y values are defined as:

x¯n1=i=1n1xin1 and y¯n1=i=1n1yin1,

then we also have:

i=1n1xi=x¯n1(n1) and i=1n1yi=y¯n1(n1).

With that, we have:

Covn(X,Y)=((n1)Covn1(X,Y)+(y¯n1ynn)(x¯n1(n1)x¯n1(n1))+(x¯n1xnn)(y¯n1(n1)y¯n1(n1)+(y¯n1ynn)(n1))+(xnx¯n)(yny¯n))nCovn(X,Y)=((n1)Covn1(X,Y)+(y¯n1ynn)(x¯n1(n1)x¯n1(n1))+(x¯n1xnn)(y¯n1(n1)y¯n1(n1)+(y¯n1ynn)(n1))+(xnx¯n)(yny¯n))nCovn(X,Y)=((n1)Covn1(X,Y)+(x¯n1xnn)(y¯n1ynn)(n1)+(xnx¯n)(yny¯n))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+xnynxny¯nx¯nyn+x¯ny¯n)n

Since the recurrence equation for the sample mean of the y values is:

y¯n=y¯n1y¯n1ynn,

then we have:

Covn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+xnynxn(y¯n1y¯n1ynn)x¯nyn+x¯n(y¯n1y¯n1ynn))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+xnynxny¯n1+xn(y¯n1ynn)x¯nyn+x¯ny¯n1x¯n(y¯n1ynn))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+xnynxny¯n1x¯nyn+x¯ny¯n1+(xnx¯n)(y¯n1ynn))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+(xnx¯n)(yny¯n1)+(xnx¯n)(y¯n1ynn))n

Since the recurrence equation for the sample mean of the x values is:

x¯n=x¯n1x¯n1xnnx¯n=nx¯n1n+x¯n1+xnnx¯n=(n1)x¯n1+xnn,

then we have:

Covn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+(xnx¯n)(yny¯n1)+(xn((n1)x¯n1+xnn))(y¯n1ynn))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+(xnx¯n)(yny¯n1)+(nxnn+(n1)x¯n1xnn)(y¯n1ynn))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+(xnx¯n)(yny¯n1)+((n1)x¯n1+(n1)xnn)(y¯n1ynn))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+(xnx¯n)(yny¯n1)(n1)(x¯n1xnn)(y¯n1ynn))nCovn(X,Y)=((n1)Covn1(X,Y)+(n1)(x¯n1xnn)(y¯n1ynn)+(xnx¯n)(yny¯n1)(n1)(x¯n1xnn)(y¯n1ynn))nCovn(X,Y)=(n1)Covn1(X,Y)+(xnx¯n)(yny¯n1)nCovn(X,Y)=nCovn1(X,Y)Covn1(X,Y)+(xnx¯n)(yny¯n1)nCovn(X,Y)=nCovn1(X,Y)n+(Covn1(X,Y)(xnx¯n)(yny¯n1))nCovn(X,Y)=Covn1(X,Y)Covn1(X,Y)(xnx¯n)(yny¯n1)n

Therefore, the recurrence equation for the biased sample covariance (a.k.a. online covariance) is:

Covn(X,Y)=Covn1(X,Y)Covn1(X,Y)(xnx¯n)(yny¯n1)n

 

Note: We can manipulate this recurrence equation such as that we also have:

Covn(X,Y)=Covn1(X,Y)Covn1(X,Y)(xnx¯n1)(yny¯n)n,

Covn(X,Y)=Covn1(X,Y)Covn1(X,Y)(n1n)(xnx¯n1)(yny¯n1)n,

and

Covn(X,Y)=(n1)(Covn1(X,Y)+(xnx¯n1)(yny¯n1)n)n.

Reference:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance


Example of C++ code that computes the online covariance:

// Filename: main.cpp
#include <iostream>
#include <iomanip>
 
int main() {
 
    double x;
    double y;
    double n = 0;
    double mean_x = 0;  // mean of the x values
    double mean_y = 0;  // mean of the y values
    double cov = 0;     // covariance of the x and y values
    double prev_mean_x; // previous mean of the x values
    double prev_mean_y; // previous mean of the y values
    double prev_cov;    // previous covariance of the x and y values
 
    if ( std::cin >> x && std::cin >> y ) {
        ++n;
        mean_x = x;
        mean_y = y;
        cov = 0;
        while ( std::cin >> x && std::cin >> y ) {
            prev_mean_x = mean_x;
            prev_mean_y = mean_y;
            prev_cov = cov;
            ++n;
            mean_x = prev_mean_x - ( prev_mean_x - x ) / n;
            mean_y = prev_mean_y - ( prev_mean_y - y ) / n;
            cov = prev_cov - ( prev_cov - ( x - mean_x ) * ( y - prev_mean_y ) ) / n;
        }
    }
 
    std::cout << "n:      " << n << '\n';
    std::cout << "mean_x: " << std::setprecision( 17 ) << mean_x << '\n';
    std::cout << "mean_y: " << std::setprecision( 17 ) << mean_y << '\n';
    std::cout << "cov:    " << std::setprecision( 17 ) << cov << '\n';
 
}

Example of data.txt:

-281.189       612.083
974.663        -24.0965
25.8526        401.539
.              .
.              .
.              .

Command Line:

g++ -o main.exe main.cpp -std=c++11 -march=native -O3 -Wall -Wextra -Werror -static
./main.exe < data.txt

 

Note: Mathematica’s Covariance[] function computes the unbiased sample covariance matrix, not the biased sample covariance matrix; therefore, the biased sample covariance matrix is computed in Mathematica as:

( ( Length[ list ] - 1 ) / Length[ list ] ) * Covariance[ list ]

 

Online Covariance
online_covariance.pdf
online_covariance.docx

Tagged on:

3 thoughts on “Online Covariance

  1. Joshua Burkholder Post author

    This is also derived from the following:

  2. Lavinius

    Hello,

    I just tried your online covariance formula and it is very precise. Thank you for it, and for the mathematical demonstration. Do you have anything published in which you included it (just the formula), or should I just reference the website?

Leave a Reply

Your email address will not be published. Required fields are marked *