Seemingly Unrelated Regression Model using Gibbs Sampler


Home | Academic Articles


 

Purpose

The purpose of the seemingly unrelated regression (SUR) model using Gibbs sampler is to create a model that has a better fit than the corresponding model under the Frequentist school.

 

Example

 

Let’s suppose we have 3 sets of data to create 3 models.

 

Here is the first data set:

y

2.1

4.5

6.8

9.4

10.8

X1

1

2

3

4

5

 

And the second:

y

4.8

9.6

13.7

19.6

21.4

X2

1

2

3

4

5

X3

0.25

0.42

0.84

1.32

1.64

 

And the third:

y

0.5

1.1

1.4

2.3

3.4

X4

10.2

20.5

30.4

41.6

50.2

 


 

Using least squares, the summary statistics for each of the coefficients are:

 

Model #1

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

0.03

0.3728

0.0805

0.9409

-1.1564

1.2164

X1

2.23

0.1124

19.8402

0.0003

1.8723

2.5877

 

Model #2

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

1.1418

2.1708

0.5260

0.6514

-8.1986

10.4822

X2

3.8261

2.9894

1.2799

0.3290

-9.0363

16.6886

X3

1.3420

8.0463

0.1668

0.8829

-33.2786

35.9626

 

Model #3

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

-0.3674

0.3177

-1.1562

0.3313

-1.3786

0.6438

X4

0.0689

0.0094

7.3223

0.0053

0.0390

0.0989

 

If all the variables are combined into one model, this is the result:

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

0.0492

0.4615

0.1066

0.9172

-0.9791

1.0775

x1

2.2248

0.1557

14.2931

0.0000

1.8779

2.5716

x2

5.1036

0.9861

5.1753

0.0004

2.9064

7.3009

x3

-1.8103

2.9568

-0.6122

0.5540

-8.3984

4.7778

x4

0.0577

0.0153

3.7725

0.0036

0.0236

0.0918

 

The goal is to create models that solve for the coefficients simultaneously. Arnold Zellner introduced a concept called the seemingly unrelated regression (SUR) model.

 

As explained in the technical details, the Gibbs sampler was run for 50,000 iterations.

 

If one has two models, the one that has the lower sum of squared errors is the model with the better fit.

 

In regard to the sum of the squared errors, these are the results:

SUR

Model #1

Model #2

Model #3

Frequentist

4.0339

0.379

3.3770

0.2719

4.6119

 

The sum of squared errors is less for each of the individual models than for the SUR model which, in turn, is less than that of the Frequentist model. It should be noted that when the Gibbs sampler is run for 100,000 iterations, the sum of squared errors goes down to 4.0331 which would seem to suggest that running the extra 50,000 iterations is not worth the while.

 

Technical details

 

In the SUR model, this is how the data would be set up:

 

y1

x1

x2

x3

x4

2.1

1

0

0

0

4.5

2

0

0

0

6.8

3

0

0

0

9.4

4

0

0

0

10.8

5

0

0

0

4.8

0

1

0.25

0

9.6

0

2

0.42

0

13.7

0

3

0.84

0

19.6

0

4

1.32

0

21.4

0

5

1.64

0

0.5

0

0

0

10.2

1.1

0

0

0

20.5

1.4

0

0

0

30.4

2.3

0

0

0

41.6

3.4

0

0

0

50.2

 


 

Of course, for each model, you would still need a vector of 1s for the intercept, which leads to this matrix which I call the X new matrix:

y1

x10

x11

x20

x21

x22

x30

x31

2.1

1

1

0

0

0

0

0

4.5

1

2

0

0

0

0

0

6.8

1

3

0

0

0

0

0

9.4

1

4

0

0

0

0

0

10.8

1

5

0

0

0

0

0

4.8

0

0

1

1

0.25

0

0

9.6

0

0

1

2

0.42

0

0

13.7

0

0

1

3

0.84

0

0

19.6

0

0

1

4

1.32

0

0

21.4

0

0

1

5

1.64

0

0

0.5

0

0

0

0

0

1

10.2

1.1

0

0

0

0

0

1

20.5

1.4

0

0

0

0

0

1

30.4

2.3

0

0

0

0

0

1

41.6

3.4

0

0

0

0

0

1

50.2

 

In the frequentist school, the beta vector β and the covariance matrix of the error terms ∑ are fixed quantities. However, in the Bayesian school, β follows a normal distribution and ∑ follows an inverse Wishart distribution.

 

This is the basic way the Gibbs sampler works for the SUR model: Start off with least squares as in the frequentist school. From that, you create the E matrix of the error terms. Since there are 3 models and each model has 5 rows, the E matrix is a 5x3 matrix. For the above models, this would be the E matrix:

-0.16

-0.5040

0.1645

0.01

0.2414

0.0546

0.08

-0.0492

-0.3276

0.45

1.3795

-0.1994

-0.38

-1.0767

0.3079

 

From that, you create the S matrix which is E’E. For the above models, this would be the S matrix:

0.3790

1.1090

-0.2587

1.1090

3.3771

-0.6602

-0.2587

-0.6602

0.2719

 

It should be noted that the diagonal elements are equivalent to the sum of squared errors (residuals) from the regression ANOVA table for each model.

 

To get the sigma matrix, we divide the elements by the number of rows in each model. Since there are 5 rows, this is the sigma matrix for the above data:

0.0758

0.2218

-0.0517

0.2218

0.6754

-0.1320

-0.0517

-0.1320

0.0544

 

As you notice from the above matrix, the error terms are correlated. However, we

would like uncorrelated errors. The way we do that is by first computing the lower Cholesky root of the sigma matrix which we call the L matrix. This would be the result:

 

0.275318

0

0

0.805645

0.162314

0

-0.18795

0.119367

0.069388

 

We then take the inverse of the L matrix:

3.6322

0.0000

0.0000

-18.0282

6.1609

0.0000

40.8517

-10.5985

14.4117

 

The dimensions of the X new matrix are 15x7. However, the dimensions of L inverse matrix are 3x3. To create a 15x15 matrix, we compute the Kronecker product of the L inverse matrix and the identity matrix of dimensions 5x5, since each model has 5 rows. This is what the result looks like for the first 5 rows and columns:

3.6322

0

0

0

0

0

3.6322

0

0

0

0

0

3.6322

0

0

0

0

0

3.6322

0

0

0

0

0

3.6322

 


 

We then create the  matrix which is the product of the above Kronecker

product and the X new matrix.

14.4117

14.4117

0

0

0

0

0

3.6322

7.2644

0

0

0

0

0

3.6322

10.8966

0

0

0

0

0

3.6322

14.5288

0

0

0

0

0

3.6322

18.161

0

0

0

0

0

-18.0282

-18.0282

6.1609

6.1609

1.540225

0

0

-18.0282

-36.0564

6.1609

12.3218

2.587578

0

0

-18.0282

-54.0846

6.1609

18.4827

5.175156

0

0

-18.0282

-72.1128

6.1609

24.6436

8.132388

0

0

-18.0282

-90.141

6.1609

30.8045

10.10388

0

0

40.8517

40.8517

-10.5985

-10.5985

-2.64963

14.4117

146.9993

40.8517

81.7034

-10.5985

-21.197

-4.45137

14.4117

295.4399

40.8517

122.5551

-10.5985

-31.7955

-8.90274

14.4117

438.1157

40.8517

163.4068

-10.5985

-42.394

-13.99

14.4117

599.5267

40.8517

204.2585

-10.5985

-52.9925

-17.3815

14.4117

723.4673

 

 

We can also create the  vector which is the product of the Kronecker product and the y vector. This is what it looks like for this data:

30.26457

16.3449

24.69896

34.14268

39.22776

-8.2869

-21.9823

-38.1874

-48.7114

-62.8613

42.12162

97.93992

152.7685

209.4223

263.3902

 

This is the point at which the Bayesian approach departs from that of the frequentist school.

 

We start with the prior distributions of β and ∑. To do that, we introduce some new terms.

 

The  vector is the average of previous beta vectors. If there are no previous beta vectors, it is a vector of 0s.

 

The A matrix is known as the precision matrix. To say there is a bit of literature on this matrix is a tad of an understatement. However, in a PDF file put together by Peter E. Rossi (one of the authors of Bayesian Statistics and Marketing), I found a formulation of it by multiplying the identity matrix by a scalar of 0.05. As a twist on this, I tried a scalar of the product of the number of coefficients and 0.01. Since there are 4 X variables and 3 intercepts for a total of 7 coefficients, the scalar for the A matrix would be 0.07.

 

The prior distribution of β follows a normal distribution with the mean equal to  and the covariance matrix equal to A-1.

The V matrix is constructed by taking the sigma matrix constructed previously and multiplying it by the degrees of freedom v = n – k in which k represents the number of coefficients. Since n = 15 and k = 7, we have v = 8 degrees of freedom. For this data set, this is the V matrix:

0.5306

1.5526

-0.3619

1.5526

4.7278

-0.924

-0.3619

-0.924

0.3808

 

The prior distribution of ∑ follows an inverse Wishart distribution with parameters V and v degrees of freedom.

 

We now combine the prior distributions and the data to create the posterior distributions.

 

In the frequentist school, B = (X’X)-1(X’Y). However, in the Bayesian school,

 = (X’X + A)-1(X’Y + A).

 

Once  is constructed, the posterior distribution of β follows a normal distribution with the mean of  and covariance matrix of .

 

If we assume that the  vector is a vector of 0s, then .

For this data set, this is:

-0.1569

2.274553

0.383083

4.328316

0.331196

-0.24892

0.066227

 

In order to generate β, we need a vector of Z values which follows a normal distribution with mean of 0 and standard deviation of 1. Since there are 7 coefficients, we need a vector of 7 Z values. Suppose, the computer generates this

vector:

0.7839

2.5079

1.8482

-1.2640

-0.2313

-0.8057

0.2340

 

We also need the lower Cholesky root of  For this data, the current value of this matrix is:

0.01062

-0.00452

0.031115

-0.01332

0.00025

-0.00732

0.000305

-0.00452

0.003009

-0.01327

0.008877

-0.00019

0.003126

-0.0002

0.031115

-0.01327

0.130568

-0.06583

0.052223

-0.00036

0.000331

-0.01332

0.008877

-0.06583

0.062417

-0.09222

0.003747

-0.00042

0.00025

-0.00019

0.052223

-0.09222

0.249939

-0.00112

4.36E-05

-0.00732

0.003126

-0.00036

0.003747

-0.00112

0.026158

-0.00078

0.000305

-0.0002

0.000331

-0.00042

4.36E-05

-0.00078

3.22E-05

 

The lower root is:

0.103054

0

0

0

0

0

0

-0.04386

0.032934

0

0

0

0

0

0.301925

-0.00082

0.198514

0

0

0

0

-0.12922

0.097434

-0.13465

0.134514

0

0

0

0.002426

-0.00264

0.259367

-0.42169

0.069514

0

0

-0.07106

0.000278

0.10628

0.065777

-0.01119

0.073216

0

0.002959

-0.00221

-0.00284

-0.00151

0.00186

-0.00196

0.000919

 

Since this is a multivariate normal distribution, we multiply the lower matrix with the Z vector and add . This is the result:

0.7839

2.5079

1.8482

-1.264

-0.2313

-0.8057

0.234

 

When we multiply X and β and calculate the error terms, we can create the E matrix:

-1.1918

4.273625

-1.0811

-1.2997

10.37695

-2.8913

-1.5076

15.83809

-4.9079

-1.4155

23.11312

-6.6287

-2.5234

26.25113

-7.5411

 

After that, we once again create the S matrix which is E’E:

13.75365

-141.416

40.85756

-141.416

1600.128

-463.527

40.85756

-463.527

134.4237

 

The posterior distribution of ∑ follows an inverse Wishart distribution with parameters S + V and degrees of freedom v + n in which n is the number of rows in each model. For this data set, this is the sum of S and V:

14.28425

-139.863

40.49566

-139.863

1604.856

-464.451

40.49566

-464.451

134.8045

 

The degrees of freedom are 8 + 5 = 13.

 

To generate ∑, we need the T matrix. Since there are 3 models, the dimensions of the T matrix are 3 x 3. For this data set, this is the basic setup of the T matrix:

 

 

For example,  is a chi-square variable with 13 degrees of freedom.

 

Suppose, the following T matrix is generated:

7.201962

0

0

-0.1135

15.26426

0

0.567206

0.759346

8.226725

 

We also need U which is the upper Cholesky root of the S + V matrix:

3.779451

-37.0062

10.71469

0

15.34273

-4.42824

0

0

0.624956

 

We then generate C = T’U:

51.86826

-2.54992

8.665058

0

232.9976

17.83779

0

0

67.679

 

Finally, ∑ = (C-1)(C-1)’ :

0.000378119

3.76071E-06

-3.72939E-05

3.76071E-06

1.96999E-05

-1.6714E-05

-3.72939E-05

-1.6714E-05

0.000218319

 

We then use this to generate the next β and so on. After running 50,000 iterations of the Gibbs sampler, the averages of the generated β and ∑ are calculated. This is due to the random Z and  values that are generated in each iteration. This is a result for β:

Coefficient

Beta (Gibbs)

SE (Gibbs)

Beta (LS)

SE (LS)

X10

0.1073

2.4016

0.03

0.3728

X11

2.2077

0.2182

2.23

0.1124

X20

1.1355

0.2255

1.1418

2.1708

X21

3.8157

0.0751

3.8261

2.9894

X22

1.3808

0.4142

1.3420

8.0463

X30

-0.3544

0.2041

-0.3674

0.3177

X31

0.0686

0.0002

0.0689

0.0094

 

As a comparison, the coefficients and standard errors from the individual least square models are included. It should be noted that the standard errors for models #2 and #3 under the Gibbs sampler are quite less than under least squares.

 

After 50,000 iterations, this is a result for ∑:

2.187177

-0.62216

0.501253

-0.62216

0.188484

-0.13562

0.501253

-0.13562

0.183619

 

The aforementioned standard errors are calculated by computing

 

In this equation, n represents the number of rows in each model, which, in this case, is 5.

 

Reference:

Rossi, P.E., Allenby, G.M., and McCullogh, R. Bayesian Statistics and Marketing. Chichester: John Wiley & Sons, 2005.

Zellner, A. An Introduction to Bayesian Inference in Econometrics. New York: John Wiley & Sons, 1970.