Purpose
The purpose of the seemingly
unrelated regression (SUR) model using Gibbs sampler is to create a model that
has a better fit than the corresponding model under the Frequentist school.
Example
Let’s suppose we
have 3 sets of data to create 3 models.
Here is the first
data set:
y 
2.1 
4.5 
6.8 
9.4 
10.8 
X1 
1 
2 
3 
4 
5 
And the second:
y 
4.8 
9.6 
13.7 
19.6 
21.4 
X2 
1 
2 
3 
4 
5 
X3 
0.25 
0.42 
0.84 
1.32 
1.64 
And the third:
y 
0.5 
1.1 
1.4 
2.3 
3.4 
X4 
10.2 
20.5 
30.4 
41.6 
50.2 
Using least squares, the summary statistics
for each of the coefficients are:
Model #1

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
0.03 
0.3728 
0.0805 
0.9409 
1.1564 
1.2164 
X1 
2.23 
0.1124 
19.8402 
0.0003 
1.8723 
2.5877 
Model #2

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
1.1418 
2.1708 
0.5260 
0.6514 
8.1986 
10.4822 
X2 
3.8261 
2.9894 
1.2799 
0.3290 
9.0363 
16.6886 
X3 
1.3420 
8.0463 
0.1668 
0.8829 
33.2786 
35.9626 
Model #3

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
0.3674 
0.3177 
1.1562 
0.3313 
1.3786 
0.6438 
X4 
0.0689 
0.0094 
7.3223 
0.0053 
0.0390 
0.0989 
If all the variables are combined into one
model, this is the result:

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
0.0492 
0.4615 
0.1066 
0.9172 
0.9791 
1.0775 
x1 
2.2248 
0.1557 
14.2931 
0.0000 
1.8779 
2.5716 
x2 
5.1036 
0.9861 
5.1753 
0.0004 
2.9064 
7.3009 
x3 
1.8103 
2.9568 
0.6122 
0.5540 
8.3984 
4.7778 
x4 
0.0577 
0.0153 
3.7725 
0.0036 
0.0236 
0.0918 
The goal is to create models that solve for the
coefficients simultaneously. Arnold Zellner introduced a concept called the
seemingly unrelated regression (SUR) model.
As explained in the technical details, the
Gibbs sampler was run for 50,000 iterations.
If one has two models, the one that has the
lower sum of squared errors is the model with the better fit.
In regard to the sum of the squared errors, these are the results:
SUR 
Model
#1 
Model
#2 
Model
#3 
Frequentist 
4.0339 
0.379 
3.3770 
0.2719 
4.6119 
The sum of squared errors
is less for each of the individual models than for the SUR model which, in
turn, is less than that of the Frequentist model. It should be noted that when
the Gibbs sampler is run for 100,000 iterations, the sum of squared errors goes
down to 4.0331 which would seem to suggest that running the extra 50,000
iterations is not worth the while.
Technical details
In the SUR model, this is how the data would
be set up:
y1 
x1 
x2 
x3 
x4 
2.1 
1 
0 
0 
0 
4.5 
2 
0 
0 
0 
6.8 
3 
0 
0 
0 
9.4 
4 
0 
0 
0 
10.8 
5 
0 
0 
0 
4.8 
0 
1 
0.25 
0 
9.6 
0 
2 
0.42 
0 
13.7 
0 
3 
0.84 
0 
19.6 
0 
4 
1.32 
0 
21.4 
0 
5 
1.64 
0 
0.5 
0 
0 
0 
10.2 
1.1 
0 
0 
0 
20.5 
1.4 
0 
0 
0 
30.4 
2.3 
0 
0 
0 
41.6 
3.4 
0 
0 
0 
50.2 
Of course, for each
model, you would still need a vector of 1s for the intercept, which leads to
this matrix which I call the X new matrix:
y1 
x10 
x11 
x20 
x21 
x22 
x30 
x31 
2.1 
1 
1 
0 
0 
0 
0 
0 
4.5 
1 
2 
0 
0 
0 
0 
0 
6.8 
1 
3 
0 
0 
0 
0 
0 
9.4 
1 
4 
0 
0 
0 
0 
0 
10.8 
1 
5 
0 
0 
0 
0 
0 
4.8 
0 
0 
1 
1 
0.25 
0 
0 
9.6 
0 
0 
1 
2 
0.42 
0 
0 
13.7 
0 
0 
1 
3 
0.84 
0 
0 
19.6 
0 
0 
1 
4 
1.32 
0 
0 
21.4 
0 
0 
1 
5 
1.64 
0 
0 
0.5 
0 
0 
0 
0 
0 
1 
10.2 
1.1 
0 
0 
0 
0 
0 
1 
20.5 
1.4 
0 
0 
0 
0 
0 
1 
30.4 
2.3 
0 
0 
0 
0 
0 
1 
41.6 
3.4 
0 
0 
0 
0 
0 
1 
50.2 
In the frequentist school, the beta vector
β and the covariance matrix of the error terms ∑ are fixed
quantities. However, in the Bayesian school, β follows a normal
distribution and ∑ follows an inverse Wishart distribution.
This is the basic
way the Gibbs sampler works for the SUR model: Start off with least squares as
in the frequentist school. From that, you create the E matrix of the error
terms. Since there are 3 models and each model has 5 rows, the E matrix is a
5x3 matrix. For the above models, this would be the E matrix:
0.16 
0.5040 
0.1645 
0.01 
0.2414 
0.0546 
0.08 
0.0492 
0.3276 
0.45 
1.3795 
0.1994 
0.38 
1.0767 
0.3079 
From that, you create
the S matrix which is E’E. For the above models, this would be the S matrix:
0.3790 
1.1090 
0.2587 
1.1090 
3.3771 
0.6602 
0.2587 
0.6602 
0.2719 
It should be noted
that the diagonal elements are equivalent to the sum of squared errors
(residuals) from the regression ANOVA table for each model.
To get the sigma
matrix, we divide the elements by the number of rows in each model. Since there
are 5 rows, this is the sigma matrix for the above data:
0.0758 
0.2218 
0.0517 
0.2218 
0.6754 
0.1320 
0.0517 
0.1320 
0.0544 
As you notice from the above matrix, the error
terms are correlated. However, we
would like uncorrelated errors. The way we do
that is by first computing the lower Cholesky root of the sigma matrix which we
call the L matrix. This would be the result:
0.275318 
0 
0 
0.805645 
0.162314 
0 
0.18795 
0.119367 
0.069388 
We then take the
inverse of the L matrix:
3.6322 
0.0000 
0.0000 
18.0282 
6.1609 
0.0000 
40.8517 
10.5985 
14.4117 
The dimensions of
the X new matrix are 15x7. However, the dimensions of L^{ }inverse matrix
are 3x3. To create a 15x15 matrix, we compute the Kronecker product of the L
inverse matrix and the identity matrix of dimensions 5x5, since each model has
5 rows. This is what the result looks like for the first 5 rows and columns:
3.6322 
0 
0 
0 
0 
0 
3.6322 
0 
0 
0 
0 
0 
3.6322 
0 
0 
0 
0 
0 
3.6322 
0 
0 
0 
0 
0 
3.6322 
We then create the matrix
which is the product of the above Kronecker
product and the X new matrix.
14.4117 
14.4117 
0 
0 
0 
0 
0 
3.6322 
7.2644 
0 
0 
0 
0 
0 
3.6322 
10.8966 
0 
0 
0 
0 
0 
3.6322 
14.5288 
0 
0 
0 
0 
0 
3.6322 
18.161 
0 
0 
0 
0 
0 
18.0282 
18.0282 
6.1609 
6.1609 
1.540225 
0 
0 
18.0282 
36.0564 
6.1609 
12.3218 
2.587578 
0 
0 
18.0282 
54.0846 
6.1609 
18.4827 
5.175156 
0 
0 
18.0282 
72.1128 
6.1609 
24.6436 
8.132388 
0 
0 
18.0282 
90.141 
6.1609 
30.8045 
10.10388 
0 
0 
40.8517 
40.8517 
10.5985 
10.5985 
2.64963 
14.4117 
146.9993 
40.8517 
81.7034 
10.5985 
21.197 
4.45137 
14.4117 
295.4399 
40.8517 
122.5551 
10.5985 
31.7955 
8.90274 
14.4117 
438.1157 
40.8517 
163.4068 
10.5985 
42.394 
13.99 
14.4117 
599.5267 
40.8517 
204.2585 
10.5985 
52.9925 
17.3815 
14.4117 
723.4673 
We can also create the vector which
is the product of the Kronecker product and the y vector. This is what it looks
like for this data:
30.26457 
16.3449 
24.69896 
34.14268 
39.22776 
8.2869 
21.9823 
38.1874 
48.7114 
62.8613 
42.12162 
97.93992 
152.7685 
209.4223 
263.3902 
This is the point
at which the Bayesian approach departs from that of the frequentist school.
We start with the
prior distributions of β and ∑. To do that, we introduce some new
terms.
The vector is
the average of previous beta vectors. If there are no previous beta vectors, it
is a vector of 0s.
The A matrix is
known as the precision matrix. To say there is a bit of literature on this
matrix is a tad of an understatement. However, in a PDF file put together by
Peter E. Rossi (one of the authors of Bayesian Statistics and Marketing), I
found a formulation of it by multiplying the identity matrix by a scalar of
0.05. As a twist on this, I tried a scalar of the product of the number of
coefficients and 0.01. Since there are 4 X variables and 3 intercepts for a
total of 7 coefficients, the scalar for the A matrix would be 0.07.
The prior
distribution of β follows a normal distribution with the mean equal to and the
covariance matrix equal to A^{1}.
The V matrix is
constructed by taking the sigma matrix constructed previously and multiplying
it by the degrees of freedom v = n – k in which k represents the number of
coefficients. Since n = 15 and k = 7, we have v = 8 degrees of freedom. For
this data set, this is the V matrix:
0.5306 
1.5526 
0.3619 
1.5526 
4.7278 
0.924 
0.3619 
0.924 
0.3808 
The prior
distribution of ∑ follows an inverse Wishart distribution with parameters
V and v degrees of freedom.
We now combine the
prior distributions and the data to create the posterior distributions.
In the
frequentist school, B = (X’X)^{1}(X’Y). However, in the Bayesian
school,
= (X’X
+ A)^{1}(X’Y + A).
Once is
constructed, the posterior distribution of β follows a normal distribution
with the mean of and
covariance matrix of .
If we assume that
the vector
is a vector of 0s, then .
For this data
set, this is:
0.1569 
2.274553 
0.383083 
4.328316 
0.331196 
0.24892 
0.066227 
In order to generate β, we need a vector
of Z values which follows a normal distribution with mean of 0 and standard
deviation of 1. Since there are 7 coefficients, we need a vector of 7 Z values.
Suppose, the computer generates this
vector:
0.7839 
2.5079 
1.8482 
1.2640 
0.2313 
0.8057 
0.2340 
We also need the
lower Cholesky root of For
this data, the current value of this matrix is:
0.01062 
0.00452 
0.031115 
0.01332 
0.00025 
0.00732 
0.000305 
0.00452 
0.003009 
0.01327 
0.008877 
0.00019 
0.003126 
0.0002 
0.031115 
0.01327 
0.130568 
0.06583 
0.052223 
0.00036 
0.000331 
0.01332 
0.008877 
0.06583 
0.062417 
0.09222 
0.003747 
0.00042 
0.00025 
0.00019 
0.052223 
0.09222 
0.249939 
0.00112 
4.36E05 
0.00732 
0.003126 
0.00036 
0.003747 
0.00112 
0.026158 
0.00078 
0.000305 
0.0002 
0.000331 
0.00042 
4.36E05 
0.00078 
3.22E05 
The lower root
is:
0.103054 
0 
0 
0 
0 
0 
0 
0.04386 
0.032934 
0 
0 
0 
0 
0 
0.301925 
0.00082 
0.198514 
0 
0 
0 
0 
0.12922 
0.097434 
0.13465 
0.134514 
0 
0 
0 
0.002426 
0.00264 
0.259367 
0.42169 
0.069514 
0 
0 
0.07106 
0.000278 
0.10628 
0.065777 
0.01119 
0.073216 
0 
0.002959 
0.00221 
0.00284 
0.00151 
0.00186 
0.00196 
0.000919 
Since this is a multivariate
normal distribution, we multiply the lower matrix with the Z vector and add . This is the result:
0.7839 
2.5079 
1.8482 
1.264 
0.2313 
0.8057 
0.234 
When we multiply
X and β and calculate the error terms, we can create the E matrix:
1.1918 
4.273625 
1.0811 
1.2997 
10.37695 
2.8913 
1.5076 
15.83809 
4.9079 
1.4155 
23.11312 
6.6287 
2.5234 
26.25113 
7.5411 
After that, we once again create the S matrix
which is E’E:
13.75365 
141.416 
40.85756 
141.416 
1600.128 
463.527 
40.85756 
463.527 
134.4237 
The posterior
distribution of ∑ follows an inverse Wishart distribution with parameters
S + V and degrees of freedom v + n in which n is the number of rows in each
model. For this data set, this is the sum of S and V:
14.28425 
139.863 
40.49566 
139.863 
1604.856 
464.451 
40.49566 
464.451 
134.8045 
The degrees of
freedom are 8 + 5 = 13.
To generate ∑,
we need the T matrix. Since there are 3 models, the dimensions of the T matrix
are 3 x 3. For this data set, this is the basic setup of the T matrix:
For example, is a
chisquare variable with 13 degrees of freedom.
Suppose, the
following T matrix is generated:
7.201962 
0 
0 
0.1135 
15.26426 
0 
0.567206 
0.759346 
8.226725 
We also need U which is the upper Cholesky
root of the S + V matrix:
3.779451 
37.0062 
10.71469 
0 
15.34273 
4.42824 
0 
0 
0.624956 
We then generate
C = T’U:
51.86826 
2.54992 
8.665058 
0 
232.9976 
17.83779 
0 
0 
67.679 
Finally, ∑
= (C^{1})(C^{1})’ :
0.000378119 
3.76071E06 
3.72939E05 
3.76071E06 
1.96999E05 
1.6714E05 
3.72939E05 
1.6714E05 
0.000218319 
We then use this ∑ to generate
the next β and so on. After running 50,000 iterations of the Gibbs
sampler, the averages of the generated β and ∑ are calculated. This is due to the random Z and values that are generated in each
iteration. This is a result
for β:
Coefficient 
Beta
(Gibbs) 
SE
(Gibbs) 
Beta
(LS) 
SE
(LS) 
X10 
0.1073 
2.4016 
0.03 
0.3728 
X11 
2.2077 
0.2182 
2.23 
0.1124 
X20 
1.1355 
0.2255 
1.1418 
2.1708 
X21 
3.8157 
0.0751 
3.8261 
2.9894 
X22 
1.3808 
0.4142 
1.3420 
8.0463 
X30 
0.3544 
0.2041 
0.3674 
0.3177 
X31 
0.0686 
0.0002 
0.0689 
0.0094 
As a comparison, the coefficients and standard errors from the individual
least square models are included. It should be noted that the standard errors
for models #2 and #3 under the Gibbs sampler are quite less than under least
squares.
After 50,000 iterations, this is a result for ∑:
2.187177 
0.62216 
0.501253 
0.62216 
0.188484 
0.13562 
0.501253 
0.13562 
0.183619 
The aforementioned standard
errors are calculated by computing
In this equation, n represents the number of rows in each model, which, in
this case, is 5.
Reference:
Rossi, P.E.,
Allenby, G.M., and McCullogh, R. Bayesian
Statistics and Marketing. Chichester: John Wiley & Sons, 2005.
Zellner,
A. An Introduction to Bayesian Inference
in Econometrics. New York: John Wiley & Sons, 1970.