Economics of SharePoint Governance Part 14 The Existence and Strength of the Relationship (Multicollinearity, Autocorrelation, And Hetero-skedasticity)
For our purposes, the evaluation criteria developed by the inference analysis will be divided into two groups. The first, usually described as “analysis of variance,” yields four evaluation criteria that may serve as bases for inferences about the existence, strength, and validity of a regression model:
(a) The correlation coefficient, r or R, measures the degree of association or covariation between the dependent variable and an independent variable in the regression model. One such simple correlation coefficient may be computed between the dependent variable and each independent variable in the model, and in a multivariate model between each pair of independent variables. Computerized statistical systems usually produce a matrix of such simple correlation coefficients so that the analyst may ascertain the correlations between the dependent and each of the independent variables, as well as among the independent variables (as illustrated on pages following). If the model contains more than a single independent variable, a so-called multiple correlation coefficient, R, may also be computed to assess the over-all association between the dependent variable and all of the included independent variables taken together. The domain of r is from -1 to +1, with positive values implying direct relationships, and negative values indicating inverse relationships. Values of r near the extremes of this range imply near perfect inverse or direct relationship between the two variables, depending upon sign. Values of r in the neighborhood of zero (positive or negative), however, imply no statistically identifiable relationship between the variables.
(b) The coefficient of determination, r2, is interpreted as the proportion of the variation in the dependent variable data that can be statistically explained by data for the independent variable for which the r2 is computed. In a multivariate regression model, a coefficient of multiple determination, R2, may be computed; if there is only one independent variable in the model, the computed R2 will be equal to the only simple r2. Since r2 is computed as the squared value of the correlation coefficient, r2 is unsigned, and falls within the range of zero to +1. Although computed values of r and r2 contain essentially the same information (except for differences in sign), and each implies a value of the other, many analysts prefer to focus attention on r2 because of its determination interpretation.
The interpretation of any computed r2 statistic is subjective, and hence opens to dispute. For example, how high (toward unity) does the r2 have to be in order for the analyst to infer the existence and strength of a relationship? How low (toward zero) can an r2 statistic be before the analyst may draw the inference that no statistically identifiable relationship exists between the dependent and an independent variable? Analysts in the natural sciences often expect r2 values in excess of 0.9 (or even higher) to indicate the existence of a useable relationship.
Because of the degree of randomness, capriciousness, and ignorance that may characterize human decision making and behavior in the aggregate, a social scientist may defensibly judge an r2 that is in excess of 0.7 (or perhaps even somewhat lower) to be indicative of a statistically meaningful relationship. But most analysts are skeptical of the existence of a statistically meaningful relationship if the r2 between the dependent and independent variables is below 0.3 (in statistical jargon, the null hypothesis, i.e., that there is no relationship, is supported). For our purposes, the r2 values of 0.3 and 0.7 will be taken as evaluation benchmarks: values of r2 in excess of 0.7 are sufficient to reject the null hypothesis; values below 0.3 support the null hypothesis. But the reader should be aware that both of these values are rather arbitrarily selected and are subject to challenge.
Assuming that these values will serve satisfactorily as evaluation criteria, what of the r2 range between 0.3 and 0.7? This constitutes a statistical “no-man’s land” wherein no strong inferences can be drawn about either the existence or the non-existence of a statistically meaningful relationship between two variables. The interpretation of r2 values below 0.3 or in excess of 0.7 may be further refined with reference to the statistical significance of the regression model and particular variables within it.
(c) The computed F-statistic may be used as a basis for drawing an inference about the statistical significance of a regression model. Once the F-value is computed, the analyst must consult an F-distribution table (which may be found in any statistical source book or college-level statistics text). For the particular regression model under consideration, its computed F-value may be compared with F-distribution table values in the column for “1 degree of freedom in the numerator,” and on the row corresponding to the number of degrees of freedom (DF) of the regression model. The DF of a regression model is the number of observations less the number of variables (dependent and independent) in the model. For a regression analysis including a dependent variable and two independent variable conducted with 60 observations, the DF is 57. If the regression model’s computed F-value exceeds the F-distribution value read from the appropriate column and row in the table, the analyst may infer that the model is statistically significant at the level indicated in the heading of the table for the F-distribution (usually .05, i.e., only 5 chances in 100 that the model is spurious).
Suppose that the computed F-value for a regression model is 4.73, with 60 degrees of freedom. An F-distribution table reveals that the F-value required for significance at the .05 level is 4.00; the F-value required for the .01 significance level is 7.08. These findings support the inference that the regression model is statistically significant at the .05 level, but not at the .01 level.
For many less-consequential forecasting purposes, most analysts probably would be willing to accept (though with hesitancy) a regression model with r2 of 0.7 and statistical significance at the .05 level. If truly consequential decisions are to be based upon the regression model forecasts, the analyst may not be willing to use any model for which r2 is less than some very high value (like 0.9 or 0.95), with statistical significance below some very low level (like 0.01 or 0.001).
(d) The standard error of the estimate (SEE) may be used to specify various confidence intervals for forecasts made with the regression model. Realistically speaking, the likelihood that the actual value at some forecast-target date will fall precisely on the value estimated in a regression model is nearly zero. In other words, the forecasted value is a “point estimate” that the analyst hopes will be close to the actual value when it finally occurs. The computed SEE specifies a numeric range on either side of the point estimate within which there is an approximate 66 percent chance that the actual value will fall. Two other confidence intervals are also conventionally prescribed. There is a 95 percent probability that actual value will lie within a range of two standard errors of the forecasted point estimate, and a 99 percent probability that the actual value will lie within three SEEs of the forecasted value. For example, suppose that a regression model forecasts a point estimate of 732 for the target date, with an SEE of 27. The 66 percent confidence interval may thus be computed as the range from 705 to 759 (i.e., 732 +/- 27); the 95 percent confidence interval is from 678 to 786; and the 99 percent confidence interval is from 651 to 813. It should be apparent that the higher the required confidence in the forecasts made with a regression model, the wider will be the range within which the actual value likely will fall. As a general rule, the SEE will be smaller the higher the r2 and the lower the statistical significance of the regression model. Other things remaining the same, a regression model with a smaller SEE is preferable to one with a larger SEE. In any case, the analyst would be better off in reporting the results of a regression model forecast to specify confidence intervals rather than the single-valued point estimate that will almost certainly not happen.
Certain inference statistics are also computed for purposes of assessing the statistical significance of the regression coefficient (b, the estimated slope parameter) of each included independent variable.
(a) The standard error of the regression coefficient, SEC, is computed for each of the slope parameters estimated by the regression procedure. Unless the entire universe of values (all that have or can exist) are available for all variables included in a regression model, the regression analysis can do no more than construct an estimate of the true slope parameter value from the sample of data currently available. By its very nature, time series regression analysis could never encompass the entire span of time from the “beginning” through all eternity. Data for various finite time spans will thus yield differing estimates of the true parameter of relationship between any two variables. The hope of the analyst, and one of the premises upon which regression analysis is erected, is that all such estimated parameter values will exhibit a central tendency to converge upon the true value of relationship, and that any single estimated regression coefficient will not be very far from the true value.
All such regression coefficient estimates are presumed to constitute a normally-distributed population for which a standard deviation (a measure of average dispersion about the mean) may be computed. This particular standard deviation is called the standard error of the regression coefficient. It may be used to specify the 66, 95, and 99 percent confidence intervals within which the true coefficient value is likely to lie. As a general rule, the smaller the value of the SEC relative to its regression coefficient, the more reliable is the estimate of the regression coefficient.
(b) The t-value may be computed for each regression coefficient. In generalized inferential analysis, the “student’s t-test” may be used to test for the significance of the difference between two sample means. Applied to regression analysis, the t-value may be used to test for the significance of the difference between the estimated regression coefficient and the mean of all such regression coefficients that could be estimated. Since the latter is unknowable, the t-value is usually computed for the difference between the estimated regression coefficient and zero. As such, it can only be used to ascertain the likelihood that the estimated regression coefficient is non-zero.
Once the t-value for a regression coefficient is computed, the analyst may consult a student’s t-distribution table on the appropriate DF row to see where the computed t-value would lie. The t-table value just below the computed t-value identifies the column in the t-distribution table that specifies the significance level of the test. Suppose that the absolute value (unsigned) of the computed t-value for an estimated slope parameter is 1.73, with 60 degrees of freedom. A t-distribution table would show 1.73 lying between the values 1.671 and 2.000 on the 60 degree-of-freedom row. The column heading of the row containing the value 1.671 is the 0.1 significance level, implying that there is only one chance in ten that the estimated regression coefficient is not different from zero.
As a general rule, the lower the significance level of a regression coefficient, the more reliable are the forecasts that can be made using the model containing the independent variable for which the regression coefficient was estimated. For especially consequential decision making, the analyst may not be willing to retain any term in a regression forecasting model that is statistically significant above the 0.01 level. Since the t-value may be computed as the ratio of the estimated regression coefficient to its computed SEC, a rule of thumb may be prescribed that permits the analyst to avoid reference to a t-distribution table. If the absolute value of an estimated regression coefficient exceeds its computed SEC, the analyst may infer that the regression coefficient is statistically significant at the 0.33 level or lower. If the regression coefficient is more than twice the magnitude of its SEC, this implies a 0.05 significance level for the coefficient. Likewise, if b exceeds its SEC by a factor of 3, the implied significance level is below 0.01.
There are several possible problems that may emerge in the multiple regressions model contexts. All are consequences of violation of one or another of the assumptions or premises that underlie the multiple regression environments. Adjustments may be made to the data or the analysis to deal with some of the problems, but in cases of others the analysis should simply be aware of the likely effects.
The most fundamental of the multiple regression assumptions is that the independent variables are truly independent of one another. Multicollinearity may be identified by the presence of non-trivial correlation between pairs of the independent variables included within the model. Multicollinearity may be detected by examining the correlation matrix for all of the variables contained in the model.
Multicollinearity is almost certain to be present in any autoregressive or polynomial regression model of order higher than 1st. Because the successive terms in a kth-order autoregressive model use essentially the same data as the first term, except shifted by some number of rows, the assumption of independence among the “independent” variables is clearly violated. Likewise, because the successive terms in a kth-order polynomial model employ the same data as the first term, except as raised to successively higher powers, the assumption of independence again is clearly violated.
Multicollinearity may also be present among the different independent variables included in the model, even if they are not autoregressive with the dependent variable, and even if they are each only 1st order. If two independent variables are linearly similar, i.e., very correlated with each other, it is as if the same variable were included two times in the model, thereby contributing its explanatory power twice, and thus amounting to so much “deck stacking.” The usual effect of the presence of non-trivial multicollinearity is to inflate the standard errors of the coefficients of the collinear independent variables, rendering their computed t-values too low, implying excessively high levels of statistical significance (bad, since the lower the significance level the better).
Some statisticians prefer to remedy the presence of any non-trivial multicollinearity by removal of one or the other of the two collinear variables from the model. Others suggest that if there are good conceptual reasons for including both independent variables, they should both be retained in the model unless the multicollinearity is extreme (i.e., the correlation between the collinear independent variables approaches 1.00 in absolute value), or unless the analyst is particularly concerned about the statistical significance of either of the collinear independent variables. In this latter case, if the independent variables are time series, the analyst might try differencing the collinear independent variables and respecifying the model with the differenced series in place of the raw data series to see if significant information is contained in either of the collinear independent variables that is not also contained in the other.
Another of the premises underlying multiple regression modeling is that the forecast errors constitute an independent random variable, i.e., a random noise series. If there is a discernible pattern in the forecast error series, then autocorrelation is present in the dependent variable series. Autocorrelation may be detected by computing autocorrelation coefficients to some level of specified order of autocorrelation. Alternately, the analyst may construct a sequence plot of the forecast error series. The statistical software system may facilitate this procedure by allowing the user to have the forecast error series written to the next available empty column in the data matrix so that the sequence plot of that column may be constructed. If the forecast error series exhibits alternation of points above and below its mean, then the object series is negatively auto correlated. Positive autocorrelation is present if the error series exhibits “runs” of points above the mean alternating with runs of points below the mean in a cyclical (or seasonal) fashion. The expected number of runs if the series is truly random noise may be estimated for comparison with the actual (by count) number of runs exhibited by the series. If the actual number of runs is smaller than the expected number, then autocorrelation almost surely is present in the dependent variable series.
The effect of the presence of autocorrelation within the dependent variable series is to render the r, F, and t statistics unreliable. In particular, the presence of autocorrelation will likely result in understated standard errors of the regression coefficients, thus causing overstatement of the t-values, implying better (i.e., lower) significance levels for the estimated regression coefficients than warranted. Although the estimated regression coefficients themselves are unbiased (i.e., not unduly specific to the particular data set), autocorrelation results in computed confidence intervals that are narrower than they should be.
Some degree of autocorrelation is likely present in every economic or business time series, and the analyst should probably ignore it unless it is extreme. As noted earlier, one or more auto-regressive terms may constitute or be included in the regression model as the primary explanatory independent variables. If the analyst discovers the presence of non-trivial autocorrelation in a regression model that was specified without autoregressive terms, he might consider respecifying it to include one or more such terms as the means of handling the autocorrelation. The approach in this case is to try to use the auto correlated information rather than purge it from the model.
The problem of hetero-skedasticity occurs if there is a systematic pattern between the forecast error series and any of the independent variable series. Homoskedasticity is the absence of such a pattern. Whether the model exhibits the property of heteroskedasticity may be discerned by having the forecast errors plotted against data for each of the independent variables in scatter diagrams. If the scatter of the plotted points exhibits any discernible path, then heteroskedasticity is present within the model.
If the regression model is non-trivially heteroskedastistic, the mean squared error and the standard error of the estimate will be specific to the particular data set; another data set may yield inference statistics that diverge widely from those computed from the first data set. Likewise, the inference statistics associated with the particular independent variables (SEC, t, and significance level) will also be specific to the data set. I shall leave the matter of heteroskedasticity with the warning to the analyst of the likely consequences for his model, i.e., that its usefulness for modeling may be strictly limited to the range of data included in the object series.
This series is a lot of parts that I am quasi-using pieces of for a academic research paper stance so bear with me if it gets too esoteric. Or read the other governance articles available within the SharePoint Security category within the main site (available through the parent menu).