Models constitute the primary vehicles for conducting economic analysis. Economic theorists, attempting to push the frontiers of economic knowledge, employ models to discern the economic nature of the world. Models constitute the nearly universally employed instructional vehicles of the economics classroom. Government as well as private-sector decision makers use models to forecast important environmental conditions. They also use specially tailored models for policy simulation purposes, i.e., to examine “what if” circumstances before actually having to make real-world decisions.
Models may be represented in any of several possible formats. The most rudimentary form is that of a verbal assertion of a relationship between a dependent and one or more independent variables. An example of such a verbal assertion is the law of demand, i.e., that people will tend to buy items in larger quantities at lower prices, and vice-versa. Mathematical functional notation may be employed to represent the model, as, for example, y = f(x1, x2, …), where y is understood to be a dependent variable, the behavior of which is determined in some sense by one or more independent variables, x1, x2, etc.
The specific structure of the model cannot be shown in functional notation format. The model can be made specific by putting it into an equation format, such as the so-called slope-intercept form of a linear relationship, y = a + bx. Here, a is the y-axis intercept, and b is the slope of the line representing the relationship between x and y. Other equation formats are also possible, as will be discussed below. And from the equation format, it is but a short step to the computer language statement to represent the model. Indeed, most economic models designed for forecasting and policy simulation purposes eventually find their ways into computerized implementations.
The process of specifying a model can be described in several identifiable steps. The first step is to model relationships between the dependent variable and one or more independent variables. The modeling process, following the procedures outlined previously, involves selection of independent variables thought most likely to affect the behavior of the dependent variable, and guesses at the nature of the relationships (i.e., whether direct or inverse, linear or curved, and if curved relationships, the direction of curvature). Such guesses lead to expectations of signs (+ or -) of the coefficients of the variables, positive for direct relationships, negative for inverse relationships. Prospective independent variables not expressly included in a model are treated as if they are constant (even if in actuality they are not).
The next step is to acquire data for all of the variables expressly included in the model. The data may be captured cross sectionally (i.e., at a point in time but across a number of subjects) or as a time series (i.e., for a single subject, but over a period of time). The process of data capture can be expected to be costly in terms of time, effort, and various explicit expenses.
The third step in the model specification process is to estimate the values of the parameters (or constants) of the hypothesized relationships. In the slope-intercept equation model format, the values of the constants a and b must be estimated. Although these values may be estimated by informal means (use of the imagination, “eye-balling”), the remainder of this will be devoted to elaboration of the formal statistical procedures that have been adopted nearly universally by economic analysts for the specification process. After the model has been specified, the final step in the process is to attempt to verify the model, or ascertain its adequacy to the end for which it was constructed. The tools of statistical inference are applied to this objective, but the ultimate test of a model is its ability to explain, predict, simulate, or instruct.
Regression analysis is the most widely used, formalized method of estimating the demand (or any other type of) functional relationship in economics and business administration settings. This post introduces the modeling capabilities of regression analysis, but a fuller elaboration of its capabilities and extensions into the realm economic forecasting will be provided in preceding posts.
The simplest form of regression model relates two variables, of which one is described as dependent, and the other as independent. The format for this simplest regression model may be described in functional-notation form as
(1) y = f(x)
where y is the dependent variable, and x is the independent variable. The linear form of the simple regression model could be expressed in slope-intercept form as
(2) y = a + bx
where a is the intercept parameter, and b is the slope parameter. What statistical regression analysis will accomplish, given an adequate amount of data for x and y, is to compute values for the parameters a and b according to a certain mathematical principle. The data may have been gleaned from historical sources, or generated with experiments or sample survey techniques.
The simple linear regression analysis may be extended to multiple regression with the addition of terms for other independent variables, or to higher-order relationships with the addition of terms for values of the single independent variable raised (or transformed by self-multiplication) to successively higher powers. The general form of the linear multiple regression models is
(3) y = a + b1x1 + b2x2 + … + bnxn.
Assuming for the moment a simple regression model as in equation (2), the data for variable y may be given graphic representation by plotting them in the vertical dimension relative to the data for the independent variable, x, in the horizontal dimension on a set of standard coordinate axes. A “scatter diagram” of the data for two series can be used. This scatter diagram can exhibit a great deal of variation about an upward-sloping path from left to right as the independent variable changes on the horizontal axis. It is possible to draw a smooth line, free-hand style, through the data plot. The smooth line may be straight or curvilinear.
Regression analysis functions to accomplish mathematically what the analyst may have drawn as a free-hand curve of average relationship using an “eye-ball” fitting technique. Although there are several possible regression principles, the most commonly-used approach is to “fit” the average relationship curve to the plotted points so as to minimize the squared deviations of the curve from the points. Assuming this so-called “least-squares” technique is properly applied, there is no other mathematical function that can be “fitted” to the plotted points with any smaller average error relative to the location of the plotted points. The regression analysis produces estimates of the parameters (intercept, slope) of the best-fit average relationship curve.
Depending upon the amount of variation in the dependent variable, the resulting regression equation may be able to estimate, with some error, values of existing observations within the data set, and to predict values of hypothetical observations not included in the data set. This latter possibility constitutes the potential of regression analysis to serve as a modeling technique. Given a regression equation such as that developed from a “least-squares” regression procedure on a set of data, unknown values of the dependent variable, y, may be estimated by inserting a known value of the independent variable, x, into the regression equation and solving for the dependent variable value. The error involved in estimating or predicting such values constitutes a potentially serious problem that will be discussed subsequently.
Some dependent variables can be adequately modeled with reference to a single independent variable, but other independent variables can be modeled only inadequately in this manner. There are two possibilities for these variables: either they are characterized so extensively by random noise that they cannot be modeled, or there are one or more other phenomena that govern or influence the behavior of the variables. If comparable data for these other phenomena can be acquired, then conventional simple and multiple regression procedures may be invoked to model the relationships. The conventional formats for the simple and multiple regression models are given by equations (2) and (3) above.
Once a multiple regression model has been specified and the parameter values estimated, the analyst may discern the predictive ability of each of the included independent variables by examining the inference statistics for each of them as discussed later in this post. Any independent variable that in the judgment of the analyst does not make a satisfactory contribution to the explanation of the behavior of the dependent variable data may then be deleted from the model when the model is respecified. Many statistical software packages include options for stepwise deletion of inadequately contributing independent variables from the model according to some criterion specified by the programmer or the analyst. In the stepwise regression procedure, the full model including all variables selected by the analyst is first estimated, and then the model is automatically respecified in subsequent steps, omitting one variable in each step, until only one independent variable remains in the model. The analyst may then inspect the sequence of model specifications, looking for a significant drop in the overall level of explanation of the behavior of the dependent variable. Once this loss is identified, the model specified prior to the deletion of the independent variable resulting in the significant loss is the optimal model.
The multiple regression equation, linear in its equation (3) format, can be made to fit curvilinear data paths by converting it to a polynomial format. The polynomial equation includes one independent variable raised to successively higher powers. For example, a quadratic polynomial equation includes linear and second-order (or squared) terms in the format:
(4) y = a + b1x + b2x2.
Data for the second independent-variable term are generated by performing a squared transformation on the data for the first independent variable. The general format for an k-th order polynomial model is
(5) y = a + b1x + b2x2 + b3x3 + … + bkxk,
where data for all terms beyond the linear term are generated by subsequent transformations. Some software packages can automatically generate a specified k-th order polynomial regression model computationally (i.e., without having to go through the data transformation phase). The analyst should consider a polynomial form of relationship when the scatter diagram exhibits a curved path that is not apparently amenable to linear modeling. As a general criterion, the analyst should specify a polynomial equation of order k that is equal to the number of directions of curvature apparent in the scatter diagram, plus 1. For example, if the scatter diagram exhibits one direction of curvature, then a k=2, or second-order regression model should be specified. If the scatter diagram exhibits two directions of curvature, a k=3 or third-order (cubic) model of form
(6) y = a + b1x + b2x2 + b3x3
should be specified. If the analyst is certain that the relationship is not linear but in doubt about the appropriate order of relationship, he could specify second-, third-, and perhaps fourth-order models, then choose the one with the smallest mean squared error (discussed below).
Finally, I should note that the multiple regression formats can accommodate a mixture of all of the formats described to this point, and further incorporate moving averages described in previous posts as well as components of a decomposed time series as described. For example, suppose the analyst finds that variable x1 is a significant predictor of the behavior of the object series, but that the explanation needs to be supplemented by the presence of two other independent variables, x2 and x3, the first linear and the other in a second-order relationship. Such a regression model might have the following appearance:
y = a + b1x1 + b2x2 + b3x3 + b4x32.
All of the models that we have illustrated to this point have been additive in the sense that the effects of all of the separate independent variables are simply added together to compose the total effect on the dependent variable. Each independent variable is assumed to have its impact on the dependent variable independently of any other independent variable. There may be reason to believe that two or more independent variables have joint or interactive impact on the dependent variable. An example might be found in utility analysis where the total utility realized from consumption of a particular good, e.g., beer, can be expected to depend not only on the quantity of beer consumed, but also the quantity of pretzels available for consumption along with the beer. In cases like this, a multiplicative functional relationship such as that in equation (7) may be more appropriate:
(7) y = a x1bx2
Here, the total effect on y is the product of x1 raised to the power b and x2 raised to the power c, multiplied by a scale factor a. Equation (7) can be converted to a form that can be handled by (additive) multiple linear regressions by taking the logarithms of all terms in the equation:
(8) log y = log a + b log x1 + c log x2.
Then, in order to estimate the function, it is necessary to convert the data for variables y, x1, and x2 to logarithmic format. Most computerized statistical systems can accomplish this by way of performing a logarithmic transformation on data for each variable. It should be noted, however, that such a regression equation will predict values for the log of variable y. The analyst must then take the anti-log (or exponential) of log y to find the predicted value of y.
Regression analysis purports to provide answers to a very specific question: “What is the nature of the relationship between the dependent variable and an independent variable?” The question is answered by estimating values of the parameters in a best-fit equation. But regression analysis begs two other very important questions: “Is there a significant relationship between the selected variables?” and, if so, “How strong (or close, or reliable) is the relationship?” If there is no significant relationship, or even if the existing relationship is only trivial, a regression analysis will dumbly estimate the values of the parameters. It is therefore necessary to delve into the significance of the estimated relationships. The existence and significance of a hypothesized relationship should perhaps be brought into question even before the regression analysis is conducted.
The statistical complement to regression analysis that attempts to address the questions begged by it is correlation analysis, a special application of the tools of statistical inference. Although the user of statistics is typically concerned with the results of both the correlation and the regression analyses, it is possible to conduct either analysis without making computations for the other.
This series is a lot of parts that I am quasi-using pieces of for a academic research paper stance so bear with me if it gets too esoteric. Or read the other governance articles available within the SharePoint Security category within the main site (available through the parent menu).