Purpose
Editors
Submission Guidelines
Subscriptions
Current Issue
Back Issues
AEM Home

volume 55 article #5

 A Neural Networks Primer for Agricultural Economists

Terry L. Kastens, Allen M. Featherstone, and Arlo W. Biere

The authors are USDA National Needs graduate Fellow, associate professor, and professor, respectively, Department of Agricultural Economics, Kansas State University. Helpful comments of B. Wade Brorsen and two anonymous reviewers are gratefully acknowledged.

Abstract <top>

This article is designed to allow an agricultural economist to expediently reach an introductory understanding of neural networks. A general backpropagation-estimated feedforward artificial neural network is developed, using matrix algebra notation. A hedonic price model for used combines in the Great Plains is estimated with an ordinary least squares regression model as well as with a neural network. Partial derivatives as well as out-of-sample forecasting ability are compared across the two models. The neural net appears superior to the regression model where dependent variable data are sparse. Where successive dependent variable intervals are more uniformly represented by data, the linear regression model appears superior. The neural network appears better able to distinguish the effects of collinear variables.

Key words: backpropagation, combines, farm machinery pricing, forecasting, neural networks.

Article <top>

Neural networks as an empirical tool comes to agricultural economics with graphical and notational descriptions that are characteristic of the disciplines which initiated the technique's development. However, neither brain drawings from biology (with the accompanying dendrites, axons, and synaptic junctures), nor signal function notation from electrical engineering, nor lines of C-programming code from computer science, is the most expedient way to introduce neural networks to agricultural economists. Matrix algebra provides a more comfortable framework for agricultural economists.

The objective of this study is to compare a neural network with ordinary least squares regression in an analysis of used combine pricing in the Great Plains. The network is a backpropagation-estimated feedforward neural network presented in matrix algebra. The matrix presentation provides a way that may exempt (given the ever-increasing speed of computers) the empirical economist from learning a lower-level programming language. Additionally, matrix algebra provides a parsimonious way of tracking the many numbers inherent within a neural network estimation. Partial derivatives and price predictions are used to compare the neural network with the ordinary least squares regression linear model of used combine prices.

Neural Networks and Agricultural Economics <top>

Neural networks are being used in agricultural research at an increasing rate. A recent examination of the National Agricultural Library's database, Agricola, showed these numbers of documents associated with the keywords "neural network(s)": 1984_91, 32 documents; 1992, 40 documents.1 None of the pre-1992 entries were in journals commonly associated with agricultural economics; they were from agricultural engineering, agronomy, or food science. More recently, in agricultural economics related areas, neural networks have been applied in arenas such as crop yield prediction (Uhrig, Engel, and Baker), price prediction (Claussen and Uhrig; Kohzadi, Boyd, Kermanshahi, and Kaastra), futures trading (Hamm, Brorsen, and Sharda), credit scoring (Miller), and production function estimation (Joerding, Li, and Young).

1Similar results were found in databases more closely tied to business and economics, where the combined Social Sciences Index and Business Periodicals Index (H.W. Wilson Company) display: 1983_90, 55 documents; 1991_mid-1994, 123 documents.

Neural networks permeated few applied fields as recently as 10 years ago due to the huge computer resources required to process neural networks. Agricultural economics is relatively late among the agriculturally related fields in using neural networks. This may not be surprising when we consider that agricultural economists tend to use only tools of mathematical optimization and statistics, tools whose theoretical bases are clear. In addition, the statistics discipline, which traditionally has provided theoretical groundwork for agricultural economists, suggests that neural networks offer little new in problem-solving techniques (Cheng and Titterington, and seven associated comment articles, 1994). The primary authors and comment authors see little theoretical basis underlying much neural network construction, but concede that the strength of neural networks is that they provide algorithms which are intuitively appealing and that are easy to implement, given today's computing power. One conclusion of the authors, especially applicable to applied agricultural economists, is that research using neural networks should compare a neural network's results with those of more conventional statistical tools.

The foregoing is not to suggest that an acceptable (in the eyes of the agricultural economist) theoretical basis underlying neural networks is not being pursued. One must only examine a recent issue of the IEEE Transactions on Neural Networks journal to get a feel for the magnitude of this search. But until the statistical framework behind neural networks becomes more unified and comprehensive, neural networks should be considered as primarily empirical tools, with judgment regarding a model's quality deferring principally to answering the usual applied economist's questions: Is the modeling procedure appropriate, or is it merely an exercise in data mining? Do the results make sense economically? Thus, our focus is more on the how, as opposed to the why, of neural networks.

Neural Nets <top>

Neural networks (sometimes called "artificial" to distinguish them from true biological neural networks) are a broad class of mathematical functional forms and/or associated estimation algorithms. Our discussion is confined to the widely-used but more narrowly-defined class called feedforward backpropagation neural networks. This class falls within a reduced subset of neural nets called supervised learning algorithms. By "supervised," it is meant that the model "knows" the targeted outputs. In this sense, ordinary least squares (OLS) regression is a supervised process. In contrast, unsupervised learning involves predicting outputs (actually classifying inputs) without using any actual output data in model estimation (for an example, see Kohonen).

The word "feedforward" relates to functional form—specifically, the associated multilayered recursive structure, where information can be thought of as flowing forward, from one layer (or equation or part of an equation) to the next. "Backpropagation" indicates a particular estimation algorithm (actually a gradient descent variant) which has come to be associated with neural nets. Thus, just as in conventional nonlinear econometric estimation, the functional form may be distinguished from the particular estimation algorithm employed to find the model's parameters using numerical data. In theory, the feedforward functional form could be estimated with various least squares nonlinear optimization techniques, well-established or novel.2 But, because backpropagation has been used to estimate the feedforward functional form for a number of years, that is the estimation method used in this research. Where practical, we will distinguish the functional form, feedforward neural networks (FNN), from the backpropagation (BP) estimation algorithm.

2A particularly promising algorithm may be the Levenberg-Marquardt algorithm (More; Hagan and Menhaj). For especially difficult-to-solve estimations, genetic algorithms may be promising as well (Dorsey and Mayer).

The Feedforward Neural Network Functional Form <top>

The FNN mathematical model is highly nonlinear, and a universal function approximator (Hornik, Stinchcombe, and White). An FNN can be written as a system of equations or as a single equation. Consider a simple two-input, two-output, one-hidden-layer (with three nodes) model in FNN notation:

Yi1 =   f 1 [f 0 (Xi1W1,11 + Xi 2W1,21 + w1,1)W2,11

+ f 0 (Xi1W1,12 + Xi 2W1,22 + w1,2)W2,21

+ f 0 (Xi1W1,13 + Xi 2W1,23 + w1,3)W2,31

+ w2,1] + εi1, (1)

Yi 2 =  f 1 [f 0 (Xi1W1,11 + Xi 2W1,21 + w1,1)W2,12

+ f 0 (Xi1W1,12 + Xi 2W1,22 + w1,2)W2,22

+ f 0 (Xi1W1,13 + Xi 2W1,23 + w1,3)W2,32

+ w2,2] + εi 2, (2)

where Yi1 and Yi 2 are dependent variables, Xi1 and Xi 2 are independent variables, εi1 and εi 2 are random errors, and i is the usual observational marker. The number of parentheses pairs in each equation determine the number of hidden nodes. The functions f 0 and f 1 (superscripts notational, not exponential) denote preselected functions which map their respective arguments into a bounded space. These functions are typically nonlinear, making the overall model nonlinear. The parameters to be estimated are the W1's, and w1's, tying the system together (cross-equation restrictions), as well as the W2's, and w2's, which are equation specific. The w1's and w2's are like the intercepts of a standard econometric model, only we must think of columns of 1's placed within stages of the model as well as within the input data. The BP algorithm seeks to minimize the system's sum of squared errors (over n observations): Σ ε2i1 + Σ ε2i 2.

With f defined as operating on each element of a matrix, the model specified in equations (1) and (2) can be considered a special case of the general FNN model, expressed in terms of matrices, as

Y = f L ( º f 2 ( f 1 ( f 0 (XW1 + inw )W2 + inw )W3 + inw ) º WL+1 + inwTL+1) + E. (3)

If h0 is the number of input variables, X is an n ´ h0 matrix of input data. Similarly, Y is an n ´ hf matrix of output data, where hf is the number of output variables. W1 is an h0 ´ h1 matrix of parameters (or weights) to be estimated, with h1 being chosen by the user. The i n is an n ´ 1 vector of 1's, and w1 is an h1 ´ 1 vector of parameters to be estimated. The T denotes matrix transpose. In general, the lower-case w vectors are needed to accommodate the "intercept" (bias node) at each stage of the model. Successively, W2 and w2 are parameter matrices dimensioned h1 ´ h2 and h2 ´ 1, respectively. Again, h2 is chosen by the user, while h1 is determined by the size of the matrix preceding it. In FNN jargon, L is the number of user-determined hidden layers within the model (included to allow for increased model complexity, especially discontinuities). If L is 2, the last parameter matrix and vector to be estimated are W3 and w3. In this case, the dimensions of W3 and w3 are determined by the size of the W2 matrix and the Y matrix. Thus, W3 and w3 are dimensioned h2 ´ hf and hf ´ 1, respectively. E is an n ´ hf matrix of errors.

Alternatively, the general model specified in equation (3) could be displayed in a recursive format. In this case, we will omit the error term, and consider that the model has been estimated. The "hats" on the W and w notations have been suppressed, while retained on the _ to make it clear that this matrix is of predicted values. This recursive model (with the P's as stagewise predictions, making clear the feedforward nature of an FNN) is specified as

P0 = X

P1 = f 0 (P0W1 + inw )

P2 = f 1 (P1W2 + inw )

PL+1 = f L (PLWL+1 + inwTL+1)

Y^ = PL+1

E = Y ­ Y^.

For the specification depicted in equations (1) and (2), L=1, h0=2 (number of X-variables), h1=3 [the number of parenthetically enclosed terms in equation (1) or (2)], while hf=h2=2 (number of Y-variables). In FNN terms, h1 is referred to as the number of nodes (alternatively, neurons or neurodes) in the first hidden layer. This layer is made up of the W1 matrix, the w1 vector, the associated transfer function f 0, and the P1 matrix of the layer's predictions (which becomes the input to the next, and in this case, final layer). There is only one hidden layer in this model, and thus L=1. The last matrix and vector to be estimated (for this model, W2 and w2 ), along with the associated transfer function, f L = f 1, as well as this layer's predictions, P2, make up what is referred to as the output layer. Recall that the number of nodes (hf=h2=2) in this layer are determined solely by the preselected number of output variables within the model. The P0=X matrix itself makes up the input layer. The predictions of this input layer (actually the input data themselves) become the input to the first hidden layer. This model is thus referred to as an FNN model with two inputs (nodes), two outputs (nodes), and one hidden layer comprised of three nodes. The input data, output data, and i n matrices for this model are depicted as

1 (5)

The corresponding parameter matrices may be shown as

 

 

 

 

 

 

FNN Mechanics <top>

Assuming that the input and output data for an economic problem of interest have been chosen, several user-determined choices for parameters of the FNN model must next be made, to aid solution, often with limited theoretical basis.

(1) Transfer Functions

The FNN procedure routinely uses one or more transfer (or transformational, threshold, signal, squashing) functions that map unconstrained numerical data into a pre-specified bounded space. These functions are represented by f 0, f 1, ..., f L in the models described in equation (3). Often, in an FNN model, only one transfer functional form is used, reducing the functions to f 0 = f 1 = , ..., = f L.

In general, the transfer functions are chosen to be monotone and nondecreasing with values that are either binary (valued over the interval [0, 1] ) or bipolar (valued over the interval [­1, 1] ). Two transfer functions which are commonly used are the sigmoid (logistic) and the hyperbolic tangent. The sigmoid function, f(x) = 1/(1 + e-x), with the corresponding first derivative, f_(x) = f(x)(1 ­ f(x)), maps to [0, 1] space. The hyperbolic tangent function, f(x) = (ex ­ e-x)/(ex + e-x), with the first derivative, f_(x) = 1 ­ (f(x))2, maps to [­1, 1] space. These two functions are typically chosen because of the relative simplicity of the first derivative, which becomes important in the estimation algorithm process.

(2) Number of Hidden Layers and Nodes per Layer

Both the number of hidden layers [L in equations (3) and (4)], as well as the number of nodes (column dimensions h1 ... hL ) within each hidden layer, must be selected. Salchenberger, Cinar, and Lash suggest that the number of nodes within a hidden layer (in a single-hidden-layer model) follow from a mathematical existence theorem developed in 1963 by Kolmogorov. The theorem states that for any continuous mapping function, there exists a three-layer neural network (comprised of n inputs, m outputs, and 2n + 1 hidden units) that implements the function exactly. Some recent network model theoretical work which appears promising is the development of a possible Network Information Criterion (Murata and Yoshizawa) which seeks to generalize the well-accepted linear estimation criterion, Akaike's Information Criterion (AIC).

Much of recent theoretical work regarding FNN model construction, however, is perhaps more appropriately labeled "quasi-theoretical." That is, although techniques may be established to reduce the number of modeling parameter choices required of the user, ad hoc criteria still remain, which presumably must be empirically determined (e.g., Weymaere and Martens). Other work has taken a decidedly more empirical approach. The more empirical approach to FNN model selection focuses on dividing a data set among training, testing, and/or validation. Training and testing data sets are akin to the traditional estimation and out-of-sample prediction data sets routinely used in econometric forecasting. That is, data from dependent and explanatory variables are used to estimate a model's parameters. The estimated parameters are then used, along with explanatory variable data in the testing set, to predict the dependent variable values associated with the testing data set. The functional form (among the several considered) associated with the most accurate predictions over the testing set (according to some forecast accuracy criterion) becomes the functional form of choice. The parameters of that functional form are then estimated using the dependent and explanatory variable data of the combined training and testing data set. Finally, the newly estimated parameters are used, along with the explanatory data in the validation set, to predict the dependent variable values associated with the validation set. The validation set predictions are then compared with others, such as predictions from OLS models whose parameters were also estimated using the combined training and testing data.

To reduce the ad hoc determination of training, testing, and validation data set divisions, different selections from the complete data set may alternatively be considered as the relevant validation data set. Ultimately, before the process is complete, each observation in the complete data set will be considered as a member of a validation data set. For example, Gorr, Nagin, and Szczypula describe such a cross-validation or grid-search approach to determine model structure. In their procedure, the complete data set is divided into two or more mutually independent and exhaustive validation samples. The complements to the validation sets are divided further into training and testing subsamples. Alternatively specified FNNs are trained and tested (forecasting accuracy) over a particular training and testing subsample. The FNN that is the best testing set predictor is then compared (after parameters are reestimated over the combined training and testing subsample) with alternative modeling procedures (such as OLS) in predictive accuracy over the corresponding validation sample.

Theoretical work involving choice of the number of hidden layers in an FNN is perhaps more scant than that involving choice of the number of nodes within a layer. Cybenko (reported in Gorr et al.) contends that at most, two hidden layers suffice for even discontinuous functions. In practice, FNNs often contain only one hidden layer, and rarely over two or three.

In short, without new developments, the applied neural network researcher may have to rely on various "rules of thumb" which have evolved empirically. Regarding the number of hidden nodes within a hidden layer, Eberhart and Dobbins suggest taking the square root of the sum of the number of dependent and explanatory variables, and then "adding a few." However, this approach does not get at the usual "degrees-of-freedom" issue. NeuralWare, Inc., a commercial software manufacturer, at least implicitly addresses this issue by suggesting in its tutorial manual that the number of hidden nodes in a single-hidden-layer model be established by dividing the number of observations in the training set (the estimation data set) by the product of five times the sum of the number of input and the number of output variables. They further advise: "The number of hidden units in a back-propagation network that models behavioral data must typically be very small compared to the number of training cases, otherwise the network will learn spurious relationships and will not generalize successfully" (NeuralWare, Inc., p. UN-20). Finally, the popular press routinely displays novel methods for structuring an FNN, developed both empirically and quasi-theoretically. For example, Kasuba reports a fuzzy logic procedure which selects the number of hidden nodes. Murray accomplishes the same task with genetic algorithms which, after starting randomly, evolve by combining the successful characteristics of random neural nets.

Backpropagation Mechanics <top>

The BP procedure is, in essence, gradient descent. Model prediction errors, along with associated first derivatives, are used to iteratively seek an acceptable estimation of the underlying model's parameters (weights). Prediction errors at one layer are used to calculate expected errors of the immediately preceding layer, which are used to calculate expected errors of the next preceding layer, and so on (hence backpropagation). If the weights are modified after each input/output combination (pattern, or a single data observation) is passed through the network, the procedure is called on-line training. If the weights are only updated after either all, or a subset, of the data are passed through the network, the procedure is called batch training. In this sense, OLS and nonlinear least squares are batch training.

(1) Learning Coefficient and Momentum Factor

The pure gradient descent procedure is sometimes modified with the inclusion of two modeling parameters to improve the estimation process. A learning coefficient, η, is included to prevent excessive weight changes during learning (estimation) which would stall learning by causing the derivatives to approach zero too rapidly. This is referred to as network saturation. The learning coefficient remains constant at some value between 0 and 1 over one epoch (typically, one pass of all data through the estimation algorithm). According to Kosko, η should decrease with iteration k. It should decrease slowly, but not too slowly. Consequently, he suggested that the η associated with the kth iteration should follow the two properties detailed in equation (7):

2 (7)

A learning coefficient that meets both of these two criteria is 1/k, where k is the number of the current iteration (epoch).3 In practice, the learning coefficient is often held constant at some value for a certain number of iterations (like 1,000) and then dropped incrementally over the ensuing blocks of iterations.

3This particular condition applies theoretically to on-line training and not to batch training. But, the basic idea is applicable to batch training as well.

Also included in the BP process may be a momentum factor, α, which, like η, remains constant (between 0 and 1) over one epoch. This factor is used in an effort to preclude the local minima trap called "hemstitching," where a potential solution oscillates from "valley" sidewall to sidewall, rather than taking the more direct route down the "valley." The criteria for choice of the momentum factor, α, lack even the broad-based guideline afforded the choice of the learning coefficient by equation (7). One possibility is to simply set it at a value, like 0.9—as Rumelhart and McClelland, pioneers in the parallel distributed processing arena, have often done. Alternatively, and like some commercial software programs, it may be handled in the same way as η. That is, it may be scaled down incrementally after preselected blocks of iterations.

(2) Stopping Point of the Iterative Estimation Procedure

Essentially, network training is stopped when weights quit changing. Two proportional weight change criteria that could be relevant here are the prospective parameter change (PPC) and the retrospective parameter change (RPC) convergence criteria discussed in the SAS/ETS User's Guide (SAS Institute, Inc., p. 565). Alternatively, since data are routinely scaled for the BP process, weight changes may be examined practically in levels. Another convergence criterion may be to look at the relative change in an associated objective function (such as sum of squared in-sample prediction errors). The SAS OBJECT criteria (p. 565) is an example of this approach. While the above criteria provide useful stopping conditions, the BP algorithm provides no assurance of a global maximum, which is typically the case in nonlinear estimation algorithms in general.4

4Within the BP algorithm, local/global minima-enhancing suggestions abound: (1) dynamically lowering learning and momentum coefficients, (2) layer- or node-specific learning and momentum coefficients, (3) periodically jogging the weights by some small value, (4) changing the slope of the transfer function, (5) alternative error functions, and (6) simply restarting the algorithm with alternative starting weights.

Neural network software may explicitly allow for the possibility that neither a global nor a local minimum may actually be desired (at least in-sample). That is, if a network is overtrained, it will not generalize well (predict reasonably out-of-sample). In this sense, overtraining is to be distinguished from overfitting, where an FNN with excessive hidden nodes is fit using a conventional convergence criterion. The overtraining problem is dealt with by periodically stopping the training, testing out-of-sample prediction, and stopping estimation when out-of-sample prediction quality deteriorates (for an example, see Gorr et al.). This suggests that one way to find a network that generalizes well is to prevent the model, given the functional form, from fitting the data "too" well. The properties of such an estimation process could be heavily dependent on initial starting conditions, and hence may be econometrically unpalatable. In spite of its possible unpalatability, this approach does make an applied economic researcher acutely aware of the tradeoff between in-sample and out-of-sample fit, or the importance of model validation via out-of-sample prediction (not always considered in conventional econometric estimation). Alternatively, the process itself may merely degenerate to one more exercise in data mining.

(3) Scaling the Data

Computers often have difficulty computing the exponential transformation within the typical transfer function, when it is evaluated at very large numbers. Also, the asymptotic nature of the transfer function typically means that the effective input range may be quite narrow.5 Consequently, the BP procedure often is expedited by scaling the data. Typical scalings, usually on a variable-by-variable linear basis, are into the continuous intervals [0, 1] or [­1, 1], according to whether the sigmoid or the hyperbolic tangent transfer function is used, respectively. Special consideration should be given to whether or not the modeler would know real-world maximums and minimums in the testing data set. If they are not known at time of model construction (especially relevant in time-series forecasting), then all data should be scaled according to the maximums and minimums within the training data.

5For example, in the case of the sigmoid transfer function, f(x) = 1/1 + e-x), a 1% change in x implies a 2.6% change in f(x) when x = 1, and only a 0.003% change in f(x) when x = 10.

The [0, 1] and [­1, 1] output scalings, along with the bounding nature of the associated transfer functions, would prevent predicting testing set outputs outside the range of training set outputs. At least three ways around this dilemma emerge: (1) arbitrary maximums and minimums, thought to encompass most possible real-world data, could be used; (2) the [0, 1] or [­1, 1] output mappings could be reduced arbitrarily to, say [0.15, 0.85] or [­0.85, 0.85]; or (3) output data scaling could be precluded entirely by using a linear transfer function in the output layer. In the latter case, the modeler would have to decide whether or not the remaining hidden layer(s) is enough to adequately capture the underlying model complexity.

Backpropagation Estimation Algorithm <top>

In the BP estimation algorithm illustration which follows, we consider a batch-trained model, in the form of equation (4), with L = 1 (one hidden layer) and f 0 = f 1 = sigmoid function. Thus, sig(X ), where X is a matrix, causes each element of the X matrix to be transformed according to the function 1/(1 + e-x). As before, h0, h1, and h2 indicate the number of input variables, nodes in the single hidden layer, and output variables, respectively.

Scaled data are distinguished from real-world data by an appended lower-case s. For example, X is an n ´ h0 matrix of real-world input data (n is the number of observations), but Ys is an n ´ h2 matrix of scaled output data. To explicitly consider scaling and descaling (needed later in the derivative and elasticity calculations), the two scalars denoting the minimum and maximum endpoints of the "scaled into" range for the input data are rx and Rx, respectively. The corresponding output data scalars are ry and Ry. Thus, input data is scaled into the interval [rx, Rx], and output data into [ry, Ry]. Input maxima and minima are denoted by the 1 ´ h0 row vectors Mx and mx, respectively. That is, a single element of Mx is the maximum value of the corresponding column of input data (maximum value of an input variable). Corresponding 1 ´ h2 row vectors for the output data are My and my. X(i) denotes the ith row of the matrix X, and ik is a k ´ 1 vector of 1's. Element-by-element multiplication and division of two matrices are designated as #* and #/, respectively. For example, if A = [aij] and B = [bij], then A#*B = [aij*bij], for all i and j. We note that if the same matrix appears on either side of an equation, the left-hand side becomes the updated (new) version based upon the designated transformation of the old system.

6Because the purpose of this research is expository, we consider only this randomization range. More assurance of lower local minima could be acquired by testing many alternatives here for each net estimated, and subsequently choosing the range which yields the lowest in-sample sum of squared errors.

Step 1. Scale the data.

Xs(i ) ( (Rx ­ rx) X (i ) + rxMx ­ Rxmx)
#/(Mx ­ mx) , i = 1, ..., n (8)
Ys(i)=   ( (Ry ­ ry) Y (i ) + ryMy ­ Rymy)
#/(My ­ my) , i = 1, ..., n . (9)

Step 2. Initialize the model.

Starting values of W1, w1, W2, and w2 are randomly drawn from a uniform [­0.6, 0.6] distribution.6 V1, v1, V2, and v2 are same-dimensioned counterparts, but initialized at 0. The learning coefficient, η, and the momentum factor, α, are also initialized. Passing the data to the network, we have:

P0 = Xs, (10)

P2 = Ys. (11)

Step 3. Begin iterating . . .

P1 = sig (P0W1 + inw ) . (12)

Step 4. Arrive at the model's predictions.

P2 = sig (P1W2 + inw ) . (13)

Step 5. Calculate the adjustment weights.7

D2 = P2 #* (ini2 ­ P2 ) #* (Ys ­ P2 ) , (14)

D1 = P1 #* (ini1 ­ P1 ) #* (D2W ) . (15)

7Notice the f(x)*[1 ­ f(x)] form of the first two terms of equations (14) and (15). This is the first derivative of the sigmoid transfer function. The right-hand term in equation (14) is the error associated with the last layer. In equation (15), the last term is the expected error of the hidden layer, given the error of equation (14). Equations (14) and (15) follow from differentiating a squared output error function in this batch-trained model, ýE2 [E as in equation (4), or Es as in (24)], with respect to each weight to be estimated (the W and w notations). Hence, although the network is clearly a feedforward network in training [see equations (12) and (13)], errors are backpropagated in the weight adjustment process.

Step 6. Update the weight matrices.8

w2 = w2 + ηD in + αv2, (16)

v2 = ηD in + αv2, (17)

W2 = W2 + ηPD2 + αV2, (18)

V2 = ηPD2 + αV2, (19)

w1 = w1 + ηD in + αv1, (20)

v1 = ηD in + αv1, (21)

W1 = W1 + ηP D1 + αV1, (22)

V1 = ηP D1 + αV1. (23)

8The last term in equations (16), (18), (20), and (22) accommodates the momentum factor, making part of the weight adjustment process a function of the adjustment in the iteration before.

Step 7. Calculate and report the errors.

Es = Ys ­ P2, (24)

SSEs = ( ( Es #* Es ) ih2 ) T in , (25)

 

 

 

Step 8. Adjust η and α if desired.

Step 9. Continue iterating, i.e., go back to equation (12).9

9Before continuing, this may be a good place to note how a second hidden layer might be added. First, between (13) and (14) two lines would need to be inserted, pertaining to P3 and D3, respectively. The right term in (14), Ys ­ P2, would move to the D3 line, becoming Ys ­ P3. Otherwise, the subscripts in the two inserted lines would follow the pattern already set forth. Also, four more lines would have to be inserted between (15) and (16), pertaining to w3, v3, W3, and V3, respectively.

Step 10. Stop the model based upon preselected criterion.

Step 11. Calculate the related partial effects (derivatives and elasticities: outputs with respect to inputs) evaluated at each data point.10

For i = 1, ..., n, calculate the following:

(11a) the 1 ´ h1 transfer function derivative vector at the hidden layer,

G1 = P1(i ) #* ( i 1 ­ P1(i ) ) ; (27)

(11b) the 1 ´ h2 transfer function derivative vector at the output layer,

G2 = P2(i ) #* ( i 2 ­ P2(i ) ) ; (28)

(11c) the still-scaled h0 ´ h2 matrix of partial effects (the ordinary ¶ y/¶ x matrix),

(29)

(11d) the corresponding descaled h0 ´ h2 matrix of partial effects,

(30)

(11e) the network-predicted descaled 1 ´ h2 output vector,

( ( My ­ my )          (31)

#* P2(i ) + Ry my ­ ry My ) ;

(11f) the descaled h0 ´ h2 elasticity matrix corresponding to the descaled partial effects,11

ELASi = X(i )Ti 2 #* Ji #/ (ih0Y^ (i ) ) . (32)

10For continuity with the preceding, the development here displays the partial effects evaluated over the training data. In out-of-sample prediction, it may be more appropriate to evaluate the derivatives over the out-of-sample data. Since the model structure and parameters are now set, this is a straightforward modification, where P0 is the appropriately-scaled, out-of-sample explanatory data matrix. Similarly, the descaled P2 would be the actual predicted output values.

11The elasticity derivation at equation (32) is not needed for the empirical model we examine, but is included for expository reasons. The i subscripts are added at (30) and (32) to indicate that these matrices may be stored, as they would be in the case of bootstrapped standard error calculations.

Step 12. Compute standard errors for partial effects and elasticities using bootstrapping.

(12a) Save the descaled predictions and descaled residuals:

Y~= Y^, (33)

E = Y ­ Y~ . (34)

(12b) Make a new Y-matrix:

Y (i )Y~(i ) + random row of E,

for i = 1, ..., n. (35)

(12c) Perform Steps 1 through 11—using equations (9) through (32) only.

(12d) Repeat Steps 12b and 12c five hundred times, collecting each Ji matrix of equation (30) and each ELASi matrix of equation (32) for each of the 500 runs.

(12e) For estimates of partial effects and associated standard errors, compute the mean and standard deviation for each element of the collected Ji and ELASi notations across the 500 runs. For example, where jk represents a row/column matrix position, the mean of the series made up of each jkth element of the 500 collected Ji matrices is the partial effect of the kth output with respect to the jth input evaluated at the ith observation of the data set. The standard deviation of the same series denotes the corresponding standard error.12

12All matrices do not necessarily have to be stored, as the mean and standard deviation can be merely updated with each run. Also, the partial effects and elasticities could alternatively be evaluated only at the data means rather than at each data point, reducing the memory storage requirements even further. However, these modifications only marginally reduce the vast amount of computational time required by the bootstrapping process. Therefore, we did not actually do the bootstrapping computations in this research. The description is included here because rapidly increasing computational speed will soon make such an exercise practical.

An Empirical Example: Used Combine Pricing in the Great Plains Region <top>

Characteristics of used combines are determinants of sale price in a hedonic demand framework. Hedonic prices are defined as the implicit prices of attributes revealed to economic agents from observed prices of differentiated products and the specific amounts of characteristics associated with them (Rosen). In order to examine the effects of combine characteristics on selling price in an elasticity framework, we specify the model,

ln(P) = α0 + α1ln(INTS) + α2ln(SCAP) + α3ln(CCAP) + α4ln(AGE) + α5ln(LP) + β1DIESEL + β2ROTARY + β3AC + β4IH + β5MF + β6OTH + ε, (36)

where P is the offer price in dollars of a used combine, AGE is combine age in years, INTS is usage intensity (hours of use per year of age), SCAP and CCAP are separating and cleaning capacity in square inches, respectively, and LP is the list price of the combine when it was new. DIESEL is a binary variable representing fuel type (1 if diesel, 0 otherwise); ROTARY is a binary variable representing threshing method (1 if rotary, 0 if conventional threshing cylinder). The binary variables AC, IH, MF, and OTH represent the manufacturers Allis Chalmers (Gleaner), Case-International, Massey-Ferguson, and others (such as White and New Holland). John Deere is the default manufacturer binary variable. Finally, ln denotes the natural logarithm, α and β denote parameters to be estimated, and ε is a random error term.

Because of the log framework, an α slope coefficient can be interpreted directly as the elasticity of used price with respect to the corresponding explanatory variable. Halvorsen and Palmquist provide a method of calculating the relative effects of the binary variables (in %) as

γj = ( exp(βj) ­ 1) * 100, (37)

where γj is the percentage change in the offer price of a used combine due to the jth binary variable, and βj is the corresponding coefficient estimate.

To compare with the linear regression model in (36), we specify the general feedforward neural network model:

ln(P) = g { C, ln(INTS), ln(SCAP), ln(CCAP), ln(AGE), ln(LP), DIESEL, ROTARY, AC, IH, MF, OTH } + μ, (38)

where g is the functional descriptor denoting the entire network structure, C is a vector of 1's, μ is a random error, and all other variables are identical to those in (36). As in the linear model, the partial derivative of ln(P ) with respect to one of the logged explanatory variables is directly interpretable as the corresponding elasticity. The difference is that the elasticity is not constant, as it is in the case of the linear model in (36). Rather, it is evaluated at any data point.

Source of Data <top>

The original list price, separating capacity, cleaning capacity, fuel type, and cylinder type were collected from Hot Line Farm Equipment Guide: Quick Reference Guide (Hot Line, Inc.). The balance of the data were collected from monthly issues of Hot Line: Farm Equipment Guide, ranging from 1982 to 1990. Only combine offerings within the Great Plains region were examined (North Dakota, South Dakota, Nebraska, Kansas, Oklahoma, and Texas), yielding 1,206 usable observations. Tables 1 and 2 provide summary statistics for the data used in the analysis. The average age of the used combines was around seven years. Used combine prices averaged 47% of the original list price. These combines were used around 200 hours per year. Over 60% of the combines were manufactured by John Deere. To facilitate model evaluation, every third observation of the data was set aside to construct a model validation data set, comprised of 402 observations, leaving 804 observations in the model-building (MB) data set.

Table 1. Summary Statistics of Continuous Variables in Great Plains Used Combine Data, 1982_90 (1,206 observations)

Variable

Mean

Standard Deviation

Offering Sale Price ($/used combine)

Separating Capacity (square inches)

Cleaning Capacity (square inches)

Hours Used

Age (years)

Intensity of Use (hours/age)

List Price ($/new combine)

36,770.44

7,850.88

6,371.95

1,273.35

6.93

197.07

78,490.89

17,580.83

2,935.47

1,847.23

662.13

3.37

94.6

23,796.25

 

Table 2. Summary Statistics of Binary Variables in Great Plains Used Combine Data, 1982_90 (1,206 observations)

Variable

Number

% of Total

Diesel Engine

Rotary Threshing Cylinder

Allis Chalmers (Gleaner)

Case-International

John Deere

Massey-Ferguson

Other Manufacturers

1,167

276

87

195

742

127

55

96.77

22.89

7.21

16.17

61.53

10.53

4.5

Model Estimations <top>

Equation (36) was fit to the MB data set using ordinary least squares regression. Equation (38) was fit with FNNs, trained using the BP estimation procedure. All FNN models were trained exactly according to the BP procedure set forth in equations (8) through (26), using Matlab Numeric Computation Software (Math Works, Inc.). Inputs were scaled into the [0, 1] interval and outputs into the [0.15, 0.85] interval. Weight matrices were initialized as described at Step 2 of the estimation algorithm. The sigmoid transfer function was used throughout. The learning coefficient, η, was fixed at 0.05 throughout. The momentum coefficient, α, was fixed at 0.5. Batch training was used. In all cases, training was stopped after the largest single weight change, occurring over the preceding 100 iterations, fell below 0.05 (the convergence criterion). Typically, in-sample SSEs was changing at only the fifth decimal place with each increase of 100 iterations at that point, and rapidly deteriorating to the sixth and seventh place, indicative of approaching a local minimum. The typical network trained in 3,000_5,000 iterations, with a few as high as 15,000. In line with our earlier notation, in all networks, h0=11 (the 11 explanatory variables), h2=1 (the single output variable), and h1=1, 2, 3, or 4 (the number of possible nodes considered in the single hidden layer).

To arrive at the final FNN model reported, and in the spirit of the three-way data division into training, testing, and validation sets, the following procedure was used. Four alternatively specified models were considered, with 1, 2, 3, or 4 hidden nodes. The MB data set was further divided into two independent and mutually exclusive testing sets, with 1-, 2-, 3-, and 4-node FNNs trained over the complement data set to each of the two testing sets, and subsequently used to predict the associated testing set output (given the testing set's inputs).13

13For example, suppose the MB data set is comprised of observations 1 through 804. Suppose that the first testing data set considered contains observations 1 through 402. In this case, alternative models are estimated using  observations 403_804 (the complement set). The estimated parameters, along with the explanatory data of observations 1_402, are used to predict the dependent variable values for observations 1_402. Subsequently, observations 1_402 become the training set, while observations 403_804 become the testing set, and so on.

Ultimately, this results in predicting all 804 MB output values in an out-of-sample framework, for each of the four FNN specifications. The root mean squared forecast error (RMSE) was computed for the 804 observations for each of the four models. The model with the lowest RMSE was deemed the best model, given this MB set division.

In addition to the division into two testing sets described above, we also examined MB set divisions of thirds, fourths, and fifths. For example, in the case of division into fifths, five networks were estimated for each of the 1-, 2-, 3-, and 4-node FNN specifications, in order to build the out-of-sample forecast series of all 804 MB set output values associated with each of the four FNNs. As the divisions get smaller and smaller, the resultant subsample training sets would approach the full MB data set, predicting that set's outputs quite accurately, while potentially predicting poorly the outputs in the final validation data set. To avoid that possibility, we arbitrarily did not consider divisions finer than fifths.

Ultimately, 16 out-of-sample (but within the MB data set) RMSEs were compared—four alternatively specified FNNs over four different MB data set divisions. The FNN associated with the lowest RMSE, among the 16, determined the model structure of the final FNN reported. The FNN model chosen was a four-hidden-node model.14 Prior to use in validation, the weights of this FNN were re-estimated, using the full MB data set.

14The best network happened to occur when the MB data were divided into fourths.

Model Validation <top>

The OLS and FNN estimated models were evaluated two ways. First, both models were used to predict the actual outputs associated with the validation data set, conditional upon the validation data set's inputs.15 Second, the associated partial derivatives were computed for each model. In the case of the FNN, equation (33) was used to derive the derivatives. The derivatives were evaluated at each data point in the validation data set. Because they are constant, the OLS derivatives may also be considered evaluated at the same points. For each partial derivative derived from the FNN, the average of the 402 evaluation points was calculated, to be interpreted as the expected value of the partial effect associated with a randomly drawn combine. These calculations provide at least rough comparisons with the corresponding coefficient estimates of the OLS model. In addition, for each partial derivative, the standard deviation of the 402 evaluations was calculated as well. Although not equivalent to standard errors of coefficient estimates, these standard deviations provide some indication of the underlying variability, as well as an indication of the persistence of effects predicted by the FNN, across different combines. Finally, although technically only proper if we assume a constant partial derivative over the [0, 1] range for a given binary variable, we calculated equation (37) for both OLS and FNN models. For the FNN, we evaluated the γj at each of the 402 data points, and subsequently computed the average evaluation.16

15The ln(P) predictions for the FNN were transformed to P-predictions by simply inverting the log function. For the OLS, since it is a clearly defined statistical model, the inherent bias in the predictions was adjusted by multiplying the inverse log predictions by exp(ý s2), where s2 was the regression residual variance (Neyman and Scott).

16This approach was taken (computing the mean and standard deviation of the partial effects across observations), rather than the bootstrapping approach described earlier, in order to acquire a feel for the variability of the partial effects without having to invest the computer time required of bootstrapping.

Empirical Results <top>

The price forecasting accuracy for the OLS and FNN models is depicted in Figures 1 and 2, respectively. To make it easier to graphically see the relationship between actual price and forecasted price, combines were sorted by price before display. The FNN predictions are associated with a 9% lower RMSE than the OLS predictions, and a slightly higher forecast r2 (squared linear correlation between forecasts and actuals), both indicative of greater prediction accuracy associated with the FNN compared to the OLS model. However, FNN forecasts were more biased than OLS forecasts, overvaluing the combines by $1,300 on average. The OLS procedure appeared to have slightly more difficulty forecasting the most expensive combines. In fact, examination of only the combines above the median price (observations 202_402 on the graphs in Figures 1 and 2) showed OLS to be associated with an RMSE of $9,718 and a mean error of $1,127 (undervaluing the combines), while the FNN had an RMSE of only $8,102 and a mean error of only $180. Thus, at least this graphical nonlinearity (fewer combines in each dollar interval at successively higher prices) appears to be better modeled by the FNN, out of sample. On 

the other hand, over the more graphically linear section of Figures 1 and 2 (observations 1_201), the OLS model overvalued combines less (mean error of ­$1,842) and predicted more accurately (RMSE of $5,734) than did the FNN model (mean error of ­$2,781 and RMSE of $6,378). These results indicate that, where data are more evenly distributed across successive dependent variable intervals (here, a constant number of combines representing each price interval), OLS may be superior to FNN.

The derivative results discussed earlier are displayed in Table 3. Persistence in the FNN, according to whether the absolute value of the mean of the 402 derivative evaluations is more than twice or three times as large as its standard deviation, generally appears closely associated with the 0.05 and 0.01 statistical significance levels of the parameter estimates in the OLS models, respectively. For example, and not surprisingly, intensity of use, age, and list price appear to be important determinants of used combine price in both the OLS model and the FNN model. Each is statistically significant at the 0.01 level of significance in the OLS model, and each is largely persistent in the FNN model (the mean derivative is more than three times the standard deviation).

A combine which had been used twice as intensely as another was expected to be worth around 14% less at trade-in time. Doubling the age of a combine suggests a 70% drop in its value according to the FNN model, and only a 54% drop with the OLS model. The list price effect of the FNN model indicates much larger devaluations (the depreciation independent of age) for combines that were more expensive when they were new, as compared to the OLS model. It is interesting to note that separating capacity and cleaning capacity are each associated with persistent derivatives in the FNN, when neither is statistically significant in the OLS model. The OLS difficulty in discriminating these two effects may stem from the collinearity involved, as the linear correlation over the total data between these two variables is 0.63.

The diesel engine binary variable is neither statistically significant nor FNN-persistent. Perhaps the intuitively larger value associated with a diesel engine is captured in list price instead. Presence of a rotary cylinder is associated with an 18% (OLS) to 24% (FNN) larger combine value, indicating that combines with rotary threshing devices hold their values well. Allis Chalmers and Massey-Ferguson combines appear substantially discounted from comparable John Deere combines by both the OLS and FNN models, with discounts in the 16% to 18% range for Allis Chalmers, and around 22% for Massey-Ferguson. On the other hand, IH combines are not statistically or persistently lower valued than John Deere combines.

Figure 1. Actual and OLS Predictions of Used Combine Prices

 

Figure 2. Actual and FNN Predictions of Used Combine Prices

 

Table 3. Premiums and Discounts of Great Plains Used Combines: 1982_90

OLS Model

FNN Modela

Variable

Parameter
Estimate

Standard
Error

% Value

Average Derivative

Standard Deviation

Average
% Value

Intercept

3.235**

0.466

Continuous Variables:

Intensity

Separating Capacity

Cleaning Capacity

Age

List Price

 

­0.138**

0.145

0.020

­0.540**

0.661**

 

0.020

0.083

0.074

0.018

0.044

 

 

 

­0.150##

0.073#

0.160##

­0.698##

0.379##

 

0.047

0.026

0.033

0.029

0.116

Binary Variables:

Diesel

Rotary

AC

IH

MF

Other

 

­0.026

0.161*

­0.165**

­0.063

­0.248**

­0.005

 

0.054

0.078

0.056

0.055

0.038

0.063

 

­2.6

17.5

­16.2

­6.1

­22.0

­0.5

 

 

0.237

0.212##

­0.210#

­0.117

­0.255##

­0.116

 

0.131

0.044

0.105

0.067

0.061

0.062

 

27.8

23.7

­18.5

­10.8

­22.4

­10.8

Model R

0.814

0.860b

Notes: Single and double asterisks (*) denote statistical significance for the OLS model at the 0.05 and the 0.01 level, respectively. Persistence and large persistence for the FNN are denoted by single and double # symbols, respectively, according to whether the absolute value of the average derivative was more than twice as large as the associated standard deviation or more than three times as large. The % values for OLS were derived using equation (37) in the text. The average % values for the FNN were also derived using equation (37); only FNN partial derivatives were substituted for the coefficient estimates and the results subsequently averaged across the points of derivative evaluation.

aFour-hidden-node, one-hidden-layer model, solved with the backpropagation process described in the "Model Estimation" section of the text.

bSquared linear correlation between in-sample, model-predicted combine prices and actual combine prices.

Conclusions <top>

This article provides an introduction to the rationale underlying the development and implementation of an empirical feedforward neural network (FNN) analysis. Model training, testing, and validation are discussed, with intuitive links with traditional econometrics provided throughout. A matrix algebra approach to neural network modeling is developed that provides practical coding for solving an FNN using backpropagation, and for analytically deriving the partial effects and elasticities. The coding is shown to be readily extended to simulate the standard errors for the partial effects and elasticities in a conventional bootstrapping exercise.

A hedonic price model of used combine prices in the Great Plains from 1982_90 was examined with an ordinary least squares (OLS) regression model as well as with an FNN. Out-of-sample dependent variable predictions and out-of-sample partial derivatives were used to compare the OLS model with the FNN model. In a squared forecast error sense, the FNN predicted combine price more accurately than did the OLS model, but it also tended to be more biased. Where dependent variable intervals were poorly represented with data, the FNN outperformed the OLS model. That is, in our models, as higher and higher used combine values were supported with fewer and fewer combines, FNN appeared superior. On the other hand, where used combine values were more or less uniformly represented by data, the OLS model appeared superior to the FNN.

The mean and standard deviation of each FNN-derived partial derivative series, where the series is made up of the partial derivative evaluated at each data point, are at least heuristically comparable to the corresponding linear regression coefficient estimate and its standard error, respectively. The FNN model appeared superior to OLS in distinguishing the effects on used combine price associated with two highly correlated explanatory variables—threshing capacity and cleaning capacity. More research will be required to determine whether the finding of FNN superiority in the case of sparse dependent variable data or collinear explanatory variables can be generalized.

As usual, the addition of a new tool to the economist's toolbox comes as a double-edged sword. The highly flexible feedforward neural network functional form provides increased latitude for the applied researcher. But, lacking economically- or statistically-driven functional form constraints, only the choice of which data to include in a model, and out-of-sample prediction, remain to prevent data mining from reaching new heights. Thus, neural nets should typically be validated in an out-of-sample framework. This validation requirement suggests that the forecasting literature may see increased relevance, even for those applied economists who do not traditionally see themselves as forecasters.

References <top>

Akaike, H. "A New Look at Statistical Model Identification." IEEE Transactions on Applied Computers AC-19(1974): 716­23.

Cheng, B., and D.M. Titterington. "Neural Networks: A Review from a Statistical Perspective" [along with comments from numerous authors]. Statistical Sci. 9(1994):2­54.

Claussen, K.L., and J.W. Uhrig. "Cash Soybean Price Prediction with Neural Networks." In Applied Commodity Price Analysis, Forecasting, and Market Risk Management, pp. 56­65. Proceedings of the NCR-134 Conference, Chicago, IL, April 18_19, 1994.

Cybenko, G. "Continuous Valued Neural Networks with Two Hidden Layers Are Sufficient." Technical report. Department of Computer Science, Tufts University, Medford, MA, 1988.

Dorsey, R., and W.J. Mayer. "Genetic Algorithms of Estimation Problems with Multiple Optima, Nondifferentiability, and Other Irregular Features." J. Bus. and Econ. Statis. 13(1995):53­66.

Eberhart, R.C., and R.W. Dobbins. "Implementations." In Neural Network PC Tools, edited by R.C. Eberhart and R.W. Dobbins. San Diego: Academic Press, 1990.

Gorr, W.L., D. Nagin, and J. Szczypula. "Comparative Study of Artificial Neural Network and Statistical Models for Predicting Student Grade Point Averages." Internat. J. Forecasting 10(1994):17­34.

Hagan, M.T., and M.B. Menhaj. "Training Feedforward Networks with the Marquardt Algorithm." IEEE Transactions on Neural Networks 5(1994):989­93.

Halvorsen, R., and R. Palmquist. "The Interpretation of Dummy Variables in Semilogarithmic Equations." Amer. Econ. Rev. 70(1980):474­75.

Hamm, L., B.W. Brorsen, and R. Sharda. "Futures Trading with a Neural Network." In Applied Commodity Price Analysis, Forecasting, and Market Risk Management, pp. 286­97. Proceedings of the NCR-134 Conference, Chicago, IL, April 19_20, 1993.

Hornik, K., M. Stinchcombe, and H. White. "Multilayer Feedforward Networks Are Universal Approximators." Neural Networks 2(1989):359­66.

Hot Line, Inc. Hot Line: Farm Equipment Guide. Fort Dodge, IA, various issues, 1982­90.

. Hot Line Farm Equipment Guide: Quick Reference Guide. Fort Dodge, IA: Hot Line, Inc., various issues 1982_90.

Institute of Electrical and Electronics Engineers, Inc. IEEE Transactions on Neural Networks. Washington, DC, various issues.

Joerding, W.H., Y. Li, and D.L. Young. "Feedforward Neural Network Estimation of a Crop Yield Response Function." J. Agr. and Appl. Econ. 26(1994):252­63.

Kasuba, T. "Simplified Fuzzy Artmap." AI Expert 8(November 1993):18­25.

Kohonen, T. Self-Organization and Associative Memory. New York: Springer-Verlag, 1988.

Kohzadi, N., M. Boyd, B. Kermanshahi, and I. Kaastra. "Forecasting Livestock Prices with an Artificial Neural Network versus Linear Time Series Models." In Applied Commodity Price Analysis, Forecasting, and Market Risk Management, pp. 131­43. Proceedings of the NCR-134 Conference, Chicago, IL, April 18_19, 1994.

Kolmogorov, A.N. "On the Representation of Continuous Function of Many Variables by Superposition of Continuous Functions of One Variable and Addition." Amer. Mathematical Society Translation 28(1963):55­59.

Kosko, B. Neural Networks and Fuzzy Systems. Englewood Cliffs, NJ: Prentice-Hall, Inc., 1992.

Math Works, Inc. Matlab Numeric Computation Software. Natick, MA, 1994.

Miller, L.H. "Neural Networks in Agricultural Credit Management." Paper presented at the American Agricultural Economics Association annual meetings, San Diego, 7­10 August 1994.

More, J.J. "The Levenberg-Marquardt Algorithm: Implementation and Theory." In Numerical Analysis, edited by G.A. Watson. Lecture Notes in Mathematics no. 630. New York: Springer-Verlag, 1977.

Murata, N., and S. Yoshizawa. "Network Information Criterion──Determining the Number of Hidden Units for an Artificial Neural Network Model." IEEE Transactions on Neural Networks 5(1994):865­72.

Murray, D. "Tuning Neural Networks with Genetic Algorithms." AI Expert 9(June 1994):27­31.

National Agricultural Library. Agricola, Silver Platter Version 3.11. CD-ROM based database. Bethesda, MD, 1984­92.

NeuralWare, Inc. Using NeuralWorks: A Tutorial for NeuralWorks Professional II/PLUS and NeuralWorks Explorer. Pittsburgh, PA: NeuralWare, Inc., 1993.

Neyman, J., and E.L. Scott. "Correction for Bias Introduced by a Transformation of Variables." Ann. Mathematical Statis. 31(1960):643­55.

Rosen, S. "Hedonic Prices and Implicit Markets: Product Differentiation in Pure Competition." J. Polit. Econ. 82(1974):34­55.

Rumelhart, D.E., and J.L. McClelland. Parallel Distributed Processing, Explorations in the Microstructure of Cognition. Vol. 1: Foundations. Cambridge, MA: MIT Press, 1986.

Salchenberger, L.M., E.M. Cinar, and N.A. Lash. "Neural Networks: A New Tool for Predicting Thrift Failures." In Neural Networks in Finance and Investing, edited by R.R. Trippi and E. Turban. Chicago: Probus, 1993.

SAS Institute, Inc. SAS/ETS User's Guide, Version 6, 2nd ed. Cary, NC: SAS Institute, Inc., 1993.

Uhrig, J.W., B.A. Engel, and W.L. Baker. "An Application of Neural Networks: Predicting Corn Yields." In Applied Commodity Price Analysis, Forecasting, and Market Risk Management, pp. 407­17. Proceedings of the NCR-134 Conference, Chicago, IL, April 20_22, 1992.

Weymaere, N., and J. Martens. "On the Initialization and Optimization of Multilayer Perceptrons." IEEE Transactions on Neural Networks 5(1994):738­51.

Wilson, H.W. Company. Business Periodicals Index, Version 3.0. Wilsondisc CD-ROM based database. New York, 1983 through 23 July 1994.

Social Sciences Index. CD-ROM based database. New York, various dates 1983_94.

 

<top>

 


Send questions and comments to Faye Butts fsb1@cornell.edu

This page was last modified on: 02/10/04

Topics
Volume 55
Abstract
Article
Neural Networks and Agricultural Econommics
Neural Nets
The Feedforward Neural Network Functional Form
FNN Mechanics
Backpropogation Mechanics
Backpropogation Estimation Algorithm
An Empirical Example:
Source of Data
Model Estimations
Model Validation
Empirical Results
Conclusions
References

AEM Home Site Map Contact Us Cornell

© 2002 Cornell University
Department of Applied Economics and Management