Backpropagation Error
Contents |
a playout is propagated up the search tree in Monte Carlo tree search This article has multiple issues. back propagation error calculation Please help improve it or discuss these issues on the
Backpropagation
talk page. (Learn how and when to remove these template messages) This article may be expanded
Back Propagation Algorithm In Neural Network
with text translated from the corresponding article in German. (March 2009) Click [show] for important translation instructions. View a machine-translated version of the German article.
Back Propagation Algo
Google's machine translation is a useful starting point for translations, but translators must revise errors as necessary and confirm that the translation is accurate, rather than simply copy-pasting machine-translated text into the English Wikipedia. Do not translate text that appears unreliable or low-quality. If possible, verify the text with references provided understanding backpropagation in the foreign-language article. After translating, {{Translated|de|Backpropagation}} must be added to the talk page to ensure copyright compliance. For more guidance, see Wikipedia:Translation. This article may be expanded with text translated from the corresponding article in Spanish. (April 2013) Click [show] for important translation instructions. View a machine-translated version of the Spanish article. Google's machine translation is a useful starting point for translations, but translators must revise errors as necessary and confirm that the translation is accurate, rather than simply copy-pasting machine-translated text into the English Wikipedia. Do not translate text that appears unreliable or low-quality. If possible, verify the text with references provided in the foreign-language article. After translating, {{Translated|es|Backpropagation}} must be added to the talk page to ensure copyright compliance. For more guidance, see Wikipedia:Translation. This article may be too technical for most readers to understand. Please help improve this article to make it understandable to non-experts, without removing t
define the cost function with respect to that single example to be: This is a (one-half) squared-error cost function. Given a training set of m examples, we then define the overall cost function to be: The first http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm term in the definition of J(W,b) is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent overfitting. [Note: Usually weight decay is not applied to the bias terms , as reflected in our definition for J(W,b). Applying weight decay to the bias units usually makes only a small difference to the final network, back propagation however. If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.] The weight decay parameter λ controls the relative importance of the two terms. Note also the slightly overloaded notation: J(W,b;x,y) is the squared back propagation algo error cost with respect to a single example; J(W,b) is the overall cost function, which includes the weight decay term. This cost function above is often used both for classification and for regression problems. For classification, we let y = 0 or 1 represent the two class labels (recall that the sigmoid activation function outputs values in [0,1]; if we were using a tanh activation function, we would instead use -1 and +1 to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the [0,1] range (or if we were using a tanh activation function, then the [ − 1,1] range). Our goal is to minimize J(W,b) as a function of W and b. To train our neural network, we will initialize each parameter and each to a small random value near zero (say according to a Normal(0,ε2) distribution for some small ε, say 0.01), and then apply an optimization algorithm such as batch gradient descent. Since J(W,b) is a non-convex function, gradient descent is susceptible to local optima; however, in practice gradient descent usually works fairly well. Finally, note that it is important to initialize the parameters randomly, rather than to all 0's. If all the parameters start off at identical values, then all the hidden layer units will end up learn
be down. Please try the request again. Your cache administrator is webmaster. Generated Sat, 01 Oct 2016 14:01:48 GMT by s_hv972 (squid/3.5.20)