Made byBobr AI

Backpropagation Demystified: How Neural Networks Learn

Learn the fundamentals of neural network training: forward propagation, loss functions, gradient descent, the chain rule, and activation functions like ReLU.

#neural-networks#backpropagation#machine-learning#artificial-intelligence#deep-learning#gradient-descent#data-science
Watch
Pitch
Neural Networks / Backpropagation
What is Backpropagation?
Slide 1 of 8
A neural network is an optimization problem — find the best weights and biases to minimize error.
Backpropagation tells us which direction to adjust each parameter.
The network is one big nested function — calculus lets us trace how any weight affects the final loss.
Instinct version: drag a weight slider, watch loss go up → move it the other way. Backprop is the math of that instinct.
forward pass → input x neuron output ŷ loss L ← who's responsible?
Made byBobr AI
NEURAL NETWORKS / BACKPROPAGATION
Forward Propagation
The Network Makes a Prediction
Before we can go backward, we go forward.
Each neuron: multiply input by weight, add bias → output
neuron(x) = wx + b
Each layer's output becomes the next layer's input — a chain of functions.
Example: x = 2.1, w = 1, b = 0 → ŷ = (1)(2.1) + 0 = 2.1
The forward pass gives us: a prediction + all intermediate values needed for backprop.
forward pass → × w=1 + b=0 x val = 2.1 wx val = 2.1 ŷ prediction val = 2.1
Made byBobr AI
NEURAL NETWORKS / BACKPROPAGATION
Measuring Mistakes
The Loss Function
We need a score for how wrong the network is — that's the loss.
Mean Squared Error (MSE):
Loss = (1/n) Σ (ŷᵢ − yᵢ)²
For each example: (predicted − true)² then average across all.
Squaring: negatives don't cancel, and big mistakes are penalized harder.
No loss = no signal. The loss is the score we minimize.
Example: (2.1 − 4)² = (−1.9)² = 3.61
MSE Visualized input x value ŷ (predicted) error = ŷ − y squared error
Made byBobr AI
NEURAL NETWORKS / BACKPROPAGATION
Which Way is Down?
Gradient Descent
The gradient = direction of steepest ascent (loss increases fastest).
To lower loss: go the opposite direction — steepest descent.
Learning rate controls step size: too large = overshoot, too small = very slow.
  1. Start at random weights
  2. Compute gradient (steepest ascent)
  3. Flip it → steepest descent
  4. Take a small step
  5. Repeat until minimum
Start (high loss) Minimum (loss ≈ 0) weight w → bias b →
Made byBobr AI
NEURAL NETWORKS / BACKPROPAGATION
The Chain Rule
Connecting the Dots Backward
The network is a nested function: loss depends on neuron output, which depends on weights.
Chain rule: to find ∂loss/∂w, multiply upstream × local gradient.
∂loss/∂w = (∂loss/∂ŷ) × (∂ŷ/∂w)
∂loss/∂b = (∂loss/∂ŷ) × (∂ŷ/∂b)
Upstream gradient: how loss reacts to the layer's output.
Local gradient: how the layer's output reacts to its own parameter.
Key benefit: intermediate values from the forward pass get reused — no redundant computation.
Forward pass → input x x = 2.1 neuron(x,w,b) ŷ = wx + b loss(ŷ, y) (ŷ − y)² ∂ŷ/∂w, ∂ŷ/∂b ∂loss/∂ŷ ← Backward pass (chain rule) ∂loss/∂w = (∂loss/∂ŷ) × (∂ŷ/∂w)
Made byBobr AI
NEURAL NETWORKS / BACKPROPAGATION
Backward Propagation
Computing the Gradient
x = 2.1, w = 1, b = 0 → ŷ = 2.1, y = 4, loss = 3.61
∂loss/∂w = −7.98
∂loss/∂b = −3.8
lr = 0.01
w := 1 − (0.01)(−7.98) = 1.0798
b := 0 − (0.01)(−3.8) = 0.038
New loss = 2.87 ↓ (was 3.61) — saved 0.74 in one step!
💡
Negative gradient = loss drops when we increase the parameter → so we increase it.
forward → ← backward (gradients) grad=−3.8 grad=−7.98 x=2.1 w=1 b=0 × + ŷ=2.1 loss= 3.61
Made byBobr AI
NEURAL NETWORKS / BACKPROPAGATION
Why We Need More Than a Line
Activation Functions
A single linear neuron can only produce straight lines — stacking more doesn't help.
Real data: curves, patterns, images, language. Needs non-linearity.
Three ways to add complexity: more neurons, more layers, activation functions.
f(x) = max(0, x)
input < 0: output 0 (off)
input > 0: pass through
Still differentiable — chain rule still works. Backprop still runs.
Linear — Poor Fit
Poor fit
ReLU — Great Fit
Great fit ✓
ReLU Shape
0 Output 0 (off) Pass through f(x) = max(0, x)
Made byBobr AI
NEURAL NETWORKS / BACKPROPAGATION
Why Backpropagation Makes Learning Possible
The Full Picture
Doesn't change whether you have 1 neuron or 175 billion.
Deeper networks = more derivatives to chain. The logic is identical.
Allows any neural network — spam filter to language model — to learn from data.
"As far as neural networks reach, backpropagation will follow."
Repeat ×1000s Training Loop ① Forward Pass make a prediction ② Loss measure how wrong ③ Backward Pass chain rule → find culprit ④ Update nudge params down gradient
Made byBobr AI
Bobr AI

DESIGNER-MADE
PRESENTATION,
GENERATED FROM
YOUR PROMPT

Create your own professional slide deck with real images, data charts, and unique design in under a minute.

Generate For Free

Backpropagation Demystified: How Neural Networks Learn

Learn the fundamentals of neural network training: forward propagation, loss functions, gradient descent, the chain rule, and activation functions like ReLU.

Neural Networks / Backpropagation

What is Backpropagation?

Slide 1 of 8

NEURAL NETWORKS / BACKPROPAGATION

Forward Propagation

The Network Makes a Prediction

NEURAL NETWORKS / BACKPROPAGATION

Measuring Mistakes

The Loss Function

NEURAL NETWORKS / BACKPROPAGATION

Which Way is Down?

Gradient Descent

NEURAL NETWORKS / BACKPROPAGATION

The Chain Rule

Connecting the Dots Backward

The network is a <span style='color: #ffffff; font-weight: 600;'>nested function</span>: loss depends on neuron output, which depends on <span style='color: #4dabf7; font-family: monospace;'>weights</span>.

Chain rule: to find <span style='color: #4dabf7; font-family: monospace;'>∂loss/∂w</span>, multiply <span style='color: #ffffff; font-weight: 600;'>upstream &times; local</span> gradient.

∂loss/∂w = (∂loss/∂ŷ) &times; (∂ŷ/∂w)

∂loss/∂b = (∂loss/∂ŷ) &times; (∂ŷ/∂b)

<span style='color: #ffffff; font-weight: 600;'>Upstream gradient:</span> how loss reacts to the layer's output.

<span style='color: #ffffff; font-weight: 600;'>Local gradient:</span> how the layer's output reacts to its own parameter.

<span style='color: #ffffff; font-weight: 600;'>Key benefit:</span> intermediate values from the forward pass get reused &mdash; no redundant computation.

NEURAL NETWORKS / BACKPROPAGATION

Backward Propagation

Computing the Gradient

x = 2.1, w = 1, b = 0 → ŷ = 2.1, y = 4, loss = 3.61

∂loss/∂w = −7.98

∂loss/∂b = −3.8

lr = 0.01

w := 1 − (0.01)(−7.98) = 1.0798

b := 0 − (0.01)(−3.8) = 0.038

New loss = 2.87 ↓ (was 3.61) — saved 0.74 in one step!

Negative gradient = loss drops when we increase the parameter → so we increase it.

NEURAL NETWORKS / BACKPROPAGATION

Why We Need More Than a Line

Activation Functions

NEURAL NETWORKS / BACKPROPAGATION

Why Backpropagation Makes Learning Possible

The Full Picture

  • neural-networks
  • backpropagation
  • machine-learning
  • artificial-intelligence
  • deep-learning
  • gradient-descent
  • data-science