Activation Functions: Why Neural Networks Need Them

Last week, we covered forward and backward propagation - how neural networks make predictions and learn from their mistakes.

This week, let's talk about activation functions.

You've probably heard: "Neural networks need activation functions" or "They must be non-linear."

But why? What problem do they actually solve?

Let's break it down.

Why We Need Activation Functions

Imagine you build a neural network without any activation functions.

Just neurons doing multiplication and addition. Stack multiple layers. Connect them together.

Seems reasonable, right?

Here's the problem: it doesn't work.

Linear vs Non-Linear Functions

To understand why, we need to understand what "linear" and "non-linear" mean.

Linear Functions

A linear function is just multiplication and addition.

f(x) = 2x + 3

That's linear. Multiply by 2, add 3.

If you graph it, you get a straight line.

Double the input? The output changes in a predictable way. Always a straight line.

Neural network - Linear function

Non-Linear Functions

A non-linear function creates curves and bends.

f(x) = x²

This is non-linear. Graph it, you get a curve (a parabola).

The output doesn't change proportionally. Sometimes it goes up slowly, sometimes quickly. It curves.

Neural networks - Non-linear functions

The Stacking Problem

Here's what happens when you stack linear functions.

Say you have two layers:

Layer 1: f(x) = 2x + 1
Layer 2: g(x) = 3x + 2

When you connect them (output of layer 1 goes into layer 2):

g(f(x)) = g(2x + 1)
        = 3(2x + 1) + 2
        = 6x + 3 + 2
        = 6x + 5

Look at the result: 6x + 5

Still just multiplication and addition. Still linear.

You could have 10 layers, 100 layers, 1000 layers. Doesn't matter.

Mathematically, it all collapses into one equation: mx + b

All that depth? All those layers? Useless.

Your "deep" neural network is mathematically identical to a single-layer network.

Neural network- Stacking Linear Functions

Real-World Problems Need Non-Linearity

This is a problem because real-world data isn't linear.

Example 1: Image Classification

You're building a cat vs dog classifier.

The difference between "cat" and "dog" in pixel space isn't a straight line.

It's complex. It's curved.

A straight line can't separate cats from dogs. You need curves and complex shapes.

Example 2: Language Understanding

Consider: "This movie is not bad"

A linear model sees "not" and "bad" → thinks negative.

But the actual meaning is positive (double negative).

Understanding this requires non-linear processing.

Example: XOR Problem

You have four points:

(0,0) → Class A
(0,1) → Class B
(1,0) → Class B
(1,1) → Class A

Try drawing a single straight line that separates Class A from Class B.

You can't. No straight line works.

You need a curved boundary. You need non-linearity.

Input (x,y)	Class	Linear Separable?
(0,0)	Blue	No - needs
(0,1)	Red	curved boundary
(1,0)	Red
(1,1)	Blue

How Activation Functions Fix This

Activation functions introduce non-linearity into the network.

They break the "stacking linear = still linear" problem.

Without Activation Functions

Input → [Linear] → [Linear] → [Linear] → Output

Result: Everything collapses to mx + b

With Activation Functions

Input → [Linear] → [Activation] → [Linear] → [Activation] → Output

Result: Complex non-linear transformation

The activation function bends the mathematical space. Allows the network to create curves, not just straight lines.

Without them, depth is meaningless. With them, each layer can learn something new.

WITHOUT ACTIVATION FUNCTIONS:
Input → [Layer 1] → [Layer 2] → [Layer 3] → Output
\t\u2193
Collapses to:
Input → [Single Linear Function] → Output

WITH ACTIVATION FUNCTIONS:
Input → [Layer 1] → [ReLU] → [Layer 2] → [ReLU] → [Layer 3] → Output
(Each layer learns something new)

The Three Main Activation Functions

Now let's look at the three most common activation functions.

1. ReLU (Rectified Linear Unit)

The simplest one.

ReLU(x) = max(0, x)

What it does:

If x is positive → keep it
If x is negative → make it zero

Example:

ReLU(5) = 5
ReLU(-3) = 0
ReLU(0.7) = 0.7
ReLU(-0.2) = 0

If you graph it: diagonal line for positives, flat at zero for negatives.

That bend at zero makes it non-linear.

Neural networks - ReLU activation function

When to use it:

Default choice for hidden layers. 90% of the time, use ReLU.

Why?

Fast to compute (just check if x > 0)
Works well in deep networks
Simple but effective

The problem:

Sometimes neurons "die" - get stuck at zero and never recover. Called the "dying ReLU" problem.

In practice, not a huge issue.

2. Sigmoid

Squashes everything between 0 and 1.

Sigmoid(x) = 1 / (1 + e^(-x))

What it does:

Large positive input → close to 1
Large negative input → close to 0
Around zero → around 0.5

Creates a smooth S-shaped curve.

Example:

Sigmoid(5) \u2248 0.99
Sigmoid(0) = 0.5
Sigmoid(-5) \u2248 0.01

Neural networks - Sigmoid Function Graph

When to use it:

Mainly for output layers when you need probabilities.

Example: "Is this a cat?"

Output 0.9 → 90% confident it's a cat
Output 0.1 → 10% confident (probably not)

The problem:

"Vanishing gradient" - in deep networks, gradients become tiny. Learning becomes slow.

That's why we don't use sigmoid in hidden layers anymore. ReLU is better.

3. Tanh (Hyperbolic Tangent)

Similar to sigmoid, but outputs between -1 and 1.

Tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

What it does:

Large positive input → close to 1
Large negative input → close to -1
Around zero → around 0

Also S-shaped, but centered at zero.

Example:

Tanh(5) \u2248 0.99
Tanh(0) = 0
Tanh(-5) \u2248 -0.99

Neural network - Tanh Function Graph

When to use it:

Sometimes in hidden layers when you need negative values.

Better than sigmoid for hidden layers because it's zero-centered. But ReLU is usually still better.

The problem:

Also has vanishing gradient (less severe than sigmoid).

Quick Comparison

Function	Range	Speed	Use Case	Main Issue
ReLU	0 to ∞	Very fast	Hidden layers (default)	Can "die"
Sigmoid	0 to 1	Slower	Output (binary)	Vanishing gradient
Tanh	-1 to 1	Slower	Hidden layers	Vanishing gradient

Neural networks - activation functions

Practical Example

Building an image classifier: cat vs dog.

Your network:

Input (pixels)
\u2193
Hidden Layer 1 → ReLU
\u2193
Hidden Layer 2 → ReLU
\u2193
Hidden Layer 3 → ReLU
\u2193
Output Layer → Sigmoid

Why:

Layers 1-3 use ReLU: Fast, efficient. Extracts features (edges, shapes, textures).
Output uses Sigmoid: Gives probability (0 = dog, 1 = cat).

If output is 0.85, you're 85% confident it's a cat.

Key Takeaway

Without activation functions:

All layers collapse into one linear equation
Network can only draw straight lines
Deep learning doesn't work

With activation functions:

Each layer learns something new
Network creates complex, curved decision boundaries
Deep learning becomes possible

Think of it this way:

Linear functions = trying to draw a circle using only a ruler.

Non-linear activation functions = you can draw curves, circles, any shape needed.

Other Activation Functions

You'll hear about:

Leaky ReLU
ELU (Exponential Linear Unit)
Swish
GELU

These are variations trying to fix specific problems.

For learning, stick with ReLU, Sigmoid, and Tanh. They cover 90% of cases.

Once you understand the basics, explore the rest.

Next Week

Now you know why activation functions are critical and how the main ones work.

Next question: How does a neural network know if it's getting better or worse?

That's where loss functions come in.

Next week, we'll cover how neural networks measure their mistakes and why choosing the right loss function matters.

Thanks for reading! Got questions, feedback, or want to chat about AI? Hit reply – I read and respond to every message. And if you found this valuable, feel free to forward it to a friend who'd benefit!

How was today's email?

👍 Loved it 😐 It was okay 👎 Not great

Pranay

Infolia.ai

Activation Functions: Why Neural Networks Need Them

Why We Need Activation Functions

Linear vs Non-Linear Functions

Linear Functions

Non-Linear Functions

The Stacking Problem

Real-World Problems Need Non-Linearity

Example 1: Image Classification

Example 2: Language Understanding

Example: XOR Problem

How Activation Functions Fix This

Without Activation Functions

With Activation Functions

The Three Main Activation Functions

1. ReLU (Rectified Linear Unit)

2. Sigmoid

3. Tanh (Hyperbolic Tangent)

Quick Comparison

Practical Example

Key Takeaway

Other Activation Functions

Next Week

💬 Join the Discussion

📬 Enjoying this content?

⏰ Wait! Don't miss out