Hey folks,

Over the past seven weeks, we covered the fundamentals: how neural networks work, how they learn, and how training transforms random weights into working models.

This week: The data structure that makes modern AI possible.

Everything we've covered so far works with simple inputs, individual numbers or small lists. But real-world AI handles images, videos, and text sequences. For that, we need tensors.

Let's break it down.

The Problem with Simple Data Structures

The neural networks from Issues #34-40 worked with straightforward inputs:

Example: Predicting house prices

Input: [3 bedrooms, 2000 sq ft, 2 bathrooms]
Output: $450,000

Three numbers in, one number out. Simple.

But what about an image classifier?

A small 28×28 pixel grayscale image contains 784 numbers (one per pixel). A color photo at 224×224 pixels contains 150,528 numbers (224 × 224 × 3 RGB channels).

How do you organize 150,528 numbers efficiently? How do you process 32 images simultaneously in a batch?

You need a better data structure. That's where tensors come in.

What Tensors Actually Are

A tensor is a multi-dimensional array of numbers with a well-defined shape.

Think of tensors as containers that organize data by dimensions:

0D Tensor (Scalar): A single number

5

1D Tensor (Vector): A list of numbers

[1, 2, 3, 4, 5]

2D Tensor (Matrix): A grid of numbers

[[1, 2, 3],
 [4, 5, 6],
 [7, 8, 9]]

3D Tensor: A cube of numbers (think: a stack of matrices)

Multiple 2D grids stacked on top of each other

4D, 5D, 6D+ Tensors: Even higher dimensional structures

The key insight: Tensors aren't complex math, they're organized containers for data.

The Dimension Ladder

Dimension Ladder

Let's climb from simple to complex:

Scalar (0D Tensor): Single number. temperature = 72
Shape: () | Use: Single predictions

Vector (1D Tensor): List of numbers. [100, 150, 200, 175]
Shape: (5,) | Use: Word embeddings, audio samples

Matrix (2D Tensor): Grid of numbers.
Shape: (28, 28) | Use: Grayscale images, spreadsheets

3D Tensor: Stack of matrices.
Shape: (224, 224, 3) | Use: Color images (height × width × RGB channels)

4D Tensor: Batch of 3D tensors.
Shape: (32, 224, 224, 3) | Use: 32 color images for training

5D+ Tensor: Even higher dimensions.
Shape: (10, 30, 224, 224, 3) | Use: Video (10 clips, 30 frames each)

Real-World Example: Understanding Image Tensors

Image as Tensor

Let's break down how images become tensors:

Grayscale Image (2D Tensor)

A 28×28 pixel grayscale image from MNIST (handwritten digits):

Shape: (28, 28)
Total values: 784 numbers
Each value: 0-255 (pixel brightness)

This is just a 2D matrix. Each number represents one pixel's brightness.

Color Image (3D Tensor)

A 224×224 pixel color photo:

Shape: (224, 224, 3)
Total values: 150,528 numbers
Dimensions: height × width × channels

The third dimension (3) represents RGB:

  • Channel 0: Red intensity at each pixel
  • Channel 1: Green intensity at each pixel
  • Channel 2: Blue intensity at each pixel

Think of it as three 224×224 grayscale images stacked together—one for red, one for green, one for blue.

Batch of Color Images (4D Tensor)

Training neural networks processes multiple images simultaneously:

Shape: (32, 224, 224, 3)
Total values: 4,816,896 numbers
Dimensions: batch_size × height × width × channels

This is 32 color images, each 224×224×3, organized into a single 4D tensor.

Why batch processing?

  • Faster training (GPUs process batches efficiently)
  • More stable gradient calculations
  • Better memory utilization

Shape vs Rank: Avoiding Confusion

Two critical tensor concepts:

Shape: The size of each dimension

tensor.shape = (32, 224, 224, 3)

Tells you: 32 images, each 224×224 pixels, with 3 color channels.

Rank: The number of dimensions

tensor.rank = 4

Tells you: This is a 4D tensor (4 dimensions).

Common confusion:

  • "3D tensor" doesn't mean 3D graphics
  • It means a tensor with 3 dimensions (like a color image: height × width × channels)

Tensor Operations: What You Can Do

Tensors support mathematical operations optimized for GPUs:

Element-wise operations: Apply operations to each element. Adding tensors, multiplying by scalars. Used for normalizing pixel values, applying activation functions.

Matrix multiplication: Fundamental to neural networks. result = weights @ input. Every forward propagation step uses this.

Reshaping: Change shape without changing data. Flatten (28, 28) image to (784,) vector. Used when converting images for fully-connected layers.

Broadcasting: Automatically expand smaller tensors. Add scalar 10 to a matrix, and it applies to every element. Used for adding bias terms.

Why GPUs Love Tensors

Tensors enable parallel processing, which is why GPUs accelerate deep learning so dramatically.

CPU approach (sequential):

Process pixel 1
Process pixel 2
Process pixel 3
... (150,528 pixels later)
Done!

GPU approach (parallel):

Process all 150,528 pixels simultaneously
Done!

Tensors organize data so GPUs can apply the same operation to thousands of elements at once. A task that takes a CPU 10 minutes might take a GPU 10 seconds.

This is why modern AI requires GPUs: Not because GPUs are "faster," but because tensors enable massive parallelization.

Why Modern Architectures Need Tensors

Tensor Architectures

Every modern neural network architecture operates on tensors:

Convolutional Neural Networks (CNNs):

  • Input: 4D tensor (batch, height, width, channels)
  • Process images by sliding filters across spatial dimensions
  • Output: 4D tensor with learned features

Recurrent Neural Networks (RNNs):

  • Input: 3D tensor (batch, time_steps, features)
  • Process sequences (text, audio, time series)
  • Output: 3D tensor with temporal patterns

Transformers:

  • Input: 3D tensor (batch, sequence_length, embedding_dimension)
  • Process text using attention mechanisms
  • Power models like ChatGPT, BERT, GPT-4

You can't understand these architectures without understanding tensors. They're the foundation everything else builds on.

Practical Example: Image Classification Pipeline

Let's trace how tensors flow through an image classifier:

Step 1: Load 32 color photos (224×224 pixels each)

Step 2: Convert to tensor → Shape: (32, 224, 224, 3) = 4.8M numbers

Step 3: Normalize pixel values from [0, 255] to [0, 1]

Step 4: Convolutional layers process spatial dimensions → extract features

Step 5: Output predictions → Shape: (32, 1000) = 32 images, 1000 possible classes each

The entire pipeline operates on tensors from start to finish.

Common Tensor Shapes You'll See

Image classification:

  • Input: (batch, height, width, channels)(32, 224, 224, 3)
  • Output: (batch, num_classes)(32, 1000)

Text processing:

  • Input: (batch, sequence_length, embedding_dim)(32, 512, 768)
  • Output: (batch, sequence_length, vocab_size)(32, 512, 50000)

Time series:

  • Input: (batch, time_steps, features)(32, 100, 10)
  • Output: (batch, prediction_window)(32, 24)

Recognizing these patterns helps you understand how different architectures work.

Key Takeaway

Tensors are organized containers for multi-dimensional data.

They're not complex math—they're practical data structures:

  • 0D (scalar): A single number
  • 1D (vector): A list
  • 2D (matrix): A grid
  • 3D+: Stacked grids

Why they matter:

  • Real-world data is multi-dimensional (images, video, sequences)
  • Tensors organize this data efficiently
  • GPUs process tensors in parallel (massive speedup)
  • Every modern architecture (CNNs, RNNs, Transformers) operates on tensors

The insight: Tensors aren't optional advanced math. They're the fundamental data structure of modern AI.

Master tensors, and CNNs, RNNs, and Transformers become much easier to understand.

What's Next

Next week: Convolutional Neural Networks (CNNs) and how they process image tensors.

Read the full AI Learning series → Learn AI

Thanks for reading! Got questions, feedback, or want to chat about AI? Hit reply – I read and respond to every message. And if you found this valuable, feel free to forward it to a friend who'd benefit!

Pranay
Pranay
Infolia.ai

💬 Join the Discussion