LSTM Networks

Long Short-Term Memory — A Deep Learning Architecture for Sequential Data

LSTM stands for Long Short-Term Memory. It is a type of Recurrent Neural Network (RNN), which places it firmly within the field of deep learning — a subset of machine learning that uses multi-layered neural networks to learn patterns from data.

Where LSTM Fits in the AI Landscape


Artificial Intelligence
  └── Machine Learning
        └── Deep Learning
              ├── CNNs        (images, spatial data)
              ├── Transformers (language models, attention-based)
              └── RNNs        (sequential / time-series data)
                    ├── Vanilla RNN   (simple, suffers from vanishing gradients)
                    ├── GRU           (simplified gating, faster)
                    └── LSTM          (full gating mechanism, best for long sequences)

The Problem LSTM Solves

Standard neural networks treat each input independently — they have no concept of order or memory. But many real-world problems are sequential: stock prices, weather, language, sensor readings, music. The current value depends on what came before it.

Vanilla RNNs attempted to solve this by feeding output back into the network as input, but they suffer from the vanishing gradient problem — during training, gradients shrink exponentially as they propagate backward through time, making it impossible to learn long-range dependencies. An RNN trained on 365 days of data effectively "forgets" what happened 60+ days ago.

LSTM was introduced by Hochreiter & Schmidhuber in 1997 specifically to fix this. It uses a gating mechanism that allows it to selectively remember or forget information over long sequences — hundreds or even thousands of time steps.

LSTM Cell Architecture

Each LSTM cell has three gates and a cell state. Think of the cell state as a conveyor belt — information flows along it largely unchanged, and the gates decide what to add or remove.


                    ┌─────────────────────────────────────────┐
                    │             LSTM Cell                    │
                    │                                         │
   Cell State ─────┤── × ──────────── + ──────────────────────┤──── Cell State
   (long-term      │   │              │                       │     (updated)
    memory)        │   │              │                       │
                    │ ┌─┴──┐   ┌──────┴──────┐   ┌────────┐  │
                    │ │ Fg │   │  Ig  × Cand │   │   Og   │  │
                    │ │gate│   │ gate   gate  │   │  gate  │  │
                    │ └─┬──┘   └──┬────┬─────┘   └───┬────┘  │
                    │   │         │    │              │        │
                    │   └────┬────┘    │         ┌───┘        │
                    │        │         │         │            │
   Hidden State ───┤────────┴─────────┴─────────┤────────────┤──── Hidden State
   (short-term     │     [h(t-1), x(t)] concat  │   tanh()   │     (output)
    memory)        │                             │            │
                    └─────────────────────────────────────────┘

                           Input: x(t)

   Fg = Forget Gate   →  "What old info should I discard?"    σ(0 to 1)
   Ig = Input Gate    →  "What new info is worth storing?"    σ(0 to 1)
   Cand = Candidate   →  "What are the new candidate values?" tanh(-1 to 1)
   Og = Output Gate   →  "What part of cell state to output?" σ(0 to 1)

The Three Gates Explained

Forget Gate (Fg) — looks at the previous hidden state and current input, outputs a number between 0 and 1 for each value in the cell state. A value of 0 means "completely forget this" and 1 means "keep this entirely."
Input Gate (Ig) — decides which new information to store. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of candidate values. They are multiplied together.
Output Gate (Og) — determines what the cell outputs. The cell state is passed through tanh (squashing to -1 to 1) and multiplied by the sigmoid output of this gate, so only the chosen parts are sent forward.

LSTM in Code — A Minimal Example

Here is a simple LSTM for time-series prediction using TensorFlow/Keras:


import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# Suppose X_train has shape: (samples, timesteps, features)
# e.g. 1000 samples, 60-day lookback, 8 features per day
model = Sequential([
    LSTM(128, return_sequences=True, input_shape=(60, 8)),
    Dropout(0.2),

    LSTM(128, return_sequences=True),
    Dropout(0.2),

    LSTM(128, return_sequences=False),  # last layer returns single output
    Dropout(0.2),

    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1),  # predict one value (e.g. tomorrow's price)
])

model.compile(optimizer='adam', loss='huber', metrics=['mae'])
model.fit(X_train, y_train, epochs=100, batch_size=64,
          validation_split=0.1)

Key Concepts to Understand

Lookback window — the number of past time steps the model sees at each prediction. A lookback of 60 means the model receives the last 60 days of data and predicts day 61.
Stacked LSTM layers — multiple LSTM layers on top of each other. Lower layers learn short-term patterns; upper layers learn abstract, longer-term patterns. Use return_sequences=True on all layers except the last.
Features — each time step can have multiple input values. For financial data this might be price, volume, RSI, MACD, etc. For weather: temperature, humidity, wind speed.
Scaling — LSTM gates use sigmoid and tanh activations which operate in the 0–1 and -1–1 range. Input data must be scaled (typically with MinMaxScaler) for the network to train effectively.

Common Use Cases

Time-series forecasting — stock prices, energy demand, sales
Natural language processing — text generation, sentiment analysis
Speech recognition — audio waveform to text
Anomaly detection — detecting unusual patterns in sensor data
Music generation — learning and reproducing musical sequences

Practical LSTM Improvements for Time-Series Forecasting

Below are five targeted improvements that significantly impact prediction quality, especially when working with long lookback windows and volatile time-series data.

1. Huber Loss Instead of MSE

Mean Squared Error (MSE) squares every error, which means large price moves (crashes, spikes) dominate the loss function. The model learns to "play it safe" and revert toward historical averages to minimize those squared penalties.

Huber loss behaves like MSE for small errors but switches to linear (MAE-like) behavior for large errors, controlled by a delta threshold. This makes the model robust to outlier moves without ignoring them entirely.


# MSE: loss = (y_true - y_pred)²         ← large errors get squared
# MAE: loss = |y_true - y_pred|           ← linear, but not smooth at 0
# Huber: best of both worlds

import tensorflow as tf

# delta controls the switchover point
loss_fn = tf.keras.losses.Huber(delta=1.0)

model.compile(optimizer='adam', loss=loss_fn, metrics=['mae'])


Loss
  │
  │   MSE ╱
  │      ╱         Huber loss: quadratic near zero,
  │     ╱          linear for large errors
  │    ╱  ╱ Huber
  │   ╱  ╱
  │  ╱  ╱
  │ ╱ ╱  ╱ MAE
  │╱╱╱
  └───────────── Error
       δ

2. Sample Weighting — Recency Bias

With a long lookback (e.g. 1460 days), the model treats all historical data equally. But market conditions from 3 years ago may be irrelevant today. Sample weighting lets you tell the model: "recent data matters more."


import numpy as np

def make_sample_weights(n_samples, method='linear'):
    """Weight recent training samples more heavily."""
    t = np.linspace(0, 1, n_samples)

    if method == 'linear':
        w = 0.5 + 0.5 * t         # range: 0.5 → 1.0
    elif method == 'exponential':
        w = np.exp(3 * t)          # ~20× heavier at the end
    else:
        return None

    w /= w.mean()                  # normalise so mean weight = 1
    return w

weights = make_sample_weights(len(X_train), method='linear')

model.fit(X_train, y_train,
          sample_weight=weights,   # ← pass to .fit()
          epochs=100, batch_size=64)


Weight
 1.0 │                              ╱ linear
     │                          ╱╱╱
     │                      ╱╱╱
     │                  ╱╱╱
 0.5 │──────────────╱╱╱
     │
     └──────────────────────────────── Time
     oldest                      newest
     sample                      sample

3. Log Returns as a Feature

When data grows exponentially over time, MinMaxScaler compresses early values into a tiny band near zero and recent values near one. The LSTM struggles to learn from this distorted distribution. Log returns transform multiplicative price changes into additive ones, producing a roughly stationary series that the network can learn from much more effectively.


import numpy as np
import pandas as pd

# Raw price: [100, 110, 105, 120, 115]
# After MinMaxScaler: compressed, non-stationary

# Log returns: captures the *rate of change* regardless of price level
df['log_returns'] = np.log(df['price'] / df['price'].shift(1))

# A 10% gain at $100 and a 10% gain at $100,000
# both produce log_return ≈ 0.0953
# Without log returns, the $100 move is invisible to the scaler


Raw price (exponential growth):     Log returns (stationary):
  │            ╱                      │
  │           ╱                   0.1 │  ╷   ╷       ╷
  │         ╱╱                        │  │╷  │╷  ╷╷  │╷
  │       ╱╱                      0.0 │──┼┼──┼┼──┼┼──┼┼──
  │    ╱╱╱                            │  ╵│  ╵│  │╵  ╵
  │╱╱╱╱                          -0.1 │   ╵   ╵  ╵
  └────────── Time                    └────────────── Time

  Hard to learn from                  Easy to learn from

4. Monte Carlo Dropout for Uncertainty Estimation

Standard LSTM prediction gives you a single line — no indication of how confident the model is. Monte Carlo Dropout runs the prediction multiple times (e.g. 50 runs) with dropout kept ON during inference. Each run produces a slightly different forecast. The spread of those forecasts gives you a real, data-driven confidence interval.


import numpy as np

def predict_monte_carlo(model, input_sequence, n_days, n_runs=50):
    """Run n_runs stochastic forward passes with dropout active."""
    all_forecasts = []

    for _ in range(n_runs):
        predictions = []
        seq = input_sequence.copy()

        for _ in range(n_days):
            # training=True keeps dropout ON → stochastic output
            pred = model(seq.reshape(1, *seq.shape), training=True).numpy()
            predictions.append(pred[0, 0])

            new_row = seq[-1].copy()
            new_row[0] = pred[0, 0]
            seq = np.vstack([seq[1:], new_row])

        all_forecasts.append(predictions)

    all_forecasts = np.array(all_forecasts)  # shape: (n_runs, n_days)

    median = np.median(all_forecasts, axis=0)
    lower  = np.percentile(all_forecasts, 5, axis=0)    # 5th percentile
    upper  = np.percentile(all_forecasts, 95, axis=0)   # 95th percentile

    return median, lower, upper


Price
  │          ╱╱╱╱╲╲ ← upper 95th percentile
  │        ╱╱╱╱╱╱╱╱╲╲
  │       ╱╱╱╱╱╱╱╱╱╱╱╱
  │     ╱╱╱ ── median ──╲╲
  │    ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱
  │   ╱╱╱╱╱╱╱╱╱╱╲╲
  │  ╱╱╱╱╲╲        ← lower 5th percentile
  │ │
  │─┤
  │ │ ← today
  └──────────────────── Time
     Wider band = more uncertainty
     (grows further into the future)

5. Bollinger Band Division-by-Zero Guard

The Bollinger Band position feature calculates where the current price sits relative to the upper and lower bands. When volatility drops to near zero (flat price action), the bands collapse and the denominator approaches zero, producing inf or NaN values. These poison the MinMaxScaler and corrupt the entire training dataset.


import numpy as np

sma_20 = df['price'].rolling(20).mean()
std_20 = df['price'].rolling(20).std()
upper_band = sma_20 + 2 * std_20
lower_band = sma_20 - 2 * std_20
band_width = upper_band - lower_band

# BEFORE (dangerous):
# bb_position = (price - lower) / (upper - lower)
# → inf when upper == lower

# AFTER (safe):
bb_position = np.where(
    band_width > 0,
    (df['price'] - lower_band) / band_width,
    0.5  # neutral position when bands collapse
)


                  Normal bands:           Collapsed bands:
Price             upper ─────────         upper ═══════════
  │              ╱               ╲        lower ═══════════
  │         ────╱── price ────────╲──     price ═══════════
  │              ╲               ╱
  │               lower ────────          band_width ≈ 0
  │                                       division → inf ✗
  │               band_width > 0          use 0.5 instead ✓
  └──────────── Time

Summary Table

Improvement	Problem Solved	Impact
Huber Loss	MSE over-penalizes large moves	Model stops reverting to mean
Sample Weighting	Old data drowns out recent trends	Learns current market structure
Log Returns	Scaler compresses exponential data	Captures multiplicative patterns
MC Dropout	No confidence measure	Real uncertainty bands
BB Zero Guard	inf values corrupt training	Stable feature engineering