Performance and Troubleshooting#

Concepts#

Load Average#

Load average represents the average number of processes waiting to run (or running) over 1, 5, and 15 minutes:

uptime
# 15:18:24 up 22:52,  1 user,  load average: 0.33, 0.14, 0.05
#                                            1min  5min  15min

cat /proc/loadavg
# 0.24 0.13 0.05 1/832 454403

How to interpret it:

  • Compare load to number of CPU cores (nproc)
  • Load = number of cores → CPU is fully utilized
  • Load > number of cores → processes are waiting (overloaded)
  • Load « number of cores → system is idle
# Number of CPU cores
nproc
# Example: 4

# Load of 4.0 on a 4-core system = 100% utilization
# Load of 8.0 on a 4-core system = overloaded (processes queuing)
# Load of 1.0 on a 4-core system = 25% utilization

Trends matter more than snapshots: if the 1-minute average is much higher than the 15-minute average, load is spiking. If all three are high, the system is consistently overloaded.

CPU Monitoring#

top / htop#

top        # built-in, always available
htop       # more user-friendly (install: sudo apt install htop)

top key columns:

  • %CPU — percentage of CPU used by the process
  • %MEM — percentage of RAM used
  • TIME+ — total CPU time consumed
  • S (state) — R=running, S=sleeping, D=uninterruptible (I/O wait), Z=zombie

top keyboard shortcuts:

  • P — sort by CPU
  • M — sort by memory
  • k — kill a process (enter PID)
  • 1 — toggle per-CPU display
  • q — quit

htop additions:

  • Visual CPU/memory bars
  • Mouse support
  • Tree view (F5)
  • Search (F3), filter (F4)
  • Kill with F9

mpstat — Per-CPU Statistics#

sudo apt install -y sysstat

# All CPUs
mpstat

# Per-CPU breakdown, every 2 seconds
mpstat -P ALL 2

# Key fields:
# %usr  — user-space CPU
# %sys  — kernel CPU
# %iowait — waiting for I/O (disk bottleneck indicator)
# %idle — unused CPU

High %iowait suggests a disk bottleneck, not a CPU bottleneck.

Memory Monitoring#

# Overview
free -h

# Example output:
#               total   used   free   shared  buff/cache  available
# Mem:          7.7G    2.1G   3.2G   180M    2.4G        5.2G
# Swap:         2.0G    0B     2.0G

Key distinction:

  • used — RAM actively used by processes
  • buff/cache — RAM used for disk caches (automatically freed when needed)
  • available — RAM that can be used by new processes (free + reclaimable cache)

available is the number that matters, not free. Linux aggressively caches disk data in RAM, which is good — it makes free look low but available stays high.

vmstat — Virtual Memory Statistics#

# One-shot
vmstat

# Every 2 seconds, 5 samples
vmstat 2 5

# Key columns:
# r   — processes waiting for CPU
# b   — processes blocked on I/O
# si  — swap in (KB/s)
# so  — swap out (KB/s)    ← if consistently > 0, you need more RAM
# us  — user CPU %
# sy  — system CPU %
# wa  — I/O wait %
# id  — idle %

If si/so (swap in/out) are consistently non-zero, the system is swapping — a sign of insufficient RAM.

Disk I/O Monitoring#

iostat#

# Requires sysstat
iostat

# Detailed, every 2 seconds
iostat -xz 2

# Key fields:
# %util  — how busy the disk is (100% = saturated)
# await  — average time for an I/O request (ms)
# r/s, w/s — reads/writes per second

iotop — I/O by Process#

sudo apt install -y iotop

# Show processes doing I/O
sudo iotop

# Only show processes actively doing I/O
sudo iotop -o

# Batch mode (for logging)
sudo iotop -b -n 5

Network Monitoring#

# Active connections
ss -tuln          # listening ports
ss -tunp          # active connections with process names

# Bandwidth per interface (install: sudo apt install -y iftop)
sudo iftop

# Connection statistics
cat /proc/net/dev

# Simple traffic check
ip -s link show

Identifying Bottlenecks#

A systematic approach:

System slow?
  │
  ├─ High load average?
  │    ├─ High %CPU → CPU bottleneck → find the process (top, sort by CPU)
  │    └─ High %iowait → Disk bottleneck → check iostat, iotop
  │
  ├─ Low available memory?
  │    ├─ Swapping (si/so > 0) → Memory bottleneck → find memory hog (top, sort by MEM)
  │    └─ OOM killer active → check dmesg for "Out of memory"
  │
  ├─ Disk full?
  │    └─ df -h → find large files (du, ncdu)
  │
  └─ Network issue?
       └─ ss, ping, traceroute, iftop

Quick Troubleshooting Commands#

# Overall health at a glance
uptime                    # load average
free -h                   # memory
df -h                     # disk space

# What's using CPU?
top -bn1 | head -20       # snapshot
ps aux --sort=-%cpu | head -10

# What's using memory?
ps aux --sort=-%mem | head -10

# What's using disk I/O?
sudo iotop -o -bn1 2>/dev/null

# What's using disk space?
du -sh /* 2>/dev/null | sort -rh | head -10

# Recent errors
dmesg -T --level=err | tail -10
journalctl -p err -b --no-pager | tail -20

# Failed services
systemctl --failed

# Disk space by directory (interactive)
sudo apt install -y ncdu
sudo ncdu /

Stress Testing (Optional)#

# Install stress tool
sudo apt install -y stress

# Stress CPU (4 workers for 30 seconds)
stress --cpu 4 --timeout 30s

# Stress memory (2 workers, 512MB each)
stress --vm 2 --vm-bytes 512M --timeout 30s

# Watch the impact in another terminal
htop

Lab#

Exercise 1: System Overview#

uptime
nproc
free -h
df -h

Exercise 2: CPU Analysis#

# Snapshot of top processes by CPU
ps aux --sort=-%cpu | head -10

# Interactive monitoring
top
# Press 1 to see per-CPU stats
# Press P to sort by CPU
# Press q to quit

Exercise 3: Memory Analysis#

free -h
vmstat 1 5
# Check si/so columns for swap activity

Exercise 4: Disk Analysis#

df -h
du -sh /var/* 2>/dev/null | sort -rh | head -10

# If sysstat is installed:
iostat -x 1 3 2>/dev/null

Exercise 5: Find Problems#

# Check for errors
journalctl -p err -b --no-pager | tail -10
dmesg --level=err 2>/dev/null | tail -10
systemctl --failed

# Check for OOM kills
dmesg | grep -i "out of memory" | tail -5
journalctl -b | grep -i "oom" | tail -5

Review#

1. What does load average represent?

The average number of processes that are running or waiting to run, measured over 1, 5, and 15 minutes. Compare it to the number of CPU cores: load equal to the core count means full utilization; higher means processes are queuing.

2. Why is "available" memory more important than "free" memory?

Linux uses free RAM as disk cache (buff/cache), which makes “free” look low. This cache is automatically released when applications need memory. “Available” shows how much memory can actually be used by new processes, including reclaimable cache.

3. How do you tell if a system is swapping?

Run vmstat and check the si (swap in) and so (swap out) columns. If they are consistently non-zero, the system is swapping — meaning RAM is insufficient for the workload.

4. What does high %iowait in top or mpstat indicate?

The CPU is idle because it is waiting for disk I/O to complete — a disk bottleneck. Use iostat and iotop to identify which disk and which process is causing the bottleneck.

5. How do you quickly find what is using the most CPU or memory?

ps aux --sort=-%cpu | head for CPU, ps aux --sort=-%mem | head for memory. Or use top/htop interactively and sort by the relevant column.


Previous: The Kernel | Next: Firewall Deep Dive