Sorting, Cutting, and Counting#

Concepts#

The Unix Philosophy in Action#

These small, specialized tools each do one thing well. Their power comes from combining them with pipes. This lesson covers the remaining essential text-processing commands.

sort — Sort Lines#

sort file.txt                    # alphabetical sort
sort -r file.txt                 # reverse order
sort -n file.txt                 # numeric sort (1, 2, 10 not 1, 10, 2)
sort -h file.txt                 # human-readable numbers (1K, 2M, 3G)
sort -k2 file.txt                # sort by 2nd field
sort -k2,2n file.txt             # sort by 2nd field numerically
sort -t: -k3 -n /etc/passwd      # sort by UID (field 3, delimiter :)
sort -u file.txt                 # sort and remove duplicates
sort -f file.txt                 # case-insensitive sort

The -k flag specifies which field to sort by. Fields are separated by whitespace by default (change with -t).

# Sort by the 3rd column numerically, in reverse
sort -k3 -n -r data.txt

# Sort by multiple keys: first by field 2, then by field 3
sort -k2,2 -k3,3n data.txt

uniq — Report or Filter Unique/Duplicate Lines#

Important: uniq only detects adjacent duplicates. You almost always need to sort first.

sort file.txt | uniq             # remove adjacent duplicates
sort file.txt | uniq -c          # count occurrences (prepends count)
sort file.txt | uniq -d          # show only duplicates
sort file.txt | uniq -u          # show only unique lines (appearing once)
sort file.txt | uniq -ci         # count, case insensitive

A very common pattern — frequency count:

# Count occurrences and sort by frequency (most common first)
sort file.txt | uniq -c | sort -rn

cut — Extract Columns/Fields#

# By field (delimiter-based)
cut -d: -f1 /etc/passwd          # field 1, delimiter :
cut -d: -f1,7 /etc/passwd        # fields 1 and 7
cut -d: -f1-3 /etc/passwd        # fields 1 through 3
cut -d, -f2 data.csv             # field 2, delimiter comma

# By character position
cut -c1-10 file.txt              # characters 1 through 10
cut -c5- file.txt                # character 5 to end of line
cut -c-20 file.txt               # first 20 characters

cut is simpler than awk for straightforward column extraction but cannot handle variable whitespace (multiple spaces treated as one delimiter). For that, use awk.

wc — Word Count#

wc file.txt                     # lines, words, characters
wc -l file.txt                  # lines only
wc -w file.txt                  # words only
wc -c file.txt                  # bytes only
wc -m file.txt                  # characters only (multibyte aware)
wc -l *.txt                     # line count for each file + total
# Count files in a directory
ls /etc | wc -l

# Count running processes
ps aux | wc -l

tr — Translate or Delete Characters#

tr works character by character. It reads from stdin only (does not accept filenames).

# Replace characters
echo "hello" | tr 'a-z' 'A-Z'          # HELLO (lowercase to uppercase)
echo "HELLO" | tr 'A-Z' 'a-z'          # hello (uppercase to lowercase)

# Replace specific characters
echo "hello world" | tr ' ' '_'         # hello_world (space to underscore)
echo "2024-10-15" | tr '-' '/'          # 2024/10/15

# Delete characters
echo "Hello 123 World" | tr -d '0-9'    # Hello  World (delete digits)
echo "hello   world" | tr -d ' '        # helloworld (delete spaces)

# Squeeze repeated characters
echo "heeellooo" | tr -s 'elo'          # helo (squeeze repeated e, l, o)
echo "hello    world" | tr -s ' '       # hello world (squeeze spaces)

column — Format Output into Columns#

# Make output into neat columns
echo -e "Name Age City\nAlice 30 NYC\nBob 25 LA" | column -t
# Name   Age  City
# Alice  30   NYC
# Bob    25   LA

# With a specific delimiter
cat /etc/passwd | column -t -s:

paste — Merge Lines Side by Side#

# Merge two files side by side
paste file1.txt file2.txt

# Use a custom delimiter
paste -d, file1.txt file2.txt

# Merge all lines into one (serial)
paste -s file.txt

# Merge every 3 lines into one
paste - - - < file.txt

head and tail — First and Last Lines#

head -n 5 file.txt               # first 5 lines
head -c 100 file.txt             # first 100 bytes
tail -n 5 file.txt               # last 5 lines
tail -f /var/log/syslog          # follow (live output as new lines are added)
tail -f -n 0 /var/log/syslog     # follow, starting from now (no history)

tail -f is essential for monitoring logs in real time. Press Ctrl + C to stop.

rev — Reverse Lines#

echo "hello" | rev               # olleh
echo "/path/to/file" | rev | cut -d/ -f1 | rev    # file (extract filename)

Building Pipelines#

The real power is combining these tools:

# Top 10 most common words in a file
cat file.txt | tr -s ' ' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn | head -10

# Top 10 largest installed packages
dpkg-query -W --showformat='${Installed-Size}\t${Package}\n' | sort -rn | head -10

# Most common shells on the system
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn

# Disk usage by directory, sorted
du -sh /var/* 2>/dev/null | sort -rh | head -10

# Unique IPs from an access log
awk '{print $1}' access.log | sort -u | wc -l

Lab#

Exercise 1: sort#

mkdir -p ~/lab/texttools
cd ~/lab/texttools

cat > names.txt << 'EOF'
Charlie
alice
Bob
dave
Alice
bob
EOF

# Alphabetical sort (uppercase before lowercase by default)
sort names.txt

# Case-insensitive sort
sort -f names.txt

# Reverse sort
sort -r names.txt

# Sort and remove duplicates (case insensitive)
sort -fu names.txt

Exercise 2: sort with Fields#

cd ~/lab/texttools

cat > employees.txt << 'EOF'
Alice Engineering 85000
Bob Marketing 72000
Carol Engineering 92000
Dave Marketing 68000
Eve Sales 78000
Frank Sales 81000
EOF

# Sort by department (field 2)
sort -k2 employees.txt

# Sort by salary (field 3) numerically
sort -k3 -n employees.txt

# Sort by salary descending
sort -k3 -nr employees.txt

# Sort by department, then by salary within each department
sort -k2,2 -k3,3n employees.txt

Exercise 3: uniq and Frequency Counting#

cd ~/lab/texttools

cat > visits.txt << 'EOF'
/index.html
/about.html
/index.html
/contact.html
/index.html
/about.html
/products.html
/index.html
/contact.html
/index.html
EOF

# Count page visit frequency (sort first!)
sort visits.txt | uniq -c | sort -rn

# Show only pages visited more than once
sort visits.txt | uniq -d

# Show pages visited exactly once
sort visits.txt | uniq -u

Exercise 4: cut#

cd ~/lab/texttools

# Extract usernames from /etc/passwd
cut -d: -f1 /etc/passwd | head -10

# Extract username and home directory
cut -d: -f1,6 /etc/passwd | head -10

# Extract the first 15 characters of each line
cut -c1-15 /etc/passwd | head -10

# Work with CSV data
cat > data.csv << 'EOF'
Name,Age,City,Score
Alice,30,New York,95
Bob,25,Los Angeles,87
Carol,28,Chicago,92
Dave,35,Houston,88
EOF

# Extract the Name column
cut -d, -f1 data.csv

# Extract Name and Score
cut -d, -f1,4 data.csv

Exercise 5: tr#

cd ~/lab/texttools

# Uppercase conversion
echo "hello world" | tr 'a-z' 'A-Z'

# Replace spaces with newlines (one word per line)
echo "hello world foo bar" | tr ' ' '\n'

# Delete all digits
echo "abc123def456" | tr -d '0-9'

# Squeeze multiple spaces
echo "too    many     spaces" | tr -s ' '

# Replace all punctuation with spaces
echo "hello, world! how are you?" | tr '[:punct:]' ' '

Exercise 6: Building a Pipeline#

cd ~/lab/texttools

# Find the top 3 highest-paid employees
sort -k3 -nr employees.txt | head -3

# Average salary (using awk for the math)
awk '{sum += $3} END {printf "Average salary: $%.2f\n", sum/NR}' employees.txt

# Department headcount
cut -d' ' -f2 employees.txt | sort | uniq -c | sort -rn

# Most common login shell on the system
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn | head -5

# Clean up
cd ~
rm -rf ~/lab/texttools

Review#

1. Why must you sort before using uniq?

uniq only detects adjacent duplicate lines. Without sorting, duplicates scattered throughout the file will not be collapsed. sort | uniq is the standard pattern.

2. How do you sort numerically by the third column?

sort -k3 -n file.txt. Without -n, sort treats numbers as strings (so “10” comes before “2” alphabetically).

3. What is the standard pattern for counting frequencies?

sort | uniq -c | sort -rn — sort the data, count adjacent duplicates, then sort by count (numeric, descending) to see the most common items first.

4. What is the difference between `cut` and `awk` for field extraction?

cut is simpler and faster but treats each delimiter character individually (multiple spaces = multiple empty fields). awk treats runs of whitespace as a single delimiter and supports conditions and calculations. Use cut for simple delimited data (CSV, /etc/passwd); use awk for whitespace-separated or complex processing.

5. How does `tr` differ from `sed`?

tr translates character-by-character (replace every ‘a’ with ‘b’). sed works with strings and patterns (replace the word “apple” with “orange”). tr reads only from stdin; sed can read files. Use tr for simple character transformations; use sed for pattern-based substitutions.

6. What does `tail -f` do?

It follows a file, displaying new lines as they are appended. Essential for monitoring logs in real time. Press Ctrl + C to stop.

7. How do you count the number of lines in a file?

wc -l file.txt. When used in a pipeline: command | wc -l.


Previous: sed and awk | Next: Processes and Job Control