Sorting, Cutting, and Counting
Sorting, Cutting, and Counting#
Concepts#
The Unix Philosophy in Action#
These small, specialized tools each do one thing well. Their power comes from combining them with pipes. This lesson covers the remaining essential text-processing commands.
sort — Sort Lines#
sort file.txt # alphabetical sort
sort -r file.txt # reverse order
sort -n file.txt # numeric sort (1, 2, 10 not 1, 10, 2)
sort -h file.txt # human-readable numbers (1K, 2M, 3G)
sort -k2 file.txt # sort by 2nd field
sort -k2,2n file.txt # sort by 2nd field numerically
sort -t: -k3 -n /etc/passwd # sort by UID (field 3, delimiter :)
sort -u file.txt # sort and remove duplicates
sort -f file.txt # case-insensitive sort
The -k flag specifies which field to sort by. Fields are separated by whitespace by default (change with -t).
# Sort by the 3rd column numerically, in reverse
sort -k3 -n -r data.txt
# Sort by multiple keys: first by field 2, then by field 3
sort -k2,2 -k3,3n data.txt
uniq — Report or Filter Unique/Duplicate Lines#
Important: uniq only detects adjacent duplicates. You almost always need to sort first.
sort file.txt | uniq # remove adjacent duplicates
sort file.txt | uniq -c # count occurrences (prepends count)
sort file.txt | uniq -d # show only duplicates
sort file.txt | uniq -u # show only unique lines (appearing once)
sort file.txt | uniq -ci # count, case insensitive
A very common pattern — frequency count:
# Count occurrences and sort by frequency (most common first)
sort file.txt | uniq -c | sort -rn
cut — Extract Columns/Fields#
# By field (delimiter-based)
cut -d: -f1 /etc/passwd # field 1, delimiter :
cut -d: -f1,7 /etc/passwd # fields 1 and 7
cut -d: -f1-3 /etc/passwd # fields 1 through 3
cut -d, -f2 data.csv # field 2, delimiter comma
# By character position
cut -c1-10 file.txt # characters 1 through 10
cut -c5- file.txt # character 5 to end of line
cut -c-20 file.txt # first 20 characters
cut is simpler than awk for straightforward column extraction but cannot handle variable whitespace (multiple spaces treated as one delimiter). For that, use awk.
wc — Word Count#
wc file.txt # lines, words, characters
wc -l file.txt # lines only
wc -w file.txt # words only
wc -c file.txt # bytes only
wc -m file.txt # characters only (multibyte aware)
wc -l *.txt # line count for each file + total
# Count files in a directory
ls /etc | wc -l
# Count running processes
ps aux | wc -l
tr — Translate or Delete Characters#
tr works character by character. It reads from stdin only (does not accept filenames).
# Replace characters
echo "hello" | tr 'a-z' 'A-Z' # HELLO (lowercase to uppercase)
echo "HELLO" | tr 'A-Z' 'a-z' # hello (uppercase to lowercase)
# Replace specific characters
echo "hello world" | tr ' ' '_' # hello_world (space to underscore)
echo "2024-10-15" | tr '-' '/' # 2024/10/15
# Delete characters
echo "Hello 123 World" | tr -d '0-9' # Hello World (delete digits)
echo "hello world" | tr -d ' ' # helloworld (delete spaces)
# Squeeze repeated characters
echo "heeellooo" | tr -s 'elo' # helo (squeeze repeated e, l, o)
echo "hello world" | tr -s ' ' # hello world (squeeze spaces)
column — Format Output into Columns#
# Make output into neat columns
echo -e "Name Age City\nAlice 30 NYC\nBob 25 LA" | column -t
# Name Age City
# Alice 30 NYC
# Bob 25 LA
# With a specific delimiter
cat /etc/passwd | column -t -s:
paste — Merge Lines Side by Side#
# Merge two files side by side
paste file1.txt file2.txt
# Use a custom delimiter
paste -d, file1.txt file2.txt
# Merge all lines into one (serial)
paste -s file.txt
# Merge every 3 lines into one
paste - - - < file.txt
head and tail — First and Last Lines#
head -n 5 file.txt # first 5 lines
head -c 100 file.txt # first 100 bytes
tail -n 5 file.txt # last 5 lines
tail -f /var/log/syslog # follow (live output as new lines are added)
tail -f -n 0 /var/log/syslog # follow, starting from now (no history)
tail -f is essential for monitoring logs in real time. Press Ctrl + C to stop.
rev — Reverse Lines#
echo "hello" | rev # olleh
echo "/path/to/file" | rev | cut -d/ -f1 | rev # file (extract filename)
Building Pipelines#
The real power is combining these tools:
# Top 10 most common words in a file
cat file.txt | tr -s ' ' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn | head -10
# Top 10 largest installed packages
dpkg-query -W --showformat='${Installed-Size}\t${Package}\n' | sort -rn | head -10
# Most common shells on the system
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn
# Disk usage by directory, sorted
du -sh /var/* 2>/dev/null | sort -rh | head -10
# Unique IPs from an access log
awk '{print $1}' access.log | sort -u | wc -l
Lab#
Exercise 1: sort#
mkdir -p ~/lab/texttools
cd ~/lab/texttools
cat > names.txt << 'EOF'
Charlie
alice
Bob
dave
Alice
bob
EOF
# Alphabetical sort (uppercase before lowercase by default)
sort names.txt
# Case-insensitive sort
sort -f names.txt
# Reverse sort
sort -r names.txt
# Sort and remove duplicates (case insensitive)
sort -fu names.txt
Exercise 2: sort with Fields#
cd ~/lab/texttools
cat > employees.txt << 'EOF'
Alice Engineering 85000
Bob Marketing 72000
Carol Engineering 92000
Dave Marketing 68000
Eve Sales 78000
Frank Sales 81000
EOF
# Sort by department (field 2)
sort -k2 employees.txt
# Sort by salary (field 3) numerically
sort -k3 -n employees.txt
# Sort by salary descending
sort -k3 -nr employees.txt
# Sort by department, then by salary within each department
sort -k2,2 -k3,3n employees.txt
Exercise 3: uniq and Frequency Counting#
cd ~/lab/texttools
cat > visits.txt << 'EOF'
/index.html
/about.html
/index.html
/contact.html
/index.html
/about.html
/products.html
/index.html
/contact.html
/index.html
EOF
# Count page visit frequency (sort first!)
sort visits.txt | uniq -c | sort -rn
# Show only pages visited more than once
sort visits.txt | uniq -d
# Show pages visited exactly once
sort visits.txt | uniq -u
Exercise 4: cut#
cd ~/lab/texttools
# Extract usernames from /etc/passwd
cut -d: -f1 /etc/passwd | head -10
# Extract username and home directory
cut -d: -f1,6 /etc/passwd | head -10
# Extract the first 15 characters of each line
cut -c1-15 /etc/passwd | head -10
# Work with CSV data
cat > data.csv << 'EOF'
Name,Age,City,Score
Alice,30,New York,95
Bob,25,Los Angeles,87
Carol,28,Chicago,92
Dave,35,Houston,88
EOF
# Extract the Name column
cut -d, -f1 data.csv
# Extract Name and Score
cut -d, -f1,4 data.csv
Exercise 5: tr#
cd ~/lab/texttools
# Uppercase conversion
echo "hello world" | tr 'a-z' 'A-Z'
# Replace spaces with newlines (one word per line)
echo "hello world foo bar" | tr ' ' '\n'
# Delete all digits
echo "abc123def456" | tr -d '0-9'
# Squeeze multiple spaces
echo "too many spaces" | tr -s ' '
# Replace all punctuation with spaces
echo "hello, world! how are you?" | tr '[:punct:]' ' '
Exercise 6: Building a Pipeline#
cd ~/lab/texttools
# Find the top 3 highest-paid employees
sort -k3 -nr employees.txt | head -3
# Average salary (using awk for the math)
awk '{sum += $3} END {printf "Average salary: $%.2f\n", sum/NR}' employees.txt
# Department headcount
cut -d' ' -f2 employees.txt | sort | uniq -c | sort -rn
# Most common login shell on the system
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn | head -5
# Clean up
cd ~
rm -rf ~/lab/texttools
Review#
1. Why must you sort before using uniq?
uniq only detects adjacent duplicate lines. Without sorting, duplicates scattered throughout the file will not be collapsed. sort | uniq is the standard pattern.
2. How do you sort numerically by the third column?
sort -k3 -n file.txt. Without -n, sort treats numbers as strings (so “10” comes before “2” alphabetically).
3. What is the standard pattern for counting frequencies?
sort | uniq -c | sort -rn — sort the data, count adjacent duplicates, then sort by count (numeric, descending) to see the most common items first.
4. What is the difference between `cut` and `awk` for field extraction?
cut is simpler and faster but treats each delimiter character individually (multiple spaces = multiple empty fields). awk treats runs of whitespace as a single delimiter and supports conditions and calculations. Use cut for simple delimited data (CSV, /etc/passwd); use awk for whitespace-separated or complex processing.
5. How does `tr` differ from `sed`?
tr translates character-by-character (replace every ‘a’ with ‘b’). sed works with strings and patterns (replace the word “apple” with “orange”). tr reads only from stdin; sed can read files. Use tr for simple character transformations; use sed for pattern-based substitutions.
6. What does `tail -f` do?
It follows a file, displaying new lines as they are appended. Essential for monitoring logs in real time. Press Ctrl + C to stop.
7. How do you count the number of lines in a file?
wc -l file.txt. When used in a pipeline: command | wc -l.
Previous: sed and awk | Next: Processes and Job Control