Linux Commands GuideLinux Tutorials

10 Essential Linux Command-Line Tools for Data Scientists

Introduction

Working with data on Linux often means moving quickly between interactive analysis (Python, Jupyter) and fast, repeatable command-line operations. This guide covers 10 essential Linux command-line tools every data scientist should know: grep, awk, sed, cut, sort, uniq, wc, head/tail, find, jq — plus practical coverage of ls and disk-usage tools (du / find) for managing space. We'll use a small sample dataset and realistic terminal outputs so you can copy, run, and learn.

Prerequisites

  • A Linux machine (Ubuntu/Debian, CentOS/RHEL, Fedora, etc.).
  • Basic shell familiarity (bash).
  • Installed utilities: awk, sed, cut, sort, uniq, wc, head, tail, find, ls (these are standard). Install jq if you plan to parse JSON (installation shown below).

Installation (jq)

jq is not always installed by default. Use your distro package manager:

sudo apt update && sudo apt install -y jq
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  jq
0 upgraded, 1 newly installed, 0 to remove and 12 not upgraded.
Need to get 250 kB of archives.
After this operation, 800 kB of additional disk space will be used.
Setting up jq (1.6-1) ...
Processing triggers for man-db (2.9.1-1)...

Explanation: sudo runs the command as root; the package manager installs jq so you can parse JSON on the command line.

Project setup: create the sample dataset

Create a small e-commerce CSV we’ll use through the article. The heredoc creates sales_data.csv in the current directory.

cat > sales_data.csv << 'EOF'
order_id,date,customer_name,product,category,quantity,price,region,status
1001,2024-01-15,John Smith,Laptop,Electronics,1,899.99,North,completed
1002,2024-01-16,Sarah Johnson,Mouse,Electronics,2,24.99,South,completed
1003,2024-01-16,Mike Brown,Desk Chair,Furniture,1,199.99,East,completed
1004,2024-01-17,John Smith,Keyboard,Electronics,1,79.99,North,completed
1005,2024-01-18,Emily Davis,Notebook,Stationery,5,12.99,West,completed
1006,2024-01-18,Sarah Johnson,Laptop,Electronics,1,899.99,South,pending
1007,2024-01-19,Chris Wilson,Monitor,Electronics,2,299.99,North,completed
1008,2024-01-20,John Smith,USB Cable,Electronics,3,9.99,North,completed
1009,2024-01-20,Anna Martinez,Desk,Furniture,1,399.99,East,completed
1010,2024-01-21,Mike Brown,Laptop,Electronics,1,899.99,East,cancelled
1011,2024-01-22,Emily Davis,Pen Set,Stationery,10,5.99,West,completed
1012,2024-01-22,Sarah Johnson,Monitor,Electronics,1,299.99,South,completed
1013,2024-01-23,Chris Wilson,Desk Chair,Furniture,2,199.99,North,completed
1014,2024-01-24,Anna Martinez,Laptop,Electronics,1,899.99,East,completed
1015,2024-01-25,John Smith,Mouse Pad,Electronics,1,14.99,North,completed
1016,2024-01-26,Mike Brown,Bookshelf,Furniture,1,149.99,East,completed
1017,2024-01-27,Emily Davis,Highlighter,Stationery,8,3.99,West,completed
1018,2024-01-28,NULL,Laptop,Electronics,1,899.99,South,pending
1019,2024-01-29,Chris Wilson,Webcam,Electronics,1,89.99,North,completed
1020,2024-01-30,Sarah Johnson,Desk Lamp,Furniture,2,49.99,South,completed
EOF

Explanation: Creating a sample CSV lets us demonstrate realistic commands and outputs; the heredoc writes multiple lines to sales_data.csv. No command output is produced for the redirection itself.

Essential command-line tools (examples + outputs)

1) grep — quick pattern search

What it does: search lines that match a pattern. Why use it: fast filtering for rows or log lines.

grep "John Smith" sales_data.csv
1001,2024-01-15,John Smith,Laptop,Electronics,1,899.99,North,completed
1004,2024-01-17,John Smith,Keyboard,Electronics,1,79.99,North,completed
1008,2024-01-20,John Smith,USB Cable,Electronics,3,9.99,North,completed
1015,2024-01-25,John Smith,Mouse Pad,Electronics,1,14.99,North,completed

Explanation: This shows every row containing “John Smith”. Use -i for case-insensitive, -n to show line numbers, and -v to invert match (exclude).

grep -c "Laptop" sales_data.csv
5

Explanation: -c counts matching lines — here 5 Laptop rows.

2) awk — field-aware processing & quick aggregation

What it does: a lightweight programming language for records and fields. Why use it: extract columns, compute sums, averages without loading Python.

awk -F',' '{print $4, $7}' sales_data.csv | head -6
product price
Laptop 899.99
Mouse 24.99
Desk Chair 199.99
Keyboard 79.99
Notebook 12.99

Explanation: -F’,’ sets comma as the field separator. {print $4, $7} prints product and price (columns 4 and 7). The header row is printed first.

awk -F',' 'NR>1 {sum+=$7} END {printf "Total Revenue: $%.2f\n", sum}' sales_data.csv
Total Revenue: $6342.80

Explanation: NR>1 skips header; sum collects prices; END prints the total. Use awk for fast single-pass aggregations.

3) sed — stream editor for substitutions and transformations

What it does: edit text streams with scripts. Why use it: fix missing values or perform regex find-and-replace on the fly.

sed 's/NULL/Unknown/g' sales_data.csv | grep "Unknown"
1018,2024-01-28,Unknown,Laptop,Electronics,1,899.99,South,pending

Explanation: The s/NULL/Unknown/g expression replaces occurrences of “NULL” with “Unknown”. Useful to normalize missing-value markers before downstream processing.

sed '1d' sales_data.csv | head -3
1001,2024-01-15,John Smith,Laptop,Electronics,1,899.99,North,completed
1002,2024-01-16,Sarah Johnson,Mouse,Electronics,2,24.99,South,completed
1003,2024-01-16,Mike Brown,Desk Chair,Furniture,1,199.99,East,completed

Explanation: 1d deletes the first (header) line — a common step when piping CSV rows into analysis tools.

4) cut — simple column extraction

What it does: extract delimited fields. Why use it: fast and concise when no calculations are needed.

cut -d',' -f3,4 sales_data.csv | head -6
customer_name,product
John Smith,Laptop
Sarah Johnson,Mouse
Mike Brown,Desk Chair
John Smith,Keyboard
Emily Davis,Notebook

Explanation: -d’,’ specifies the delimiter; -f3,4 extracts customer name and product fields. Use cut for very fast column pulls in pipes.

5) sort — ordering for analysis and deduplication

What it does: sort lines; supports numeric, reverse, and multi-key sorts. Why use it: prepare data for uniq, find top values, or reorder datasets.

sort -t',' -k7 -rn sales_data.csv | head -6
1001,2024-01-15,John Smith,Laptop,Electronics,1,899.99,North,completed
1006,2024-01-18,Sarah Johnson,Laptop,Electronics,1,899.99,South,pending
1010,2024-01-21,Mike Brown,Laptop,Electronics,1,899.99,East,cancelled
1014,2024-01-24,Anna Martinez,Laptop,Electronics,1,899.99,East,completed
1018,2024-01-28,NULL,Laptop,Electronics,1,899.99,South,pending
1009,2024-01-20,Anna Martinez,Desk,Furniture,1,399.99,East,completed

Explanation: -t’,’ sets field separator; -k7 sorts by the 7th field (price); -rn sorts numerically in reverse order for highest-first.

6) uniq — count distinct values (after sort)

What it does: collapse adjacent duplicate lines and optionally count them. Why use it: quick frequency counts (like value_counts in pandas).

cut -d',' -f8 sales_data.csv | tail -n +2 | sort | uniq -c
   5 East
   7 North
   5 South
   3 West

Explanation: We extract the region column, skip the header (tail -n +2), sort (uniq requires sorted input), then count occurrences with uniq -c.

7) wc — fast counts: lines, words, bytes

What it does: count lines/words/bytes. Why use it: quick sanity checks (rows present, file size).

wc -l sales_data.csv
21 sales_data.csv

Explanation: Counts lines in the file: 1 header + 20 records = 21. Use wc -c for bytes or wc -w for words.

8) head / tail — preview large files efficiently

What it does: show the first or last N lines. Why use it: inspect file headers and tail records without opening the whole file.

head -6 sales_data.csv
order_id,date,customer_name,product,category,quantity,price,region,status
1001,2024-01-15,John Smith,Laptop,Electronics,1,899.99,North,completed
1002,2024-01-16,Sarah Johnson,Mouse,Electronics,2,24.99,South,completed
1003,2024-01-16,Mike Brown,Desk Chair,Furniture,1,199.99,East,completed
1004,2024-01-17,John Smith,Keyboard,Electronics,1,79.99,North,completed
1005,2024-01-18,Emily Davis,Notebook,Stationery,5,12.99,West,completed

Explanation: head -6 shows header + first 5 rows for quick schema checks.

tail -5 sales_data.csv
1016,2024-01-26,Mike Brown,Bookshelf,Furniture,1,149.99,East,completed
1017,2024-01-27,Emily Davis,Highlighter,Stationery,8,3.99,West,completed
1018,2024-01-28,NULL,Laptop,Electronics,1,899.99,South,pending
1019,2024-01-29,Chris Wilson,Webcam,Electronics,1,89.99,North,completed
1020,2024-01-30,Sarah Johnson,Desk Lamp,Furniture,2,49.99,South,completed

Explanation: Useful for checking recent appends or last records.

9) find — locate files and run commands

What it does: search file trees with name, type, timestamp, size filters. Why use it: find datasets or run operations (wc, xargs) across many files.

mkdir -p data_project/{raw,processed,reports}
cp sales_data.csv data_project/raw/
cp sales_data.csv data_project/processed/sales_cleaned.csv
echo "Summary report" > data_project/reports/summary.txt
find data_project -name "*.csv"
data_project/raw/sales_data.csv
data_project/processed/sales_cleaned.csv

Explanation: Demonstrates building a small project structure and using find to locate all CSVs. find can also execute commands on matches with -exec.

find data_project -name "*.csv" -exec wc -l {} \;
21 data_project/raw/sales_data.csv
21 data_project/processed/sales_cleaned.csv

Explanation: For each CSV found, -exec runs wc -l to show line counts — helpful in audits or processing pipelines.

10) jq — JSON querying for API data

What it does: parse and transform JSON. Why use it: most web APIs return JSON; jq makes extraction and conversion fast.

cat > sales_sample.json << 'EOF'
{
  "orders": [
    {
      "order_id": 1001,
      "customer": "John Smith",
      "product": "Laptop",
      "price": 899.99,
      "region": "North",
      "status": "completed"
    },
    {
      "order_id": 1002,
      "customer": "Sarah Johnson",
      "product": "Mouse",
      "price": 24.99,
      "region": "South",
      "status": "completed"
    },
    {
      "order_id": 1006,
      "customer": "Sarah Johnson",
      "product": "Laptop",
      "price": 899.99,
      "region": "South",
      "status": "pending"
    }
  ]
}
EOF

Explanation: Create a small JSON sample to demonstrate jq queries (no output from heredoc itself).

jq '.' sales_sample.json
{
  "orders": [
    {
      "order_id": 1001,
      "customer": "John Smith",
      "product": "Laptop",
      "price": 899.99,
      "region": "North",
      "status": "completed"
    },
    {
      "order_id": 1002,
      "customer": "Sarah Johnson",
      "product": "Mouse",
      "price": 24.99,
      "region": "South",
      "status": "completed"
    },
    {
      "order_id": 1006,
      "customer": "Sarah Johnson",
      "product": "Laptop",
      "price": 899.99,
      "region": "South",
      "status": "pending"
    }
  ]
}

Explanation: jq ‘.’ pretty-prints JSON. Use filters like .orders[] | select(.price > 100) to filter high-value orders.

jq -r '.orders[] | [.order_id, .customer, .product, .price] | @csv' sales_sample.json
1001,"John Smith","Laptop",899.99
1002,"Sarah Johnson","Mouse",24.99
1006,"Sarah Johnson","Laptop",899.99

Explanation: Convert JSON objects into CSV rows with @csv for downstream ingestion into a CSV pipeline.

Bonus: ls — listing files and details

What it does: list directory contents with many useful flags. Why use it: check files, sizes, ownership quickly.

ls -lh
total 1.5M
-rw-r--r-- 1 user user 1.3K Feb 27 12:01 sales_data.csv
drwxr-xr-x 4 user user 4.0K Feb 27 12:02 data_project

Explanation: -l produces long listing (permissions, owner, size), -h makes sizes human-readable. Use ls -la to show hidden files, ls -lS to sort by size.

Disk usage & finding large directories / files (du, find)

Monitoring disk usage is essential when working with large datasets. Use du and find to discover space hogs.

du -hs * | sort -rh | head -5
1.2G	data_project
512M	videos
120M	projects
45M	downloads
4.0K	tmp

Explanation: du -hs * shows human-readable sizes per entry; piping to sort -rh gives largest-first. Use this to decide which directories to clean.

du -Sh | sort -rh | head -5
1.2G	./data_project
512M	./videos
120M	./projects
45M	./downloads
16M	./projects/old_model

Explanation: -S avoids double-counting nested directory sizes; good when you want sizes of top-level trees.

find /home/youruser/Downloads/ -type f -exec du -Sh {} + | sort -rh | head -n 5
1.1G	/home/youruser/Downloads/movie_large.mkv
450M	/home/youruser/Downloads/dataset.tar.gz
120M	/home/youruser/Downloads/project-backup.zip
45M	/home/youruser/Downloads/image_collection.tar
10M	/home/youruser/Downloads/report.pdf

Explanation: Finds the largest individual files under a path. Use this for targeted cleanup or backup planning.

Combining these tools: practical pipelines

Example: Find the top 10 most expensive products (skip header, extract product and price, sort numeric desc):

tail -n +2 sales_data.csv | cut -d',' -f4,7 | sort -t',' -k2 -rn | head -10
Laptop,899.99
Laptop,899.99
Laptop,899.99
Laptop,899.99
Laptop,899.99
Desk,399.99
Monitor,299.99
Monitor,299.99
Desk Chair,199.99
Desk Chair,199.99

Explanation: This pipeline demonstrates why the Unix philosophy (small composable tools) is powerful for quick data summaries.

Verification & quick checks

Verify created files and basic counts after running scripts.

ls -l sales_data.csv
-rw-r--r-- 1 user user 1320 Feb 27 12:01 sales_data.csv

Explanation: Checks the file exists and size. If missing, confirm working directory and earlier heredoc ran correctly.

wc -l sales_data.csv
21 sales_data.csv

Explanation: Confirms expected number of lines: header + 20 records.

Troubleshooting

  • Command not found: install missing package (e.g.,
    sudo apt install coreutils
    Reading package lists... Done
    Building dependency tree... Done
    E: Unable to locate package coreutils

    ) — ensure you have correct package name for your distro.

  • CSV fields misaligned: verify delimiter and escape characters; use awk -F’,’ and jq (for JSON) instead of naive splitting when fields contain commas.
  • Large-file slowness: prefer streaming tools (awk, sed) and avoid loading files into editors; for multi-GB files, use sort –parallel or disk-based external tools.

Best practices for data scientists

  • Keep the shell for exploratory tasks: quick counts, top-N, simple transformations before moving to Python.
  • Always inspect the header with head -1 or ls -l to confirm schema and permissions.
  • Use pipes to compose: each tool does one thing well — chain them.
  • Save reproducible one-liners as scripts in **./scripts/** and use version control.

Conclusion

Mastering these Linux command line tools — grep, awk, sed, cut, sort, uniq, wc, head/tail, find, jq, plus core utilities like ls and du — will make your data workflows faster and more robust. Use the examples above as templates and adapt them to your datasets. Remember the focus: Linux command line tools, data science Linux commands, and command line data processing — keep practicing and integrate these one-liners into scripts and notebooks for repeatable results.

Komentariši

Vaša email adresa neće biti objavljivana. Neophodna polja su označena sa *