In the high-stakes world of Linux system administration, stability is not just a goal—it is a requirement. Whether you are managing a single VPS for a blog or a massive cluster of microservices in a DevOps environment, the health of your system depends on how well you manage its core resources: CPU and Memory. A single poorly optimized query, a memory leak in a Node.js application, or a rogue “zombie” process can escalate quickly, causing a kernel panic or a complete system freeze.
While industry-standard tools like Prometheus, Grafana, and Datadog offer beautiful dashboards and historical data, they often function as passive observers. They will tell you that your server died, but they won’t always act to save it in real-time. This is where custom automation becomes invaluable. In this guide, we will build a proactive, “self-healing” Bash watchdog that identifies resource abuse, validates it through confirmation cycles, and takes corrective action automatically.
1. Understanding Linux Process Management and the Kernel
Before we write a single line of code, it is essential to understand what we are actually monitoring. In Linux, every process is an instance of a running program, identified by a unique PID (Process ID). The Linux kernel uses a scheduler to distribute CPU time among these PIDs.
When a process begins to consume excessive resources, it isn’t always a “bug.” It could be a legitimate heavy task like video encoding or database indexing. However, if a process stays at 99% CPU for an extended period, it usually indicates a race condition or an infinite loop. Similarly, high memory usage often points to a memory leak, where an application forgets to release RAM back to the system. Our script’s job is to differentiate between a temporary “spike” and a persistent “threat.”
2. Architecture: Detection, Validation, and Whitelisting
Our script is built on three core pillars of system logic:
- Real-time Detection: We interface with the
ps(process status) utility. Unliketop, which is interactive,psallows us to snapshot the system state in a parseable format. - Confirmation Logic: This is the “brain” of the script. To avoid killing a process that just had a 1-second spike, the script uses a counter. A process must violate the rules X times in a row before it is terminated.
- Whitelist Protection: Every server has critical “citizens.” You never want to kill your Database (MySQL/PostgreSQL) or your Web Server (Nginx) even if they are heavy. Our script includes a regex-based whitelist to protect these vital services.
3. The Implementation: smart_monitor.sh
This script is designed for Bash 4.0+. It uses associative arrays to keep track of multiple rogue processes simultaneously.
#!/bin/bash
# ==============================================================================
# SCRIPT: smart_monitor.sh
# DESCRIPTION: Advanced Linux Resource Watchdog with Auto-Kill & Whitelisting
# ==============================================================================
# --- [1] CONFIGURATION SECTION ---
MAX_CPU=85 # Trigger alert if CPU > 85%
MAX_MEM=80 # Trigger alert if RAM > 80%
CHECK_INTERVAL=10 # Time between scans (seconds)
LOG_FILE="/var/log/smart_monitor.log"
# --- [2] WHITELIST SECTION ---
# List process names exactly as they appear in 'ps -comm'
# Example: "mysql nginx sshd dockerd"
WHITELIST="mysql mariadb postgresql dockerd nginx apache2 sshd systemd rsync"
# --- [3] AUTOMATION LOGIC ---
# Number of consecutive checks a process must fail before being killed
CONFIRMATIONS_NEEDED=3
# Initialize associative array to store violation counts per PID
declare -A VIOLATIONS
# Ensure the script runs as root
if [[ $EUID -ne 0 ]]; then
echo "CRITICAL ERROR: This script requires root privileges to manage system processes."
# Fallback for log file if not root
LOG_FILE="./smart_monitor_local.log"
fi
echo "--------------------------------------------------------"
echo "WATCHDOG ACTIVE: Starting Resource Monitoring..."
echo "Config: CPU Limit: $MAX_CPU% | MEM Limit: $MAX_MEM%"
echo "Interval: Every $CHECK_INTERVAL seconds"
echo "--------------------------------------------------------"
while true; do
# Fetch top 5 resource-intensive processes
# Format: CPU%, MEM%, CommandName, PID
# We use 'sed' to skip the header line
PROCESS_LIST=$(ps -eo pcpu,pmem,comm,pid --sort=-pcpu | sed -n '2,6p')
while read -r cpu mem comm pid; do
# [A] CHECK WHITELIST
# We wrap names in spaces to prevent partial matching (e.g., 'sshd' vs 'sshd_miner')
if [[ " $WHITELIST " =~ " $comm " ]]; then
continue
fi
# [B] NORMALIZE DATA
# Bash does not handle decimals. We strip everything after the dot.
cpu_int=$(echo "$cpu" | cut -d. -f1)
mem_int=$(echo "$mem" | cut -d. -f1)
# [C] EVALUATE PERFORMANCE
if [ "$cpu_int" -gt "$MAX_CPU" ] || [ "$mem_int" -gt "$MAX_MEM" ]; then
# Increment the violation counter for this specific PID
((VIOLATIONS[$pid]++))
# [D] DECISION ENGINE
if [ "${VIOLATIONS[$pid]}" -ge "$CONFIRMATIONS_NEEDED" ]; then
TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S")
ALERT_MSG="[KILL ACTION] Process '$comm' (PID: $pid) violated limits for $CONFIRMATIONS_NEEDED cycles."
# Write to Log
echo "[$TIMESTAMP] $ALERT_MSG (CPU: $cpu%, MEM: $mem%)" >> "$LOG_FILE"
# Broadcast to all terminals
wall "RESOURCE CRITICAL: Terminating rogue process '$comm' (PID: $pid)."
# Terminate the process (SIGKILL)
kill -9 "$pid"
# Clean up the array to free memory
unset VIOLATIONS[$pid]
else
echo "$(date): WARNING: '$comm' (PID: $pid) is hovering at CPU:${cpu_int}%. Violation #${VIOLATIONS[$pid]}" >> "$LOG_FILE"
fi
else
# Reset violation counter if the process returns to normal behavior
VIOLATIONS[$pid]=0
fi
done <<< "$PROCESS_LIST"
# Sleep to prevent the script itself from becoming a resource hog
sleep "$CHECK_INTERVAL"
done 4. Deep Dive: Technical Explanations
4.1 The Power of ‘ps’ over ‘top’
Most beginners use the top command to see what’s happening. However, top is an interactive tool that consumes significant resources itself. For automation, we use ps -eo. The -e flag stands for “every process,” and -o allows us to define a custom “output” format. This ensures our script only receives the data it needs (CPU, Mem, Name, PID), making the parsing process extremely fast and lightweight.
4.2 Handling Floating Points in Shell
A common pitfall in Bash scripting is the lack of native support for floating-point arithmetic (decimals). When ps reports 85.7%, a standard Bash comparison like if [ 85.7 -gt 80 ] will throw a syntax error. To solve this, we use the cut -d. -f1 command. This treats the dot as a delimiter and takes only the first part, effectively “flooring” the number to an integer. While this loses a tiny bit of precision, in the context of system monitoring, 0.7% is negligible.
4.3 Preventing Memory Bloat in the Script
Because we use an Associative Array (VIOLATIONS), the script remembers every PID it sees. If a server has millions of short-lived processes, this array could grow. However, our script is optimized: once a process is killed or returns to normal levels, we unset or reset its entry, ensuring the script can run for months without needing a restart.
5. Deployment Strategy
5.1 Permissions and Security
Since the script has the power to terminate processes, it must be protected. Access should be restricted to the root user only:
sudo chown root:root smart_monitor.sh sudo chmod 700 smart_monitor.sh
The 700 permission ensures that only the root user can read, write, or execute the script, preventing non-privileged users from seeing your whitelist or log paths.
5.2 Log Rotation (Preventing Disk Full Errors)
A monitoring script that logs every 10 seconds can quickly fill up a disk. You must use the logrotate utility. Create a new file /etc/logrotate.d/smart_monitor and paste the following:
/var/log/smart_monitor.log { daily rotate 7 compress delaycompress missingok notifempty } This will keep only the last 7 days of logs and compress older ones, saving precious SSD space.
6. Running as a Professional Systemd Service
Don’t run your script in a random screen or tmux session. For production, you should create a systemd service. This allows the script to start automatically after a reboot and restart itself if it crashes.
Create the service file:
sudo nano /etc/systemd/system/resource-watchdog.service
Insert this configuration:
[Unit] Description=Linux Resource Watchdog Service After=network.target [Service] Type=simple ExecStart=/usr/local/bin/smart_monitor.sh Restart=always RestartSec=10 User=root [Install] WantedBy=multi-user.target
Enable and start your new watchdog:
sudo systemctl daemon-reload
sudo systemctl enable resource-watchdog
sudo systemctl start resource-watchdog 7. Conclusion: The Path to Zero-Touch Admin
Building a self-healing system is a journey, not a destination. By implementing this script, you have moved beyond simple monitoring into the realm of automated orchestration. You have created a system that can defend itself against rogue software while you sleep.
The next steps for this project could include integrating Telegram or Slack Webhooks to get push notifications on your phone whenever the script kills a process. In the world of Linux, if you can measure it, you can automate it.