How to Efficiently Find and Fix Broken Services in Linux

In the daily life of a Linux system administrator, encountering broken or failed services is a common challenge that demands a quick and effective response. Whether you’re maintaining a critical web server, a database backend, or a network service, knowing exactly which services have failed—and why—can save you hours of troubleshooting. This is where systemctl --failed becomes your go-to diagnostic tool. In this guide, we’ll explore how to find and fix broken services using systemctl, delving into practical examples and advanced techniques that unlock the full power of systemd service management on Debian, Ubuntu, RHEL, CentOS, and even Arch Linux. If you manage Linux servers in production environments, understanding this command inside and out is essential for rapid incident response and server health checks.

Understanding systemctl and the Role of –failed

Systemd is the init system responsible for booting your Linux system and managing ongoing services in the background. The systemctl command is the primary interface for interacting with systemd’s service units, socket units, timers, and more. When something goes wrong, running systemctl status on a single service can be useful, but it often results in overwhelming output if you don’t already know the failing service.

This is where the –failed flag shines—it provides a concise list of all currently failed units that systemd tracks. These units represent services or other systemd-managed entities that failed to start or have stopped unexpectedly. This lets an administrator quickly triage which services are problematic at the moment without sifting through pages of normal status messages.

sudo systemctl list-units --failed

  UNIT                               LOAD   ACTIVE SUB    DESCRIPTION
● apache2.service                    loaded failed failed The Apache HTTP Server
● postgresql.service                 loaded failed failed PostgreSQL RDBMS

LOAD   = Reflects whether the unit's configuration has been parsed by systemd.
ACTIVE = The high-level unit activation state, i.e. running, exited, or failed.
SUB    = The low-level unit activation state, values depend on unit type.

2 loaded units listed.

This command lists all failed units, highlighting errors with a red dot. It’s typically the first command to run when a server behaves unexpectedly—giving you immediate insight into what’s broken. Notice that –failed doesn’t come with any arguments here; it simply filters the output to display only units in the failed state.

Filtering for Services Only: Focusing Your Efforts

By default, systemctl list-units --failed returns all failed units, including timers, sockets, mount points, and more. However, a practical administrator is usually most concerned about failed services (.service units), since these directly impact application availability.

To filter down to just services, add the –type=service flag. This reduces noise and allows you to prioritize service failures immediately.

sudo systemctl list-units --failed --type=service

  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● nginx.service            loaded failed failed A high performance web server
● mariadb.service          loaded failed failed MariaDB database server

2 loaded units listed.

Remember, the parameter is singular: --type=service. Using the plural services is a common beginner mistake and will result in an error.

Digging Deeper: Identifying Why Services Failed

The list of failed services is your starting point, but understanding why each service failed is crucial for remediation. The systemctl status command, when run with the service name, provides detailed information including recent log messages from the journal.

sudo systemctl status nginx.service

● nginx.service - A high performance web server
   Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2024-06-07 09:22:11 UTC; 3min ago
  Process: 1234 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)
 Jun 07 09:22:11 server systemd[1]: Starting A high performance web server...
 Jun 07 09:22:11 server nginx[1234]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
 Jun 07 09:22:11 server systemd[1]: nginx.service: Control process exited, code=exited status=1
 Jun 07 09:22:11 server systemd[1]: nginx.service: Failed with result 'exit-code'.
 Jun 07 09:22:11 server systemd[1]: Failed to start A high performance web server.

This output reveals the root cause—in this case, nginx fails to start because port 80 is already bound. The logs here are your treasure trove for troubleshooting real production outages.

When you have multiple failed services, repeating this manually can be tedious. A useful trick is to automate this with xargs to query the status for all failed services in one go:

sudo systemctl list-units --failed --type=service --no-legend --plain | awk '{print $1}' | xargs sudo systemctl status --no-pager -l

● nginx.service - A high performance web server
...[nginx status output]...

● mariadb.service - MariaDB database server
...[mariadb status output]...

This command pipeline efficiently handles multiple failures, providing full status details without interactive pagers. The flags –no-legend and –plain strip extra headers and formatting that can complicate scripting or parsing.

Best Practices for Using systemctl –failed Effectively

In my 15+ years managing Linux production servers, I’ve gathered some key best practices that every sysadmin should keep in mind:

Always use sudo: Accessing full service status and journal logs requires root privileges. Forgetting sudo often leads to “Permission denied” errors that confuse beginners.
Combine with filtering: For fast triage, filter by unit type (--type=service) or by state (--state=failed) when needed.
Use --no-pager when scripting: Interactive pagers like less break automation unless suppressed, which wastes valuable time during outages.
Automate alerting: Extract counts with systemctl list-units --failed --no-legend --plain | wc -l to integrate into monitoring scripts or cron jobs for proactive service health checks.
Don’t confuse --failed with --state=failed: Both filter failed units, but --state= lets you specify other states (active, inactive), a more flexible option once comfortable.

A Real-World Troubleshooting Scenario

In a recent production incident, a customer’s web server had been randomly going down during peak traffic hours. Running systemctl --failed --type=service immediately revealed a failed apache2.service. Checking its status exposed a log message about “port 80 already in use.” Further digging into netstat showed that a rogue instance of nginx was already bound to the same port, deployed by a configuration mistake during a recent application rollout.

Because the broken service was quickly identified, the team reversed the configuration and restarted Apache in minutes, avoiding a prolonged outage. This example highlights the importance of systemctl --failed as a triage tool—it focuses attention where it’s needed instead of digging through endless logs blind.

Conclusion

The systemctl --failed command is an indispensable Swiss Army knife for Linux system administrators, especially when managing complex production servers. It instantly surfaces broken services, helping you triage issues efficiently without wading through noise. By combining filtering options like --type=service, scripting-friendly flags such as --no-legend and --no-pager, and leveraging systemctl status for detailed diagnostics, you gain a powerful workflow for maintaining service uptime and diagnosing failures with confidence.

Next time your server acts up unexpectedly, use systemctl --failed as your first step—it will often cut your troubleshooting time dramatically. Once you’ve mastered this tool, move on to integrating its checks into your automated monitoring system to catch hidden failures before they impact users.