AcademyTerminal Tactics: Survival in the ShellPhase 10: Final Extraction

Lesson 2: The Troubleshooter (Capstone Part 2)

It's 3 AM. The pager goes off: "Production server unresponsive." This is the moment everything you've learned comes together. You need to diagnose and fix the issue — fast.

The Troubleshooting Playbook

Every experienced DevOps engineer follows a systematic approach:

1. CHECK LOGS     → What error messages exist?
2. CHECK PROCESSES → Is something hogging resources?
3. CHECK NETWORK  → Can the server communicate?
4. CHECK DISK     → Is storage full?
5. CHECK MEMORY   → Is RAM exhausted?

Step 1: Read the Logs

Logs are your crime scene evidence. Check them first:

grep -i 'error' /var/log/syslog | tail -20    # Recent errors
journalctl -xe                                  # Systemd journal
tail -f /var/log/nginx/error.log               # Watch logs live

Step 2: Hunt Rogue Processes

A runaway process can eat all CPU or memory:

ps aux --sort=-%cpu | head -10    # Top CPU consumers
ps aux --sort=-%mem | head -10    # Top memory consumers
kill -9 <PID>                     # Force-kill a rogue process

Step 3: Verify Network

Is the server actually accepting connections?

ss -tlnp | grep 80              # Is port 80 listening?
ping -c 3 google.com            # Does it have internet?
curl -I http://localhost         # Is the web server responding?

Step 4: Check Storage

A full disk is a silent killer:

df -h                            # Overall disk usage
du -sh /var/log/*  | sort -rh | head -5   # Biggest log files

Step 5: Generate a Report

Document your findings for the team:

echo "=== Health Report ===" > report.txt
date >> report.txt
free -h >> report.txt
df -h >> report.txt
booting...

Final Mission

A production server is down. Follow the playbook:

  1. Read the evidence: Run grep -i 'error' /var/log/syslog | tail -5.
  2. Hunt the culprit: Find top CPU consumers with ps aux --sort=-%cpu | head -5.
  3. Test the network: Check if port 80 is open with ss -tlnp | grep 80.
  4. Check the storage: Look for full disks with df -h | grep -E '[89][0-9]%|100%'.
  5. Document it: Generate a health report to share with your team.

🎉 Congratulations!

You've completed the Linux Foundations for DevOps course! You now have the skills to navigate, secure, monitor, automate, and troubleshoot Linux servers — the foundation of every DevOps career.

Next Steps:

  • Practice these commands on a real Linux VM (try DigitalOcean or AWS Free Tier).
  • Learn Docker — containerization is the next level.
  • Explore CI/CD pipelines with GitHub Actions or Jenkins.

Mission Control

Investigate the error logs

Expected Command

grep -i 'error' /var/log/syslog | tail -5

Find the process consuming the most CPU

Test if the web server port is reachable

Check if any partition is full

Generate a system health report