So, you just got a shiny new shell script from ChatGPT (or Copilot, or your favorite AI buddy). It looks legit. It even feels right. But then that creeping doubt sets in: > "Wait… is this thing safe to run on production?" Welcome to the world of unit testing shell scripts generated by LLMs — where the stakes are high, `sudo` is dangerous, and one wrong `rm -rf` can ruin your whole day. In this post, we'll walk through a battle-tested way to safely test and validate scripts that manage real services like PM2, Docker, Nginx, or anything that touches system state. ## The Problem With Trusting LLM Shell Scripts <img src={require('./img/11527922.jpg').default} alt="Frustrated engineer realizing the risks of blindly trusting LLM-generated shell scripts" width="500" height="450"/> <br/> Large Language Models like ChatGPT are awesome at generating quick shell scripts. But even the best LLM: * Can make assumptions about your environment * Might use the wrong binary name (like `pgrep -x PM2` instead of `pm2`) * Can forget that `systemctl restart docker` isn't always a no-op Even if the logic is 90% correct, that 10% can: * Restart your services at the wrong time * Write to incorrect log paths * Break idempotency (runs that shouldn't change state do) [According to a recent study on AI-generated code](https://www.infosecurity-magazine.com/news/llms-vulnerable-code-default/), about 15% of LLM-generated shell scripts contain potentially dangerous commands when run in production environments. ## Strategy 1: Add a `--dry-run` Mode Every LLM-generated script should support a `--dry-run` flag. This lets you preview what the script would do — without actually doing it. Here's how you add it: ```bash DRY_RUN=false [[ "$1" == "--dry-run" ]] && DRY_RUN=true log_action() { echo "$(date): $1" $DRY_RUN && echo "[DRY RUN] $1" || eval "$1" } # Example usage log_action "sudo systemctl restart nginx" ``` This pattern gives you traceable, reversible operations. [For more advanced dry-run implementations, check this guide](https://www.reddit.com/r/Frontend/comments/1i16clz/generating_unit_tests_with_llms/). ## Strategy 2: Mock External Commands You don't want `docker restart` or `pm2 resurrect` running during testing. You can override them like this: ```bash mkdir mock-bin echo -e '#!/bin/bash\necho "[MOCK] $0 $@"' > mock-bin/docker chmod +x mock-bin/docker export PATH="$(pwd)/mock-bin:$PATH" ``` Now, any call to `docker` will echo a harmless line instead of nuking your containers. Symlink other dangerous binaries like `systemctl`, `pm2`, and `rm` as needed. This technique is borrowed from [Bash Automated Testing System (BATS)](https://github.com/bats-core/bats-core), which uses mocking extensively. ## Strategy 3: Use `shellcheck` LLMs sometimes mess up quoting, variables, or command usage. [`ShellCheck`](https://www.shellcheck.net/) is your best friend. Just run: ```bash shellcheck myscript.sh ``` And it'll tell you: * If variables are unquoted (`"$var"` vs `$var`) * If commands are used incorrectly * If your `if` conditions are malformed It's like a linter, but for your shell’s sanity. ## Strategy 4: Use Functions, Not One Big Blob Break your script into testable chunks: ```bash check_pm2() { ps aux | grep '[P]M2' > /dev/null } restart_all() { pm2 resurrect docker restart my-app systemctl restart nginx } ``` Now you can mock and call these functions directly in a test harness without running the whole script. This modular approach mirrors [modern software testing principles](https://martinfowler.com/bliki/UnitTest.html). ## Strategy 5: Log Everything. Seriously. Log every decision point. Why? Because "works on my machine" isn't helpful when the container didn't restart or PM2 silently failed. ```bash log() { echo "$(date '+%F %T') [LOG] $1" >> /var/log/pm2_watchdog.log } ``` ## Strategy 6: Test in a Sandbox If you've got access to Docker or a VM, spin up a replica and try running the script in that environment. Better to break a fake server than your actual one. Try: ```bash docker run -it ubuntu:20.04 # Then apt install what you need: pm2, docker, nginx, etc. ``` [Check this Docker-based testing guide](https://www.testcontainers.org/) ## Bonus: Tools You Might Love <img src={require('./img/4264704.jpg').default} alt="Developer presenting useful tools for safely testing shell scripts generated by LLMs" width="500" height="450"/> <br/> * [BATS](https://github.com/bats-core/bats-core): Bash unit testing framework * [shunit2](https://github.com/kward/shunit2): xUnit-style testing for POSIX shell * [assert.sh](https://github.com/lehmannro/assert.sh): dead-simple shell assertion helper * [shellspec](https://github.com/shellspec/shellspec): full-featured, RSpec-like shell test framework ## Final Thoughts: Don't Just Run It — Test It <img src={require('./img/20944874.jpg').default} alt="Two engineers discussing safe testing practices for LLM-generated shell scripts" width="500" height="450"/> <br/> It's tempting to copy-paste that LLM-generated shell script and run it. But in production environments — especially ones with critical services like PM2 and Nginx — the safer path is to test before trust. Use dry-run flags. Mock your commands. Run scripts through `shellcheck`. Add logging. Test in Docker. Break things in safe places. With these strategies, you can confidently validate AI-generated shell scripts and ensure they behave as expected before hitting your production servers. [Nife](https://nife.io/), a hybrid cloud platform, offers a seamless solution for deploying and managing applications across edge, cloud, and on-premise infrastructure. If you're validating shell scripts that deploy services via Docker, PM2, or Kubernetes, it's worth exploring how Nife can simplify and secure that pipeline. [Its containerized app deployment](https://nife.io/solutions/Deploy%20Containarized%20Apps) capabilities allow you to manage complex infrastructure with minimal configuration. Moreover, through features like [OIKOS Deployments](https://nife.io/oikos/features/deployments), you gain automation, rollback support, and a centralized view of distributed app lifecycles — all crucial for testing and observability.