An Awful Edge Case in Bash's `set -e`

The last six months, I've been building out the automated testing infrastructure at a start up. Our infrastructure is mostly in Python, but writing Bash scripts is inevitable. At the end of the day, automated testing is all about running commands in a row—and Bash is the right tool for the job.

There are a whole bunch of articles about how to write safe Bash scripts, and the standard advice is to add set -euo pipefail to make your scripts "safe." In this article, I'm going to describe one edge case where set -e completely fails to work.

Background

Suppose we have two projects in a git repo, say MyLibrary and MyApplication, where MyApplication depends on MyLibrary. And suppose each project provides a script test.sh that looks something like this:

#!/bin/bash

set -e

./configure
make

python fancy_test_driver.py tests/first_tests
python fancy_test_driver.py --option-1 tests/second_tests
python fancy_test_driver.py --option-2 tests/third_tests

This is a pretty reasonable script. We can now require that all changes pass both my_library/test.sh and my_application/test.sh before being merged.

As time goes on, the amount of tests (and projects!) can spiral out of control. Eventually, someone (me) gets tasked with trying to optimize things. One obvious thing to do is to abort early. If MyLibrary fails testing, we don't need to bother with testing MyApplication.

Of course, the developers who used to get errors from both projects aren't very happy about this change. Now passing the test.sh scripts is like peeling an onion: you resolve the first layer of errors only to find more errors lurking underneath—hidden by the early abort. However, there's a middle ground. We can test MyApplication only if MyLibrary fails while running test cases. If MyLibrary fails during compilation, continuing on is pointless since MyApplication depends on MyLibrary.

So how do we distinguish when test.sh fails during compilation from when it fails during testing? Exit codes, naturally:

#!/bin/bash

set -e

COMPILE_FAILURE_CODE=79

./configure || exit $COMPILE_FAILURE_CODE
make || exit $COMPILE_FAILURE_CODE

python fancy_test_driver.py tests/first_tests
python fancy_test_driver.py --option-1 tests/second_tests
python fancy_test_driver.py --option-2 tests/third_tests

And we're done! An exit code of 0 means we passed the tests, an exit code of 79 means compilation failure (no need to test further projects), and any other exit code means we failed in testing—so we can continue testing the other projects.

Finding the Problem

The above solution works fine when we only have two lines that need the special exit code. However, it quickly becomes unwieldly when it needs to be applied to more lines:

#!/bin/bash

set -e

COMPILE_FAILURE_CODE=79

pushd codegen_tool1 || exit $COMPILE_FAILURE_CODE
./configure || exit $COMPILE_FAILURE_CODE
make || exit $COMPILE_FAILURE_CODE
./codegen_tool1 || exit $COMPILE_FAILURE_CODE
popd || exit $COMPILE_FAILURE_CODE

pushd codegen_tool2 || exit $COMPILE_FAILURE_CODE
./configure || exit $COMPILE_FAILURE_CODE
make || exit $COMPILE_FAILURE_CODE
./codegen_tool2 || exit $COMPILE_FAILURE_CODE
popd || exit $COMPILE_FAILURE_CODE

./configure || exit $COMPILE_FAILURE_CODE
make || exit $COMPILE_FAILURE_CODE

# Run some tests...

Gross! Clearly, we should factor out the || exit $COMPILE_FAILURE_CODE line and have it apply to all of our lines at once. We can easily do this by creating a separate build.sh script:

#!/bin/bash

set -e

pushd codegen_tool1
./configure
make
./codegen_tool1
popd

# snip

./configure
make

And then adjusting test.sh to just have:

#!/bin/bash

set -e

COMPILE_FAILURE_CODE=79

./build.sh || exit $COMPILE_FAILURE_CODE

# Run some tests...

And this, too, works great! But wait! Why even use a second script? Can't we do the exact same thing with a subshell?

#!/bin/bash

set -e

COMPILE_FAILURE_CODE=79

(
    pushd codegen_tool1
    ./configure
    make
    ./codegen_tool1
    popd

    # snip

    ./configure
    make
) || exit $COMPILE_FAILURE_CODE

# Run some tests...

No! This subshell implementation is dangerously broken. The rest of this article will explore how and why the subshell code does not function as expected.

There's a Problem?

Let's run some simple bash programs and see what happens. First, we'll confirm set -e works as expected.

$ cat script1.sh
#!/bin/bash
echo "Statement 1"
(exit 3)
echo "Statement 2"

$ echo "$?"
0

#!/bin/bash
set -e
echo "Statement 1"
(exit 3)
echo "Statement 2"

$ ./script2.sh
Statement 1

$ echo "$?"
3

Yep, that's what we expected. Without set -e Bash ran every statement, and with set -e Bash stopped after the first non-zero exit code. What if we put our statements in a subshell?

$ cat script3.sh
#!/bin/bash
(
    echo "Statement 1"
    (exit 3)
    echo "Statement 2"
)

$ ./script3.sh
Statement 1
Statement 2

$ echo "$?"
0

$ cat ./script4.sh
#!/bin/bash
set -e
(
    echo "Statement 1"
    (exit 3)
    echo "Statement 2"
)

$ ./script4.sh
Statement 1

$ echo "$?"
3

And again, that's what we expected. The subshell made no difference. Note that set -e does propagate into the subshell. Alright. What if we mask the subshell's exit code?

$ cat script5.sh
#!/bin/bash
set -e
(
    echo "Statement 1"
    (exit 3)
    echo "Statement 2"
) || exit 9

$ ./script5.sh
Statement 1
Statement 2

$ echo "$?"
0

And again everything works as...

Wait, what?

Okay. Maybe set -e doesn't propagate?

$ cat script6.sh
#!/bin/bash
set -e
(
    set -e
    echo "Statement 1"
    (exit 3)
    echo "Statement 2"
) || exit 9

$ ./script6.sh
Statement 1
Statement 2

$ echo "$?"
0

$ cat script7.sh
#!/bin/bash
(
    set -e
    echo "Statement 1"
    (exit 3)
    echo "Statement 2"
) || exit 9

$ ./script7.sh
Statement 1
Statement 2

$ echo "$?"
0

Nope. Still doesn't work. Even if we just have set -e in the subshell and not in the outer script, it doesn't work.

So what's going on?

Well, if we dig into the Bash man page, we find this excerpt about set -e:

Exit immediately if a pipeline (which may consist of a single simple command), a list, or a compound command (see SHELL GRAMMAR above), exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test following the if or elif reserved words, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command's return value is being inverted with !.

So, the spec says that if you're using && or ||, only the last command's exit code can cause the shell to exit. This makes sense, because you expect command_1 || command_2 to execute command_2 if command_1 fails. Without this exception, it would be very hard to have any logical statements when -e is set.

The behavior we just witnessed is, therefore, Working as Intended™. When we try to mask the subshell's exit code, we put the subshell at the start of an || list. So, the subshell's exit code will not cause an exit despite -e being set. But, every single command inside the subshell is also considered part of the || list, and thus no exit code anywhere in the subshell can cause the subshell to exit. It's as if set +e is being run implicitly in the subshell—only, as we've seen, we can't override it with an explicit set -e in the subshell.

Is there anything we can do to fix this? Well, you're probably better off with one of the approaches I presented earlier. If you need to stay inside the same shell script, the solution with trap below is probably what you want. And if you truly need to use a subshell, I was able to come up with this mess:

$ cat script8.sh
#!/bin/bash
set -e
echo "Some stuff with -e set"

set +e
(
    set -e
    echo "Statement 1"
    (exit 3)
    echo "Statement 2"
)
[[ $? -ne 0 ]] && exit 9
set -e

echo "More code with -e set (unreachable)"

$ ./script8.sh
Some stuff with -e set
Statement 1

$ echo "$?"
9

Bonus Solution

I ended up using trap for error masking:

$ cat script9.sh
#!/bin/bash

set -e

trap 'exit 9' ERR

echo "Statement 1"
(exit 3)
echo "Statement 2"

trap - ERR

$ ./script9.sh
Statement 1

$ echo "$?"
9

The nice part about this solution is it allows you to stick with just one script (useful if you need to use functions), and the logic is straightforward. This option can be harder to make work if you're already using trap ERR for cleanup, though.