Friday, June 5, 2020

Golang Shutdown Flow



While doing some enhancements to the Golang microservices at work I came across quite a few calls to logrus.Fatal late in the execution of the service. Some of these particular services are long running processes that consume from Kafka and write to GCP Spanner. The problem with logrus.Fatal when called late in these services lifecycle is that Fatal internally calls os.Exit(1). So let’s examine why this is bad for the system in which the service is running.

On the surface I wonder if a non-recoverable error encountered late in the process is “Fatal”. The interpretation of what “Fatal” means is subjective. At the beginning of the process when reading the configuration file, making external system connections -- this is a Fatal problem because the service can even get started. Some precondition failure -- yeah, that’s Fatal. But if a service has been happily consuming messages and writing transformed data to a database only to encounter a non-recoverable error -- is that “Fatal”?

But that’s not what I wanted to show you. What I wanted to show you is why logrus.Fatal(...) is the wrong way to shutdown a service that has encountered a fatal error.

First some basics: https://play.golang.org/p/b8CAlmiVZPH

package main

import (
"errors"
"fmt"
)

func main() {
defer func() {
if err := recover(); err != nil {
fmt.Println("chance to recover:", err)
panic(err)
} else {
fmt.Println("nothing to recover")
}
}()

fmt.Println("Hello, playground")

panic(errors.New("It's a perfect time to panic! -- Woody"))
}


This is basic Golang: the app panics, the defer will execute, recover() will consume/catch the error, and then we re-panic (just as a naive solution). The panic is reported, and the application returns a non-zero status code.

But what about os.Exit(1)? We know that logrus calls os.Exit(int). We need to be aware of the impact upon our defer statements when we use os.Exit(int). As noted in the godoc, os.Exit(int) does not take time to run defer functions. It's just going to shutdown: https://play.golang.org/p/fB8GnFxEATs
package main
import (
"fmt"
"os"
func main() {
defer func() {
if err := recover(); err != nil {
fmt.Println("chance to recover:", err)
} else {
fmt.Println("nothing to recover")
}
}()
fmt.Println("Hello, playground")
os.Exit(0)
}

Here the defer does not run. If you were going to gracefully release the connections to Kafka, Spanner, or any other external resource -- that did not happen. For illustration sake I also have this example returning the success zero status code.

There is another way to exit a Golang application. Well, sort of: runtime.Goexit(). As noted in the godoc all registered defer will be executed.  However, it only exits from one goroutine.  So if this is called in your last goroutine, your service will crash -- in the same fashion as when all goroutines are blocked causing deadlock: https://play.golang.org/p/z0k56ZMoF8N
package main

import (
"log"
"runtime"
)

func main() {
defer func() {
if err := recover(); err != nil {
log.Println("chance to recover:", err)
} else {
log.Println("nothing to recover")
}
}()

log.Println("Hello, playground")

runtime.Goexit()
}

Where does that leave us? Well, it's important to understand the impact of your libraries upon the flow of execution. The logrus.Fatal(...) method is definitely useful. Use it with the full understanding of what it is doing. Use it during service initialization before any defer functions have been registered. Use it when you know you want defer statements to be skipped.

Bonus:


It is important when fixing these sorts of problems with service shutdown to recognize the significance of your services exit code. Your services exit code is its last communication with the software architecture -- it's dying breath used to wheeze out one little death rattle.  Are you running your services in Docker? Kubernetes? 

The service exit code is going to communicate to the container if it exited successfully or crashed with an error. In those situations where you were calling logrus.Fatal, you likely do not want to log the error and simply return. That would have your service return zero as exit code communicating success to the software architecture system. Make sure you take into account the pod and the configured restartPolicy. If your service is shutting down because of a non-recoverable error you likely want to wheeze out a death rattle of non-zero.