Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When e2e failed, background daemons are still running #397

Open
Tracked by #5
bap2pecs opened this issue Jun 21, 2024 · 2 comments
Open
Tracked by #5

When e2e failed, background daemons are still running #397

bap2pecs opened this issue Jun 21, 2024 · 2 comments
Assignees

Comments

@bap2pecs
Copy link
Contributor

this gave us lots of trouble when debugging a failed test

$ make test-e2e-op                                                                             [18:13:35]
cd tools; \
        go install -trimpath github.com/babylonchain/babylon/cmd/babylond
go test -mod=readonly -timeout=25m -v github.com/babylonchain/finality-provider/itest github.com/babylonchain/finality-provider/itest/opstackl2 -count=1 --tags=e2e_op
?       github.com/babylonchain/finality-provider/itest [no test files]
=== RUN   TestSubmitFinalitySignature
service injective.evm.v1beta1.Msg does not have cosmos.msg.v1.service proto annotation
service injective.evm.v1beta1.Msg does not have cosmos.msg.v1.service proto annotation
    test_manager.go:105: Babylon node is started
2024/06/21 18:17:56 Cannot remove dir 1
2024/06/21 18:17:56 Cannot remove dir 2
    test_manager.go:110: 
                Error Trace:    /Users/<redacted>/Documents/Projects/babylon-finality-provider/itest/opstackl2/test_manager.go:110
                                                        /opt/homebrew/Cellar/go/1.22.4/libexec/src/runtime/panic.go:770
                                                        /Users/<redacted>/Documents/Projects/babylon-finality-provider/cosmwasmclient/client/keys.go:18
                                                        /Users/<redacted>/Documents/Projects/babylon-finality-provider/itest/opstackl2/e2e_test.go:42
                                                        /Users/<redacted>/Documents/Projects/babylon-finality-provider/itest/opstackl2/e2e_test.go:77
                Error:          Received unexpected error:
                                exit status 1
                Test:           TestSubmitFinalitySignature
--- FAIL: TestSubmitFinalitySignature (1.88s)
FAIL
FAIL    github.com/babylonchain/finality-provider/itest/opstackl2       2.833s
FAIL
make: *** [test-e2e-op] Error 1

then we realized it's b/c there were some process running:

$ ps                                                                                           [18:17:56]
  PID TTY           TIME CMD
 8321 ttys001    2:12.87 babylond start --home=/var/folders/9_/q4wsdnh14_s60_74cd2rbztm0000gp/T/zBabylonTest2191261572/node0/babyl
 8329 ttys001    2:17.42 wasmd start --home /var/folders/9_/q4wsdnh14_s60_74cd2rbztm0000gp/T/ZWasmdTest3482039778 --rpc.laddr tcp:
92472 ttys001    0:00.29 /bin/zsh -il
99267 ttys033    0:00.21 -zsh

we found out the panic happened inside

func (n *babylonNode) stop() (err error) {
	if n.cmd == nil || n.cmd.Process == nil {
		// return if not properly initialized
		// or error starting the process
		return nil
	}

	defer func() {
		err = n.cmd.Wait()
	}()

	if runtime.GOOS == "windows" {
		return n.cmd.Process.Signal(os.Kill)
	}
	return n.cmd.Process.Signal(os.Interrupt)
}

we should have a better way to deal w it here

@bap2pecs
Copy link
Contributor Author

bap2pecs commented Jun 30, 2024

today we found it's due to code like this:

func fatal(err error) {
	fmt.Fprintf(os.Stderr, "[fpd] %v\n", err)
	os.Exit(1)
}

so when os.Exit() is called, the process will terminate immediately without running deferred functions. This is because os.Exit does not allow the current function to return, bypassing the defer mechanism.

so code like these are not executed:

func (ctm *OpL2ConsumerTestManager) Stop(t *testing.T) {
	var err error
	err = ctm.FpApp.Stop()
	require.NoError(t, err)
	err = ctm.BabylonHandler.Stop()
	require.NoError(t, err)
	ctm.EOTSServerHandler.Stop()
}

thus leaving some processes dangling

cc @SebastianElvis

@SebastianElvis
Copy link
Member

Yeah we are aware of this issue, and great work finding the root cause! Looks like we need a more graceful way to terminate the program compared to os.Exit(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants