fix: logout discover dead lock #1085

qfrank · 2024-04-10T12:48:17Z

Pls see relate issue include comments as reference. This PR will fix the issue.

NOTE: creating a new goroutine for every call to PushToChan might introduce overhead, especially under high load. I'm not sure if we can use sync/atomic for simple state updates, or maybe redesigning the service to minimize shared state and reduce the need for locks. But I think this PR is a good start entry to discuss 🙂

status-im-auto · 2024-04-10T12:52:58Z

Jenkins Builds

❔	Commit	#️⃣	Finished (UTC)	Duration	Platform	Result
✔️	`e66538e`	#1	2024-04-10 12:52:57	~4 min	`nix-flake`	📄`log`

✔️	`ea59f76`	#2	2024-04-11 10:33:47	~2 min	`nix-flake`	📄`log`

richard-ramos · 2024-04-10T14:20:47Z

I'm confused. The code as it is right now is cancelable. Do check this small example in go playground that simulates the behavior of the current logic

package main

import (
	"context"
	"fmt"
	"time"
)

type PeerData = string

func main() {

	ctx, cancel := context.WithCancel(context.Background())
	channel := make(chan PeerData)

	// This simulates a blocking read
	go func() {

		time.Sleep(10 * time.Second)
		thePeerData := <-channel
		fmt.Println("received peerData " + thePeerData)
	}()

	// This simulates logging out
	go func() {
		time.Sleep(2 * time.Second)
		cancel()
	}()

	somePeerData := PeerData("alice")

	select {
	case channel <- somePeerData:
		fmt.Println("Pushed peerdata into channel succesfully")

	case <-ctx.Done():
		fmt.Println("Cancel was executed")
	}
}

It should print Cancel was executed

cammellos · 2024-04-10T15:23:15Z

I'm confused. The code as it is right now is cancelable. Do check this small example in go playground that simulates the behavior of the current logic

package main

import (
	"context"
	"fmt"
	"time"
)

type PeerData = string

func main() {

	ctx, cancel := context.WithCancel(context.Background())
	channel := make(chan PeerData)

	// This simulates a blocking read
	go func() {

		time.Sleep(10 * time.Second)
		thePeerData := <-channel
		fmt.Println("received peerData " + thePeerData)
	}()

	// This simulates logging out
	go func() {
		time.Sleep(2 * time.Second)
		cancel()
	}()

	somePeerData := PeerData("alice")

	select {
	case channel <- somePeerData:
		fmt.Println("Pushed peerdata into channel succesfully")

	case <-ctx.Done():
		fmt.Println("Cancel was executed")
	}
}

It should print Cancel was executed

@qfrank @richard-ramos one thing I noticed, is that the channel is non buffered and it's used outside the struct.
I think a better way is to have the channel buffered, and have the code publishing on the channel non-blocking, otherwise the code will lock if the listener (which you don't have control over since it's outside the struct), doesn't empty the channel.
Could that be the issue with the deadlock?
i.e the listener stops listening to the channel, before context is done, but after we try to publish

cammellos · 2024-04-10T15:31:19Z

I'm confused. The code as it is right now is cancelable. Do check this small example in go playground that simulates the behavior of the current logic
package main

import (
	"context"
	"fmt"
	"time"
)

type PeerData = string

func main() {

	ctx, cancel := context.WithCancel(context.Background())
	channel := make(chan PeerData)

	// This simulates a blocking read
	go func() {

		time.Sleep(10 * time.Second)
		thePeerData := <-channel
		fmt.Println("received peerData " + thePeerData)
	}()

	// This simulates logging out
	go func() {
		time.Sleep(2 * time.Second)
		cancel()
	}()

	somePeerData := PeerData("alice")

	select {
	case channel <- somePeerData:
		fmt.Println("Pushed peerdata into channel succesfully")

	case <-ctx.Done():
		fmt.Println("Cancel was executed")
	}
}
It should print Cancel was executed
@qfrank @richard-ramos one thing I noticed, is that the channel is non buffered and it's used outside the struct. I think a better way is to have the channel buffered, and have the code publishing on the channel non-blocking, otherwise the code will lock if the listener (which you don't have control over since it's outside the struct), doesn't empty the channel. Could that be the issue with the deadlock? i.e the listener stops listening to the channel, before context is done, but after we try to publish

ah nevermind, I noticed that's addressed in @richard-ramos , but one other thing I noticed, is that we export mutexes from commontService, I think if we do that, all bets are off, mutexes should only be ideally used for managing internal state, most likely the cause is due to both commonService and CommonDiscoveryServices sharing a mutex, seems like the best way is to re-think how the two coordinate and avoid sharing locks

qfrank · 2024-04-11T02:24:32Z

Sorry for the confusing(I thought this was explict), actually the dead lock is not relate to CommonDiscoveryService.channel, let me explain how the dead lock was created:
After logout is called:
seedBootnodesForDiscV5 get invoked first, but it was hanging here(sp.WaitGroup().Wait()), goroutine details:

#	0x1024ee17b	sync.runtime_Semacquire+0x2b								/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/runtime/sema.go:62
#	0x1024ff3e7	sync.(*WaitGroup).Wait+0x77								/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/sync/waitgroup.go:116
#	0x1030206d3	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).Stop.func1+0x43	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:43
#	0x103020d97	github.com/waku-org/go-waku/waku/v2/service.(*CommonService).Stop+0x77			/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_service.go:55
#	0x10302065f	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).Stop+0x3f		/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:41
#	0x10304a733	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).Stop+0x63			/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:246
#	0x1033df8bb	github.com/status-im/status-go/wakuv2.(*Waku).restartDiscV5+0xfb			/Users/frank/development/go/src/github.com/status-im/status-go/wakuv2/waku.go:1677
#	0x1033df093	github.com/status-im/status-go/wakuv2.(*Waku).seedBootnodesForDiscV5+0x453

we can see that sp.commonService.Stop accquired lock first with sp.Lock().

then another goroutine:

1 @ 0x1024bdcb4 0x1024cfe6c 0x1024cfe49 0x1024ee2c8 0x1030209dc 0x103020989 0x103020768 0x10304bfbc 0x10304b4b0 0x10304bea0 0x10304c3a4 0x10304a5ac 0x1024f2d24
#	0x1024ee2c7	sync.runtime_SemacquireRWMutexR+0x27							/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/runtime/sema.go:82
#	0x1030209db	sync.(*RWMutex).RLock+0x6b								/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/sync/rwmutex.go:71
#	0x103020988	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).RLock+0x18	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:66
#	0x103020767	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).PushToChan+0x57	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:52
#	0x10304bfbb	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).peerLoop.func1+0xab		/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:444
#	0x10304b4af	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).Iterate+0x39f			/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:337
#	0x10304be9f	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).peerLoop+0x13f		/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:437
#	0x10304c3a3	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).runDiscoveryV5Loop+0x133	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:474
#	0x10304a5ab	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).start.func1+0x6b

PushToChan was trying to call sp.RLock(), this will get blocked, since sp.commonService.Stop didn't release the lock yet. and sp.commonService.Stop was waiting(sp.WaitGroup().Wait()) the WaitGroup(runDiscoveryV5Loop) to be finished. @richard-ramos @cammellos

qfrank · 2024-04-11T09:37:42Z

// This is not a mutex protected function, it is up to the caller to use it in a mutex protected context
func (sp *CommonService) ErrOnNotRunning() error {
	if !sp.started {
		return ErrNotStarted
	}
	return nil
}

can we use sync/atomic with sp.started ?
if yes, then we may don't need call sp.RLock() within PushToChan but we need take care of channel data consuming, trying solve it with this way
@richard-ramos

cammellos · 2024-04-11T09:49:18Z

runDiscoveryV5Loop

WaitGroup is exported in other places:

vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go
vendor/github.com/waku-org/go-waku/waku/v2/rendezvous/rendezvous.go

and it's combined with the lock as you pointed out, we should definitely try to avoid this pattern, we should keep wait group/locks/etc coordination for internal usage or top-down with a single agent, otherwise it's likely going to be a whack-a-mole

qfrank · 2024-04-11T09:55:10Z

runDiscoveryV5Loop

WaitGroup is exported in other places:
vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go
vendor/github.com/waku-org/go-waku/waku/v2/rendezvous/rendezvous.go
and it's combined with the lock as you pointed out, we should definitely try to avoid this pattern, we should keep wait group/locks/etc coordination for internal usage or top-down with a single agent, otherwise it's likely going to be a whack-a-mole

agreed!

can we use sync/atomic with sp.started ?
if yes, then we may don't need call sp.RLock() within PushToChan but we need take care of channel data consuming, trying solve it with this way

do you agree this as a quick fix? @cammellos

cammellos · 2024-04-11T10:01:17Z

instinctively, I don't think that would work, if there's truly a race condition, sync/atomic will only protect that value and not the writing to channels.
rather, I think having a lock specific to the channel is better, since it looks like now it's used to cover multiple states, started and the channel, and that might be the issue, but i'd have to spend far more time to understand it properly

qfrank · 2024-04-11T10:09:33Z

instinctively, I don't think that would work, if there's truly a race condition, sync/atomic will only protect that value and not the writing to channels. rather, I think having a lock specific to the channel is better, since it looks like now it's used to cover multiple states, started and the channel, and that might be the issue, but i'd have to spend far more time to understand it properly

why we need protect the writing to channels? could you explain a little bit? thanks in advance! because i see no reason so that was my quick fix @cammellos

…ervice.PushToChan

qfrank · 2024-04-11T10:42:08Z

FYI, the test result shows the latest commit also fixed logout stuck issue. @cammellos @richard-ramos

cammellos · 2024-04-11T11:34:32Z

why we need protect the writing to channels? could you explain a little bit? thanks in advance! because i see no reason so that was my quick fix @cammellos

Writing to a closed channel would panic, that'd be my guess

qfrank · 2024-04-11T11:54:10Z

why we need protect the writing to channels? could you explain a little bit? thanks in advance! because i see no reason so that was my quick fix @cammellos

Writing to a closed channel would panic, that'd be my guess

if this is the case , i saw we already have sp.WaitGroup().Wait() which will ensure close channel happened after writing channel

richard-ramos

LGTM
I guess it should be okay to use this atomic.... Just wondering if this operation inside PushToChan is needed at all, and if we shouldn't instead just past a context, i.e, remove this:

if err := sp.ErrOnNotRunning(); err != nil {
	return false
}

and change the function signature to:

func (sp *CommonDiscoveryService) PushToChan(ctx context.Context, data PeerData) bool {
	select {
	case sp.channel <- data:
		return true
        case <-ctx.Done():
                return false
	case <-sp.Context().Done():
		return false
	}
}

Perhaps something to be done in future PRs.

qfrank · 2024-04-15T12:57:40Z

Just wondering if this operation inside PushToChan is needed at all

I had the same feeling :), thank you for the review

qfrank requested review from cammellos and vitvly April 10, 2024 12:48

qfrank self-assigned this Apr 10, 2024

qfrank requested review from richard-ramos and chaitanyaprem as code owners April 10, 2024 12:48

fix:deadlock between CommonDiscoveryService.Stop and CommonDiscoveryS…

ea59f76

…ervice.PushToChan

qfrank force-pushed the fix/logout_discover_dead_lock branch from e66538e to ea59f76 Compare April 11, 2024 10:31

qfrank mentioned this pull request Apr 11, 2024

The app randomly gets stuck on the login page. status-im/status-mobile#19484

Closed

richard-ramos approved these changes Apr 15, 2024

View reviewed changes

qfrank merged commit 6d889ca into go1.19 Apr 15, 2024
4 of 9 checks passed

qfrank deleted the fix/logout_discover_dead_lock branch April 15, 2024 13:12

This was referenced Apr 15, 2024

fix_: logout deadlock status-im/status-go#5056

Merged

fix: logout discover dead lock #1090

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: logout discover dead lock #1085

fix: logout discover dead lock #1085

qfrank commented Apr 10, 2024 •

edited

Loading

status-im-auto commented Apr 10, 2024 •

edited

Loading

richard-ramos commented Apr 10, 2024 •

edited

Loading

cammellos commented Apr 10, 2024

cammellos commented Apr 10, 2024

qfrank commented Apr 11, 2024 •

edited

Loading

qfrank commented Apr 11, 2024 •

edited

Loading

cammellos commented Apr 11, 2024

qfrank commented Apr 11, 2024

cammellos commented Apr 11, 2024

qfrank commented Apr 11, 2024

qfrank commented Apr 11, 2024 •

edited

Loading

cammellos commented Apr 11, 2024

qfrank commented Apr 11, 2024

richard-ramos left a comment

qfrank commented Apr 15, 2024

fix: logout discover dead lock #1085

fix: logout discover dead lock #1085

Conversation

qfrank commented Apr 10, 2024 • edited Loading

status-im-auto commented Apr 10, 2024 • edited Loading

Jenkins Builds

richard-ramos commented Apr 10, 2024 • edited Loading

cammellos commented Apr 10, 2024

cammellos commented Apr 10, 2024

qfrank commented Apr 11, 2024 • edited Loading

qfrank commented Apr 11, 2024 • edited Loading

cammellos commented Apr 11, 2024

qfrank commented Apr 11, 2024

cammellos commented Apr 11, 2024

qfrank commented Apr 11, 2024

qfrank commented Apr 11, 2024 • edited Loading

cammellos commented Apr 11, 2024

qfrank commented Apr 11, 2024

richard-ramos left a comment

Choose a reason for hiding this comment

qfrank commented Apr 15, 2024

qfrank commented Apr 10, 2024 •

edited

Loading

status-im-auto commented Apr 10, 2024 •

edited

Loading

richard-ramos commented Apr 10, 2024 •

edited

Loading

qfrank commented Apr 11, 2024 •

edited

Loading

qfrank commented Apr 11, 2024 •

edited

Loading

qfrank commented Apr 11, 2024 •

edited

Loading