Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: logout discover dead lock #1085

Merged
merged 1 commit into from
Apr 15, 2024
Merged

Conversation

qfrank
Copy link
Collaborator

@qfrank qfrank commented Apr 10, 2024

Pls see relate issue include comments as reference. This PR will fix the issue.

NOTE: creating a new goroutine for every call to PushToChan might introduce overhead, especially under high load. I'm not sure if we can use sync/atomic for simple state updates, or maybe redesigning the service to minimize shared state and reduce the need for locks. But I think this PR is a good start entry to discuss 🙂

@qfrank qfrank requested review from cammellos and vitvly April 10, 2024 12:48
@qfrank qfrank self-assigned this Apr 10, 2024
@status-im-auto
Copy link

status-im-auto commented Apr 10, 2024

Jenkins Builds

Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ e66538e #1 2024-04-10 12:52:57 ~4 min nix-flake 📄log
✔️ ea59f76 #2 2024-04-11 10:33:47 ~2 min nix-flake 📄log

@richard-ramos
Copy link
Member

richard-ramos commented Apr 10, 2024

I'm confused. The code as it is right now is cancelable. Do check this small example in go playground that simulates the behavior of the current logic

package main

import (
	"context"
	"fmt"
	"time"
)

type PeerData = string

func main() {

	ctx, cancel := context.WithCancel(context.Background())
	channel := make(chan PeerData)

	// This simulates a blocking read
	go func() {

		time.Sleep(10 * time.Second)
		thePeerData := <-channel
		fmt.Println("received peerData " + thePeerData)
	}()

	// This simulates logging out
	go func() {
		time.Sleep(2 * time.Second)
		cancel()
	}()

	somePeerData := PeerData("alice")

	select {
	case channel <- somePeerData:
		fmt.Println("Pushed peerdata into channel succesfully")

	case <-ctx.Done():
		fmt.Println("Cancel was executed")
	}
}

It should print Cancel was executed

@cammellos
Copy link
Contributor

I'm confused. The code as it is right now is cancelable. Do check this small example in go playground that simulates the behavior of the current logic

package main

import (
	"context"
	"fmt"
	"time"
)

type PeerData = string

func main() {

	ctx, cancel := context.WithCancel(context.Background())
	channel := make(chan PeerData)

	// This simulates a blocking read
	go func() {

		time.Sleep(10 * time.Second)
		thePeerData := <-channel
		fmt.Println("received peerData " + thePeerData)
	}()

	// This simulates logging out
	go func() {
		time.Sleep(2 * time.Second)
		cancel()
	}()

	somePeerData := PeerData("alice")

	select {
	case channel <- somePeerData:
		fmt.Println("Pushed peerdata into channel succesfully")

	case <-ctx.Done():
		fmt.Println("Cancel was executed")
	}
}

It should print Cancel was executed

@qfrank @richard-ramos one thing I noticed, is that the channel is non buffered and it's used outside the struct.
I think a better way is to have the channel buffered, and have the code publishing on the channel non-blocking, otherwise the code will lock if the listener (which you don't have control over since it's outside the struct), doesn't empty the channel.
Could that be the issue with the deadlock?
i.e the listener stops listening to the channel, before context is done, but after we try to publish

@cammellos
Copy link
Contributor

I'm confused. The code as it is right now is cancelable. Do check this small example in go playground that simulates the behavior of the current logic

package main

import (
	"context"
	"fmt"
	"time"
)

type PeerData = string

func main() {

	ctx, cancel := context.WithCancel(context.Background())
	channel := make(chan PeerData)

	// This simulates a blocking read
	go func() {

		time.Sleep(10 * time.Second)
		thePeerData := <-channel
		fmt.Println("received peerData " + thePeerData)
	}()

	// This simulates logging out
	go func() {
		time.Sleep(2 * time.Second)
		cancel()
	}()

	somePeerData := PeerData("alice")

	select {
	case channel <- somePeerData:
		fmt.Println("Pushed peerdata into channel succesfully")

	case <-ctx.Done():
		fmt.Println("Cancel was executed")
	}
}

It should print Cancel was executed

@qfrank @richard-ramos one thing I noticed, is that the channel is non buffered and it's used outside the struct. I think a better way is to have the channel buffered, and have the code publishing on the channel non-blocking, otherwise the code will lock if the listener (which you don't have control over since it's outside the struct), doesn't empty the channel. Could that be the issue with the deadlock? i.e the listener stops listening to the channel, before context is done, but after we try to publish

ah nevermind, I noticed that's addressed in @richard-ramos , but one other thing I noticed, is that we export mutexes from commontService, I think if we do that, all bets are off, mutexes should only be ideally used for managing internal state, most likely the cause is due to both commonService and CommonDiscoveryServices sharing a mutex, seems like the best way is to re-think how the two coordinate and avoid sharing locks

@qfrank
Copy link
Collaborator Author

qfrank commented Apr 11, 2024

Sorry for the confusing(I thought this was explict), actually the dead lock is not relate to CommonDiscoveryService.channel, let me explain how the dead lock was created:
After logout is called:
seedBootnodesForDiscV5 get invoked first, but it was hanging here(sp.WaitGroup().Wait()), goroutine details:

#	0x1024ee17b	sync.runtime_Semacquire+0x2b								/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/runtime/sema.go:62
#	0x1024ff3e7	sync.(*WaitGroup).Wait+0x77								/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/sync/waitgroup.go:116
#	0x1030206d3	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).Stop.func1+0x43	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:43
#	0x103020d97	github.com/waku-org/go-waku/waku/v2/service.(*CommonService).Stop+0x77			/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_service.go:55
#	0x10302065f	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).Stop+0x3f		/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:41
#	0x10304a733	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).Stop+0x63			/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:246
#	0x1033df8bb	github.com/status-im/status-go/wakuv2.(*Waku).restartDiscV5+0xfb			/Users/frank/development/go/src/github.com/status-im/status-go/wakuv2/waku.go:1677
#	0x1033df093	github.com/status-im/status-go/wakuv2.(*Waku).seedBootnodesForDiscV5+0x453

we can see that sp.commonService.Stop accquired lock first with sp.Lock().

then another goroutine:

1 @ 0x1024bdcb4 0x1024cfe6c 0x1024cfe49 0x1024ee2c8 0x1030209dc 0x103020989 0x103020768 0x10304bfbc 0x10304b4b0 0x10304bea0 0x10304c3a4 0x10304a5ac 0x1024f2d24
#	0x1024ee2c7	sync.runtime_SemacquireRWMutexR+0x27							/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/runtime/sema.go:82
#	0x1030209db	sync.(*RWMutex).RLock+0x6b								/opt/homebrew/Cellar/[email protected]/1.20.11/libexec/src/sync/rwmutex.go:71
#	0x103020988	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).RLock+0x18	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:66
#	0x103020767	github.com/waku-org/go-waku/waku/v2/service.(*CommonDiscoveryService).PushToChan+0x57	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/service/common_discovery_service.go:52
#	0x10304bfbb	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).peerLoop.func1+0xab		/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:444
#	0x10304b4af	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).Iterate+0x39f			/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:337
#	0x10304be9f	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).peerLoop+0x13f		/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:437
#	0x10304c3a3	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).runDiscoveryV5Loop+0x133	/Users/frank/development/go/src/github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go:474
#	0x10304a5ab	github.com/waku-org/go-waku/waku/v2/discv5.(*DiscoveryV5).start.func1+0x6b

PushToChan was trying to call sp.RLock(), this will get blocked, since sp.commonService.Stop didn't release the lock yet. and sp.commonService.Stop was waiting(sp.WaitGroup().Wait()) the WaitGroup(runDiscoveryV5Loop) to be finished. @richard-ramos @cammellos

@qfrank
Copy link
Collaborator Author

qfrank commented Apr 11, 2024

// This is not a mutex protected function, it is up to the caller to use it in a mutex protected context
func (sp *CommonService) ErrOnNotRunning() error {
	if !sp.started {
		return ErrNotStarted
	}
	return nil
}

can we use sync/atomic with sp.started ?
if yes, then we may don't need call sp.RLock() within PushToChan but we need take care of channel data consuming, trying solve it with this way
@richard-ramos

@cammellos
Copy link
Contributor

runDiscoveryV5Loop

WaitGroup is exported in other places:

vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go
vendor/github.com/waku-org/go-waku/waku/v2/rendezvous/rendezvous.go

and it's combined with the lock as you pointed out, we should definitely try to avoid this pattern, we should keep wait group/locks/etc coordination for internal usage or top-down with a single agent, otherwise it's likely going to be a whack-a-mole

@qfrank
Copy link
Collaborator Author

qfrank commented Apr 11, 2024

runDiscoveryV5Loop

WaitGroup is exported in other places:

vendor/github.com/waku-org/go-waku/waku/v2/discv5/discover.go
vendor/github.com/waku-org/go-waku/waku/v2/rendezvous/rendezvous.go

and it's combined with the lock as you pointed out, we should definitely try to avoid this pattern, we should keep wait group/locks/etc coordination for internal usage or top-down with a single agent, otherwise it's likely going to be a whack-a-mole

agreed!

can we use sync/atomic with sp.started ?
if yes, then we may don't need call sp.RLock() within PushToChan but we need take care of channel data consuming, trying solve it with this way

do you agree this as a quick fix? @cammellos

@cammellos
Copy link
Contributor

instinctively, I don't think that would work, if there's truly a race condition, sync/atomic will only protect that value and not the writing to channels.
rather, I think having a lock specific to the channel is better, since it looks like now it's used to cover multiple states, started and the channel, and that might be the issue, but i'd have to spend far more time to understand it properly

@qfrank
Copy link
Collaborator Author

qfrank commented Apr 11, 2024

instinctively, I don't think that would work, if there's truly a race condition, sync/atomic will only protect that value and not the writing to channels. rather, I think having a lock specific to the channel is better, since it looks like now it's used to cover multiple states, started and the channel, and that might be the issue, but i'd have to spend far more time to understand it properly

why we need protect the writing to channels? could you explain a little bit? thanks in advance! because i see no reason so that was my quick fix @cammellos

@qfrank qfrank force-pushed the fix/logout_discover_dead_lock branch from e66538e to ea59f76 Compare April 11, 2024 10:31
@qfrank
Copy link
Collaborator Author

qfrank commented Apr 11, 2024

FYI, the test result shows the latest commit also fixed logout stuck issue. @cammellos @richard-ramos

@cammellos
Copy link
Contributor

why we need protect the writing to channels? could you explain a little bit? thanks in advance! because i see no reason so that was my quick fix @cammellos

Writing to a closed channel would panic, that'd be my guess

@qfrank
Copy link
Collaborator Author

qfrank commented Apr 11, 2024

why we need protect the writing to channels? could you explain a little bit? thanks in advance! because i see no reason so that was my quick fix @cammellos

Writing to a closed channel would panic, that'd be my guess

if this is the case , i saw we already have sp.WaitGroup().Wait() which will ensure close channel happened after writing channel

Copy link
Member

@richard-ramos richard-ramos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I guess it should be okay to use this atomic.... Just wondering if this operation inside PushToChan is needed at all, and if we shouldn't instead just past a context, i.e, remove this:

if err := sp.ErrOnNotRunning(); err != nil {
	return false
}

and change the function signature to:

func (sp *CommonDiscoveryService) PushToChan(ctx context.Context, data PeerData) bool {
	select {
	case sp.channel <- data:
		return true
        case <-ctx.Done():
                return false
	case <-sp.Context().Done():
		return false
	}
}

Perhaps something to be done in future PRs.

@qfrank
Copy link
Collaborator Author

qfrank commented Apr 15, 2024

Just wondering if this operation inside PushToChan is needed at all

I had the same feeling :), thank you for the review

@qfrank qfrank merged commit 6d889ca into go1.19 Apr 15, 2024
4 of 9 checks passed
@qfrank qfrank deleted the fix/logout_discover_dead_lock branch April 15, 2024 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants