-
-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hanging after exit is sent #4152 #4153
Conversation
We need to send the update of config to reactor thread to avoid the race condition. |
I don't understand how the fix works (will leave review to people who understand it better) but great that you managed to resolve it. |
I have not measured, But I don't think there is much, only 3 sec for each time a server trying to close when there is config update going on, most test won't update the config, and there might be long enough time before we exit for the loading of the initial config. Instead of the initial motivation to speed up the testcase. The PR is rather a bug fix that is to solve the potential dead lock situation from happenning during the shutdown and exit of HLS. |
Hmm, this is tricky. I'm not sure this is quite the right fix. Can we instead make it so that we throw in this case instead of hanging? |
Alternatively we could make the shutdown handler take away the ideStateVar to prevent the race condition. This would not change the thread the configupdate callback is running on,since we have only one users for this ideStateVar. I think we should fix this race condition one way or the other. |
I think that's a better approach. But I think maybe we can work at a lower level of granularity? Basically, I think the pattern we want is something like this:
In this case I think you've narrowed it down to the
What you're suggesting with the
So I think it would be best to try and localise the guard more. Ideally we would just make the thing that currently hangs fail instead. I don't know exactly where that is, but maybe you do? |
7aa27ac
to
dc53035
Compare
Yes, I agree about this. In summary, we actually have two problem to solve here.
I have update this branch with the gurad fix.
After the our gurad fix, we would have a second user: the shutdown handler. So there would be a second functionality for the
After writing the above, I believe |
Did you manage to figure out exactly which function is hanging?
Again, I'm not sure this is necessarily a problem. What happens if we restart the shake session after it is closed? It should fail! So I think it's fine if we allow "higher-level" things to run, so long as the "lower-level" things that can't work fail appropriately.
... and now this is an invariant we have to remember to use whenever we work with |
🤔 If we only make the "lower-level" to fail instead of hanging,
The exact function that is hanging: Err... The truth is even wilder than my imagination, the reactor thread do not even have time to ship out And |
I am creating a minimal example to see if I can reproduce the hanging. This is a minimal example of the hanging: {-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE PackageImports #-}
{-# LANGUAGE LambdaCase #-}
import Control.Concurrent.STM
import Control.Monad (forever)
import Data.Functor (void)
import Control.Monad.IO.Unlift (MonadUnliftIO)
import UnliftIO.Concurrent
import UnliftIO (async, waitAnyCancel)
import GHC.IO (unsafeInterleaveIO)
import System.Timeout (timeout)
type Que = TQueue String
runWithDb :: (Que -> IO ()) -> IO ()
runWithDb k = do
q <- newTQueueIO
k q
main :: IO ()
main = do
queVar <- newEmptyMVar
lifetime <- newEmptyMVar
die <- newEmptyMVar
mainQue <- newTQueueIO
~que <- unsafeInterleaveIO $ takeMVar queVar
_ <- flip forkFinally (const $ pure ()) $ do
untilMVar lifetime $ runWithDb $ \q -> do
threadDelay 1000000
putMVar queVar q
forever $ do
putStrLn "Waiting for message"
msg <- atomically $ readTQueue mainQue
putStrLn $ "Received: " ++ msg
putMVar lifetime ()
_ <- forkIO $ do
x <- atomically $ readTQueue que
putStrLn $ "Received: " ++ x
putMVar die ()
timeout 3000000 (takeMVar die) >>= \case
Just _ -> putStrLn "Exiting"
Nothing -> putStrLn "Timeout in 3 s"
return ()
untilMVar :: MonadUnliftIO m => MVar () -> m () -> m ()
untilMVar mvar io = void $ do
waitAnyCancel =<< traverse async [ io , readMVar mvar ] |
Sure, but then we can work from there, e.g. we can ask: should we be catching exceptions during shutdown, or in the config handler, or whatever. At least the problem is not masked!
Huh, that is weird. AFAIK writing to a I also note that there's some stuff in there about |
You are right writing the TChan won't hang if the We have a very interesting situation: |
I also am not sure how that can happen! Surely creating the queue and then writing to it should be fine? |
Notice the computation of the If the thread that would fill the I am not sure why AFAIK, It should be fine if we remove |
Yeah, that's super weird and I don't know why we do it like that. If it works the other way that looks reasonable to me, what do you think @wz1000 ? In general it seems like anywhere we have a The other question of course is why the hanging thread doesn't get killed. I think it's probably because we process notifications synchronously, so if we get stuck processing a notification we will never see the I think we could also make this simpler on the |
@michaelpj |
That was just about the generic shutdown stuff. I would like to understand why there is a hang here. But it seems like at least one thing we should try is your commit that gets rid of the weirdness. Does that pass the tests? And does it solve the problem? (I still think a comment from @wz1000 would be helpful here.) |
Yes, it passes the test and solves the problem. Simply put, it hangs because we are trying to So two things have been done here so far:
The main concern is that removal of Also I would like to know if there would be unwanted side effect shipping the |
10eb6f1
to
f387ccd
Compare
Trying to fix #4152
Reactor thread offered a db connect to hieDB, if it exist, the db connection exit with it.
Operation might need the db connection should only run if the reactor thread is still running.
Otherwise a race condition might happened:
Shutdown handler would stop the reactor thread, cancel the shake seesion.
If an operation is still trying to reach the db, in this case, the config update callback would try to call
setSomethingModified
which need db connections. It hangs HLS forever.This PR ships the
config update callback
to the reactor thread to fix the deadlock.Alternative would be to make the shutdown handler take away the ideaVar.