Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement connection manager #1021

Conversation

sameh-farouk
Copy link
Member

@sameh-farouk sameh-farouk commented Nov 17, 2024

Connection Pool Manager Implementation

Overview

This PR introduces a connection pool manager for TFChain client to improve connection handling, and reliability. It replaces the previous simple single connection management with a pool management implementation.

Key Features

  • Connection Pooling: Maintains a configurable pool of connections (min/max size)
  • Automatic Recovery: Handles failed connections with automatic replacement
  • Connection Reuse: Efficiently reuses healthy connections
  • Configurable Settings: Customizable pool size, idle timeout, and connection timeouts
  • Thread Safety: thread-safe implementation with lock handling
  • Resource Management: Proper cleanup of idle and unhealthy connections
  • Tests:Added comprehensive test suite including stress tests

Are these changes back-compatible with old code

The Manager interface has been updated with new methods. The changes were almost compatible with the old code by opting for having aliases for deprecated/renamed methods and setting defaults for the pool configuration. However:

  • It's important to call the new mgr.Close() method when done with the manager.
  • If the defaults are not working for you, you should add a custom configuration by constructing the manager using the new NewManagerWithConfig function.

Deprecated Methods

No immediate action is required atm but these methods may be removed in the future

  • mgr.Raw() is deprecated
  • mgr.Substrate() is depricated and replaced with mgr.GetConnection().
  • sub.Close() is deprecated and replaced with sub.Release().

Checklist:

  • My change requires a change to the documentation and I have updated it accordingly
  • I have added tests to cover my changes.

@sameh-farouk sameh-farouk marked this pull request as ready for review November 18, 2024 03:24
@xmonader xmonader requested a review from iwanbk November 18, 2024 11:07
Comment on lines +286 to +287
urls := shuffle(m.unusedURLs())
urls = append(urls, m.urls...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it here, isn't the urls returned from unusedURLs are already in m.urls? why not just use m.urls? and why shuffling them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating new connections, I want to prioritize unused URLs but won’t restrict attempts to them if all unused URLs are down. also, unusedURLs() can return 0 URLs.
Shuffling was implemented to distribute loads. Without shuffling, when an RPC node goes down, all nodes would attempt to connect to the same URL, causing spikes and high load.

Comment on lines +196 to +198
if conn := m.getExistingConn(); conn != nil {
return conn, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're trying to get an existing connection here and then retrying again in the backoff retry function, should we remove this and only use the one in the retry

Copy link
Contributor

@rawdaGastan rawdaGastan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • didn't review tests
  • to be continued

// be behind before we don't accept it. block time is 6 seconds, so
// right now we only allow 2 blocks delay
acceptableDelay = 2 * 6 * time.Second
var (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comments are removed

Copy link
Contributor

@xmonader xmonader Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely restore the comments in the PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to clean up my own comments but accidentally deleted old ones as well, thanks to dump code editor plugins :(

return err == nil
}

func (pc *poolConn) close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to update lastUsed and isUse?

// so instead we just check the health when we need to and do the maintanance in demand
// HealthCheckInterval time.Duration
// Timeout for creating new connections
ConnectionTimeout time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't get it? why setting a timeout for creating connections?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, this is the behavior you want.
Mainly to avoid waiting indefinitely on a stuck connection instead of attempting to reconnect.
Also, this parameter is not used at all and should be cleaned up.
The reason I removed the timeout logic (not completely as you noticed) is that I found it already implemented in the inner substrate client out of the box, so it wasn't needed in our code.

// Minimum number of connections to maintain
MinPoolSize int
// Maximum time a connection can be idle before being closed (if the pool has more than MinPoolSize)
MaxIdleTime time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if I reached it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rawdaGastan the connection should get closed as far as i understand


type manager struct {
urls []string
pool []*poolConn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pools?

t, err := getTime(cl, meta)
if err != nil || time.Since(t) > AcceptableDelay {
cl.Client.Close()
return nil, nil, fmt.Errorf("node health check failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not embedding the error?

if s.conn != nil && s.conn.isHealthy() {
conn := s.conn.conn
meta := s.conn.meta
s.conn.lastUsed.Store(time.Now().Unix())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update inUse field here?

Msg("checking for healthy connections")

// Try getting existing connection first
if conn := m.getExistingConn(); conn != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why trying to get connection before using backoff.Retry?

return unused
}

func (m *manager) aquiredConnCount() int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it ia about inUse count not all connections>

log.Debug().Msg("ensuring minimum connections in the pool")
inUseCount := m.aquiredConnCount()
urls := shuffle(m.unusedURLs())
urls = append(urls, m.urls...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why appending shuffled unused and all manager urls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment here

}

func (m *manager) getHealthyConn() (*poolConn, error) {
log.Debug().Int("pool_size", len(m.pool)).Int("aquired_count", m.aquiredConnCount()).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw m.pool is accessed here without any locking which leads to data race. Try running the tests with -race flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also typo aquired should be acquired

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it does an io, it should have a context.

mgrImpl.mu.RLock()
defer mgrImpl.mu.RUnlock()

assert.LessOrEqual(t, len(mgrImpl.pool), mgrImpl.config.MinPoolSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't len(mgrImpl.pool) be greater or equal the the MinPoolSize?

// right now we only allow 2 blocks delay
acceptableDelay = 2 * 6 * time.Second
var (
ErrInvalidVersion = fmt.Errorf("invalid version")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For static errors use only errors.New, fmt.Errorf communicates the intention of either wrapping or formatting

//ErrNotFound is returned if an object is not found
ErrNotFound = fmt.Errorf("object not found")
const (
AcceptableDelay = 2 * 6 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHy exported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and restore the comment please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed shouldn't be exported since it's an internal implementation

@@ -37,145 +36,437 @@ type Versioned struct {
type Conn = *gsrpc.SubstrateAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alias is misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your suggestions here?

@@ -37,145 +36,437 @@ type Versioned struct {
type Conn = *gsrpc.SubstrateAPI
type Meta = *types.Metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

conn: conn,
meta: meta,
url: url,
lastUsed: atomic.Int64{},
Copy link
Contributor

@xmonader xmonader Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialized to what? use .Store with unix timestamp

meta: meta,
url: url,
lastUsed: atomic.Int64{},
inUse: atomic.Bool{},
Copy link
Contributor

@xmonader xmonader Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .Store(false)

url[i], url[j] = url[j], url[i]
})
// Validate and adjust configuration
if config.MaxPoolSize < 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these validations should fail and never swap, override values provided by the user, we need to tell them they're wrong instead

ctx: ctx,
cancel: cancel,
config: config,
checkChan: make(chan struct{}, 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why buffered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer size of 1 is sufficient because we only need to know if a check is pending or not.
If a check is already pending, other goroutines are not blocked (Non-blocking sends).
Also, Multiple check requests are coalesced, collapsing multiple requests into a single check operation (reduces contention).

 select {
    case m.checkChan <- struct{}{}: // Non-blocking send
        log.Debug().Msg("triggered connection check")
    default: // If channel is full, skip
        log.Debug().Msg("connection check already pending")
    }

Note that initially, I have the health check execute on demand—immediately when a connection fails—and periodically, however, I later reconsidered the impact of the side effects on the load and the actual benefits.

if len(url) == 0 {
panic("at least one url is required")
func NewManagerWithConfig(config ManagerConfig, urls ...string) Manager {
if len(urls) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never panic on input validation!

m := &manager{
urls: shuffle(urls),
pool: make([]*poolConn, 0, config.MaxPoolSize),
ctx: ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redflag to store ctx on a struct.

urls []string
pool []*poolConn
mu sync.RWMutex
ctx context.Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt we need that, it may even introduce more problems than it fixes

// the connection must be closed after you are done using it
func (p *mgrImpl) Substrate() (*Substrate, error) {
cl, meta, err := p.Raw()
func (m *manager) GetConnection(ctx context.Context) (*Substrate, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is that ctx used?

poolSize := len(m.pool)

if poolSize < m.config.MinPoolSize || (poolSize < m.config.MaxPoolSize && poolSize == inUseCount) {
if conn, err := m.createConnection(url); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in case of error?

func (s *Substrate) Close() {
s.Release()
}

func getTime(cl Conn, meta Meta) (t time.Time, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have context I believe.

if s.conn != nil {
s.conn.inUse.Store(false)
}
s.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's unlock case in 434, and always unlock on 440?

meta Meta
func (m *manager) Close() error {
m.cancel()
m.wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes we only created it with NewManagerWithConfig and may block forever.

return conn, err
}

func (m *manager) healthChecker() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    for {
        m.mu.Lock()
        for len(m.pool) == 0 {
            // Wait until there is work to do or the context is canceled
            m.cond.Wait()
        }
        m.mu.Unlock()

        // Perform health checks
        select {
        case <-m.ctx.Done():
            return
        default:
            m.checkConnections()
        }
    }

cancel context.CancelFunc
wg sync.WaitGroup
config ManagerConfig
checkChan chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with a condition variable instead

}

// Default configuration
var DefaultConfig = ManagerConfig{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's common I'd prefer to have SubstrateManagerDefaultConfig as a function instead

conn Conn
meta Meta
url string
lastUsed atomic.Int64 // Unix timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the poolConn is used anywhere without the mutex, so probably better to get rid of all of the atomics.

@iwanbk
Copy link
Member

iwanbk commented Nov 19, 2024

There are few things that i noticed here:

  1. why we need to shuffle? Because it also considering the url used by the pool element. Which means it eventually become round robin.
  2. looks like initializePool is not needed. We can initialize checkChan without buffer, then we can remove the initializePool
  3. Do we really need to have ready-to-use conn? That we need to have MinPoolSize conn. It has some costs:
  • code complexity
  • unnecessary connection creation -> unnecessary CPU usage on both server & client
  1. Do we really need to have this pool in the first place? If we use http, i've made some check:

https://github.com/centrifuge/go-substrate-rpc-client/blob/e3b938563803c6a71043c4e7ba2e0d27c400f514/gethrpc/http.go#L124-L126

I did a quick read on the websocket, but not sure about it.

If we use http, then i think we can simply use one substrate client for each URL

Please double check the substrate client internal

@iwanbk
Copy link
Member

iwanbk commented Nov 19, 2024

OK, found out that we use websocket.

I did a quick read on the websocket, but not sure about it.

We indeed need to have a pool if we use websocket, because it doesn't have the internal pool like the http one.

@iwanbk
Copy link
Member

iwanbk commented Nov 19, 2024

but the docs says the contrary: it is websocket which can do concurrent request

https://docs.substrate.io/build/application-development/#connecting-to-a-node

Most applications use a WebSocket port because a single connection can be used for multiple messages to and from a node. With an HTTP connection, applications can only send and receive responses one message at a time

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Nov 25, 2024

Thanks all for the contributions and comments. there's a lot of work still needed here and some decisions were rushed.

I took a step back and discussed some high-level considerations, like whether we really need pool management, with @xmonader.
We agreed that a full connection pool is likely overkill since we only need a simple failover mechanism.

Complexity Considerations:

  • Fail-over: Simple, easy to understand
  • Pool: More complex, more moving parts

Resource Usage Considerations:

  • Fail-over: Single connection
  • Pool: Multiple connections, background goroutines

Performance Considerations:

  • Fail-over: Can have delays during failover
  • Pool: Better performance under load, instant failover

After considering the trade-offs, we decided that the fail-over mechanism is sufficient.
I will close this PR in favor of a new feature PR with simpler failover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants