-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-8765 - Add CloudMetadata to MachineStatus #4637
base: main
Are you sure you want to change the base?
Conversation
…metadata-to-maichine-status
// Status encapsulates a resource name along with state transition metadata. | ||
type Status struct { | ||
// NodeStatus encapsulates a resource name along with state transition metadata. | ||
type NodeStatus struct { | ||
Name Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also move Name out of NodeStatus and into resource.Status, but keeping it there for now since I think it would be good to have the name come from the node and not the graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://viam.atlassian.net/browse/RSDK-9550 is the ticket, marked it as low since it's low-impact
} | ||
|
||
func (w *GraphNode) resourceStatus() Status { | ||
var resName Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue here is that test objects are not very flexible and there is not a good way to pass in the correct node name to the resource (especially if going across "remotes")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on why this won't match the resource name? Is it only in the case of remote resources? Just want an example of a mismatch so I better understand in which cases the original config.Name
or current.Name()
are incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mock remote resources in tests - especially how I've opted to not actually spin up real servers, the "remote" resources are real resources and no way to get the names to be accurate.
robot/impl/local_robot.go
Outdated
if closeCtx.Err() != nil { | ||
return | ||
} | ||
if !rOpts.disableBackgroundReconfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not strictly necessary, but added as a convenience to provide better guarantees to our tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comment below, could we rename this option disableCompleteConfigWorker
? The goroutine here isn't fully Reconfigure
ing, it's only calling completeConfig
and updateWeakDependents
.
robot/impl/local_robot.go
Outdated
@@ -1157,7 +1159,7 @@ func (r *localRobot) reconfigure(ctx context.Context, newConfig *config.Config, | |||
if newConfig.Cloud != nil { | |||
r.Logger().CDebug(ctx, "updating cached config") | |||
if err := newConfig.StoreToCache(); err != nil { | |||
r.logger.CErrorw(ctx, "error storing the config", "error", err) | |||
r.logger.CDebugw(ctx, "error storing the config", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be a real error scenario? Failure to cache the new config to disk?
md, _ := r.CloudMetadata(ctx) //nolint:errcheck | ||
for _, resourceStatus := range r.manager.resources.Status() { | ||
// if the resource is local, we can use the status as is and attach the cloud metadata of this robot. | ||
if !resourceStatus.Name.ContainsRemoteNames() && resourceStatus.Name.API != client.RemoteAPI { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodes for remote robots also count as remote
|
||
// defaultRemoteMachineStatusTimeout is the default timeout for getting resource statuses from remotes. This prevents | ||
// remote cycles from preventing this call from finishing. | ||
var defaultRemoteMachineStatusTimeout = time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open to keeping this simpler and not having this timeout as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents remote cycles
I wouldn't have expected this timeout to be "preventing" remote cycles, but I do think we need timeouts for any remote interactions for when the remote is unexpectedly offline. How does this timeout "prevent" cycles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this timeout "prevent" cycles by basically adding an upper limit to how long this call can run for - if it is an infinite loop (remote1 points to remote2 and remote2 points to remote1), at least the call will return after a minute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Some initial comments + qs.
} | ||
|
||
func (w *GraphNode) resourceStatus() Status { | ||
var resName Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on why this won't match the resource name? Is it only in the case of remote resources? Just want an example of a mismatch so I better understand in which cases the original config.Name
or current.Name()
are incorrect.
robot/impl/local_robot.go
Outdated
if closeCtx.Err() != nil { | ||
return | ||
} | ||
if !rOpts.disableBackgroundReconfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comment below, could we rename this option disableCompleteConfigWorker
? The goroutine here isn't fully Reconfigure
ing, it's only calling completeConfig
and updateWeakDependents
.
robot/impl/local_robot.go
Outdated
@@ -1157,7 +1159,7 @@ func (r *localRobot) reconfigure(ctx context.Context, newConfig *config.Config, | |||
if newConfig.Cloud != nil { | |||
r.Logger().CDebug(ctx, "updating cached config") | |||
if err := newConfig.StoreToCache(); err != nil { | |||
r.logger.CErrorw(ctx, "error storing the config", "error", err) | |||
r.logger.CDebugw(ctx, "error storing the config", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be a real error scenario? Failure to cache the new config to disk?
|
||
// defaultRemoteMachineStatusTimeout is the default timeout for getting resource statuses from remotes. This prevents | ||
// remote cycles from preventing this call from finishing. | ||
var defaultRemoteMachineStatusTimeout = time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents remote cycles
I wouldn't have expected this timeout to be "preventing" remote cycles, but I do think we need timeouts for any remote interactions for when the remote is unexpectedly offline. How does this timeout "prevent" cycles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, just a couple nits.
@@ -287,6 +287,34 @@ func (r *localRobot) sendTriggerConfig(caller string) { | |||
} | |||
} | |||
|
|||
func (r *localRobot) completeConfigWorker() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I realize you have the comment above the calling site of this method, but a comment above the method describing what it does would also be super helpful for posterity.
@@ -20,6 +20,9 @@ type options struct { | |||
shutdownCallback func() | |||
|
|||
enableFTDC bool | |||
|
|||
// disableCompleteConfigWorker starts the robot without any background processes - should only be used for tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// disableCompleteConfigWorker starts the robot without any background processes - should only be used for tests. | |
// disableCompleteConfigWorker starts the robot without the complete config worker - should only be used for tests. |
@@ -82,3 +85,10 @@ func WithShutdownCallback(shutdownFunc func()) Option { | |||
o.shutdownCallback = shutdownFunc | |||
}) | |||
} | |||
|
|||
// withDisableCompleteConfigWorker returns a Option which disables background reconfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// withDisableCompleteConfigWorker returns a Option which disables background reconfiguration. | |
// withDisableCompleteConfigWorker returns an Option which disables the complete config worker. |
original PR: #4637
Tested a few scenarios
Local -> remote1, works
local -> remote1 -> remote2 also works