Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-8765 - Add CloudMetadata to MachineStatus #4637

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

cheukt
Copy link
Member

@cheukt cheukt commented Dec 16, 2024

original PR: #4637

Tested a few scenarios
Local -> remote1, works

  • if remote1 is running an old version of viam-server, remote1's resources get a blank cloud metadata
  • if local is not a cloud robot, remote1's resources still get cloud metadata

local -> remote1 -> remote2 also works

  • tried it with remote1 off (after initial connection), remote1 and remote2's resources exists but get a blank cloud metadata
  • remote2 with old rdk, remote2's resources get a blank cloud metadata
  • remote2 off, remote2's resources get a blank cloud metadata

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 16, 2024
@cheukt cheukt added the appimage-ignore-tests Build AppImage of PR and ignore tests label Dec 17, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 17, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 18, 2024
// Status encapsulates a resource name along with state transition metadata.
type Status struct {
// NodeStatus encapsulates a resource name along with state transition metadata.
type NodeStatus struct {
Name Name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also move Name out of NodeStatus and into resource.Status, but keeping it there for now since I think it would be good to have the name come from the node and not the graph

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://viam.atlassian.net/browse/RSDK-9550 is the ticket, marked it as low since it's low-impact

}

func (w *GraphNode) resourceStatus() Status {
var resName Name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue here is that test objects are not very flexible and there is not a good way to pass in the correct node name to the resource (especially if going across "remotes")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on why this won't match the resource name? Is it only in the case of remote resources? Just want an example of a mismatch so I better understand in which cases the original config.Name or current.Name() are incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock remote resources in tests - especially how I've opted to not actually spin up real servers, the "remote" resources are real resources and no way to get the names to be accurate.

if closeCtx.Err() != nil {
return
}
if !rOpts.disableBackgroundReconfiguration {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not strictly necessary, but added as a convenience to provide better guarantees to our tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comment below, could we rename this option disableCompleteConfigWorker? The goroutine here isn't fully Reconfigureing, it's only calling completeConfig and updateWeakDependents.

@@ -1157,7 +1159,7 @@ func (r *localRobot) reconfigure(ctx context.Context, newConfig *config.Config,
if newConfig.Cloud != nil {
r.Logger().CDebug(ctx, "updating cached config")
if err := newConfig.StoreToCache(); err != nil {
r.logger.CErrorw(ctx, "error storing the config", "error", err)
r.logger.CDebugw(ctx, "error storing the config", "error", err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be a real error scenario? Failure to cache the new config to disk?

md, _ := r.CloudMetadata(ctx) //nolint:errcheck
for _, resourceStatus := range r.manager.resources.Status() {
// if the resource is local, we can use the status as is and attach the cloud metadata of this robot.
if !resourceStatus.Name.ContainsRemoteNames() && resourceStatus.Name.API != client.RemoteAPI {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes for remote robots also count as remote


// defaultRemoteMachineStatusTimeout is the default timeout for getting resource statuses from remotes. This prevents
// remote cycles from preventing this call from finishing.
var defaultRemoteMachineStatusTimeout = time.Minute
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to keeping this simpler and not having this timeout as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents remote cycles

I wouldn't have expected this timeout to be "preventing" remote cycles, but I do think we need timeouts for any remote interactions for when the remote is unexpectedly offline. How does this timeout "prevent" cycles?

Copy link
Member Author

@cheukt cheukt Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this timeout "prevent" cycles by basically adding an upper limit to how long this call can run for - if it is an infinite loop (remote1 points to remote2 and remote2 points to remote1), at least the call will return after a minute

@cheukt cheukt requested a review from benjirewis December 18, 2024 17:31
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 18, 2024
@cheukt cheukt marked this pull request as ready for review December 18, 2024 17:31
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 18, 2024
Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some initial comments + qs.

}

func (w *GraphNode) resourceStatus() Status {
var resName Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on why this won't match the resource name? Is it only in the case of remote resources? Just want an example of a mismatch so I better understand in which cases the original config.Name or current.Name() are incorrect.

robot/impl/local_robot.go Show resolved Hide resolved
if closeCtx.Err() != nil {
return
}
if !rOpts.disableBackgroundReconfiguration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comment below, could we rename this option disableCompleteConfigWorker? The goroutine here isn't fully Reconfigureing, it's only calling completeConfig and updateWeakDependents.

@@ -1157,7 +1159,7 @@ func (r *localRobot) reconfigure(ctx context.Context, newConfig *config.Config,
if newConfig.Cloud != nil {
r.Logger().CDebug(ctx, "updating cached config")
if err := newConfig.StoreToCache(); err != nil {
r.logger.CErrorw(ctx, "error storing the config", "error", err)
r.logger.CDebugw(ctx, "error storing the config", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be a real error scenario? Failure to cache the new config to disk?

robot/impl/local_robot_test.go Show resolved Hide resolved

// defaultRemoteMachineStatusTimeout is the default timeout for getting resource statuses from remotes. This prevents
// remote cycles from preventing this call from finishing.
var defaultRemoteMachineStatusTimeout = time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents remote cycles

I wouldn't have expected this timeout to be "preventing" remote cycles, but I do think we need timeouts for any remote interactions for when the remote is unexpectedly offline. How does this timeout "prevent" cycles?

robot/impl/robot_options.go Outdated Show resolved Hide resolved
testutils/resource_utils.go Show resolved Hide resolved
robot/impl/local_robot_test.go Outdated Show resolved Hide resolved
robot/impl/local_robot_test.go Show resolved Hide resolved
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 23, 2024
@cheukt cheukt requested a review from benjirewis December 23, 2024 16:28
Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, just a couple nits.

@@ -287,6 +287,34 @@ func (r *localRobot) sendTriggerConfig(caller string) {
}
}

func (r *localRobot) completeConfigWorker() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I realize you have the comment above the calling site of this method, but a comment above the method describing what it does would also be super helpful for posterity.

@@ -20,6 +20,9 @@ type options struct {
shutdownCallback func()

enableFTDC bool

// disableCompleteConfigWorker starts the robot without any background processes - should only be used for tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// disableCompleteConfigWorker starts the robot without any background processes - should only be used for tests.
// disableCompleteConfigWorker starts the robot without the complete config worker - should only be used for tests.

@@ -82,3 +85,10 @@ func WithShutdownCallback(shutdownFunc func()) Option {
o.shutdownCallback = shutdownFunc
})
}

// withDisableCompleteConfigWorker returns a Option which disables background reconfiguration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// withDisableCompleteConfigWorker returns a Option which disables background reconfiguration.
// withDisableCompleteConfigWorker returns an Option which disables the complete config worker.

robot/impl/local_robot_test.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
appimage-ignore-tests Build AppImage of PR and ignore tests safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants