Race Conditions for Requests #359

BenjaminHohlmann · 2024-12-19T13:27:11Z

Problem:
SDC Messages are subject to race condition and their order is not guaranteed. If implemented without proper safeguards, this may lead to hazardous situations with unexpected behavior.

There is no guarantee, that messages are transmitted sequentially over the network.
Simultaneously, there is no requirement, that SDC participants wait for a request to finish before starting the next request.

Exemplary error-case scenario
A standard-compliant consumer may issues a series of set operations on a metric without waiting for a confirmation before issuing the next set operation. Intuitively, the consumer may assume that the metric will be set to the value of the last set operation. However, due to the lack of guaranteed message ordering, this assumption might not hold true.

Notably, when using the regular HTTP request/response pattern over a single socket (as most implementations presumably do), this issue does not arise. Requests are buffered until a response is received, and only then can the next request be processed. However, problems may occur when communication is established over multiple sockets, for example, to improve performance. Similarly, this issue is relevant when using gRPC instead of MDPWS.

Additionally, this error case cannot be detected from the OperationInvoked reports. These reports provide information about the metric, operation handles, and invocation state, but not the value being set.

Likewise, setXXXResponse messages do not carry sufficient information to determine execution order, as the reported mdibVersion represents the version before applying the set request.

Solutions:
1) Add a requirement, stating that the request/response pattern is to be followed at all times.
Most manufacturers likely already implement this pattern. It is a simple, safe, and backward-compatible solution that does not require significant changes.

On the other hand, this introduces limitations performance-wise:
The queries-per-seconds (Qps) are limited to 1 / round-trip-time, with the latter being at least twice the latency. With a latency of 100ms, this limits the queries-per-second to 5. Depending on the application, this could be a critical safety issue.
In general, it doubles the distributed system's vulnerability to latency compared to solution 2).

2) Add a message-counter for each connection for the recipient to order the incoming messages.
This requires introducing additional fields.

3) Adapt the OperationInvoked report, such that out-of-order executions can be detected.
By reporting the new mdibVersion, too, the order of execution could be determined.
Alternatively, the actual value could be included in the report.
This solves the issue just for set requests, not in general. It may still arise with other services, even though those already existent are not affected by it (needs confirmation).

4) Add an optional requirement, to make manufacturers aware of this issue, but maintain maximal flexibility in implementation
As mentioned in 1), not waiting for a response improves overall system responsiveness. As such, it may be desired behavior.
Manufacturers are hinted at this issue/opportunity by the requirement, reducing its risk.
Open question: If some manufacturers adhere to the requirement, and some don't, is the overall system still safe? Does this needs to be communicated during handshake?

PaulMartinsen · 2024-12-19T20:13:36Z

There are a couple of requirements in SDPi that could be relevant to this:

R1003 mandates HTTP 1.1 without pipelining. I think this means participants must wait for an HTTP response to every message before sending another message though, as you noted, this limits performance.
R1001 mandates providers establish only a single TCP connection for each consumer. I think the intent here is to prevent providers sending parallel notifications over multiple connections, which could lead to the race conditions you mention.

Do these requirements address the race you condition you raised?

Some consumers are using the mdib version to check message ordering (R1005), though I think there are some problems with that. The solution proposed in that issue doesn't anticipate multiple connections though. Nonetheless, could it help resolve the race condition you raised while allowing improved performance?

BenjaminHohlmann · 2024-12-20T11:36:03Z

Thank you very much for pointing these out!

We think R1001 is meant to be applied to reports only, as indicated by its leading text, the Chapter and the fact that it can not be fulfilled for services, where the provider serves as host:
Reports are send out by the provider's HTTP client and there can only be one TCP connection. Thus, any service that requires a HTTP server on the provider's side (including set!) would require opening a second TCP connection, violating the requirement.

R1003 is not sufficient, as one may still set up multiple HTTP 1.1 connections.

BenjaminHohlmann · 2024-12-20T11:55:14Z

The proposed solution of introducing a mechanism for "transmitted reports counting" for partial subscriptions does not solve the issue, as it applies to reports only, for which the issue does not apply, given R1001.
Yet, if the mechanism was extended to all messages, it would indeed implement solution 2).

PaulMartinsen · 2024-12-20T20:49:06Z

There's tricky language in all this, but I think you are right about R1001 being for out-bound connections only.

A SOMDS Provider shall only establish one TCP connection at a time for every subscribed SOMDS Consumer.

Strictly, once a connection's been established for notifications I think this requirement, as written now, says that one connection should be used for both sending notifications from to a consumer and receiving operation commands from that consumer. I wonder if that's the intent? I suspect it isn't, but if so would create a bunch of new questions, e.g.:

should the connection stay open when the subscription ends in case there are more operation command messages?
who's supposed to close the connection?

I agree that R1003 isn't sufficient to avoid race conditions if the consumer is permitted to setup multiple connections to a single provider; and I can't see anything that bans that. However, this may be the same race condition that happens when multiple consumers each setup a single connection to the provider and send conflicting commands. I'm only just starting to look at invoked operations now, so I'm not sure how that's resolved. There are a bunch of requirements in 10207, §7.4.2 that may be relevant though.

BenjaminHohlmann added the Comment NEW A submitted comment waiting to be reviewed and dispositioned label Dec 19, 2024

github-project-automation bot moved this to New issues in Gemini SDPi Releases Dec 19, 2024

github-project-automation bot added this to Gemini SDPi Releases Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Conditions for Requests #359

Race Conditions for Requests #359

BenjaminHohlmann commented Dec 19, 2024

PaulMartinsen commented Dec 19, 2024

BenjaminHohlmann commented Dec 20, 2024

BenjaminHohlmann commented Dec 20, 2024

PaulMartinsen commented Dec 20, 2024

Race Conditions for Requests #359

Race Conditions for Requests #359

Comments

BenjaminHohlmann commented Dec 19, 2024

PaulMartinsen commented Dec 19, 2024

BenjaminHohlmann commented Dec 20, 2024

BenjaminHohlmann commented Dec 20, 2024

PaulMartinsen commented Dec 20, 2024