-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IF: propagate hs_*_messages #1548
Comments
Changes may instead need to be outside of net plugin, e.g. changes to qc_chain to support an qc tree analogous to fork database for reversible blocks. This would be needed to detect duplicates to drop them on the floor and avoid the infinite message loop problem. |
I'm thinking that we want the rebroadcasting to be encapsulated inside of the IF engine, since it is already tasked with calling "broadcast hs message (x)" as a side-effect of message processing. So the IF engine will also make the decision to disseminate (route) the message to the network during the processing of each incoming message. Here's my current thinking about what these decisions would look like, for each message type:
This is in addition to the broadcast-this-new-message requests that are already inside qc_chain. That would mean qc_chain is also a gossip router now. A free optimization can be achieved by not ever rebroadcasting a message to the original sender. This can be achieved by passing an identifier of the sender to the IF engine together with the incoming message notification, which can then pass that as an optional argument to the send/broadcast requests (i.e. do broadcast, but exclude this given destination/connection/node). A further optimization then is the creation of a "hotstuff" connection/channel/subscription type, similar to what is being done to block connections, transaction connections, and the new peer addressbook/discovery feature (last time I checked, month+ ago). All these broadcasts are limited to those connections only, which would allow e.g. hotstuff messages to be disseminated inside of the network core, eventually reaching all infrastructure nodes (and any external/downstream nodes interested in collecting this information and which have a "hotstuff" connection to a core network node). I am assuming that all of the hotstuff messages are going to be what we are seeing in I don't think the net_plugin should be making these decisions, as it would perhaps have to understand what the messages mean to be able to make optimal routing decisions. For a gossip network, pushing the routing into the IF engine will not needlessly complicate the interface between the two modules. All you need, in addition, is the original sender (net relayer, not HS message creator) information, which is ultimately meaningless to the IF engine, but is meaningful to the net_plugin, which just gets it back on a rebroadcast, without having to ever store it in the meantime. Depending on what information is required to implement all of the routing decisions, we will end up needing some additional data structures or a change to the existing one(s), as @arhag mentioned above. This spec can be slightly or completely off the target, so any discussions, completions, corrections here are of course welcome (and needed). |
I agree with the approach of the IF engine handling this functionality. It is not clear to me the complete logic for rebroadcast of a message. However, the following must hold: All hs messages must either generate a new/modified hs message OR forward the message to all peers (except the peer it came from). This ensures all message propagate throughout the entire network. I don't think limiting transmission of hs message to hs configured peers is needed. These messages are not large and I don't think the network overhead will warrant the added complexity. If we find it is useful, it can be added post-5.0. See #1559 for latest hs_message formats. |
EDIT: NOTE: This comment is obsolete. I believe this may be a good direction for vote rebroadcasting. I would use a second (local) bitset inside of
EDIT: If either of the bitsets on a proposal is set, the vote is new and is thus rebroadcasted. EDIT: I think this is wrong: only the leader should be aggregating signatures. A non-leader just reads the active_finalizers bitset, and sets the secondary bitset only. EDIT: I think that, aside any missing signature checking / authentication, this pseudocode does it:
EDIT: The above is still wrong; I confused Proposal with QC; Proposal doesn't have a sig bitset. |
EDIT: NOTE: This comment is obsolete. Tentative spec for the New Block message gossip.
EDIT: the |
Tentative spec for the New View message gossip. This has to be confirmed/reviewed by someone else, since I don't know enough about Hotstuff to know whether this is correct or not; however this cannot prevent liveness if we still have leader timeouts and/or leader rotation.
|
Tentative spec for the Proposal message gossip. Again, I don't know enough to be sure that this is correct, so I need a confirmation here. If you receive a proposal that you do insert in your store, then it is new and valid, and thus you rebroadcast it. If you are already discarding that proposal for whatever reasons, you do not rebroadcast it -- I don't think there's a case where you are inserting it on your store and don't rebroadcast, or a scenario where you are discarding it and do rebroadcast. And you do rebroadcast it before you broadcast votes on it (i.e. stuff it in the socket send buffers for each peer connection before the votes).
|
@heifner suggests GCs will be at LIB, meaning not time-based or based on a maximum buffer size and I like/agree. |
EDIT: NOTE: I am not sure this is relevant, but I'm leaving it here for now. If we decide(d) to merge the leader role with the proposer role for Leap 5, then we can perhaps defer the actual use of the New Block message, since its role as an actual network message seems to be to enable a separation between the proposer and the leader roles (and to enable unit tests which don't have another block propagation mechanism). If this separation does not happen in Leap 5, then we don't need that message, and thus we don't need to deal with it in the context of this ticket (i.e. coming up with a block ID cache to implement its dissemination). The message type still exists (for tests, for future use, and maybe as a synthetic message generated locally for the proposer role to communicate with the leader role inside the same node) but in the Leap 5 protocol it wouldn't be actually sent (over the actual network). Since IF removes the need for quick rotation in the producer schedule, the Leap protocol could also get rid of the 6-second/12-block producer window and enlarge it to match what a healthy window for Leader alternation would look like. In other words, if it would make more sense to rotate the leader (say) once every 12 seconds or 24 seconds, then you just make that the new production window and that allows you to have a single producer = (proposer + leader) window rotation that also eschews any need to design an explicit, event-driven, network-messaging-based, negotiated, active leader timeout and replacement solution, just like the producer window eschews any need to design such a thing for failing producers as well. In my view, the 6-second window for production that we are accustomed to serves two purposes. One is the purpose above: to be fast enough that we can accumulate implicit confirmations of block finality faster by spreading signature collection over the production of blocks (which each collect one signature; the "quorum certificate" as it were is implicit and spread over time). And the second purpose is to be small enough that when an elected and scheduled producer is completely disrupted, the disruption of service to the network is "not noticeable", in comparison with the typical expected QoS of blockchain networks -- for example, a 6-second outage due to a completely botched production window is roughly half of the block time of normal operation of the Ethereum network (something like 12 seconds per block), so even if one producer is completely down, you still get latency at least equivalent to e.g. Ethereum. But a block producer being down is a rare occurrence, especially with redundant production infrastructure that BPs should have. With IF, we would no longer need to worry about the total latency of a full production round. And if, in addition, we discard the preoccupation with "decreased network QoS due to producers that are completely down" (which is a problem addressable at the IT & governance level) then that would mean we are free to increase the production window to match something that would make sense for a forced leader rotation window which -- as discussed above -- frees us from having to worry about leader failure substitution; a failed leader role is equivalent to a failed proposer role, and both are equivalent to a failed producer entity, and thus we can assume the entire problem solves itself every X seconds, where X can be something other than 6 seconds. X can be anything that mitigates whatever costs (if any!) an automatic leader rotation imposes. Maybe I'm off here, but anyway, if that's true, then omitting the New Block message from the Leap 5 release would simplify this ticket. EDIT: Actually, having message dissemination for New Block solved (even if not optimally) is good in any case, since it keeps the engine ready for future use of the message in a Leap protocol that does decide to separate proposer and leader. EDIT: Even if you have redundant producer infrastructure (e.g. 2 block producers with the same block signing key) only one of them can be set up as the leader, with its unique BLS key, right? So even in that scenario, the proposer and the leader roles are in the same node (same BLS key). |
Current design direction, summed up:
EDIT: For the first item, going with this, which will be cleaner than passing connection_id as an additional argument:
|
For this to be secure, to avoid attacks where malicious peers are trying to censor message propagation, we need to compute a secure hash on top of the full byte serialization of every incoming and outgoing HS message. Other secure alternatives would involve signing and signature verification, which we don't want. Which means we probably want to do that at the net_plugin after all. So I will try to move the duplicate message filtering logic to the net_plugin. |
I don't follow, how does a hash help here? The malicious peer can just create a hash of its malicious message. The bls signatures in the messages are over the data that must be secure I thought. |
If you can guarantee that you are already and always going to pay the cost to verify a signature (of whatever type) over the message that you are receiving (which is to be determined by the application) then you can use that entire signature (say, 384 bits) as a key to a "don't rebroadcast this message in the future". In any other case, an attacker can just replay that signature (that is not verified and re-verified every time you receive it) over garbled content on the rest of the message, preventing the propagation of the "good" version of the message, in theory, which at that point could be irrelevant to node A receiving it (i.e. in application logic it is not verifying the signature, aggregating it, etc.) but is relevant to node B, which is counting on node A to rebroadcast it. Each HS message has its own structure. Some have a global signature object, others have a block ID and a justify-qc, like the New Block message. So you have to think about what piece of that message you are using to store on an index of "seen messages" for that specific message type. Each HS message type is getting its own protocol, in a sense. Maybe there's a way to leverage whatever bits of intelligence the qc_chain has about the meaning of the messages, but it is safer to, at least at first, treat messages as opaque bags of bits and the net_plugin just rebroadcasts any hotstuff message that it has "never seen before" (never seen recently, actually). It is better to deal with the spam problem later (i.e. an attacker is just manufacturing fake hotstuff messages to spam the network), than having to deal with a message censorship attack. In a spam attack you are just wasting bandwidth, but in a censorship attack the honest messages are being dropped. So my concern is that I don't want an honest message to be wrongly recognized as a message that has been already "seen". To do that, the easiest way is to just hash the byte serialization of messages. This way, an attacker cannot censor a message by sending that exact message, byte-by-byte, to be broadcast. The spam problem can be countered with an additional decision layer which is Hotstuff-specific; the net_plugin asks the qc_chain, after the message passes the duplicate check, if the message is "irrelevant". The node has to know that the qc_chain state it has is sufficient to make that call, and then know how to make that evaluation of what the message means to decide whether it's going to be dropped or not, even if it is apparently not a byte-by-byte duplicate of something previously seen (which is the case with spam). |
The root of the problem is the Multicast Problem, which is the reason why the open internet doesn't route multicast, but will happily route any unicast packet, no matter if it is meaningless and part of a DDoS attack or not (which it has no automated way of knowing). If we are introducing message broadcasting to the protocol and we don't have simple messages like "transactions" and "blocks", which have strict ordering, are always verified, etc., but instead we have these control messages that together implement a complex, critical and latency-sensitive protocol, then it becomes tricky to combine the fact that nodes all just trust each other to blindly multicast each other's messages to everyone. Maybe we should think about introducing the concept of a trusted mesh among the elected producer infrastructure. That would be analogous to a permissioned IP multicast subnet, for example. |
Byte-level duplicate message filtering will be done at net_plugin. If a message is a duplicate, it isn't even sent to the qc_chain. The decision to rebroadcast will be done at qc_chain. This can be a decision to rebroadcast always (always excluding the original sender), or a decision to rebroadcast conditionally. |
It was my impression that we were going to verify the signature on each and every hs message and that could be used as the key. I thought, each node would be maintaining its own view of finality and verifying all messages. |
We have decided today that we will be doing signature checks on all messages in one way or another, pretty much giving a good baseline solution or limiter to network-wide spam. We have also decided today that we will be doing duplicate checks by detecting the more general condition of an irrelevant IF message (in which duplicates are contained) -- a message that is discarded upon receipt by the qc_chain because it decides that the message serves no purpose, based on the contents of the message's fields and its own IF state. (EDIT: that means removing the SHA256 check that is there now, which may or may not be repurposed later -- see below). We have also decided that we want to get rid of the connection_id being passed into qc_chain. There is a way to get rid of the connection_id: add a return value to the entire chain of calls all the way from net_plugin connection::handle_message( const hs_message& msg ) to the qc_chain::process_proposal, process_vote, etc. methods that are all returning void now. They would return a status that would, for example, notify the caller connection (or the caller connection's my_impl-> (net_plugin_impl)) to rebroadcast the same message (but excluding that connection), or to notify of an invalid signature check or of an irrelevant message (to increase a spam/grief counter). On top of that, if we really need to, we can count peer spam/grief specifically on replay attacks by adding a specialized message sha256 check on each connection as well. Do we want a return value? EDIT: We have also decided to remove New Block as an actual HS message that is sent and seen on the wire (for starters). This is pretty much what I got for now. |
I think we want to keep the |
That part of my comment is nonsense since we can (if we want to) detect replay spam at qc_chain. The signature that is going to be verified for every hs message is the unique key we can hash and test for duplicates (or signature replay spam) at qc_chain itself (don't need to serialize the entire message; just hash the signature that is first verified for every message before we get to a duplicate check). EDIT: The parameters that are passed in/out (net_plugin <--> qc_chain) via callbacks can include a notification of any spam detected at qc_chain. |
Removed lgtm label because issue doesn't address how to do the work. |
PR #1613 will add the infrastructure that will allow the qc_chain to make Hotstuff message propagation decisions, and the Net plugin to make decisions on disconnecting peers that are sending (too frequent) invalid, irrelevant or duplicate Hotstuff messages. |
After today's discussion, I think the way forward to close this ticket is to: 1 - Let all the new finalizer_set logic make its way inside of qc_chain; 2 - Settle on a final spec for all the hotstuff message fields; Then (when everything that's required to make message filter-or-propagate decisions settles), 3 - Use all the information that is inside of qc_chain to implement all the filter-or-propagate-this-message decisions (probably inside of the 4 - If duplicate checks for message propagation are not addressed organically at that point (as part of naturally filtering messages that are irrelevant to the protocol, with the knowledge that any and every qc_chain can have at any point), then, as part of this ticket, add specific code for that (using for example a cache of SHA256 "seen signatures" or, the nuclear option, of "seen message content" [at the net_plugin level, that had been done before]); 5 - Write a specific multi-hop test for that; run tests; see that it works (e.g. no infinite propagation flooding; protocol still works), close this; Then, in the future, if needed (tracked in other, new issues -- other things that were discussed in the context of this ticket): 6 - Add logic at the net_plugin to react to notifications of lots of duplicate, invalid, etc. hotstuff messages and disconnect peer; 7 - Start pushing costly operations outside of qc_chain (or whatever other solution(s) to reduce qc_chain contention); 8 - Possibly refactor the IF engine (rename/reshape pacemaker etc.). |
- test_pacemaker allows each test to define a specific connection mesh between nodes - added a simple unit test (#7) that can be used to test a solution for #1548 (just needs to introduce a disconnect between nodes) - added support to write multi-hop unit tests (with selective message propagation by type) The hotstuff protocol has "phases", and between two adjacent phases, we are typically using different message types. This allows us to write tests where we loop propagating a specific message type across the network, for as many hops as necessary, until the `dispatch` method on the `test_pacemaker` returns an empty vector, signaling no more message propagation happened. And then, we can start propagating the message type that's expected for the next phase of the protocol (these were generated during propagation of the previous type, but were just retained). All unit tests still use a full connection mesh, which is easily created by calling `test_pacemaker::connect(vector-with-all-node-names)`. The added test is meant to be changed by the inclusion of an explicit disconnect between the two rotating leaders; this can be used as the first test of a solution to message propagation. Closes #1658
(EDIT) As discussed today, I'm going to write a propagation solution to the hotstuff messages as they are now, and test them with unit tests, and have that up in a branch. |
- qc_chain does not aggregate a signature from a finalizer that has already voted on a proposal - added test_pacemaker::duplicate() - added hotstuff_8 unit test to test for duplicate votes This fix is related to #1548 (lack of duplicate filtering only manifests with message propagation)
This issue is focused on the improvements to handling of network messages prior to unification of proposals with blocks. Unification will require further changes to this code, but that will be handled in another issue. |
Implemented by #1676 |
Need to support configuration where BPs are not directly connected. For example:
BP <-> relay-node <-> relay-node <-> BP
.Also need to prevent infinite message loop. This is accomplished for trx and blocks by keep track of what messages have been received on each connection and not echoing back the message to that connection.
The text was updated successfully, but these errors were encountered: