Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thresholds/v20 #11358

Closed
wants to merge 25 commits into from
Closed

Thresholds/v20 #11358

wants to merge 25 commits into from

Conversation

victorjulien
Copy link
Member

SV_BRANCH=OISF/suricata-verify#1938

#11324 rebased

Add support for 'by_flow' track option. This allows using the various
threshold options in the context of a single flow.

Example:

    alert tcp ... stream-event:pkt_broken_ack; \
        threshold:type limit, track by_flow, count 1, seconds 3600;

The example would limit the number of alerts to once per hour for
packets triggering the 'pkt_broken_ack' stream event.

Implemented as a special "flowvar" holding the threshold entries. This
means no synchronization is required, making this a cheaper option
compared to the other trackers.

Ticket: OISF#6822.
Allow rate_filter and thresholds from the global config to specify
tracking "by_flow".
Traffic variables (flowvars, flowbits, xbits, etc) use a smaller int for
their type than detection types. As a workaround make sure the values fit
in a uint8_t.
Limits propegation checked for DETECT_DEPTH as a content flag,
which appears to have worked by chance. After reshuffling the
keyword id's it no longer worked. This patch uses the proper
flag DETECT_CONTENT_DEPTH.
Thresholding often has 2 stages:

1. recording matches
2. appling an action, like suppress

E.g. with something like:
threshold:type limit, count 10, seconds 3600, track by_src;
the recording state is about counting 10 first hits for an IP,
then followed by the "suppress" state that might last an hour.

By_src/by_dst are expensive, as they do a host table lookup and lock
the host. If many threads require this access, lock contention becomes
a serious problem.

This patch adds a thread local cache to avoid the synchronization
overhead. When the threshold for a host enters the "apply" stage,
a thread local hash entry is added. This entry knows the expiry
time and the action to apply. This way the action can be applied
w/o the synchronization overhead.

A rbtree is used to handle expiration.
Packet pointer is not used during allocation.
Add a callback and helper function to handle data expiration.

Update datasets to explicitly not use expiration.
Instead of a Host and IPPair table thresholding layer, use a dedicated
THash to store both. This allows hashing on host+sid+tracker or
ippair+sid+tracker, to create more unique hash keys.

This allows for fewer hash collisions.

The per rule tracking also uses this, so that the single big lock is no
longer a single point of contention.

Reimplement storage for flow thresholds to reuse as much logic as
possible from the host/ippair/rule thresholds.

Ticket: OISF#426.
Use the same hash key as for the regular threshold storage,
so include gid, rev, tentant id.
@victorjulien victorjulien requested review from jufajardini and a team as code owners June 24, 2024 11:11
This was referenced Jun 24, 2024
@suricata-qa
Copy link

Information: QA ran without warnings.

Pipeline 21192

Comment on lines +339 to +346
if (track == TRACK_SRC) {
addr = p->src.addr_data32[0];
} else if (track == TRACK_DST) {
addr = p->dst.addr_data32[0];
} else {
return -1;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPv6?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, implementation is IPv4 only right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket required? Or implement before merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It's an optimization, nothing is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreso for completeness. It is not discussed in any of the commits, or the tickets referenced by this PR that this is an IPv4 only optimization, so the lack of IPv6 looks like an oversight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call sites all use PacketIsIPv4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least to me that means where is the IPv6 :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can create a ticket if you want

{
if (SCTIME_CMP_GTE(a->expires_at, b->expires_at)) {
return 1;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unneeded else

SCFree(n);
}
return -1;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unneeded else

}

/** \brief Check Thread local thresholding cache
* \retval negative value cache miss (-1 miss, -2 expired)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: retval values missing

.retval = retval,
.addr = addr,
.sid = sid,
.expires_at = expires,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Do we need expires_at and retval here as they're not used for lookup?

};
ThresholdCacheItem *found = HashTableLookup(threshold_cache_ht, &lookup, 0);
if (found) {
if (SCTIME_CMP_GT(p->ts, found->expires_at)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what if it has just timed out? So the = condition

THRESHOLD_CACHE_RB_REMOVE(&threshold_cache_tree, found);
HashTableRemove(threshold_cache_ht, found, 0);
cache_lookup_miss_expired++;
return -2; // CACHE MISS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also mention expired?

typedef struct ThresholdEntry_ {
uint32_t sid; /**< Signature id */
uint32_t gid; /**< Signature group id */
int track; /**< Track type: by_src, by_src */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo by_src

Copy link
Member

@inashivb inashivb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, mostly. Will take another look when it is staged.
Nice thoughts around optimizations!

/* expired, so reset */
lookup_tsh->tv1 = p->ts;
lookup_tsh->current_count = 1;
if (SCTIME_CMP_LTE(p->ts, entry)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. so = is considered within the time limit.

@victorjulien
Copy link
Member Author

victorjulien commented Jun 26, 2024

Makes sense to me, mostly. Will take another look when it is staged. Nice thoughts around optimizations!

Please do the full review at this stage, not when things are staged into a next branch.

@inashivb
Copy link
Member

Please do the full review at this stage, not when things are staged into a next branch.

ok. Will take another look tomorrow. Thank you. This is big work.

.key[TC_GID] = gid,
.key[TC_REV] = rev,
.key[TC_TENANT] = p->tenant_id,
};
Copy link
Member

@jasonish jasonish Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be vlan aware here? And elsewhere with respect to thresholding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created https://redmine.openinfosecfoundation.org/issues/7125 for this

It's an issue in existing code. I do not intend to address it in this work.

@victorjulien
Copy link
Member Author

comments addressed in #11388

@victorjulien victorjulien deleted the thresholds/v20 branch July 2, 2024 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants