flow/manager: fix multi instance row tracking #12208

victorjulien · 2024-12-03T12:28:00Z

In multi instance flow manager setups, each flow manager gets a slice of the hash table to manage. Due to a logic error in the chunked scanning of the hash slice, instances beyond the first would always rescan the same (first) subslice of their slice.

The pos variable that is used to keep the state of what the starting position for the next scan was supposed to be, was treated as if it held a relative value. Relative to the bounds of the slice. It was however, holding an absolute position. This meant that when doing it's bounds check it was always considered out of bounds. This would reset the sub- slice to be scanned to the first part of the instances slice.

This patch addresses the issue by correctly handling the fact that the value is absolute.

Bug: #7365.

Fixes: e9d2417 ("flow/manager: adaptive hash eviction timing")

Replaces #12205, with an improved commit message.

https://redmine.openinfosecfoundation.org/issues/7365

In multi instance flow manager setups, each flow manager gets a slice of the hash table to manage. Due to a logic error in the chunked scanning of the hash slice, instances beyond the first would always rescan the same (first) subslice of their slice. The `pos` variable that is used to keep the state of what the starting position for the next scan was supposed to be, was treated as if it held a relative value. Relative to the bounds of the slice. It was however, holding an absolute position. This meant that when doing it's bounds check it was always considered out of bounds. This would reset the sub- slice to be scanned to the first part of the instances slice. This patch addresses the issue by correctly handling the fact that the value is absolute. Bug: OISF#7365. Fixes: e9d2417 ("flow/manager: adaptive hash eviction timing")

codecov · 2024-12-03T12:52:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.18%. Comparing base (e9173f3) to head (23568b3).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #12208   +/-   ##
=======================================
  Coverage   83.17%   83.18%           
=======================================
  Files         912      912           
  Lines      257111   257111           
=======================================
+ Hits       213856   213879   +23     
+ Misses      43255    43232   -23

Flag	Coverage Δ
fuzzcorpus	`61.01% <0.00%> (-0.01%)`	⬇️
livemode	`19.41% <100.00%> (ø)`
pcap	`44.35% <100.00%> (-0.04%)`	⬇️
suricata-verify	`62.78% <100.00%> (+<0.01%)`	⬆️
unittests	`59.17% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

jufajardini

Considering the explanation, this looks good.

suricata-qa · 2024-12-04T04:45:04Z

Information:

ERROR: QA failed on SURI_TLPR1_alerts_cmp.

field	baseline	test	%
	SURI_TLPR1_stats_chk
.app_layer.flow.ftp	32421	36200	111.66%
.app_layer.flow.dcerpc_tcp	40	43	107.5%
.app_layer.error.http.parser	700	729	104.14%
.app_layer.error.ssh.parser	124	128	103.23%
.ftp.memuse	2906	3102	106.74%

Pipeline 23662

inashivb

With your patch, I see both the flow managers ending up looking at the same slices in the beginning. Does the following look correct to you?

instance: 1, hash 32768:65536 slice starting at 3276 with 3276 rows
instance: 0, hash 0:32768 slice starting at 3276 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 6552 with 3276 rows
instance: 0, hash 0:32768 slice starting at 6552 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 9828 with 3276 rows
instance: 0, hash 0:32768 slice starting at 9828 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 13104 with 3276 rows
instance: 0, hash 0:32768 slice starting at 13104 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 16380 with 3276 rows
instance: 0, hash 0:32768 slice starting at 16380 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 19656 with 3276 rows
instance: 0, hash 0:32768 slice starting at 19656 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 22932 with 3276 rows
instance: 0, hash 0:32768 slice starting at 22932 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 26208 with 3276 rows
instance: 0, hash 0:32768 slice starting at 26208 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 29484 with 3276 rows
instance: 0, hash 0:32768 slice starting at 29484 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 32760 with 3276 rows
instance: 0, hash 0:32768 slice starting at 32760 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 36036 with 3276 rows
instance: 0, hash 0:32768 slice starting at 3268 with 3276 rows
instance: 1, hash 32768:65536 slice starting at 39312 with 3276 rows
instance: 0, hash 0:32768 slice starting at 6544 with 3276 rows

While pos should be the absolute, from what I understand, absolutes for each of the FMs will be different?
With the following patch, I see a consistent upgrade in the row number per FM.

diff --git a/src/flow-manager.c b/src/flow-manager.c
index 9da986b22..d13ac72d1 100644
--- a/src/flow-manager.c
+++ b/src/flow-manager.c
@@ -855,7 +855,7 @@ static TmEcode FlowManager(ThreadVars *th_v, void *thread_data)
                 FlowTimeoutHash(&ftd->timeout, ts, ftd->min, ftd->max, &counters);
                 StatsIncr(th_v, ftd->cnt.flow_mgr_full_pass);
             } else {
-                SCLogDebug("hash %u:%u slice starting at %u with %u rows", ftd->min, ftd->max, pos,
+                SCLogNotice("instance %d, hash %u:%u slice starting at %u with %u rows", ftd->instance, ftd->min, ftd->max, ftd->min + pos,
                         rows_per_wu);
 
                 const uint32_t ppos = pos;
@@ -864,6 +864,7 @@ static TmEcode FlowManager(ThreadVars *th_v, void *thread_data)
                 if (ppos > pos) {
                     StatsIncr(th_v, ftd->cnt.flow_mgr_full_pass);
                 }
+                pos -= ftd->min;
             }
 
             const uint32_t spare_pool_len = FlowSpareGetPoolSize();

Output:

instance 0, hash 0:32768 slice starting at 0 with 3276 rows
instance 1, hash 32768:65536 slice starting at 32768 with 3276 rows
instance 1, hash 32768:65536 slice starting at 36044 with 3276 rows
instance 0, hash 0:32768 slice starting at 3276 with 3276 rows
instance 1, hash 32768:65536 slice starting at 39320 with 3276 rows
instance 0, hash 0:32768 slice starting at 6552 with 3276 rows
instance 1, hash 32768:65536 slice starting at 42596 with 3276 rows
instance 0, hash 0:32768 slice starting at 9828 with 3276 rows
instance 1, hash 32768:65536 slice starting at 45872 with 3276 rows
instance 0, hash 0:32768 slice starting at 13104 with 3276 rows
instance 1, hash 32768:65536 slice starting at 49148 with 3276 rows
instance 0, hash 0:32768 slice starting at 16380 with 3276 rows
instance 1, hash 32768:65536 slice starting at 52424 with 3276 rows
instance 0, hash 0:32768 slice starting at 19656 with 3276 rows
instance 1, hash 32768:65536 slice starting at 55700 with 3276 rows
instance 0, hash 0:32768 slice starting at 22932 with 3276 rows
instance 1, hash 32768:65536 slice starting at 58976 with 3276 rows
instance 0, hash 0:32768 slice starting at 26208 with 3276 rows
instance 1, hash 32768:65536 slice starting at 62252 with 3276 rows
instance 0, hash 0:32768 slice starting at 29484 with 3276 rows
instance 1, hash 32768:65536 slice starting at 65528 with 3276 rows
instance 0, hash 0:32768 slice starting at 32760 with 3276 rows
instance 1, hash 32768:65536 slice starting at 36036 with 3276 rows
instance 0, hash 0:32768 slice starting at 3268 with 3276 rows
instance 1, hash 32768:65536 slice starting at 39312 with 3276 rows
instance 0, hash 0:32768 slice starting at 6544 with 3276 rows
instance 1, hash 32768:65536 slice starting at 42588 with 3276 rows
instance 0, hash 0:32768 slice starting at 9820 with 3276 rows
instance 1, hash 32768:65536 slice starting at 45864 with 3276 rows
instance 0, hash 0:32768 slice starting at 13096 with 3276 rows
instance 1, hash 32768:65536 slice starting at 49140 with 3276 rows
instance 0, hash 0:32768 slice starting at 16372 with 3276 rows
instance 1, hash 32768:65536 slice starting at 52416 with 3276 rows

Lmk wdyt?

victorjulien · 2024-12-04T10:00:03Z

Good catch. Thinking about just initializing pos with ftd->min instead. Seems to work here.

victorjulien · 2024-12-04T10:09:51Z

Good catch. Thinking about just initializing pos with ftd->min instead. Seems to work here.

3616f55

victorjulien · 2024-12-04T10:10:03Z

replaced by #12218

victorjulien mentioned this pull request Dec 3, 2024

flow/manager: fix multi instance row tracking #12205

Closed

jufajardini approved these changes Dec 3, 2024

View reviewed changes

inashivb reviewed Dec 4, 2024

View reviewed changes

victorjulien mentioned this pull request Dec 4, 2024

Fm row fix/v3 #12218

Closed

victorjulien closed this Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flow/manager: fix multi instance row tracking #12208

flow/manager: fix multi instance row tracking #12208

victorjulien commented Dec 3, 2024

codecov bot commented Dec 3, 2024 •

edited

Loading

jufajardini left a comment

suricata-qa commented Dec 4, 2024

inashivb left a comment

victorjulien commented Dec 4, 2024

victorjulien commented Dec 4, 2024

victorjulien commented Dec 4, 2024

flow/manager: fix multi instance row tracking #12208

flow/manager: fix multi instance row tracking #12208

Conversation

victorjulien commented Dec 3, 2024

codecov bot commented Dec 3, 2024 • edited Loading

Codecov Report

jufajardini left a comment

Choose a reason for hiding this comment

suricata-qa commented Dec 4, 2024

inashivb left a comment

Choose a reason for hiding this comment

victorjulien commented Dec 4, 2024

victorjulien commented Dec 4, 2024

victorjulien commented Dec 4, 2024

codecov bot commented Dec 3, 2024 •

edited

Loading