Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA failure #2265

Open
372046933 opened this issue May 28, 2023 · 75 comments
Open

RDMA failure #2265

372046933 opened this issue May 28, 2023 · 75 comments

Comments

@372046933
Copy link
Contributor

Describe the bug (描述bug)

W0528 17:36:56.793694   510 external/brpc/src/brpc/input_messenger.cpp:240] Close Socket{id=1248 fd=1395 addr=10.156.8.12:24452:24141} (0x55c9fa744b40) due to unknown message: \F87\88>[\BE\C0<\00\00\00\80\CA|\E1=\F87\88>[\BE\C0<\CA\E0U\BE\CA|\E1=\F87\88>[\BE\C0<\00\00\00\80\CA|\E1=\F87\88>[\BE\C0<S}:=O\E9\82\BD...<skipping 16256 bytes>
W0528 17:36:56.794202   510 external/brpc/src/brpc/policy/baidu_rpc_protocol.cpp:265] Fail to write into Socket{id=1248 fd=1395 addr=10.156.8.12:24452:24141} (0x55c9fa744b40): Invalid argument

To Reproduce (复现方法)
master分支,RDMA打开的情况,数小时之内必现

Expected behavior (期望行为)

Versions (各种版本)
OS:
Compiler:
brpc:
protobuf:

Additional context/screenshots (更多上下文/截图)

@Tuvie
Copy link
Contributor

Tuvie commented May 29, 2023

用的是什么样的业务?是example么?怎样的测试?

@372046933
Copy link
Contributor Author

不是example,场景是一个分布式训练任务,打开RDMA开关后会出现。有没有啥辅助日志可以帮助排查的,类似TF VLOG那种?

@Tuvie
Copy link
Contributor

Tuvie commented May 29, 2023

这类错误可能是多个线程同时访问一个IOBuf导致的。先禁掉zero copy看看是否还必现?rdma_recv_zerocopy=false(禁用接收端zerocopy),rdma_zerocopy_min_size=[比较大的值](禁用发送端zerocopy)。先确认下是rpc内部竞争了,还是rpc和上面的应用之间竞争了。

@372046933
Copy link
Contributor Author

设置了rdma_recv_zerocopy=false, rdma_zerocopy_min_size=1073741824. 还是会出现异常
服务端:

W0530 12:26:38.156749   812 external/brpc/src/brpc/input_messenger.cpp:240] Close Socket{id=678 fd=1331 addr=10.156.8.33:51496:19269} (0x562e43640000) due to unknown message: 7\D80\B5\B6\B6\E0\B8\B5k\8A\AE7H\CF\B27\F1\7F\947\92\E8\0F\B6\D1\A1\816\85j%\B7V\D6\BF7\C6\C0T\B7\9EP\80\B7`h\033\A5\BF68lG\F76\9E^\877\F5\B2w...<skipping 7205 bytes>

客户端:

[E1014]Got EOF of Socket{id=12 fd=1354 addr=10.156.8.33:12455:33556}

@Tuvie
Copy link
Contributor

Tuvie commented May 30, 2023

是每次报错都来自server端吗?能否提供更多的复现方法或者问题触发时的特征?

@372046933
Copy link
Contributor Author

每次都是Server端先报错:Close socket...,客户端此时会出现成片的RPC 失败。我在客户端增加了重试逻辑,RPC是可以成功的,但是重试前发生了啥?有没有副作用? 我还在确认中。

在使用TCP时,没有出现过类似报错

@Tuvie
Copy link
Contributor

Tuvie commented May 30, 2023

客户端此时会出现成片的RPC 失败

是一个客户端的多个RPC失败,还是多个客户端的RPC失败?

@372046933
Copy link
Contributor Author

前者,一个客户端的多个RPC失败

@Tuvie
Copy link
Contributor

Tuvie commented May 30, 2023

rdma_recv_zerocopy=false(禁用接收端zerocopy),rdma_zerocopy_min_size=[比较大的值](禁用发送端zerocopy)

抱歉,我刚查了一下代码,上面这个说法有问题。rdma_zerocopy_min_size=[比较大的值]禁用的仍然是接收端zerocopy。发送端zerocopy暂时没法通过Flag禁用。我去做个版本,支持下关闭发送端zerocopy你再试试

@Tuvie
Copy link
Contributor

Tuvie commented May 30, 2023

#2267
可以试下这个PR。

这种IOBuf数据问题debug确实有些困难。如果打log数据量太大了

@372046933
Copy link
Contributor Author

#2267 有个typo需要fix一下,pipeline没有define BRPC_WITH_RDMA=true, 检查不出来

@372046933
Copy link
Contributor Author

禁用send_zerocopy之后,

W0601 12:06:39.288266 18290 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:1412] Fail to handle RDMA completion, error status(4): Socket{id=8589934594 fd=1337 addr=127.0.0.1:1234:24360} (0x0x5640308974c0): Resource temporarily unavailable
2023-06-01 12:06:39.288390: W ./tensorswitch/client/handle_base.h:119] Method:[Ping] Fail to get response from shard#0 [E3001]RDMA completion error(4) from Socket{id=8589934594 fd=1337 addr=127.0.0.1:1234:24360} (0x0x5640308974c0): RDMA verbs error, retry#9 request:

@Tuvie
Copy link
Contributor

Tuvie commented Jun 1, 2023

你使用的rdma相关的flag都有哪些?能否贴一下看看

@372046933
Copy link
Contributor Author

rdma_recv_zerocopy=true
rdma_zerocopy_min_size=512
rdma_send_zerocopy=false

@Tuvie
Copy link
Contributor

Tuvie commented Jun 1, 2023

其他的flag有设置吧?block size什么的

@372046933
Copy link
Contributor Author

372046933 commented Jun 1, 2023

没有,发现同时禁用send/recv zerocopy可以跑了,不能单独禁用send_zerocopy

@372046933
Copy link
Contributor Author

多节点运行还是不行

 Fail to handle RDMA completion, error status(4): Socket{id=32 fd=1334 addr=10.156.1.16:24615:18398} (0x0x5637deabe800): Resource temporarily unavailable

@Tuvie
Copy link
Contributor

Tuvie commented Jun 1, 2023

程序里面是否有调用RegisterMemoryForRdma专门注册其他内存?

@372046933
Copy link
Contributor Author

有的,关闭zerocopy之后,不能再调用RegisterMemoryForRdma?

@Tuvie
Copy link
Contributor

Tuvie commented Jun 1, 2023

那就是程序内有使用append_user_data直接使用了外部的内存,并且做了单独的注册对吧?不过你这个错误我目前还没在本地复现出来。还有别的线索吗

@372046933
Copy link
Contributor Author

那就是程序内有使用append_user_data直接使用了外部的内存,并且做了单独的注册对吧?不过你这个错误我目前还没在本地复现出来。还有别的线索吗

是的,区别只是rdma_send/recv_zerocopy传false,传true能跑,传false不能

@Tuvie
Copy link
Contributor

Tuvie commented Jun 2, 2023

我目前仍然无法本地复现这样的问题。你们的程序里是否主动调用了ibverbs的发送?这个错误显示的是发送的数据没有在被注册的Memory Region内。

@372046933
Copy link
Contributor Author

我目前仍然无法本地复现这样的问题。你们的程序里是否主动调用了ibverbs的发送?这个错误显示的是发送的数据没有在被注册的Memory Region内。

有获取brpc::rdma::GetRdmaPd, 接着调用ibv_reg_mr,没有主动调用ibv send/recv相关 API

@Tuvie
Copy link
Contributor

Tuvie commented Jun 2, 2023

如果是裸调ibv_reg_mr的,注册下来的mkey是通过append_user_data_with_meta传进去的是吧?

@372046933
Copy link
Contributor Author

是的,不然禁用zerocopy之前,也跑不起来

@Tuvie
Copy link
Contributor

Tuvie commented Jun 2, 2023

应该是没commit全。再试试呢

@Tuvie
Copy link
Contributor

Tuvie commented Jun 2, 2023

和之前的问题没有关系。是禁用zero_copy的时候复用了这个函数。

@372046933
Copy link
Contributor Author

和之前的问题没有关系。是禁用zero_copy的时候复用了这个函数。

上面那个不改,master分支有问题吗?

@Tuvie
Copy link
Contributor

Tuvie commented Jun 2, 2023

不开send zero copy就没啥问题。如果出问题会出Fail to handle RDMA completion, error status(4)这类错误。

@Tuvie
Copy link
Contributor

Tuvie commented Jul 24, 2023

RTR是说ready to receive packet,post receive是可以的。你说不生效是测试报错了吗

@372046933
Copy link
Contributor Author

没有报错,就是想问下,实际上这些WR会等到RTR状态再处理的对吧?

另外想请教下,类似Socket EOF的报错,应该怎么排查呢?我不太清楚这里面的状态机咋维护的。看日志有点像socket状态流转不对。

W0724 15:26:52.624647   564 external/brpc/src/brpc/input_messenger.cpp:240] Close Socket{id=8589935851 fd=1410 addr=10.156.8.30:64506:16619} (0x55b62c9a4400) due to unknown message: \FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\00\00\00\00\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF\FF\7F\FF\FF...<skipping 7358 bytes>
W0724 15:26:52.624847   564 external/brpc/src/brpc/policy/baidu_rpc_protocol.cpp:265] Fail to write into Socket{id=8589935851 fd=1410 addr=10.156.8.30:64506:16619} (0x55b62c9a4400): Invalid argument
I0724 15:26:52.727919   551 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:559] Start handshake on Socket{id=8589935271 fd=1410 addr=10.156.8.30:26734:16619} (0x0x55b5ed984240)
W0724 15:26:52.727974   551 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:567] Fail to read Hello Message from client:Socket{id=8589935271 fd=1410 addr=10.156.8.30:26734:16619} (0x0x55b5ed984240) 10.156.8.30:26734: Got EOF
I0724 15:26:52.728424   542 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:559] Start handshake on Socket{id=8589937304 fd=1410 addr=10.156.8.30:26736:16619} (0x0x55b62e366000)
I0724 15:26:52.733604   560 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:695] Handshake ends (use rdma) on Socket{id=8589937304 fd=1410 addr=10.156.8.30:26736:16619} (0x0x55b62e366000)


I0724 15:20:20.888519  1066 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:559] Start handshake on Socket{id=2490 fd=1339 addr=10.156.0.33:47142:25191} (0x0x5630d760a900)
W0724 15:20:20.888569  1066 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:567] Fail to read Hello Message from client:Socket{id=2490 fd=1339 addr=10.156.0.33:47142:25191} (0x0x5630d760a900) 10.156.0.33:47142: Got EOF
I0724 15:20:25.786496  1085 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:559] Start handshake on Socket{id=2491 fd=1423 addr=10.156.0.33:47152:25191} (0x0x5630d760ab40)
I0724 15:20:25.793311  1090 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:695] Handshake ends (use rdma) on Socket{id=2491 fd=1423 addr=10.156.0.33:47152:25191} (0x0x5630d760ab40)

@Tuvie
Copy link
Contributor

Tuvie commented Jul 24, 2023

RTR状态之前不会处理的。
你是指上面的问题是发生在连接刚刚创建,是吗?

@372046933
Copy link
Contributor Author

不是发生在刚刚创建。在通信过程中发生的,会不会是send/recv pair错乱了导致?从我的出现频率看,QPS高了之后比较容易出现

@Tuvie
Copy link
Contributor

Tuvie commented Jul 24, 2023

上面这个错误,贴下另外一端的日志?10.156.8.30这个

@372046933
Copy link
Contributor Author

客户端发现Controller->Failed为true

LOG(WARNING) << method_name_ << " RPC failed, " << cntl.ErrorText()
2023-07-24 15:26:52.658294: W xxx/yyy.cc:394] StreamSend RPC failed, [E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R1][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R2][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R3][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R4][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R5][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R6][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R7][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R8][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R9][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R10][E112]Not connected to 10.156.9.20:16619 yet

...

W0724 15:26:52.656967  1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.
W0724 15:26:52.657038  1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.
W0724 15:26:52.657049  1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.

...

@Tuvie
Copy link
Contributor

Tuvie commented Jul 25, 2023

看日志里面,是用到了rpc的stream么?

@372046933
Copy link
Contributor Author

没有Stream,只是RPC名字包含Stream

LOG(WARNING) << method_name_ << " RPC failed, " ...

@Tuvie
Copy link
Contributor

Tuvie commented Jul 25, 2023

两边的日志打包发一下看看?

@372046933
Copy link
Contributor Author

日志不多的。其中有些日志是我在BRPC里面加的

diff --git a/src/brpc/rdma/rdma_endpoint.cpp b/src/brpc/rdma/rdma_endpoint.cpp
index d3a91560..e7e60ff4 100644
--- a/src/brpc/rdma/rdma_endpoint.cpp
+++ b/src/brpc/rdma/rdma_endpoint.cpp
@@ -251,7 +251,9 @@ void RdmaConnect::StartConnect(const Socket* socket,
     }
 }
 
-void RdmaConnect::StopConnect(Socket* socket) { }
+void RdmaConnect::StopConnect(Socket* socket) {
+    LOG(INFO) << "RdmaConnect StopConnect called";
+}
 
 void RdmaConnect::Run() {
     _done(errno, _data);
@@ -274,6 +276,7 @@ static void TryReadOnTcpDuringRdmaEst(Socket* s) {
                 break;
             }
         } else if (nr == 0) {
+            LOG(INFO) << "Got socket EOF, " << *s;
             s->SetEOF();
             return;
         } else {
@@ -330,7 +333,7 @@ void RdmaEndpoint::OnNewDataFromTcp(Socket* m) {
     }
 }
 
-bool HelloNegotiationValid(HelloMessage& msg) {
+bool HelloNegotiationValid(const HelloMessage& msg) {
     if (msg.hello_ver == g_rdma_hello_version &&
         msg.impl_ver == g_rdma_impl_version &&
         msg.block_size >= MIN_BLOCK_SIZE &&
@@ -360,6 +363,7 @@ int RdmaEndpoint::ReadFromFd(void* data, size_t len) {
                     }
                 }
             } else {
+                PLOG(WARNING) << "Failed to read";
                 return -1;
             }
         } else if (nr == 0) {  // Got EOF
@@ -1030,7 +1034,7 @@ int RdmaEndpoint::PostRecv(uint32_t num, bool zerocopy) {
         if (_rq_received == _rq_size) {
             _rq_received = 0;
         }
-    };
+    }
     return 0;
 }
 
@@ -1168,11 +1172,6 @@ int RdmaEndpoint::BringUpQp(uint16_t lid, ibv_gid gid, uint32_t qp_num) {
         return -1;
     }
 
-    if (PostRecv(_rq_size, true) < 0) {
-        PLOG(WARNING) << "Fail to post recv wr";
-        return -1;
-    }
-
     attr.qp_state = IBV_QPS_RTR;
     attr.path_mtu = IBV_MTU_1024;  // TODO: support more mtu in future
     attr.ah_attr.grh.dgid = gid;
@@ -1202,6 +1201,11 @@ int RdmaEndpoint::BringUpQp(uint16_t lid, ibv_gid gid, uint32_t qp_num) {
         return -1;
     }
 
+    if (PostRecv(_rq_size, true) < 0) {
+        PLOG(WARNING) << "Fail to post recv wr";
+        return -1;
+    }
+
     attr.qp_state = IBV_QPS_RTS;
     attr.timeout = TIMEOUT;
     attr.retry_cnt = RETRY_CNT;
@@ -1293,7 +1297,8 @@ void RdmaEndpoint::DeallocateResources() {
 static const int MAX_CQ_EVENTS = 128;
 
 int RdmaEndpoint::GetAndAckEvents() {
-    int events = 0; void* context = NULL;
+    int events = 0;
+    void* context = NULL;
     while (1) {
         if (IbvGetCqEvent(_resource->comp_channel, &_resource->cq, &context) < 0) {
             if (errno != EAGAIN) {
@@ -1354,11 +1359,13 @@ void RdmaEndpoint::PollCq(Socket* m) {
                 // that the event arrives after the poll but before the notify,
                 // we should re-poll the CQ once after the notify to check if
                 // there is an available CQE.
-                if (ibv_req_notify_cq(ep->_resource->cq, 1) < 0) {
+                if (ibv_req_notify_cq(ep->_resource->cq, /*solicited_only=*/1) <
+                    0) {
                     const int saved_errno = errno;
-                    PLOG(WARNING) << "Fail to arm CQ comp channel: " << s->description();
+                    PLOG(WARNING)
+                        << "Fail to arm CQ comp channel: " << s->description();
                     s->SetFailed(saved_errno, "Fail to arm cq channel from %s: %s",
-                            s->description().c_str(), berror(saved_errno));
+                                s->description().c_str(), berror(saved_errno));
                     return;
                 }
                 notified = true;
@@ -1387,7 +1394,8 @@ void RdmaEndpoint::PollCq(Socket* m) {
                 PLOG(WARNING) << "Fail to handle RDMA completion, error status("
                               << wc[i].status << "): " << s->description();
                 s->SetFailed(ERDMA, "RDMA completion error(%d) from %s: %s",
-                             wc[i].status, s->description().c_str(), berror(ERDMA));
+                             wc[i].status, s->description().c_str(),
+                             berror(ERDMA));
                 continue;
             }
 
diff --git a/src/brpc/socket.cpp b/src/brpc/socket.cpp
index c49ca083..4ebba8d2 100644
--- a/src/brpc/socket.cpp
+++ b/src/brpc/socket.cpp
@@ -803,6 +803,7 @@ int Socket::WaitAndReset(int32_t expected_nref) {
 
 #if BRPC_WITH_RDMA
     if (_rdma_ep) {
+        LOG(WARNING) << "Reset RdmaEndpoint, " << description();
         _rdma_ep->Reset();
         _rdma_state = RDMA_UNKNOWN;
     }

server

BRPC 日志:
W0724 21:29:37.355377   978 external/brpc/src/brpc/input_messenger.cpp:247] Close Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:37376} (0x560a5f5cc480): absolutely wrong message

应用层日志:
W0724 21:29:37.355577   838 ./xxx/yyy/zzz.h:119] Method:[FuncFoo] Fail to get response from shard#0 [E22]Close Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:37376} (0x0x560a5f5cc480): absolutely wrong message

BRPC 日志
I0724 21:29:37.458484  1004 external/brpc/src/brpc/socket.cpp:2465] Checking Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480)
I0724 21:29:37.458786  1016 external/brpc/src/brpc/socket.cpp:2525] Revived Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480) (Connectable)
I0724 21:29:42.355888   985 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:418] Start handshake on 10.156.0.21:54384
I0724 21:29:42.362131   995 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:543] Handshake ends (use rdma) on Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:54384} (0x0x560a5f5cc480)

client

I0724 21:29:37.446674  1045 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:279] Got socket EOF, Socket{id=25769805360 fd=300 addr=10.156.0.21:37376:29765} (0x557bf1110480)
I0724 21:29:37.449893  1090 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:560] Start handshake on Socket{id=8589943858 fd=1413 addr=10.156.0.21:54366:29765} (0x0x557ef5f2c000)
W0724 21:29:37.449923  1090 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:568] Fail to read Hello Message from client:Socket{id=8589943858 fd=1413 addr=10.156.0.21:54366:29765} (0x0x557ef5f2c000) 10.156.0.21:54366: Got EOF
I0724 21:29:42.347077  1047 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:560] Start handshake on Socket{id=17179869414 fd=1358 addr=10.156.0.21:54384:29765} (0x0x557bf0b22900)
I0724 21:29:42.353112  1097 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:696] Handshake ends (use rdma) on Socket{id=17179869414 fd=1358 addr=10.156.0.21:54384:29765} (0x0x557bf0b22900)

@372046933
Copy link
Contributor Author

RDMA handshake只需要一次就可以建立QP了吧? 看日志在Close Socket后,又发生了一次信息交换。
感觉像是触发了什么BUG。。。

@Tuvie
Copy link
Contributor

Tuvie commented Jul 25, 2023

Close Socket之后要自动重连吧。client没有自动重连的日志吗

@372046933
Copy link
Contributor Author

这些就是重连日志吧,不过为啥会突然Close Socket呢?这是预期内的行为吗?

I0724 21:29:37.458484  1004 external/brpc/src/brpc/socket.cpp:2465] Checking Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480)
I0724 21:29:37.458786  1016 external/brpc/src/brpc/socket.cpp:2525] Revived Socket{id=8589934594 addr=10.156.8.29:29765} (0x560a5f5cc480) (Connectable)
I0724 21:29:42.355888   985 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:418] Start handshake on 10.156.0.21:54384
I0724 21:29:42.362131   995 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:543] Handshake ends (use rdma) on Socket{id=8589934594 fd=297 addr=10.156.8.29:29765:54384} (0x0x560a5f5cc480)

@Tuvie
Copy link
Contributor

Tuvie commented Jul 25, 2023

Close Socket的原因是检查到rpc消息格式错误。
Close Socket后触发client做健康检查(Checking Socket),tcp建连成功则认为健康检查通过(Revived Socket)。但是这个健康检查连接会直接close掉。但是server还是会准备rdma握手逻辑(start handshake)因为server不知道这是一个健康检查连接。但是因为client自己close了这个连接,所以server的握手失败(Got EOF)。后面的这个状态是没问题的。

根本的问题还是在rpc消息格式错误这里。

@372046933
Copy link
Contributor Author

哪里加日志方便排查问题呢?消息格式错误一般是啥原因导致的

@Tuvie
Copy link
Contributor

Tuvie commented Jul 25, 2023

可以按照下方的改动试下把发送方向和接收方向看到的message的内容打印下看看。
不过这个日志量级应该会很大。多rotate一下,看看能不能捕捉到出错。

--- a/src/brpc/rdma/rdma_endpoint.cpp
+++ b/src/brpc/rdma/rdma_endpoint.cpp
@@ -896,6 +896,7 @@ ssize_t RdmaEndpoint::CutFromIOBufList(butil::IOBuf** from, size_t ndata) {
             return -1;
         }

+        LOG_IF(INFO, FLAGS_rdma_trace_verbose) << _sbuf[_sq_current];
         ++_sq_current;
         if (_sq_current == _sq_size - RESERVED_WR_NUM) {
             _sq_current = 0;
@@ -950,6 +951,7 @@ ssize_t RdmaEndpoint::HandleCompletion(ibv_wc& wc) {
     case IBV_WC_RECV: {  // recv completion
         // Please note that only the first wc.byte_len bytes is valid
         if (wc.byte_len > 0) {
+            LOG_IF(INFO, FLAGS_rdma_trace_verbose) << _rbuf[_rq_received];
             if (wc.byte_len < (uint32_t)FLAGS_rdma_zerocopy_min_size) {
                 zerocopy = false;
             }

@372046933
Copy link
Contributor Author

好的,这个日志量可能没法落盘,实在是太大了(日志量是几个GB/s)。有没有从代码层面分析的可能呢?

另外请教下,当wc.byte_len < (uint32_t)FLAGS_rdma_zerocopy_min_size时,zerocopy为False。此时_rbuf_data[_rq_received]指向的内存是分配过的吗?

_socket->_read_buf.append(_rbuf_data[_rq_received], wc.byte_len);

@Tuvie
Copy link
Contributor

Tuvie commented Jul 25, 2023

分配过的。

日志尝试收集一下吧。

@372046933
Copy link
Contributor Author

是在哪里分配的可以贴一下吗?
日志没法log这么大,做不了。就算全落下来,一次RPC需要几秒到十几秒,没法复现问题。

@Tuvie
Copy link
Contributor

Tuvie commented Jul 25, 2023

PostRecv里面分配的。

@372046933
Copy link
Contributor Author

说的是这里吧,zerocopy为False时,DoPostRecv接收的地址传给ibv_sge.addr,此时的_rbuf_data[_rq_received]在哪里分配并注册MR的呢?

int RdmaEndpoint::PostRecv(uint32_t num, bool zerocopy) {
// We do the post repeatedly from the _rbuf[_rq_received].
while (num > 0) {
if (zerocopy) {
_rbuf[_rq_received].clear();
butil::IOBufAsZeroCopyOutputStream os(&_rbuf[_rq_received],
g_rdma_recv_block_size + IOBUF_BLOCK_HEADER_LEN);
int size = 0;
if (!os.Next(&_rbuf_data[_rq_received], &size)) {
// Memory is not enough for preparing a block
PLOG(WARNING) << "Fail to allocate rbuf";
return -1;
} else {
CHECK(size == g_rdma_recv_block_size) << size;
}
}
if (DoPostRecv(_rbuf_data[_rq_received], g_rdma_recv_block_size) < 0) {

@Tuvie
Copy link
Contributor

Tuvie commented Jul 26, 2023

这些用的是内存池里的内存。一开始就注册了

@372046933
Copy link
Contributor Author

这个报错也有点奇怪,ProtocolType应该是BAIDU_STD才对

if (m->CreatedByConnect()) {
if((ProtocolType)cur_index == PROTOCOL_BAIDU_STD) {
// baidu_std may fall to streaming_rpc.
cur_index = (int)PROTOCOL_STREAMING_RPC;
continue;
} else {
// The protocol is fixed at client-side, no need to try others.
LOG(ERROR) << "Fail to parse response from " << m->remote_side()

E0726 12:28:46.931434   969 external/brpc/src/brpc/input_messenger.cpp:122] Fail to parse response from 10.156.15.27:17657 by baidu_std at client-side
W0726 12:28:46.931502   969 external/brpc/src/brpc/input_messenger.cpp:247] Close Socket{id=8589934594 fd=1409 addr=10.156.15.27:17657:38500} (0x5628b07bc480): absolutely wrong message

@372046933
Copy link
Contributor Author

I0726 16:25:12.445185  1011 external/brpc/src/brpc/input_messenger.cpp:116] cur_index=1 preferred=1 result.error()=1 result.error_str()=try other protocols
I0726 16:25:12.445248  1011 external/brpc/src/brpc/input_messenger.cpp:116] cur_index=2 preferred=1 result.error()=1 result.error_str()=try other protocols
E0726 16:25:12.445270  1011 external/brpc/src/brpc/input_messenger.cpp:126] Fail to parse response from 10.156.15.16:20493 by baidu_std at client-side
W0726 16:25:12.445318  1011 external/brpc/src/brpc/input_messenger.cpp:251] Close Socket{id=8589934619 fd=1341 addr=10.156.15.16:20493:64774} (0x556c6752fcc0): absolutely wrong message
diff --git a/src/brpc/input_messenger.cpp b/src/brpc/input_messenger.cpp
index 43167d5b..787fc993 100644
--- a/src/brpc/input_messenger.cpp
+++ b/src/brpc/input_messenger.cpp
@@ -113,7 +113,11 @@ ParseResult InputMessenger::CutInputMessage(
             }
 
             if (m->CreatedByConnect()) {
-                if((ProtocolType)cur_index == PROTOCOL_BAIDU_STD) {
+                LOG(INFO) << "cur_index=" << cur_index
+                          << " preferred=" << preferred
+                          << " result.error()=" << result.error()
+                          << " result.error_str()=" << result.error_str();
+                if ((ProtocolType)cur_index == PROTOCOL_BAIDU_STD) {
                     // baidu_std may fall to streaming_rpc.
                     cur_index = (int)PROTOCOL_STREAMING_RPC;
                     continue;
@@ -332,7 +336,7 @@ void InputMessenger::OnNewMessages(Socket* m) {
     // - If the socket has only one message, the message will be parsed and
     //   processed in this bthread. nova-pbrpc and http works in this way.
     // - If the socket has several messages, all messages will be parsed (
-    //   meaning cutting from butil::IOBuf. serializing from protobuf is part of
+    //   meaning cutting from butil::IOBuf. Deserializing from protobuf is part of
     //   "process") in this bthread. All messages except the last one will be
     //   processed in separate bthreads. To minimize the overhead, scheduling
     //   is batched(notice the BTHREAD_NOSIGNAL and bthread_flush).

@Tuvie
Copy link
Contributor

Tuvie commented Jul 26, 2023

这些都是数据错误的结果,不是原因。至少先尝试出错的时候FATAL core出来吧

@372046933
Copy link
Contributor Author

哪个地方abort呢? 在应用层出现RPC失败的时候吗?出core了之后,查看那些变量呢?

@Tuvie
Copy link
Contributor

Tuvie commented Jul 26, 2023

就在出现协议错误这里。fatal前也顺便brpc::rdma::DumpMemoryPoolInfo()下。

另外,也贴下你和rdma相关的所有的flag设置看看。以及程序你的user_data是多大的?

@live4thee
Copy link
Contributor

live4thee commented Nov 26, 2024

补充一点发现:CC @Tuvie

  1. input_messenger.cpp 报错 due to unknown message 时,增加日志打印 RdmaEndpoint::DebugInfo() 的信息如下:
rdma_state=ON
handshake_state=ESTABLISHED
rdma_window_size=125
rdma_local_window_capacity=125
rdma_remote_window_capacity=125
rdma_sbuf_head=60
rdma_sbuf_tail=60
rdma_rbuf_head=118
rdma_unacked_rq_wr=32
rdma_received_ack=0
rdma_unsolicited_sent=0
rdma_unsignaled_sq_wr=14
  1. 此时,对应的客户端 RPC 请求增加日志打印的 RdmaEndpoint::DebugInfo() 信息如下:
rdma_state=ON
handshake_state=ESTABLISHED
rdma_window_size=19
rdma_local_window_capacity=125
rdma_remote_window_capacity=125
rdma_sbuf_head=67
rdma_sbuf_tail=86
rdma_rbuf_head=22
rdma_unacked_rq_wr=0
rdma_received_ack=0
rdma_unsolicited_sent=0
rdma_unsignaled_sq_wr=5

分析 unknown message 的报错日志,原本应该出现在 attachment 里的内容,出现在了 protobuf 里。

此外,如果 RPC 的数据较小,比如:限制在 64KB 以内,就会不容易出现。

@live4thee
Copy link
Contributor

live4thee commented Dec 4, 2024

unknown message 之前的请求内容 dump 出来发现,其中一个 RdmaIOBuf 的内容发生了错位。

假设接收到的 RdmaIOBuf 列表为 [B1, B2, ..., Bn], 其中的内容代表一个 cut 出来的 RPC 消息。则,现象是其中一个本应是当前 RPC 内容的 Bi 里面是第二个 RPC 的 RpcMeta 信息,计算 offset 刚好是少一个 RdmaIOBuf 大小:8160 字节。

Updated:
多次测试后,分析请求内容,得到共同点:

  1. 出错的 RPC 请求里内容缺失,缺失的内容大小均为 RdmaIOBuf 的整数倍;
  2. 出错的 RPC 中混入了后续 RPC 请求的内容,因此解析后续 RPC 时报错 unknown message.

Updated again:
也有少量是存粹内容出错,中间没有混入 RPC 头部元数据,如下图。填充的数据是 0xdeadface.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants