-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDMA failure #2265
Comments
用的是什么样的业务?是example么?怎样的测试? |
不是example,场景是一个分布式训练任务,打开RDMA开关后会出现。有没有啥辅助日志可以帮助排查的,类似TF VLOG那种? |
这类错误可能是多个线程同时访问一个IOBuf导致的。先禁掉zero copy看看是否还必现?rdma_recv_zerocopy=false(禁用接收端zerocopy),rdma_zerocopy_min_size=[比较大的值](禁用发送端zerocopy)。先确认下是rpc内部竞争了,还是rpc和上面的应用之间竞争了。 |
设置了
客户端:
|
是每次报错都来自server端吗?能否提供更多的复现方法或者问题触发时的特征? |
每次都是Server端先报错: 在使用TCP时,没有出现过类似报错 |
是一个客户端的多个RPC失败,还是多个客户端的RPC失败? |
前者,一个客户端的多个RPC失败 |
抱歉,我刚查了一下代码,上面这个说法有问题。rdma_zerocopy_min_size=[比较大的值]禁用的仍然是接收端zerocopy。发送端zerocopy暂时没法通过Flag禁用。我去做个版本,支持下关闭发送端zerocopy你再试试 |
#2267 这种IOBuf数据问题debug确实有些困难。如果打log数据量太大了 |
#2267 有个typo需要fix一下,pipeline没有define |
禁用send_zerocopy之后,
|
你使用的rdma相关的flag都有哪些?能否贴一下看看 |
|
其他的flag有设置吧?block size什么的 |
没有, |
多节点运行还是不行
|
程序里面是否有调用RegisterMemoryForRdma专门注册其他内存? |
有的,关闭zerocopy之后,不能再调用RegisterMemoryForRdma? |
那就是程序内有使用append_user_data直接使用了外部的内存,并且做了单独的注册对吧?不过你这个错误我目前还没在本地复现出来。还有别的线索吗 |
是的,区别只是rdma_send/recv_zerocopy传false,传true能跑,传false不能 |
我目前仍然无法本地复现这样的问题。你们的程序里是否主动调用了ibverbs的发送?这个错误显示的是发送的数据没有在被注册的Memory Region内。 |
有获取 |
如果是裸调ibv_reg_mr的,注册下来的mkey是通过append_user_data_with_meta传进去的是吧? |
是的,不然禁用zerocopy之前,也跑不起来 |
应该是没commit全。再试试呢 |
和之前的问题没有关系。是禁用zero_copy的时候复用了这个函数。 |
上面那个不改,master分支有问题吗? |
不开send zero copy就没啥问题。如果出问题会出Fail to handle RDMA completion, error status(4)这类错误。 |
RTR是说ready to receive packet,post receive是可以的。你说不生效是测试报错了吗 |
没有报错,就是想问下,实际上这些WR会等到RTR状态再处理的对吧? 另外想请教下,类似Socket EOF的报错,应该怎么排查呢?我不太清楚这里面的状态机咋维护的。看日志有点像socket状态流转不对。
|
RTR状态之前不会处理的。 |
不是发生在刚刚创建。在通信过程中发生的,会不会是send/recv pair错乱了导致?从我的出现频率看,QPS高了之后比较容易出现 |
上面这个错误,贴下另外一端的日志?10.156.8.30这个 |
客户端发现 LOG(WARNING) << method_name_ << " RPC failed, " << cntl.ErrorText() 2023-07-24 15:26:52.658294: W xxx/yyy.cc:394] StreamSend RPC failed, [E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R1][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R2][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R3][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R4][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R5][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R6][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R7][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R8][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R9][E112]Not connected to 10.156.9.20:16619 yet, server_id=47 [R10][E112]Not connected to 10.156.9.20:16619 yet
...
W0724 15:26:52.656967 1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.
W0724 15:26:52.657038 1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.
W0724 15:26:52.657049 1267 external/brpc/src/brpc/controller.cpp:675] `CanRetryBackoffInPthread()' returns false, skip retry backoff in pthread.
...
|
看日志里面,是用到了rpc的stream么? |
没有Stream,只是RPC名字包含Stream LOG(WARNING) << method_name_ << " RPC failed, " ... |
两边的日志打包发一下看看? |
日志不多的。其中有些日志是我在BRPC里面加的 diff --git a/src/brpc/rdma/rdma_endpoint.cpp b/src/brpc/rdma/rdma_endpoint.cpp
index d3a91560..e7e60ff4 100644
--- a/src/brpc/rdma/rdma_endpoint.cpp
+++ b/src/brpc/rdma/rdma_endpoint.cpp
@@ -251,7 +251,9 @@ void RdmaConnect::StartConnect(const Socket* socket,
}
}
-void RdmaConnect::StopConnect(Socket* socket) { }
+void RdmaConnect::StopConnect(Socket* socket) {
+ LOG(INFO) << "RdmaConnect StopConnect called";
+}
void RdmaConnect::Run() {
_done(errno, _data);
@@ -274,6 +276,7 @@ static void TryReadOnTcpDuringRdmaEst(Socket* s) {
break;
}
} else if (nr == 0) {
+ LOG(INFO) << "Got socket EOF, " << *s;
s->SetEOF();
return;
} else {
@@ -330,7 +333,7 @@ void RdmaEndpoint::OnNewDataFromTcp(Socket* m) {
}
}
-bool HelloNegotiationValid(HelloMessage& msg) {
+bool HelloNegotiationValid(const HelloMessage& msg) {
if (msg.hello_ver == g_rdma_hello_version &&
msg.impl_ver == g_rdma_impl_version &&
msg.block_size >= MIN_BLOCK_SIZE &&
@@ -360,6 +363,7 @@ int RdmaEndpoint::ReadFromFd(void* data, size_t len) {
}
}
} else {
+ PLOG(WARNING) << "Failed to read";
return -1;
}
} else if (nr == 0) { // Got EOF
@@ -1030,7 +1034,7 @@ int RdmaEndpoint::PostRecv(uint32_t num, bool zerocopy) {
if (_rq_received == _rq_size) {
_rq_received = 0;
}
- };
+ }
return 0;
}
@@ -1168,11 +1172,6 @@ int RdmaEndpoint::BringUpQp(uint16_t lid, ibv_gid gid, uint32_t qp_num) {
return -1;
}
- if (PostRecv(_rq_size, true) < 0) {
- PLOG(WARNING) << "Fail to post recv wr";
- return -1;
- }
-
attr.qp_state = IBV_QPS_RTR;
attr.path_mtu = IBV_MTU_1024; // TODO: support more mtu in future
attr.ah_attr.grh.dgid = gid;
@@ -1202,6 +1201,11 @@ int RdmaEndpoint::BringUpQp(uint16_t lid, ibv_gid gid, uint32_t qp_num) {
return -1;
}
+ if (PostRecv(_rq_size, true) < 0) {
+ PLOG(WARNING) << "Fail to post recv wr";
+ return -1;
+ }
+
attr.qp_state = IBV_QPS_RTS;
attr.timeout = TIMEOUT;
attr.retry_cnt = RETRY_CNT;
@@ -1293,7 +1297,8 @@ void RdmaEndpoint::DeallocateResources() {
static const int MAX_CQ_EVENTS = 128;
int RdmaEndpoint::GetAndAckEvents() {
- int events = 0; void* context = NULL;
+ int events = 0;
+ void* context = NULL;
while (1) {
if (IbvGetCqEvent(_resource->comp_channel, &_resource->cq, &context) < 0) {
if (errno != EAGAIN) {
@@ -1354,11 +1359,13 @@ void RdmaEndpoint::PollCq(Socket* m) {
// that the event arrives after the poll but before the notify,
// we should re-poll the CQ once after the notify to check if
// there is an available CQE.
- if (ibv_req_notify_cq(ep->_resource->cq, 1) < 0) {
+ if (ibv_req_notify_cq(ep->_resource->cq, /*solicited_only=*/1) <
+ 0) {
const int saved_errno = errno;
- PLOG(WARNING) << "Fail to arm CQ comp channel: " << s->description();
+ PLOG(WARNING)
+ << "Fail to arm CQ comp channel: " << s->description();
s->SetFailed(saved_errno, "Fail to arm cq channel from %s: %s",
- s->description().c_str(), berror(saved_errno));
+ s->description().c_str(), berror(saved_errno));
return;
}
notified = true;
@@ -1387,7 +1394,8 @@ void RdmaEndpoint::PollCq(Socket* m) {
PLOG(WARNING) << "Fail to handle RDMA completion, error status("
<< wc[i].status << "): " << s->description();
s->SetFailed(ERDMA, "RDMA completion error(%d) from %s: %s",
- wc[i].status, s->description().c_str(), berror(ERDMA));
+ wc[i].status, s->description().c_str(),
+ berror(ERDMA));
continue;
}
diff --git a/src/brpc/socket.cpp b/src/brpc/socket.cpp
index c49ca083..4ebba8d2 100644
--- a/src/brpc/socket.cpp
+++ b/src/brpc/socket.cpp
@@ -803,6 +803,7 @@ int Socket::WaitAndReset(int32_t expected_nref) {
#if BRPC_WITH_RDMA
if (_rdma_ep) {
+ LOG(WARNING) << "Reset RdmaEndpoint, " << description();
_rdma_ep->Reset();
_rdma_state = RDMA_UNKNOWN;
}
server
client
|
RDMA handshake只需要一次就可以建立QP了吧? 看日志在Close Socket后,又发生了一次信息交换。 |
Close Socket之后要自动重连吧。client没有自动重连的日志吗 |
这些就是重连日志吧,不过为啥会突然Close Socket呢?这是预期内的行为吗?
|
Close Socket的原因是检查到rpc消息格式错误。 根本的问题还是在rpc消息格式错误这里。 |
哪里加日志方便排查问题呢?消息格式错误一般是啥原因导致的 |
可以按照下方的改动试下把发送方向和接收方向看到的message的内容打印下看看。 --- a/src/brpc/rdma/rdma_endpoint.cpp
+++ b/src/brpc/rdma/rdma_endpoint.cpp
@@ -896,6 +896,7 @@ ssize_t RdmaEndpoint::CutFromIOBufList(butil::IOBuf** from, size_t ndata) {
return -1;
}
+ LOG_IF(INFO, FLAGS_rdma_trace_verbose) << _sbuf[_sq_current];
++_sq_current;
if (_sq_current == _sq_size - RESERVED_WR_NUM) {
_sq_current = 0;
@@ -950,6 +951,7 @@ ssize_t RdmaEndpoint::HandleCompletion(ibv_wc& wc) {
case IBV_WC_RECV: { // recv completion
// Please note that only the first wc.byte_len bytes is valid
if (wc.byte_len > 0) {
+ LOG_IF(INFO, FLAGS_rdma_trace_verbose) << _rbuf[_rq_received];
if (wc.byte_len < (uint32_t)FLAGS_rdma_zerocopy_min_size) {
zerocopy = false;
} |
好的,这个日志量可能没法落盘,实在是太大了(日志量是几个GB/s)。有没有从代码层面分析的可能呢? 另外请教下,当 brpc/src/brpc/rdma/rdma_endpoint.cpp Line 946 in 92ad882
|
分配过的。 日志尝试收集一下吧。 |
是在哪里分配的可以贴一下吗? |
PostRecv里面分配的。 |
说的是这里吧,zerocopy为False时, brpc/src/brpc/rdma/rdma_endpoint.cpp Lines 1008 to 1024 in e6881ee
|
这些用的是内存池里的内存。一开始就注册了 |
这个报错也有点奇怪, brpc/src/brpc/input_messenger.cpp Lines 115 to 122 in 3bc4366
|
diff --git a/src/brpc/input_messenger.cpp b/src/brpc/input_messenger.cpp
index 43167d5b..787fc993 100644
--- a/src/brpc/input_messenger.cpp
+++ b/src/brpc/input_messenger.cpp
@@ -113,7 +113,11 @@ ParseResult InputMessenger::CutInputMessage(
}
if (m->CreatedByConnect()) {
- if((ProtocolType)cur_index == PROTOCOL_BAIDU_STD) {
+ LOG(INFO) << "cur_index=" << cur_index
+ << " preferred=" << preferred
+ << " result.error()=" << result.error()
+ << " result.error_str()=" << result.error_str();
+ if ((ProtocolType)cur_index == PROTOCOL_BAIDU_STD) {
// baidu_std may fall to streaming_rpc.
cur_index = (int)PROTOCOL_STREAMING_RPC;
continue;
@@ -332,7 +336,7 @@ void InputMessenger::OnNewMessages(Socket* m) {
// - If the socket has only one message, the message will be parsed and
// processed in this bthread. nova-pbrpc and http works in this way.
// - If the socket has several messages, all messages will be parsed (
- // meaning cutting from butil::IOBuf. serializing from protobuf is part of
+ // meaning cutting from butil::IOBuf. Deserializing from protobuf is part of
// "process") in this bthread. All messages except the last one will be
// processed in separate bthreads. To minimize the overhead, scheduling
// is batched(notice the BTHREAD_NOSIGNAL and bthread_flush). |
这些都是数据错误的结果,不是原因。至少先尝试出错的时候FATAL core出来吧 |
哪个地方abort呢? 在应用层出现RPC失败的时候吗?出core了之后,查看那些变量呢? |
就在出现协议错误这里。fatal前也顺便brpc::rdma::DumpMemoryPoolInfo()下。 另外,也贴下你和rdma相关的所有的flag设置看看。以及程序你的user_data是多大的? |
补充一点发现:CC @Tuvie
rdma_state=ON
handshake_state=ESTABLISHED
rdma_window_size=125
rdma_local_window_capacity=125
rdma_remote_window_capacity=125
rdma_sbuf_head=60
rdma_sbuf_tail=60
rdma_rbuf_head=118
rdma_unacked_rq_wr=32
rdma_received_ack=0
rdma_unsolicited_sent=0
rdma_unsignaled_sq_wr=14
rdma_state=ON
handshake_state=ESTABLISHED
rdma_window_size=19
rdma_local_window_capacity=125
rdma_remote_window_capacity=125
rdma_sbuf_head=67
rdma_sbuf_tail=86
rdma_rbuf_head=22
rdma_unacked_rq_wr=0
rdma_received_ack=0
rdma_unsolicited_sent=0
rdma_unsignaled_sq_wr=5 分析 此外,如果 RPC 的数据较小,比如:限制在 64KB 以内,就会不容易出现。 |
Describe the bug (描述bug)
To Reproduce (复现方法)
master分支,RDMA打开的情况,数小时之内必现
Expected behavior (期望行为)
Versions (各种版本)
OS:
Compiler:
brpc:
protobuf:
Additional context/screenshots (更多上下文/截图)
The text was updated successfully, but these errors were encountered: