Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/peer: Introducing peer "provider" with peer_cq to begin with #4

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

grom72
Copy link
Collaborator

@grom72 grom72 commented Dec 2, 2022

include/ofi_peer.h file along with prov/peer directory defines space where peer-related common code is placed.
peer_cq is the first structure that has been defined. It is based on the coll_cq structure. coll provider uses peer_cq instead of removed coll_cq.

See man/fi_peer.3.md

Signed-off-by: Tomasz Gromadzki [email protected]


This change is Reviewable

@grom72 grom72 force-pushed the peer_cq branch 3 times, most recently from bc6897d to c459fc3 Compare December 2, 2022 13:43
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @grom72)


-- commits line 2 at r1:
introducing (small letter)

Code quote:

Introducing

-- commits line 5 at r1:
peer-related

Code quote:

peer related

-- commits line 9 at r1:
See man/fi_peer.3.md. - the commit message is in the plain-text format, not MD.

Code quote:

See [man/fi_peer.3.md](man/fi_peer.3.md)

Makefile.am line 89 at r1 (raw file):

	prov/util/src/cuda_ipc_monitor.c \
	prov/peer/src/peer_cq.c 	\
	prov/peer/src/peer.h 	        \

too many spaces or tabs (use only tabs, not spaces)

Code quote:

peer.h 	        \

include/ofi_peer.h line 2 at r1 (raw file):

/*
 * Copyright (c) 2019-2022 Intel Corporation, Inc.  All rights reserved.

Why not just 2022?

Code quote:

2019-2022

include/ofi_peer.h line 40 at r1 (raw file):

#include <ofi_list.h>
#include <ofi_atom.h>
#include <ofi_bitmask.h>

Are they really required here?

Code quote:

#include <ofi_atom.h>
#include <ofi_bitmask.h>

include/ofi_peer.h line 45 at r1 (raw file):

int ofi_peer_cq_open(struct fid_domain *domain, struct fi_cq_attr *attr,

Too many empty lines in between.

Code quote:

#include <ofi_util.h>



int ofi_peer_cq_open(struct fid_domain *domain, struct fi_cq_attr *attr,

include/ofi_peer.h line 52 at r1 (raw file):

			fi_addr_t src);

#endif // _OFI_PEER_H_

/* _OFI_PEER_H_ */ - // is a C++ comment

Code quote:

// _OFI_PEER_H_

prov/coll/src/coll_coll.c line 726 at r1 (raw file):

	ep = container_of(coll_op->ep, struct coll_ep, util_ep.ep_fid);

	ret = ofi_peer_cq_write(&ep->util_ep.tx_cq->cq_fid, coll_op->context, 

redundant space at the end

Code quote:


prov/peer/src/peer.h line 39 at r1 (raw file):

#if HAVE_CONFIG_H
#include <config.h>
#endif

What is it for?

Code quote:

#if HAVE_CONFIG_H
#include <config.h>
#endif

prov/peer/src/peer.h line 43 at r1 (raw file):

#include <ofi.h>
#include <ofi_util.h>

redundant empty line


prov/peer/src/peer_cq.c line 99 at r1 (raw file):

	ret = ofi_cq_init(provider, domain, attr, &cq->util_cq, 
		&ofi_cq_progress, NULL);

Doesn't it fit into one line?

Code quote:

	ret = ofi_cq_init(provider, domain, attr, &cq->util_cq,
		&ofi_cq_progress, NULL);

prov/peer/src/peer_cq.c line 122 at r1 (raw file):

ssize_t ofi_peer_cq_write(struct fid_cq *cq_fid, void *context, uint64_t flags,
			size_t len, void *buf, uint64_t data, uint64_t tag,
			fi_addr_t src)

Doesn't it fit into one line?

Code quote:

			size_t len, void *buf, uint64_t data, uint64_t tag,
			fi_addr_t src)

Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 7 files reviewed, 13 unresolved discussions (waiting on @ldorau)


-- commits line 2 at r1:

Previously, ldorau (Lukasz Dorau) wrote…

introducing (small letter)

No - see Sean's info: ofiwg#8278 (comment)


-- commits line 5 at r1:

Previously, ldorau (Lukasz Dorau) wrote…

peer-related

Done.


-- commits line 9 at r1:

Previously, ldorau (Lukasz Dorau) wrote…

See man/fi_peer.3.md. - the commit message is in the plain-text format, not MD.

Done.


Makefile.am line 89 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

too many spaces or tabs (use only tabs, not spaces)

Done.


include/ofi_peer.h line 2 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Why not just 2022?

Done.


include/ofi_peer.h line 40 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Are they really required here?

Done.


include/ofi_peer.h line 45 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Too many empty lines in between.

Done.


include/ofi_peer.h line 52 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

/* _OFI_PEER_H_ */ - // is a C++ comment

Done.


prov/coll/src/coll_coll.c line 726 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

redundant space at the end

Done.


prov/peer/src/peer.h line 39 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

What is it for?

I do not know yet.

Code quote (from prov/coll/src/coll_coll.c):

ofi_peer_cq_write

prov/peer/src/peer.h line 43 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

redundant empty line

Done.


prov/peer/src/peer_cq.c line 99 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Doesn't it fit into one line?

No


prov/peer/src/peer_cq.c line 122 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Doesn't it fit into one line?

No - we have only 80 characters

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @grom72)


-- commits line 2 at r1:

Previously, grom72 (Tomasz Gromadzki) wrote…

No - see Sean's info: ofiwg#8278 (comment)

OK


prov/peer/src/peer_cq.c line 92 at r2 (raw file):

	cq->peer_cq = peer_context->cq;

	ret = ofi_cq_init(provider, domain, attr, &cq->util_cq, 

Redundant space at the end

Code quote:

Copy link

@haichangsi haichangsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @grom72)

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 9 files at r3, all commit messages.
Reviewable status: 3 of 11 files reviewed, 1 unresolved discussion (waiting on @grom72 and @haichangsi)

vidsouza and others added 8 commits December 7, 2022 07:21
Prior to this patch, the provider would ignore the domain
name in hints->domain_attr->name even if this was set.
This patch fixed this issue.

Signed-off-by: Vishwas Dsouza <[email protected]>
This test tests the -d <domain_name> argument of the fabtests.
All the EFA domain names are tested with this test with the
corresponding fabtest.

Signed-off-by: Vishwas Dsouza <[email protected]>
This patch added a call to FI_WARN to print endpoint's address
and libfabric version after endpoint is created.

Signed-off-by: Wei Zhang <[email protected]>
include/ofi_peer.h file along with prov/peer directory defines space
where peer-related common code is placed.
peer_cq is the first structure that has been defined. It is based on
coll_cq structure. coll provider uses peer_cq instead of removed coll_cq.

See man/fi_peer.3.md

Signed-off-by: Tomasz Gromadzki <[email protected]>
Adding struct fi_provider* as the first parameter to peer_cq_init()
makes it similar to ofi_cq_init().

Signed-off-by: Tomasz Gromadzki <[email protected]>
struct peer_cq is not used at all. All operation are based on fid_peer_cq.
There are no real CQ on util_coll. Only fid_peer_cq is binded to util_coll_ep.

Signed-off-by: Tomasz Gromadzki <[email protected]>
util_coll_cq is no longer needed as we use only fid_peer_cq to link util_coll_ep and
rxm_ep.util_cq directly with help of bind() operation.

Signed-off-by: Tomasz Gromadzki <[email protected]>
struct peer_cq is no longer needed.

Signed-off-by: Tomasz Gromadzki <[email protected]>
grom72 pushed a commit that referenced this pull request Mar 24, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants