Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrinsics support for Zihintntl extension #30

Open
aswaterman opened this issue Jul 27, 2022 · 11 comments
Open

Intrinsics support for Zihintntl extension #30

aswaterman opened this issue Jul 27, 2022 · 11 comments

Comments

@aswaterman
Copy link
Contributor

The Zihintntl extension has recently passed AR. The spec is here: https://github.com/riscv/riscv-isa-manual/blob/10eea63205f371ed649355f4cf7a80716335958f/src/zihintntl.tex

During the AR, we wanted to raise the issue of whether and how the extension would be exposed in the RISC-V C API. Can y'all ponder the following and opine?

In x86, for example, _mm_stream_pi (https://github.com/gcc-mirror/gcc/blob/e75da2ace6b6f634237259ef62cfb2d3d34adb10/gcc/config/i386/xmmintrin.h#L1279-L1291) is roughly equivalent to c.ntl.all; sd in RISC-V.

(ARMv8 has LDNP/STNP instructions, but I couldn't find an intrinsics mapping for them.)

Zihintntl is more general than x86's solution in a few dimensions:

  • NTL hints can be used on both loads and stores (in x86, there are only non-temporal stores, AFAIK)
  • NTL hints can be used on any kind of load and store (in x86, it's only MMX/SSE/AVX, AFAIK)
  • NTL hints can express a memory hierarchy level (x86 can only express the equivalent of ntl.all, AFAIK)

With this in mind, the questions for the RISC-V C API folks are: how do we expose this facility in the RISC-V C API? How much of its generality do we expose? Do you foresee any impediments?

cc @kito-cheng @ptomsich

@cmuellner
Copy link
Collaborator

The specification requires the memory access to be the "immediately subsequent instruction".
This is hard/impossible to guarantee by an intrinsic that does not include the memory access operation.
Therefore I would include the memory access in the API.

I see two possible solutions.

Proposal one: provide NTL loads and stores (similar to the atomics builtins):

 type __riscv_ntl_load (type *ptr, enum ntl_domain domain);
void __riscv_ntl_store (type *ptr, enum ntl_domain domain);

However, this will probably only work reasonably well for single-GPR/FPR-memory transfers. E.g. vector memory accesses probably need a _ntl variant of the vector load/store intrinsics.

Proposal two: no intrinsics, but a NTL function attribute that does not block inlining

// all memory accesses in this function emit an NTL hint before the memory access instruction
__attribute__((target("ntl_domain=DOMAIN"))
static inline void
my_read_buf (uint8_t* buf, size_t n_bytes, uint8_t *src)
{
    __builtin_memcpy(buf, src, n_bytes);
}

@aswaterman
Copy link
Contributor Author

aswaterman commented Jul 30, 2022

I was also envisioning the intrinsic would emit the load or store in addition to the HINT. This matches how the non-temporal store intrinsics work on x86: the intrinsic actually performs the store, rather than annotating a separate assignment.

@kito-cheng
Copy link
Collaborator

kito-cheng commented Aug 1, 2022

Proposal one: provide NTL loads and stores (similar to the atomics builtins):

That sounds good to me.

Proposal two: no intrinsics, but a NTL function attribute that does not block inlining

I don't like idea of function attribute approach but that inspire me another possible solution for that: variable attribute:

uint8_t* buf __attribute__ ((ntl_doman=DOMAIN));

and any load store with pointer with this attribute will add a hint instruction.

@aswaterman
Copy link
Contributor Author

I like the pointer attribute approach, if it's feasible to implement.

And of course the x86-style intrinsic can be implemented using the pointer attribute approach with a simple wrapper function.

@topperc
Copy link
Contributor

topperc commented Aug 2, 2022

X86 has MOVNTI instruction for non-temporal store of GPR as part of SSE2.
X86 has MOVNTDQA for non-temporal vector load.

@cmuellner
Copy link
Collaborator

I don't like idea of function attribute approach but that inspire me another possible solution for that: variable attribute:

uint8_t* buf __attribute__ ((ntl_doman=DOMAIN));

and any load store with pointer with this attribute will add a hint instruction.

Yes, that's a better idea than a function attribute!

@kito-cheng
Copy link
Collaborator

I like the pointer attribute approach, if it's feasible to implement.

We need to make sure the implement effort on both compiler for variable attribute, I saw load/store in LLVM IR has encode non-temporal, but we might need to extend that to able to express different domain, so I think we need introduce new intrinsic for NTL load/store at first stage.

https://llvm.org/docs/LangRef.html#load-instruction

<result> = load [volatile] <ty>, ptr <pointer>[, align <alignment>][, !nontemporal !<nontemp_node>][, !invariant.load !<empty_node>][, !invariant.group !<empty_node>][, !nonnull !<empty_node>][, !dereferenceable !<deref_bytes_node>][, !dereferenceable_or_null !<deref_bytes_node>][, !align !<align_node>][, !noundef !<empty_node>]
<result> = load atomic [volatile] <ty>, ptr <pointer> [syncscope("<target-scope>")] <ordering>, align <alignment> [, !invariant.group !<empty_node>]
!<nontemp_node> = !{ i32 1 }
!<empty_node> = !{}
!<deref_bytes_node> = !{ i64 <dereferenceable_bytes> }
!<align_node> = !{ i64 <value_alignment> }

@aswaterman
Copy link
Contributor Author

In any case, it seems we have a path to some solution.

@kito-cheng
Copy link
Collaborator

SiFive folks is implementing builtin now.

@ptomsich
Copy link

Looks like we should also wire this up to the storent-optab.

Here's the equivalent patterns for x86:

; Expand patterns for non-temporal stores.  At the moment, only those
; that directly map to insns are defined; it would be possible to
; define patterns for other modes that would expand to several insns.

;; Modes handled by storent patterns.
(define_mode_iterator STORENT_MODE
  [(DI "TARGET_SSE2 && TARGET_64BIT") (SI "TARGET_SSE2")
   (SF "TARGET_SSE4A") (DF "TARGET_SSE4A")
   (V8DI "TARGET_AVX512F") (V4DI "TARGET_AVX") (V2DI "TARGET_SSE2")
   (V16SF "TARGET_AVX512F") (V8SF "TARGET_AVX") V4SF
   (V8DF "TARGET_AVX512F") (V4DF "TARGET_AVX") (V2DF "TARGET_SSE2")])

(define_expand "storent<mode>"
  [(set (match_operand:STORENT_MODE 0 "memory_operand")
        (unspec:STORENT_MODE
          [(match_operand:STORENT_MODE 1 "register_operand")]
          UNSPEC_MOVNT))]
  "TARGET_SSE")

@kito-cheng
Copy link
Collaborator

Proposal for the intrinsic: #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants