Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvements #314

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from
Draft

Perf improvements #314

wants to merge 21 commits into from

Conversation

GregAC
Copy link
Contributor

@GregAC GregAC commented Nov 3, 2024

Proposed performance improvements (improving SRAM throughput and Ibex single cycle multiplier), rebased on top of the pinmux changes (#309).

Timing is passing and build time looking good.

Two things were required to achieve this:

  1. Increasing the optimization level (Increase optimisation level #302), though this had no clear impact on build times
  2. Adding a register stage to the internal SPI loopback (see commit dd85088). The timing path this cuts is odd, there's a single level of logic between two registers with almost 25ns worth of delay. I suspect it's something to do with the SPI timing constraints causing the tool to want the register to output at a hugely skewed time. We may just be on the edge of getting away with it and these changes have pushed it over the edge so regardless of whether we take these changes we should likely cut the internal loopback path.

I've also added saving the utilization and timing reports from the FPGA CI as artifacts so they can be examined after (in particular so you can look at the timing results for any particular PR).

HU90m and others added 16 commits November 1, 2024 15:07
`ios` isn't very accurate as these are only outputs.
Verilator's lint complains about this being implicit when selecting a
value of 8. See future pinmapping commit.
These pins are only used for spi so don't need to be pinmuxed.
This removes blocks that have no options from the register map
Updates the definitions of the number of GPIO, UART and SPI devices that
are available in the software definitions file for Sonata to match the
changed nubmer of devices in the new pin mapping configuration. This
commit aims to keep changes minimal to get tests passing - it does not
look at solving any other issues with e.g. outdated definitions for now.
Updates the manual `pinmux_checker` application along with the
`pinmux_all_blocks_check` that uses it to match the new pin mapping
changes under Pinmux for 1.0.

This involves changing the pinmux assignment for a couple of pins to
match the changed order in the pinmux config changes. It also involves
reducing the number of UARTs that can be used by the the pinmux checker
from 1-2 down from 1-4, since the overall number of UARTs has also been
reduced by recent harwdare changes, and thus these non-existent UART
devices should not be exposed.

It finally also changes the number of SPIs exposed to just SPI0 and
SPI1 which are now exposed via Pinmux, and makes the necessary changes
to pinmux checker to facilitate this.
…ernal flash

Updates the pinmux tests for the changes introduced with the new pinmux
mapping. This primarily involves just changing the pins / devices used
to the devices that are now mapped on those pins, and updating
documentation to reflect this.

Also resets all pinmux logic after muxing to ensure that if more tests
are added afterwards or if the tests are run multiple times, that errors
are not introduced by the pinmux tests changing state.

Since the Application flash pinmux pins are no longer available through
pinmux, this also involves converting the SPI Pinmux test to change to
instead test using an extrenal SPI PMOD SF3 conecting to PMOD1. This
repurposes the existing test logic used by the pinmux checker to reduce
the amonut of additional code.
Adds a basic PMOD SF3 DPI in Verilator, attached to the PMOD1 pins so
that the device can be used. This is modelled to be the same as the
existing Application Flash on Sonata, but with a different JEDEC ID that
it reports. Makes the Spi Flash take a JEDEC ID as a parameter to reduce
duplicated logic between these DPIs.

Disclaimer: I am not a hardware engineer, the hardware changes may
contain problems.
Co-authored-by: Elliot Baptist <[email protected]>
Timing failures have been observed on this loopback path, it has minimal
logic levels but a very long delay, possibly due to I/O timing
constraints.

Adding the register stage cuts the internal path. With the register
stage the internal loopback cannot run at full speed, however as this is
for testing purposes only this is acceptable.
This was referenced Nov 3, 2024
Need a minimum of 2 (this is what is used in OpenTitan) to enable back
to back requests without stall cycles.
Update code from upstream repository
https://github.com/lowrisc/cheriot-ibex.git to revision
ea2df9db3bcea776f0dc72d6d89c31c73798ecd4

* Feed RV32M through ibexc_top_tracing/ibexc_top (Greg Chadwick)
* Switch to no bitmanip by default (Greg Chadwick)
* Feed RV32B through in ibexc_top (Greg Chadwick)

Signed-off-by: Greg Chadwick <[email protected]>
This is effectively a no-op change. Before the latest Ibex was vendored
we had no bitmanip (the RV32BFull parameter was not fully passed
through) and RV32M was the fast multiplier.

Sadly the single cycle multiplier seems to be increasing timing
pressure. It does just meet timing but greatly increases synthesis
times. As it's implemented with in-built FPGA DSP blocks it shouldn't be
a big issue to use it so something to examine here but for now leave
things as they are.
@GregAC GregAC force-pushed the perf_improvements branch from dc38d53 to 8235bc7 Compare November 3, 2024 22:28
@GregAC
Copy link
Contributor Author

GregAC commented Nov 4, 2024

For reference here is the timing summary from the CI run (from the implementation reports artefact)

    WNS(ns)      TNS(ns)  TNS Failing Endpoints  TNS Total Endpoints      WHS(ns)      THS(ns)  THS Failing Endpoints  THS Total Endpoints     WPWS(ns)     TPWS(ns)  TPWS Failing Endpoints  TPWS Total Endpoints  
    -------      -------  ---------------------  -------------------      -------      -------  ---------------------  -------------------     --------     --------  ----------------------  --------------------  
      0.043        0.000                      0                42250        0.066        0.000                      0                42249        1.741        0.000                       0                 15674  

Here's the utilization report

+----------------------------+-------+-------+------------+-----------+-------+
|          Site Type         |  Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs                 | 22736 |     0 |          0 |     32600 | 69.74 |
|   LUT as Logic             | 22628 |     0 |          0 |     32600 | 69.41 |
|   LUT as Memory            |   108 |     0 |          0 |      9600 |  1.13 |
|     LUT as Distributed RAM |   108 |     0 |            |           |       |
|     LUT as Shift Register  |     0 |     0 |            |           |       |
| Slice Registers            | 15330 |     0 |          0 |     65200 | 23.51 |
|   Register as Flip Flop    | 15330 |     0 |          0 |     65200 | 23.51 |
|   Register as Latch        |     0 |     0 |          0 |     65200 |  0.00 |
| F7 Muxes                   |   551 |     0 |          0 |     16300 |  3.38 |
| F8 Muxes                   |   142 |     0 |          0 |      8150 |  1.74 |
+----------------------------+-------+-------+------------+-----------+-------+

Build time was 11m 30s which is in line with previous build times in CI

@elliotb-lowrisc
Copy link
Contributor

I've also seen an LCD SPI path with 24 ns of delay on a single net. The hold slack on the path was only 13 ns, so I suppose it might be hold-fixing gone wrong.

@@ -86,6 +86,8 @@ jobs:
runs-on: [ubuntu-22.04-fpga, sonata]
env:
BITSTREAM_PATH: build/lowrisc_sonata_system_0/synth-vivado/lowrisc_sonata_system_0.bit
TIMING_RPT: build/lowrisc_sonata_system_0/synth-vivado/lowrisc_sonata_system_0.runs/impl_1/top_sonata_timing_summary_routed.rpt
Copy link
Contributor

@elliotb-lowrisc elliotb-lowrisc Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make this conditional on whether "top_sonata_timing_summary_postroute_physopted.rpt" exists, so we avoid looking at the wrong report when we switch to the new 2024.1-compatible post-route optimisation flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think easier to just remember to fix this up when we alter the tool version. It's reasonable for CI to specifically work with the currently agreed tool versions and need modest updates when we change versions.

assign spi_cipo = reg2hw.control.int_loopback.q ? spi_copi_o : spi_cipo_i;
logic spi_copi_q;

always_ff @(posedge clk_i) begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed the lack of reset but think it is fine. The u_spi_core/copi_shift_q driving it also lacks a reset, so this is just replicating the behaviour one cycle later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I prefer to leave resets off flops that don't strictly need them. Given the SPI ignores incoming data when it's not been specifically command to clock things in/out then it won't ever actually 'see' the first bit out of reset anyway.

@elliotb-lowrisc
Copy link
Contributor

I've run a bitstream build of this PR and a baseline build with the changes it is based on (up to and including the opt=2 change) using Vivado 2021.1 on my local machine. The breaking of the SPI loopback path did seem to improve the overall WNS (-1.451 -> -0.391). Though, for these two builds at least, the TNS went the other way (-8.168 -> -27.834), though the numbers are still relatively small. Build times were also increase substantially (19m -> 32m) though that is muddied by builds going on in parallel. All this is somewhat concerning (for the state of Sonata if not this PR) but not conclusive. More testing to follow

@elliotb-lowrisc
Copy link
Contributor

Adding timing constraint updates (#316) and building ono showed a more reasonable build time (15m) and timing passing.

@GregAC
Copy link
Contributor Author

GregAC commented Nov 4, 2024

Thanks for checking it out @elliotb-lowrisc. Frustrating it's giving you some failures.

I've certainly see some flakey behaviour from the implementation flow, one version of this (pre pinmux changes) actually gave a noticeable improvement in timing, taking us to 0.5 ns WNS (positive)!

I've rebased this locally now the pinmux changes are in and am doing another run, will see what results!

@GregAC
Copy link
Contributor Author

GregAC commented Nov 4, 2024

Well rebasing and running locally gives a terrible timing result (something like -1.5 WNS) and increased build times.

So sadly I think we'll have to reject this for 1.0 :( I will the try the two changes separately and see if they work in isolation.

As @elliotb-lowrisc says I think this is more a reflection of the current state of Sonata rather than the specific things this PR is doing, all too easy to push things the wrong way with what looks like modest design changes.

@elliotb-lowrisc
Copy link
Contributor

I think the SPI loopback change is definitely worth a quick PR and merge if possible

@GregAC
Copy link
Contributor Author

GregAC commented Nov 4, 2024

I think the SPI loopback change is definitely worth a quick PR and merge if possible

Agreed I'll spin out a separate PR

@GregAC
Copy link
Contributor Author

GregAC commented Nov 6, 2024

I have had this just passing timing on latest main but it does look very tight and was starting to push up build times.

So I think we're best off doing this post-1.0. With @elliotb-lowrisc's crossbar change and a couple of simple Ibex timing fixes I'd hope this becomes a lot more comfortable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants