floppy: apply NCO-based 2nd-order PLL to WDATA line #864

mx-shift · 2024-01-21T00:02:45Z

Centurion Finch Floppy Controller
(FFC, https://github.com/Nakazoto/CenturionComputer/wiki/FFC-Board) is an 8" floppy controller that applies up to 350ns write precompensation. FlashFloppy's existing approach of re-aligning the bit clock to the last pulse edge causes enough additional error for bitcells to be missed or injected in this case.

This commit applies a 2nd-order (proportional and integral) PLL to the incoming WDATA to mitigate both frequency offset and extreme write precompensation. The implementation is based on a virtual numerically-controlled oscillator with a frequency 2^16 times FlashFloppy's tick rate. Proportional and integral constants were chosen to nominally provide a 715kHz natural loop frequency and a damping factor of 1 and then adjusted to be powers of 2 to allow computation to use bit shifts instead of multiplication and division.

This commit has been successfully tested with a variety of PC-compatible floppy controllers as well as Centurion FFC.

keirf · 2024-01-21T05:40:42Z

Thanks for this! I need to consider whether to apply this as a default, or only, clock recovery. I can probably simplify the edges of this code a bit, where it interfaces to the rest of the firmware. My main concern is how long it takes to run per bit clock, as the old STM32 Goteks aren't very fast, and FlashFloppy is tested up to ED data rate (MFM 500ns clock). What Gotek model(s) has this been tested on?

However, fundamentally it is more principled than ae42450 and adjusting the frequency rather than messing directly with the clock phase seems obviously better for dealing with aggressive precomp. I can test this against Greaseweazle easily enough, but my gut says that this implementation shouls be quite a bit more robust.

Possibly I should try something like this in Greaseweazle too, which has a PLL that's not really a PLL. Albeit the existing implementation has proven to work pretty well even on marginal disks. I could have two implementations though, and then A/B them.

mx-shift · 2024-01-21T05:54:44Z

I only have 2x Gotek SFRKC30.AT4.35 so I've only tested with the Artery AT32F435. I was pretty careful to choose an implementation that avoids expensive math. There's one uint32_t multiply and one int32_t divide per sample. Everything else is adds, subtracts, and bit shifts. I didn't check the disassembly to see if the multiple and division are replaced by the compiler.

keirf · 2024-01-21T06:28:46Z

The other question that comes to mind is how were the PLL parameters arrived at? I'm no PLL expert, and I don't know whether for example the loop frequency and damping factors chosen are good ones, nor whether the resulting calculated k_l and k_i are good. Was there much trial and error?

mx-shift · 2024-01-21T07:14:07Z

I'll admit that I'm not particularly strong on PLL design either. I included a link to https://www.dsprelated.com/showarticle/967.php in the comment with the parameters as I referred to that site often while developing this implementation.

I did experiment with various parameters to see what range of loop frequency would work. I left damping factor alone as all the information I turned up said 1 was a safe choice. Ultimately, I found anything from 300kHz up to 750kHz worked fine for a 1us MFM clock with write precomp ranging from 100ns up to 350ns. Given that, I worked backwards from k_l and k_i that were convenient for bit shifting which also resulted in a loop frequency in that range. That resulted in k_l = 1/8, k_i = 1/256 (715kHz) and k_l = 1/16, k_i = 1/1024 (358kHz). Both worked fine with 3.5" HD and Centurion 8" DD flux images and on real hardware.

To do all this experimentation (and not need Usagi Electric to fire up the Centurion dozens of times), I wrote a tool (https://github.com/mx-shift/Centurion_Floppy/tree/main/kryoflux_to_flashfloppy) that converts Kryoflux files to a list of tick counts approximating how the AT32F435's timer capture unit generates samples via the DMA engine. This tool can also apply additional write precompensation. Then I extracted the core algorithm from IRQ_wdata_dma into another tool (https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe) so I could try different implementations and parameters. You'll notice that the tool has algorithms from FlashFloppy v4.31, a snapshot of master, Greaseweazle's two PLLs, a design based on an FDC9216, and finally this NCO-based PLL approach.

If you have Kryoflux files of other disks you'd like to verify these algorithms against, I'll gladly run them through the tooling and verify HxC can decode the generated HFEs. As I only got into floppy emulation because Usagi Electric was having difficulty with Centurion, I don't have a large library of flux images to draw from for verification.

keirf · 2024-01-21T08:15:43Z

Okay that's very awesome! I assume you found this NCO-PLL scheme to work best? Sounds like you used empirical methods to pick good k_l and k_i, and f_n emerged from that rather than working forwards, and that certainly seems fine to me.

Regarding the signed divisions, you should find that these compile to three instructions. Unfortunately conventional signed division (which rounds to zero) is not the same as arithmetic shift (which rounds to -inf). For the purposes here however, perhaps arithmetic shift is okay and will save two instructions? It is "safe enough" to try >> here because although its implementation dependent how C right shifts a negative number, in practice everyone uses arithmetic shift if it's available (which it is on almost any ISA including ARM). Leaving unsigned divisions as divisions is okay as the compiler can substitute a single LSR there itself, no problem.

The main weirdness of ASR is that a negative number repeatedly shifted will tend to -1 rather than 0.

keirf · 2024-01-21T08:37:38Z

Did you in the end find any reason to pick between 358kHz vs 715kHz? I'd certainly be inclined to go with what experiments say if they are conclusive either way but, since WDATA should have a very stable clock close to what FlashFloppy expects and basically all the noise is bit shifts introduced by precomp, the lower-bandwidth 358kHz parameters would be a natural choice.

By the way, you've probably seen this doc before, but I found this useful for thinking about PLL behaviours in relation to floppy drive data: https://info-coach.fr/atari/hardware/_fd-hard/AN-505.pdf
Yes, I do skip most of the maths!

mx-shift · 2024-01-21T08:53:58Z

This NCO-PLL scheme was the only one that could handle the Centurion's extreme write precomp. I found no difference in results between 358kHz and 715kHz though only 715kHz was tried against actual Centurion hardware. I expect 358kHz would work as well. I didn't want to trouble Usagi Electric is all. Re: bit shifts vs divisions. I tend to always write the code with division and let the compiler figure it out. Old habits from being a performance engineer: write the obvious thing until you have results showing you need to do something different. So the only explicit bit shifts in the code right now are the initial value of the PLL period and the unsigned scaling up of timer samples. Everything else is written with multiplication and division, most of them with power-of-2 constants. If it's a problem on older Goteks, I can look at what the compiler is doing to see how much further it can be streamlined.

…

On Sun, Jan 21, 2024, 12:37 AM Keir Fraser ***@***.***> wrote: Did you in the end find any reason to pick between 358kHz vs 715kHz? I'd certainly be inclined to go with what experiments say if they are conclusive either way but, since WDATA should have a very stable clock close to what FlashFloppy expects (say comfortably within 500ppm?) and basically all the noise is bit shifts introduced by precomp, the lower-bandwidth 358kHz parameters would be a natural choice. By the way, you've probably seen this doc before, but I found this useful for thinking about PLL behaviours in relation to floppy drive data: https://info-coach.fr/atari/hardware/_fd-hard/AN-505.pdf Yes, I do skip most of the maths! — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHF266G555W46WFH36WLYPTHV3AVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGU2TKMZQG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-01-21T08:59:30Z

Well that's fine, I have no strong reason to prefer 358kHz vs 715kHz. Clearly the latter has really good tolerance for bit shifts!

And yes makes sense I suppose to leave as is. I have a feeling this may only get applied for the '435 build of FlashFloppy anyway. That given you could probably even have used single-precision floats, no problem! Three-instruction divisions are unlikely to need optimisation there.

mx-shift · 2024-01-21T09:51:40Z

Someone just pointed out a slight numerical accuracy problem in the phase_integral. I need to move the divide by 256 down to the phase_step calculation. As is, small errors aren't being accumulated in the integral. I'll push an updated commit sometime tomorrow.

…

On Sun, Jan 21, 2024, 12:59 AM Keir Fraser ***@***.***> wrote: Well that's fine, I have no strong reason to prefer 358kHz vs 715kHz. Clearly the latter has really good tolerance for bit shifts! And yes makes sense I suppose to leave as is. I have a feeling this may only get applied for the '435 build of FlashFloppy anyway. That given you could probably even have used single-precision floats, no problem! Three-instruction divisions are unlikely to need optimisation there. — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHF4T4WATCW55IXYJ7NDYPTKHZAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGU2TSOBWHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-01-21T11:53:33Z

Interesting, having a play with your conversion and analysis tools. I don't actually find 350ns precomp reliable with the 715kHz PLL and an MFM clock of 1us. For example (paraphrasing slightly, but this is the sequence of commands ignoring paths):

dd if=/dev/urandom of=img1440.img bs=1024 count=1440
gw convert --format=ibm.1440 img1440.img img1440_00.0.raw
kryoflux_to_flashfloppy img1440_00.0.raw --write-precomp-ns 350
flashfloppy_to_hfe img1440_00.0.revolution1.ff_samples a.hfe nco_715k

If I inspect the result using HxC software's track analyzer, there are plenty of badly decoded bitcells and most sectors fail data CRC.

Also the resulting HFE seems a bit long, hence HxC software thinks the RPM is approx 298. This is probably just some artifact of your HFE-stuffing code.

keirf · 2024-01-21T17:12:27Z

Warning. Ramble:

Another thing I took to wondering is if there is a better way to handle precomp shifts, given that we basically know that the only high-frequency noise we will see is the deterministic shifting caused by precomp. Whenever I start thinking about it I reckon I always end up wanting something like a conservative low-bandwidth PLL though. It is interesting that a PLL (with appropriate parameters!) can tolerate even this aggressive shifting. It's not immediately obvious to me that it should be possible. After all the flux timings are expected to be written direct to disk, and only passed through a PLL on readback, at which point the precomp shifts should be mostly compensated by effects of the magnetic media.

mx-shift · 2024-01-21T19:44:55Z

Huh. Is gw applying any precomp? My tool adds additional precomp.

…

On Sun, Jan 21, 2024, 3:53 AM Keir Fraser ***@***.***> wrote: Interesting, having a play with your conversion and analysis tools. I don't actually find 350ns precomp reliable with the 715kHz PLL and an MFM clock of 1us. For example (paraphrasing slightly, but this is the sequence of commands ignoring paths): dd if=/dev/urandom of=img1440.img bs=1024 count=1440 gw convert --format=ibm.1440 img1440.img img1440_00.0.raw kryoflux_to_flashfloppy img1440_00.0.raw --write-precomp-ns 350 flashfloppy_to_hfe img1440_00.0.revolution1.ff_samples a.hfe nco_715k If I inspect the result using HxC software's track analyzer, there are plenty of badly decoded bitcells and most sectors fail data CRC. Also the resulting HFE seems a bit long, hence HxC software thinks the RPM is approx 298. This is probably just some artifact of your HFE-stuffing code. — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHF3EQHXH6CLFGX43YLTYPT6URAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGYYDKMBTGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-01-21T20:55:44Z

No Greaseweazle doesn't apply any precomp to image conversions. Could the problem be that the data is more random than previous tests, and that certain bit sequences are defeating the higher bandwidth PLL? The lower bandwidth version worked okay on this test input. Worth investigating anyway.

mx-shift · 2024-01-21T21:11:03Z

I believe you are right. I'm starting to play with even lower bandwidth versions to see if 400ns can be tolerated.

keirf · 2024-01-21T21:59:52Z

Another interesting measurement for low bandwidth PLLs will be their tolerance to frequency shift. This can also be tested with Greaseweazle as a custom format can be specified with a shifted bitrate (eg 490-510 in place of 500kbps). We should make sure the PLL can lock and lock reasonably fast to deviations of a few percent I'd say?

mx-shift · 2024-01-21T23:58:00Z

I was thinking along the same lines. I started to work through the formulas again and remembered that f_n (loop natural frequency) determines how far the PLL will "pull in" an input that has a frequency offset.

One way to think of this PLL design is as a PLL that tracks a nominal 72MHz clock. Since MFM doesn't provide a clock pulse consistently, I'm simulating the PLL clock running N cycles until the MFM edge and then spreading the accumulated error found at that edge across those N cycles. So when thinking about how much the PLL will tolerate a frequency offset, it's in reference to the nominal 72MHz clock and the tolerance is divided across however many ticks occur before an MFM edge.

For nco_715k, this means the PLL tolerates roughly 72MHz +/- 715kHz. With a 1MHz MFM clock, it's possible to see 4us between MFM pulses. So the 715kHz tolerance is spread over 288 PLL clock ticks resulting in 2.482kHz tolerance on the incoming 1MHz MFM clock. The tolerance actually gets worse with slower MFM clocks.

To experiment with slower MFM clocks, I modified https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe to allow specifying the HFE bitrate which is used to calculate write_bc_ticks just like FlashFloppy does. With a 250kbps HFE, both nco_715k and nco_358k were able to track an ibm.360 image. (Aside: I also added https://github.com/mx-shift/Centurion_Floppy/blob/main/scripts/kryo_to_hfe.sh to quickly generate HFEs with 0, 300, 350, and 400ns precomp from a kryoflux image).

I also added nco_178k to see what happens if the loop natural frequency is dropped further. This works fine with ibm.1440 and ibm.360 images containing random data and converted with gw convert. Note that an ibm.360 converted to a 250kbps HFE with nco_178k spreads the 178k tolerance over 576 PLL clock ticks resulting in 309Hz tolerance. nco_358k is only slightly better at 621Hz.

One way to improve this is to slow the PLL virtual clock down to be much closer to the expected range of MFM clocks. That requires some thinking through how to handle the PLL virtual clock and the timer samples overflowing at different times.

mx-shift · 2024-01-22T03:25:33Z

Playing around a bit more, I think I was wrong about the tolerance being spread across the measurement time. The PLL only gets updated at the end of the next MFM edge. Even if that 4us from the previous edge, the phase error is divided by the number of PLL clock ticks before updating the PLL. So the PLL can tolerate x% at the update which translates to x% across the measurement window.

Zeta is what becomes important for tolerating jitter while keeping a wide loop natural frequency. As zeta approaches 0, the jitter tolerance improves until a point where the PLL loses stability. As zeta increases above 1, phase tracking to the input clock is improved until a point where the PLL loses stability.

Putting these two pieces together, I calculated parameters for loop natural frequency of 3%, 2%, and 1% of 72MHz and zeta at various points between 0 and 1. I've added all of these to https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe as nco__. No matter what I do, I can't tolerate 400ns of write precomp. nco_1440k_0p25 seems to be a good balance of 2% frequency tolerance while also tolerating 350ns write precomp. I have not tried this with intentional frequency offsets.

I did try it against https://github.com/Nakazoto/CenturionComputer/tree/main/Drives/Floppy/FluxReading/Old/RawDS_Kryoflux which is a Centurion floppy read via greaseweazle. Note that since this was created from actual media, it only has moderate write precomp remaining as most of it was compensated for by the drive and media. Also note that Centurion adds additional write precomp starting at track 42. So looking at track 42 with an additional 150ns or so of write precomp is an example of what Centurion FFC looks like going into FlashFloppy. It has both a small frequency offset (and drift) as well as extreme write precomp. nco_1440k_0p25 seems to handle this just fine.

mx-shift · 2024-01-22T05:20:07Z

Updated the PR with the numerical accuracy fix for the integral and to adjust the PLL to 1440kHz loop frequency and loop dampening of 0.25.

Centurion Finch Floppy Controller (FFC, https://github.com/Nakazoto/CenturionComputer/wiki/FFC-Board) is an 8" floppy controller that applies up to 350ns write precompensation. FlashFloppy's existing approach of re-aligning the bit clock to the last pulse edge causes enough additional error for bitcells to be missed or injected in this case. This commit applies a 2nd-order (proportional and integral) PLL to the incoming WDATA to mitigate both frequency offset and extreme write precompensation. The implementation is based on a virtual numerically-controlled oscillator nominally running at the same frequency at the WDATA timer but with a 16.16 fixed-point representation allowing for considerable precision in frequency adjustment. Proportional and integral constants were calcuated to nominally provide a 1440kHz natural loop frequency and a damping factor of 0.25 to allow for 2% frequency offset and strong jitter rejection against write precompenstaion.

keirf · 2024-01-22T08:12:16Z

Okay, I will have a play with this when I have some time. Could be next weekend. I can't really comment authoritatively on how to interpret PLL parameters, especially when it's a digital approximation of an APLL applied to an incomplete clock stream. I think the natural frequency is actually to do with the frequency that the PLL oscillates at when perturbed. I'm also not sure the absolute f_n and f_s have much meaning for us when the only parameters we depend on in the code (k_l and k_i) depend only on the ratio between f_n and f_s.

Anyway, empirical evaluation of sensible-looking 2^-N values for k_l and k_i seems a reasonable approach.

keirf · 2024-01-22T11:06:20Z

Interestingly, a quick fairly arbitrary test with data bitrate +/- 3% of nominal, and 350ns precomp, seems handled better by nco_358k than nco_1440k_0p25.

Seems there's possibility of searching this space somewhat automatically, at the cost of implementing a harness:

Greaseweazle can generate pseudorandom IBM 1.44M Kryoflux files with arbitrary bitrate
kryoflux_to_flashfloppy can add arbitrary precomp
flashfloppy_to_hfe could be generalised to accept arbitrary pow2 k_i and k_l divisors.
Greaseweazle can analyse resulting HFE looking for valid sectors

Then run a multi-dim search across (rate,precomp,k_l,k_i) -> nr_valid_sectors

Then pick optimal k_l,k_i for preferred tradeoff between worst-case rate vs precomp. We could chuck in a few of the other 'pll' algorithms as baselines.

Then the whole thing seems empirically watertight at least, within our constraints on values for k_l,k_i.

keirf · 2024-01-26T16:42:39Z

I forked and added a parametric NCO and a test harness at https://github.com/keirf/Centurion_Floppy

I searched across bit rates up to +/-8% from 500kbps nominal. And across precomp values up to 350ns. The best parameters I can find are k_l = 1/16 and k_i = 1/128 (though k_i = 1/256 is a very close second).

It is good that the k_l matches your preferred NCOs (358k and 1440k_0p25). While k_i sits somewhere between, but closer to nco_358k.

mx-shift · 2024-01-26T17:15:39Z

Neat! I haven't had a chance this week to poke at it. I do think I have a mistake in the phase error calculation. It should _not_ be divided by write_bc_ticks. First, the error might be accumulated over multiple bitcells. Second, dividing the error to smooth it across all the clock ticks would require replaying each tick to calculate a new phase_step and associated phase_error. Since we don't know anything about the nature of the phase error between the two edges, it's best to treat it as a single, large error introduced as a transient. Basically, remove the divide by write_bc_ticks and leave the rest of the code the same.

…

On Fri, Jan 26, 2024, 8:42 AM Keir Fraser ***@***.***> wrote: I forked and added a parametric NCO and a test harness at https://github.com/keirf/Centurion_Floppy I searched across bit rates up to +/-8% from 500kbps nominal. And across precomp values up to 350ns. The best parameters I can find are k_l = 1/16 and k_i = 1/128 (though k_i = 1/256 is a very close second). — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHF4G34XYDIANUOZ23T3YQPMITAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGM3DAOJYGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-01-26T17:48:46Z

Dividing by write_bc_ticks is simply to remove the data frequency from phase calculations. It's nothing to do with averaging out the phase error across bitcells. That would be wrong, and it's not what's done here, so all good.

That said, I want to finesse the code some on the FlashFloppy side, and then I will want to re-extract that into the test program and re-run the search. But I haven't found anything in the existing NCO-PLL code I'd strictly call wrong.

mx-shift · 2024-01-26T18:00:58Z

The delta between the ideal bitcell center and the actual edge is the total phase error at that instant according to the PLL clock. I originally added the divide by write_bc_ticks thinking it would average the error but it didn't. It just feeds back less error equivalent to scaling k_l and k_i by the same amount.

…

On Fri, Jan 26, 2024, 9:48 AM Keir Fraser ***@***.***> wrote: Dividing by write_bc_ticks is simply to remove the data frequency from phase calculations. It's nothing to do with averaging out the phase error across bitcells. That would be wrong, and it's not what's done here, so all good. That said, I want to finesse the code some on the FlashFloppy side, and then I will want to re-extract that into the test program and re-run the search. But I haven't found anything in the existing NCO-PLL code I'd strictly call *wrong*. — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHFZVVKP3OBQ6D2JSR7DYQPUATAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGQ2DSNRSHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-01-26T18:56:07Z

But think about the dimensions of the values. phase_step is dimensionless, then multipled by write_bc_ticks (clock ticks) to get bc_step (clock ticks). distance_from_curr_bc_left and bc_step are also in units of clock ticks. But then phase_error needs to be dimensionless because it is used to calculate phase_step, hence dividing through by something in units of clock ticks surely makes sense.

Anyway, inarguably the code does seem to work :)

mx-shift · 2024-01-26T21:22:51Z

phase_step is nco-ticks/timer-tick write_bc_ticks is timer-ticks/mfm-tick So phase_step*write_bc_ticks is nco-ticks/mfm-tick and current_bc_left_edge is supposed to be in nco-ticks (modulo 32-bits). phase_error should also be in nco-ticks.

…

On Fri, Jan 26, 2024, 10:56 AM Keir Fraser ***@***.***> wrote: But think about the dimensions of the values. phase_step is dimensionless, then multipled by write_bc_ticks (clock ticks) to get bc_step (clock ticks). distance_from_curr_bc_left and bc_step are also in units of clock ticks. But then phase_error needs to be dimensionless because it is used to calculate phase_step, hence dividing through by something in units of clock ticks surely makes sense. Anyway, inarguably the code does seem to work :) — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHFZXRTJDKRXWY73LXL3YQP35DAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGUZTQMJVHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mx-shift · 2024-02-20T07:58:22Z

Finally found some time to revisit this. I cleaned up https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe quite a bit and replaced all the nco_* variations with a parameterized version as you had done in your fork. I also reworked your harness.py into sweep_ibm.py which scans across both ibm.360 and ibm.1440 formats to see what happens with different MFM clock rates. I also added summarize_results.py which will show how well the bitrate/precomp combinations are covered as well as the coverage of the set of parameters that had the widest coverage.

I've also worked up a revision of the NCO PLL that behaves like the math says it should. I call this nco_v2 in flashfloppy_to_hfe. sweep_ibm.py will run both nco_v1 and nco_v2 and generate results for each. I haven't let it run completely but early results show nco_v2 working quite well with p_mul=1, p_div=2048, i_mul=1, i_div=8192.

I also added a CSV output of phase_error over time to flashfloppy_to_hfe. TechTangents provided me with a kryoflux image of a floppy that failed formatting in a Xerox 645S word processor due to the drive spindle speed varying so much that it couldn't read it's own writes. I ran that through flashfloppy_master, greaseweazle_default_pll, nco_v1, and nco_v2 and plotted all of their phase errors:

nco_v2 slightly edges out nco_v1. I also did the same with a Centurion floppy as I know it has a modest amount of write precomp:

Again, nco_v2 is slightly better than nco_v1. I'll update again once my run of sweep_ibm.py finishes.

keirf · 2024-02-20T08:17:05Z

Nice. Can it tolerate 400ns precomp in a 1.44m pc disk?

mx-shift · 2024-02-20T08:19:29Z

Nope. That seems to be pretty difficult to do.

…

On Tue, Feb 20, 2024, 12:17 AM Keir Fraser ***@***.***> wrote: Nice. Can it tolerate 400ns precomp in a 1.44m pc disk? — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHF45JJTSAZO7MKAKESDYURLY5AVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGY4DINZVGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-02-20T08:53:46Z

Well this is great anyway. I still haven't got my head round the mathematical correctness of v1 versus v2, but I need to give it a bit more thought, and it's hard to argue against empirical results, plus dropping a divide, and the common-sense interpretation that we are propagating and accumulating errors with greater precision, in v2.

My main concern is whether v2 is sufficiently "scale invariant" to various write_bc_ticks values. Sweeping over precomp and clock variations at base clock 1us versus 2us will certainly be interesting: will the same parameters work best for both base clocks in v1 and/or v2.

mx-shift · 2024-02-21T04:28:09Z

TL;DR Use nco_v2[p_mul=1,p_div=2048,i_mul=1,i_div=8192] and it should work for everything up to and including ibm.2880.

Sweep finished overnight. Results are pretty clear: both nco_v1[p_mul=1,p_div=16,i_mul=1,i_div=128] and nco_v2[p_mul=1,p_div=2048,i_mul=1,i_div=8192] can cover:

ibm.360 with +/- 8% data rate and up to 400ns precomp
ibm.1440 with -8% to +6% data rate and up to 350ns precomp

nco_v1 can also do ibm.1440 with +7% data rate and 350ns precomp. I'm surprised they are different on this one data point.

I also looked at ibm.2880 for those two specific parameter sets. I can't use the same tooling since it is all built around generating HFE which can't store a track long enough for ibm.2880. Instead, I looked at the phase error plot of a nominal ibm.2880 just to check that nothing went particularly haywire.

nco_v1 does slightly better here but both nco_v1 and nco_v2 are worse than flashfloppy_master. The error phase is well within the tolerance needed for correct decode though (bitcells are 36 sample ticks wide for 1000kbps). I suspect the extra phase error noise for both nco_v1 and nco_v2 come down to using 32-bits for the NCO. Moving up to 64-bits might give it enough precision to settle down instead of oscillating around 35kHz which is very close to where the math says this filter should oscillate.

Full summaries of results:
ibm.360.summary.txt
ibm.1440.summary.txt

mx-shift · 2024-02-21T04:30:27Z

I'll gladly do the work to update this PR with nco_v2. Up to you.

keirf · 2024-02-21T06:18:18Z

Doesn't seem there's much between 'em.

You can do the ibm.2880 bit rate but not the exact format. Just do ibm.1440 but at rate=1000 and rpm=600. You'll need to add an rpm config parameter to your class Format and write that into the generated diskdef. Something like:

Format(
        'ibm.ed',
        heads=2,
        cylinders=80,
        sectors_per_cylinder=18,
        bytes_per_sector=512,
        data_rate_kbps=1000,
        rpm=600
    )

100ms of track should be plenty in which to evaluate the PLLs.

mx-shift · 2024-02-21T06:35:52Z

Thanks! I didn't think to shorten the tracks. You're right that nco_v1 and nco_v2 aren't very different in behavior or implementation. The extra divide in v1 mostly just scales the input parameters by write_bc_ticks (2048 / 72 is 28 which is close enough to 16 and 8192 / 72 is 113 which is close enough to 128). I thought that might be a problem for picking a single set of parameters that with across different data rates. For 250kbps, the tolerance is so wide it doesn't matter. It might for 1000. nco_v2 doesn't have this problem. The other difference is nco_v1 assumes both the PLL uint32 and the sample clock uint16 will overflow at the same time. Since the PLL frequency is constantly being adjusted, they won't. nco_v2 doesn't make that assumption. In practice, that error is tiny compared to the normal phase error seen so it doesn't meaningfully change the results at 500kbps. I doubt it will at 1000kbps either. So my preference for nco_v2 is mostly about removing the write_bc_ticks scaling so the PLL parameters are fixed across data rates. Rick

…

On Tue, Feb 20, 2024, 10:18 PM Keir Fraser ***@***.***> wrote: Doesn't seem there's much between 'em. You can do the ibm.2880 bit rate but not the exact format. Just do ibm.1440 but at rate=1000 and rpm=600. You'll need to add an rpm config parameter to your class Format and write that into the generated diskdef. Something like: Format( 'ibm.ed', heads=2, cylinders=80, sectors_per_cylinder=18, bytes_per_sector=512, data_rate_kbps=1000, rpm=600 ) 100ms of track should be plenty in which to evaluate the PLLs. — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHFZOEXOGYVJWVND3BG3YUWGTLAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJVHE3DQMJZHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-02-21T07:20:36Z

Let's get the higher rate numbers and see. We do indeed want one set of parameters for all rates if possible. So whichever of v1 and v2 is best for that is going to win.

mx-shift · 2024-02-22T02:32:40Z

Results are not what I expected. Both nco_v1 and nco_v2 have a wide range of parameters that work fine at 1000kbps. What's really surprising is nco_v1[p_mul=1, p_div=16, i_mul=1, i_div=128] works for 1000kbps but nco_v2[p_mul=1, p_div=2048, i_mul=1, i_div=8192] does not. I've started another run with an nco_v3 to test a theory but it seems like nco_v1[p_mul=1, p_div=16, i_mul=1, i_div=128] is a solid choice for FlashFloppy.

ibm.ed.summary.txt

keirf · 2024-02-22T08:29:27Z

Yes interesting, not only is nco_v2[p_mul=1, p_div=2048, i_mul=1, i_div=8192] a narrow miss, but nco_v1[p_mul=1, p_div=16, i_mul=1, i_div=128] remains at the very heart of a successful 2d range of parameters. It wasn't obvious to me which would end up winning in practice.

Is there anything else from v2 that's worth considering in combo with v1 for a v3 or v4?

mx-shift · 2024-02-22T10:21:24Z

v3 is two small adjustments to v2. nco_v3[p_mul=1, p_div=2048, i_mul=1, i_div=8192] is just shy of nco_v1[p_mul=1, p_div=16, i_mul=1, i_div=128] in coverage. I could probably eek out those last few cells (ibm1440 >+6% RPM +350ns) with a bit more work but nco_v1 already handles this. I suspect i_div between 4096 and 8192 would pick them up but then we lose the performance advantage of a divide by power-of-2. Let me know which version you want to proceed with and I'll update this PR accordingly.

omnibus_summary.txt

keirf · 2024-02-22T11:22:10Z

I'd stick with nco_v1[p_mul=1, p_div=16, i_mul=1, i_div=128].

I also note from previous comment: The other difference is nco_v1 assumes both the PLL uint32 and the sample clock uint16 will overflow at the same time.

Is this the following nco_v2 code?

        if (pll_period < PLL_PERIOD_MIN) {
            pll_period = PLL_PERIOD_MIN;
        } else if (pll_period > PLL_PERIOD_MAX) {
            pll_period = PLL_PERIOD_MAX;
        }

keirf · 2024-02-22T11:24:13Z

Ultimately my path forward is as follows: Get an acceptable patch ready for firmware, then port that as directly as possible back out into the test harness (so the code is line for line as close to identical as possible), and check it against the nco_v1 baseline again. This will allow to confidently test firmware optimisations too.

mx-shift · 2024-02-22T16:48:18Z

No, that bit of code is limiting the tunable range of the NCO. If it is left unlimited, it can go so low that the decoding process gets very slow. The relevant piece from nco_v2 is: // If this edge would fall within the previous bitcell, it is a runt. // Since the pll timestamps will wrap on overflow, the timestamps can't // be directly compared. Instead, compare the distance from the // previous bitcell's left edge as the subtraction will account any // wrapping. uint32_t curr_pll_edge = prev_pll_edge + (uint32_t)duration_between_samples * PLL_PERIOD_NOMINAL; uint32_t curr_pll_bc_left = prev_pll_bc_left + pll_write_bc_ticks; if ((curr_pll_edge - prev_pll_bc_left) < (curr_pll_bc_left - prev_pll_bc_left)) { printf("Runt\n"); continue; } Note how prev_pll_edge and prev_pll_bc_left are both preserved from the previous iteration. curr_pll_edge is calculated with nominal PLL ticks while curr_pll_bc_left uses the current pll_period. Because those advance at different rates and thus might overflow at different times, they are always converted into a distance by subtracting from some other point, which takes care of the overflow, before any comparison. By preserving both across iterations, any accumulated skew from the different rates is preserved.

…

On Thu, Feb 22, 2024, 3:22 AM Keir Fraser ***@***.***> wrote: I'd stick with nco_v1[p_mul=1, p_div=16, i_mul=1, i_div=128]. I also note from previous comment: The other difference is nco_v1 assumes both the PLL uint32 and the sample clock uint16 will overflow at the same time. Is this the following nco_v2 code? if (pll_period < PLL_PERIOD_MIN) { pll_period = PLL_PERIOD_MIN; } else if (pll_period > PLL_PERIOD_MAX) { pll_period = PLL_PERIOD_MAX; } — Reply to this email directly, view it on GitHub <#864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIEHF35J4XWA7TXA3KIKYLYU4S63AVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJZGI2DKMZYGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

keirf · 2024-02-22T18:04:11Z

No, that bit of code is limiting the tunable range of the NCO. If it is left unlimited, it can go so low that the decoding process gets very slow.

That might be worth keeping then?

As for the rest, as long as phase_step (aka pll_period) doesn't get very large, I don't see that prev_sample and prev_bc_left can really deviate far enough that the arithmetic involving curr_sample will overflow if we use the latter (since they should be within plus or minus bc_step/2 of each other).

I also don't see how the more complex code in nco_v2 helps anyway: prev_pll_edge appears to always be prev_sample<<16 plus a fixed delta.

mx-shift force-pushed the nco_2nd_order branch from 473e8f6 to bdcbd3c Compare January 22, 2024 05:19

mx-shift force-pushed the nco_2nd_order branch from bdcbd3c to ee6882e Compare January 22, 2024 05:29

floppy: apply NCO-based 2nd-order PLL to WDATA line #864

Are you sure you want to change the base?

floppy: apply NCO-based 2nd-order PLL to WDATA line #864

Conversation

mx-shift commented Jan 21, 2024

keirf commented Jan 21, 2024

mx-shift commented Jan 21, 2024

keirf commented Jan 21, 2024

mx-shift commented Jan 21, 2024 • edited Loading

keirf commented Jan 21, 2024

keirf commented Jan 21, 2024 • edited Loading

mx-shift commented Jan 21, 2024 via email

keirf commented Jan 21, 2024

mx-shift commented Jan 21, 2024 via email

keirf commented Jan 21, 2024

keirf commented Jan 21, 2024

mx-shift commented Jan 21, 2024 via email

keirf commented Jan 21, 2024

mx-shift commented Jan 21, 2024

keirf commented Jan 21, 2024

mx-shift commented Jan 21, 2024

mx-shift commented Jan 22, 2024

mx-shift commented Jan 22, 2024 • edited Loading

keirf commented Jan 22, 2024

keirf commented Jan 22, 2024 • edited Loading

keirf commented Jan 26, 2024 • edited Loading

mx-shift commented Jan 26, 2024 via email

keirf commented Jan 26, 2024

mx-shift commented Jan 26, 2024 via email

keirf commented Jan 26, 2024

mx-shift commented Jan 26, 2024 via email

mx-shift commented Feb 20, 2024

keirf commented Feb 20, 2024

mx-shift commented Feb 20, 2024 via email

keirf commented Feb 20, 2024 • edited Loading

mx-shift commented Feb 21, 2024

mx-shift commented Feb 21, 2024

keirf commented Feb 21, 2024

mx-shift commented Feb 21, 2024 via email

keirf commented Feb 21, 2024

mx-shift commented Feb 22, 2024 • edited Loading

keirf commented Feb 22, 2024

mx-shift commented Feb 22, 2024 • edited Loading

keirf commented Feb 22, 2024 • edited Loading

keirf commented Feb 22, 2024 • edited Loading

mx-shift commented Feb 22, 2024 via email

keirf commented Feb 22, 2024 • edited Loading

mx-shift commented Jan 21, 2024 •

edited

Loading

keirf commented Jan 21, 2024 •

edited

Loading

mx-shift commented Jan 22, 2024 •

edited

Loading

keirf commented Jan 22, 2024 •

edited

Loading

keirf commented Jan 26, 2024 •

edited

Loading

keirf commented Feb 20, 2024 •

edited

Loading

mx-shift commented Feb 22, 2024 •

edited

Loading

mx-shift commented Feb 22, 2024 •

edited

Loading

keirf commented Feb 22, 2024 •

edited

Loading

keirf commented Feb 22, 2024 •

edited

Loading

keirf commented Feb 22, 2024 •

edited

Loading