BUG: Network `jones2009_3x3_drives.json` is incompatible with MPI #960

asoplata · 2024-12-13T17:18:01Z

I ran into the following issue when investigating #871. Currently, this bug prevents the default number of cores from being greater than one, in the case of the GUI when MPI is activated.

I was testing the GUI by way of enabling MPI and increasing the default number of cores used. However, weirdly, only some, but not all, of the GUI tests were failing at the actual simulation step. I realized that the problem was with this particular test network itself, not the GUI. Here is an example script that, if you run it from the hnn-core directory, reproduces the error (assuming you have MPI installed):

#!/usr/bin/env python

from pathlib import Path
from hnn_core import jones_2009_model, MPIBackend, simulate_dipole
from hnn_core import hnn_io as hio

hnn_core_root = Path('hnn_core')

###########################
# Network 1: the default case of HNN-Core's API. The config file is shown here
# ONLY for the sake of comparison.
config_net1 = (hnn_core_root / 'param' / 'default.json')
net1 = jones_2009_model()

###########################
# Network 2: the default case for the GUI. Note that this must be loaded
# differently than the classic case above.
config_net2 = (hnn_core_root / 'param' / 'jones2009_base.json')
net2 = hio.read_network_configuration(config_net2)

###########################
# Network 3: the network that is sometimes, but not always, the network used in
# actual GUI tests. Specifically, this is the network used in GUI tests when
# the `setup_gui` fixture is used.
config_net3 = (hnn_core_root / 'tests' / 'assets' / 'jones2009_3x3_drives.json')
net3 = hio.read_network_configuration(config_net3)

########################################################################
# Now let's test that they actually simulate using MPI.
# We will see that Networks 1 and 2 work, but 3 does not. Therefore, any GUI
# tests that use Network 3 and attempt to use MPI will fail.
for net in [net1, net2, net3]:
    try:
        with MPIBackend(n_procs=4):
            dpl = simulate_dipole(net, tstop=20., n_trials=1)
    except RuntimeError as err:
        print(err)

If you run this, you'll notice that net1 (the default model of the API) and net2 (the default model of the GUI) work fine, but net3 fails. The key part of the error appears to be the following:

Building the NEURON model
0 /opt/anaconda3/envs/hc12/bin/nrniv: usable mindelay is 0 (or less than dt for fixed step method)
0  near line 0
0  ^
        0 finitialize()
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[5323,1],0]
  Errorcode: -1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[Done]

Looking at the NEURON forums ( https://www.neuron.yale.edu/phpBB/viewtopic.php?t=3090 ), this may be tied to a particular but unknown delay parameter causing problems in net3 but not net2.

More generally, this is a case of inconsistency in what networks we test and when: this issue caused many, but not all, GUI tests which run simulations to fail. As it turns out, the actual network being simulated is inconsistent in test_gui.py: sometimes, it's the default GUI network (such as in test_gui.py::test_gui_add_drives()), but sometimes it's instead the network tied to this issue (any test that uses the test_gui.py::setup_gui() fixture). This also implies that many of our GUI tests are not actually testing the default GUI network!

Therefore, the tasks for this issue should include:

Diagnose and fix the issue preventing the network of jones2009_3x3_drives.json from using MPI. Based on its creation in test_io.py::jones_2009_network(), and the error output from NEURON and MPI, it is almost certainly an issue with a delay parameter in one of the additional drives added to this network, which are not present in the default network.
Properly introduce pytest fixture usage to GUI testing, and ensure that, at minimum, all testing tests the default network (except where unreasonable).

The text was updated successfully, but these errors were encountered:

This takes George's old GUI-specific `_available_cores()` method, moves it, and greatly expands it to include updates to the logic about cores and hardware-threading which was previously inside `MPIBackend.__init__()`. This was necessary due to the number of common but different outcomes based on platform, architecture, hardware-threading support, and user choice. These changes do not involve very many lines of code, but a good amount of thought and testing has gone into them. Importantly, these `MPIBackend` API changes are backwards-compatible, and no changes to current usage code are needed. I suggest you read the long comments in `parallel_backends.py::_determine_cores_hwthreading()` outlining how each variation is handled. Previously, if the user did not provide the number of MPI Processes they wanted to use, `MPIBackend` assumed that the number of detected "logical" cores would suffice. As George previously showed, this does not work for HPC environments like on OSCAR, where the only true number of cores that we are allowed to use is found by `psutil.Process().cpu_affinity()`, the "affinity" core number. There is a third type of number of cores besides "logical" and "affinity" which is important: "physical". However, there was an additional problem here that was still unaddressed: hardware-threading. Different platforms and situations report different numbers of logical, affinity, and physical CPU cores. One of the factors that affects this is if there is hardware-threading present on the machine, such as Intel Hyperthreading. In the case of an example Linux laptop having an Intel chip with Hyperthreading, the logical and physical core numbers will report different values with respect to each other: logical includes Hyperthreads (e.g. `psutil.cpu_count(logical=True)` reports 8 cores), but physical does not (e.g. `psutil.cpu_count(logical=False)` reports 4 cores). If we tell MPI to use 8 cores ("logical"), then we ALSO need to tell it to also enable the hardware-threading option. However, if the user does not want to enable hardware-threading, then we need to make this an option, tell MPI to use 4 cores ("physical"), and tell MPI to not use the hardware-threading option. The "affinity" core number makes things even more complicated, since in the Linux laptop example, it is equal to the logical core number. However, on OSCAR, it is very different than the logical core number, and on Macos, it is not present at all. In `_determine_cores_hwthreading()`, if you read the lengthy comments, I have thought through each common scenario, and I believe resolved what to do for each, with respect to the number of cores to use and whether or not to use hardware-threading. These scenarios include: the user choosing to use hardware-threading (default) or not, across Macos variations with and without hardware-threading, Linux local computer variations with and without hardware-threading, and Linux HPC (e.g. OSCAR) variations which appear to never support hardware-threading. In the Windows case, due to both jonescompneurolab#589 and the currently-untested MPI integration on Windows, I always report the machine as not having hardware-threading. Additionally, previously, if the user did provide a number for MPI Processes, `MPIBackend` used some "heuristics" to decide whether to use MPI oversubscription and/or hardware-threading, but the user could not override these heuristics. Now, when a user instantiates an `MPIBackend` with `__init__()` and uses the defaults, hardware-threading is detected more robustly and enabled by default, and oversubscription is enabled based on its own heuristics; this is the case when the new arguments `hwthreading` and `oversubscribe` are set to their default value of `None`. However, if the user knows what they're doing, they can also pass either `True` or `False` to either of these options to force them on or off. Furthermore, in the case of `hwthreading`, if the user indicates they do not want to use it, then `_determine_cores_hwthreading()` correctly returns the number of NON-hardware-threaded cores for MPI's use, instead of the core number including hardware-threads. I have also modified and expanded the appropriate testing to compensate for these changes. Note that this does NOT change the default number of jobs to use for the GUI if MPI is detected. Such a change breaks the current `test_gui.py` testing: see jonescompneurolab#960 jonescompneurolab#960

asoplata · 2024-12-20T21:00:02Z

This is almost certainly related to #663

asoplata added bug Something isn't working hnn-gui HNN GUI testing labels Dec 13, 2024

asoplata self-assigned this Dec 13, 2024

asoplata changed the title ~~Bug: Network jones2009_3x3_drives.json is incompatible with MPI~~ BUG: Network jones2009_3x3_drives.json is incompatible with MPI Dec 13, 2024

This was referenced Dec 13, 2024

feat: refactor core/thread logic for mpibackend brown-ccv/hnn-core-ccv#2

Closed

[WIP] Fix GUI MPI available cores #871

Open

asoplata mentioned this issue Dec 17, 2024

feat: refactor core/thread logic for mpibackend brown-ccv/hnn-core-ccv#3

Open

github-actions bot mentioned this issue Jan 1, 2025

Monthly issue metrics report ntolley/hnn-core#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Network `jones2009_3x3_drives.json` is incompatible with MPI #960

BUG: Network `jones2009_3x3_drives.json` is incompatible with MPI #960

asoplata commented Dec 13, 2024

asoplata commented Dec 20, 2024

BUG: Network jones2009_3x3_drives.json is incompatible with MPI #960

BUG: Network jones2009_3x3_drives.json is incompatible with MPI #960

Comments

asoplata commented Dec 13, 2024

asoplata commented Dec 20, 2024

BUG: Network `jones2009_3x3_drives.json` is incompatible with MPI #960

BUG: Network `jones2009_3x3_drives.json` is incompatible with MPI #960