Installation and reproducability issues #1

yzimmermann · 2024-11-23T17:20:31Z

I'm trying to install this library to perform some experiments and have run into a couple of issues.
I followed the instructions and installed via install_script.sh. However, when I run python scripts/execute_experiments.py -h I get:

Well, hello there!
Computer other metrics? - False
Traceback (most recent call last):
  File "scripts/execute_experiments.py", line 14, in <module>
    from scripts.model_trainer import ModelTrainer
  File "/home/ubuntu/cna_modules/src/scripts/model_trainer.py", line 18, in <module>
    from clustering.rationals_on_clusters import RationalOnCluster
  File "/home/ubuntu/cna_modules/src/clustering/__init__.py", line 1, in <module>
    from .rationals_on_clusters import RationalOnCluster
  File "/home/ubuntu/cna_modules/src/clustering/rationals_on_clusters.py", line 7, in <module>
    from activation_functions.power_mean import RationalPowerMeanModel
ModuleNotFoundError: No module named 'activation_functions'

I don't see any references to RationalPowerMeanModel at https://github.com/k4ntz/activation-functions, which seems to install a different activations package.

To still run some experiments, even without rational activations, I commented out the relevant imports and ran the following config:

{
    "experiment_number": 1,
    "epochs": 200,
    "lr_model": 0.001,
    "lr_activation": 1e-05,
    "weight_decay": 5e-6,
    "num_hidden_features": 280,
    "num_layers": 4,
    "with_clusters": true,
    "normalize": true,
    "clusters": 14,
    "activation_type": "ActivationType.RELU",
    "recluster_option": "ReclusterOption.ITR",
    "dataset_type": "Planetoid",
    "dataset_name": "Cora",
    "task": "node_classification",
    "model": "GCNConv"
}

Out of the box, this seems to produce a random 80/10/10 split and the training output looks like this. Why is what you call the "validation accuracy" consistently (over multiple seeds) lower than the "test accuracy"? Now, when I change the number of clusters to 1 I get this. The test loss seems equally good, the validation loss is still consistently lower. In my understanding, this should behave like "with_clusters": false. However, when I do change this in the config file, I get this. The accuracy is suddenly substantially lower, but the split also seems to have automatically changed to "20 nodes per class" (140/500/1000).
In the data loading code I found this (overwriting the default split if mode)

if mode:
            # Calculate the number of nodes for each split based on the given percentages
            train_percentage, test_percentage, valid_percentage = (80, 10, 10)
            num_train_nodes = int(total_nodes * (train_percentage / 100))
            num_test_nodes = int(total_nodes * (test_percentage / 100))
            num_valid_nodes = int(total_nodes * (valid_percentage / 100))

            # Update masks accordingly
            dataset[0].train_mask.fill_(False)
            dataset[0].train_mask[:num_train_nodes] = 1
            dataset[0].val_mask.fill_(False)
            dataset[0].val_mask[num_train_nodes : num_train_nodes + num_valid_nodes] = 1
            dataset[0].test_mask.fill_(False)
            dataset[0].test_mask[
                num_train_nodes
                + num_valid_nodes : num_train_nodes
                + num_valid_nodes
                + num_test_nodes
            ] = 1

            dataset[0].transform = T.NormalizeFeatures()

and for another dataset e.g.

    if mode:
        train_percent, val_percent, test_percent = (80, 10, 10)
    else:
        train_percent, val_percent, test_percent = (60, 20, 20)
    assert train_percent + val_percent + test_percent == 100

mode seems to be set in execute_experiments.py

set_mode(config["with_clusters"] and config["normalize"])

What is going on here? What is the purpose of mode and the different splits?

Thanks in advance for your help and clarifications!

The text was updated successfully, but these errors were encountered:

yzimmermann · 2024-11-24T15:25:33Z

Some further observations (on CORA). The code for the 80/10/10 split (if mode) seems deterministic and not random? This would explain the weird behavior with the test loss significantly higher than the validation loss. Why does results_collector.pypull the maximum test accuracy over all epochs and not the test accuracy at the maximum validation accuracy?
With a random 80/10/10 split I get (with the parameters of the ablation study) over 50 seeds

	Test acc. @ max(val acc.)	max(test acc.) over all runs	max(test acc.) per run	Paper reports
Cluster + Normalize	87.87 ± 1.76	93.33	89.71 ± 1.68	93.02 ± 0.36
Normalize	87.85 ± 1.79	93.33	89.71 ± 1.68	81.60 ± 0.72
Vanilla GCN	87.87 ± 1.76	93.33	89.72 ± 1.68	81.59 ± 0.43

For the "20 nodes per class" split I get

	Test acc. @ max(val acc.)	max(test acc.) over all runs	max(test acc.) per run	Paper reports
Cluster + Normalize	81.07 ± 0.80	83.3	81.49 ± 0.62	93.02 ± 0.36
Normalize	81.07 ± 0.80	83.3	81.49 ± 0.62	81.60 ± 0.72
Vanilla GCN	81.06 ± 0.81	83.3	81.48 ± 0.62	81.59 ± 0.43

I also report the maximum test accuracy over all runs because it's something computed in results_collector.py. As far as I understand, the values in the paper are be based on the (1708/500/500) split (third split from this paper).

askrix · 2024-11-29T13:25:42Z

Hello @yzimmermann 👋🏻,

Thank you for raising the issue with all the questions. I'll address them in the same order:

The code is cleaned-up and the correct versions of scripts are uploaded. Please check it out, after the git pull all experiments should run out of the box. Should that not be the case, I'm asking you kindly to let us know! Thanks in advance for that.
Choice of splits:

As you pointed out, we are using indeed deterministic splits of 80/10/10 for all GNN architectures on Pubmed, Cora and CiteSeer. In doing so, we achieve fair comparisons over runs and with other works. For all other datasets and tasks, we used the splits provided by PyTorch Geometric!
For node regression tasks, we use splits one-to-one as they are defined by Gradient Gating paper, and we achieved much higher performance.

Please check out /src/utils/filter_plots_scripts/validation_result_collector.py as we depict test accuracy dependent on the maximal validation accuracy achieved during the training.

Both aren’t the same and often validation accuracy is lover then test accuracy. Yet again, this isn’t unanimously the case.
Our experiments with different datasets show that the difference between the two depends strongly on the dataset. Factors as distribution, data partitioning and complexity (such as Homophily, Balance among classes, #Nodes, etc.) play an important role.
I also reproduced the results for Ablation Study on Cora dataset using the last uploaded version of code. And, the table reflects the fact that our results are reproducible

Cluster	Normalize	Activate	reproduced results	reported in paper
✓			81.05±0.74	81.59±0.43
✓	✓		81.05±0.73	81.25±0.64
✓		✓	91.96±0.69	93.02±0.36
✓	✓		81.19±0.79	81.64±0.61
✓	✓		81.18±0.81	81.49±0.54
✓	✓	✓	81.07±0.73	81.60±0.72
✓	✓	✓	81.17±0.80	81.60±0.70
---------	-----------	----------	-------------------	-------------------
✓	✓	✓	92.59±0.62	93.66±0.48

Should you have further questions, please feel free to reach out!

Cheers 🍻

P.S.: Should some of your questions from the previous messages haven't been answered, then please point them out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation and reproducability issues #1

Installation and reproducability issues #1

yzimmermann commented Nov 23, 2024 •

edited

Loading

yzimmermann commented Nov 24, 2024 •

edited

Loading

askrix commented Nov 29, 2024 •

edited

Loading

Installation and reproducability issues #1

Installation and reproducability issues #1

Comments

yzimmermann commented Nov 23, 2024 • edited Loading

yzimmermann commented Nov 24, 2024 • edited Loading

askrix commented Nov 29, 2024 • edited Loading

yzimmermann commented Nov 23, 2024 •

edited

Loading

yzimmermann commented Nov 24, 2024 •

edited

Loading

askrix commented Nov 29, 2024 •

edited

Loading