Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation and reproducability issues #1

Open
yzimmermann opened this issue Nov 23, 2024 · 2 comments
Open

Installation and reproducability issues #1

yzimmermann opened this issue Nov 23, 2024 · 2 comments

Comments

@yzimmermann
Copy link

yzimmermann commented Nov 23, 2024

I'm trying to install this library to perform some experiments and have run into a couple of issues.
I followed the instructions and installed via install_script.sh. However, when I run python scripts/execute_experiments.py -h I get:

Well, hello there!
Computer other metrics? - False
Traceback (most recent call last):
  File "scripts/execute_experiments.py", line 14, in <module>
    from scripts.model_trainer import ModelTrainer
  File "/home/ubuntu/cna_modules/src/scripts/model_trainer.py", line 18, in <module>
    from clustering.rationals_on_clusters import RationalOnCluster
  File "/home/ubuntu/cna_modules/src/clustering/__init__.py", line 1, in <module>
    from .rationals_on_clusters import RationalOnCluster
  File "/home/ubuntu/cna_modules/src/clustering/rationals_on_clusters.py", line 7, in <module>
    from activation_functions.power_mean import RationalPowerMeanModel
ModuleNotFoundError: No module named 'activation_functions'

I don't see any references to RationalPowerMeanModel at https://github.com/k4ntz/activation-functions, which seems to install a different activations package.

To still run some experiments, even without rational activations, I commented out the relevant imports and ran the following config:

{
    "experiment_number": 1,
    "epochs": 200,
    "lr_model": 0.001,
    "lr_activation": 1e-05,
    "weight_decay": 5e-6,
    "num_hidden_features": 280,
    "num_layers": 4,
    "with_clusters": true,
    "normalize": true,
    "clusters": 14,
    "activation_type": "ActivationType.RELU",
    "recluster_option": "ReclusterOption.ITR",
    "dataset_type": "Planetoid",
    "dataset_name": "Cora",
    "task": "node_classification",
    "model": "GCNConv"
}

Out of the box, this seems to produce a random 80/10/10 split and the training output looks like this. Why is what you call the "validation accuracy" consistently (over multiple seeds) lower than the "test accuracy"? Now, when I change the number of clusters to 1 I get this. The test loss seems equally good, the validation loss is still consistently lower. In my understanding, this should behave like "with_clusters": false. However, when I do change this in the config file, I get this. The accuracy is suddenly substantially lower, but the split also seems to have automatically changed to "20 nodes per class" (140/500/1000).
In the data loading code I found this (overwriting the default split if mode)

if mode:
            # Calculate the number of nodes for each split based on the given percentages
            train_percentage, test_percentage, valid_percentage = (80, 10, 10)
            num_train_nodes = int(total_nodes * (train_percentage / 100))
            num_test_nodes = int(total_nodes * (test_percentage / 100))
            num_valid_nodes = int(total_nodes * (valid_percentage / 100))

            # Update masks accordingly
            dataset[0].train_mask.fill_(False)
            dataset[0].train_mask[:num_train_nodes] = 1
            dataset[0].val_mask.fill_(False)
            dataset[0].val_mask[num_train_nodes : num_train_nodes + num_valid_nodes] = 1
            dataset[0].test_mask.fill_(False)
            dataset[0].test_mask[
                num_train_nodes
                + num_valid_nodes : num_train_nodes
                + num_valid_nodes
                + num_test_nodes
            ] = 1

            dataset[0].transform = T.NormalizeFeatures()

and for another dataset e.g.

    if mode:
        train_percent, val_percent, test_percent = (80, 10, 10)
    else:
        train_percent, val_percent, test_percent = (60, 20, 20)
    assert train_percent + val_percent + test_percent == 100

mode seems to be set in execute_experiments.py

set_mode(config["with_clusters"] and config["normalize"])

What is going on here? What is the purpose of mode and the different splits?

Thanks in advance for your help and clarifications!

@yzimmermann
Copy link
Author

yzimmermann commented Nov 24, 2024

Some further observations (on CORA). The code for the 80/10/10 split (if mode) seems deterministic and not random? This would explain the weird behavior with the test loss significantly higher than the validation loss. Why does results_collector.pypull the maximum test accuracy over all epochs and not the test accuracy at the maximum validation accuracy?
With a random 80/10/10 split I get (with the parameters of the ablation study) over 50 seeds

Test acc. @ max(val acc.) max(test acc.) over all runs max(test acc.) per run Paper reports
Cluster + Normalize 87.87 ± 1.76 93.33 89.71 ± 1.68 93.02 ± 0.36
Normalize 87.85 ± 1.79 93.33 89.71 ± 1.68 81.60 ± 0.72
Vanilla GCN 87.87 ± 1.76 93.33 89.72 ± 1.68 81.59 ± 0.43

For the "20 nodes per class" split I get

Test acc. @ max(val acc.) max(test acc.) over all runs max(test acc.) per run Paper reports
Cluster + Normalize 81.07 ± 0.80 83.3 81.49 ± 0.62 93.02 ± 0.36
Normalize 81.07 ± 0.80 83.3 81.49 ± 0.62 81.60 ± 0.72
Vanilla GCN 81.06 ± 0.81 83.3 81.48 ± 0.62 81.59 ± 0.43

I also report the maximum test accuracy over all runs because it's something computed in results_collector.py. As far as I understand, the values in the paper are be based on the (1708/500/500) split (third split from this paper).

@askrix
Copy link
Collaborator

askrix commented Nov 29, 2024

Hello @yzimmermann 👋🏻,

Thank you for raising the issue with all the questions. I'll address them in the same order:

  1. The code is cleaned-up and the correct versions of scripts are uploaded. Please check it out, after the git pull all experiments should run out of the box. Should that not be the case, I'm asking you kindly to let us know! Thanks in advance for that.
  2. Choice of splits:
  • As you pointed out, we are using indeed deterministic splits of 80/10/10 for all GNN architectures on Pubmed, Cora and CiteSeer. In doing so, we achieve fair comparisons over runs and with other works. For all other datasets and tasks, we used the splits provided by PyTorch Geometric!

  • For node regression tasks, we use splits one-to-one as they are defined by Gradient Gating paper, and we achieved much higher performance.

  1. Please check out /src/utils/filter_plots_scripts/validation_result_collector.py as we depict test accuracy dependent on the maximal validation accuracy achieved during the training.

    • Both aren’t the same and often validation accuracy is lover then test accuracy. Yet again, this isn’t unanimously the case.
    • Our experiments with different datasets show that the difference between the two depends strongly on the dataset. Factors as distribution, data partitioning and complexity (such as Homophily, Balance among classes, #Nodes, etc.) play an important role.
    • I also reproduced the results for Ablation Study on Cora dataset using the last uploaded version of code. And, the table reflects the fact that our results are reproducible
    Cluster Normalize Activate reproduced results reported in paper
    81.05±0.74 81.59±0.43
    81.05±0.73 81.25±0.64
    91.96±0.69 93.02±0.36
    81.19±0.79 81.64±0.61
    81.18±0.81 81.49±0.54
    81.07±0.73 81.60±0.72
    81.17±0.80 81.60±0.70
    --------- ----------- ---------- ------------------- -------------------
    92.59±0.62 93.66±0.48

Should you have further questions, please feel free to reach out!

Cheers 🍻

P.S.: Should some of your questions from the previous messages haven't been answered, then please point them out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants