Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you provide the input data format? #2

Open
velpc opened this issue Apr 8, 2021 · 15 comments
Open

Could you provide the input data format? #2

velpc opened this issue Apr 8, 2021 · 15 comments

Comments

@velpc
Copy link

velpc commented Apr 8, 2021

Detailed explanation of hdf5 instance format of pyg, dgl, nx, or dict.

@jzhou316
Copy link
Collaborator

jzhou316 commented Apr 8, 2021

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

@velpc
Copy link
Author

velpc commented Apr 11, 2021

Thank you so much! they are very timely and helpful. Can you provide information on how to generate an xx_split_idx.pkl file from the dataset and its storage format?

@jzhou316
Copy link
Collaborator

The xx_split_idx.pkl stores indexes of how to split the original large graph dataset in the HDF5 format into train/validation/test sets. It is a dictionary with keys "train", "val", and "test", where each value is a list of graph id numbers in the corresponding subset. We use this to split the original single graph dataset into separate storage for train/validation/test in preprocessing as in here. For our dataset, these splits are randomly generated based on the total number of graphs in the dataset with ratio 8:1:1 for train/validation/test, and fixed thereafter for community use.

@jackd
Copy link

jackd commented Apr 12, 2021

Thanks for the clean datasets! One issue I have regarding the data specification:

graph_data_storage.md specifies x as node signals/features, but I can't find these in any of the hdf5 files. Furthermore, README.md suggests these are featureless graphs. Can you clarify?

@jzhou316
Copy link
Collaborator

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

@jackd
Copy link

jackd commented Apr 12, 2021

Thanks @jzhou316, that makes perfect sense - but it might be a nice addition to graph_data_storage.md :).

@jzhou316
Copy link
Collaborator

cool. I'll add some details

@velpc
Copy link
Author

velpc commented Apr 13, 2021

The detailed instructions are very helpful. How to set num_evils and num_evils_avg if our problem is for multiclassification but not biclassification (evil/non-evil)?

@jzhou316
Copy link
Collaborator

@velpc These are dataset statistics stored in the HDF5 file (and may not be used by the model). For different specific problems such as multiclassification, you can write your own data following our format with your other dataset attributes. For example, you could have attributes such as "num_class_0" "num_class_1" "num_class_2" etc. to describe the dataset. We have some example code of writing these attributes here. Hope this answers your question!

@iohelder
Copy link

iohelder commented May 5, 2021

Hi @jzhou316, is there a much much smaller dataset that can be used for quick testing of the algorithm? I wanted to try out with a smaller subset without having to download these ones specified on dataset_botnet.py file. Thanks

@jzhou316
Copy link
Collaborator

jzhou316 commented May 5, 2021

@helmoai Sorry that we currently don't have an official mini dataset for quick testing. Could you download the data and take out a subset (e.g. a few graphs) to run the mini-test? Otherwise I could generate a smaller subset from one of the datasets for you.

@iohelder
Copy link

iohelder commented Jul 9, 2021

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

Found an issue in the code you gave, to read the hdf5 files here. I think you missed the h5py.File when opening the file. It should be:

import h5py
with h5py.File('filename', "r") as f:
    e = f['0']['edge_index'][()]             # take out the edge indexes from the first graph with id '0'
    num_nodes = f['0'].attrs['num_nodes']    # access the statistics stored in attributes of the first graph with id '0'
    num_graphs = f.attrs['num_graphs']       # access the statistics stored in attributes of the dataset file

@jzhou316
Copy link
Collaborator

jzhou316 commented Jul 9, 2021

@helmoai yes you are right. Thanks for pointing it out! Updated it.

@whxuexi
Copy link

whxuexi commented May 24, 2022

In scatter_ of common.py, out (SRC, index, 0, out, dim_size, fill_value) has 6 parameters, but the display can only enter 2-5 parameters.

@tillson
Copy link

tillson commented Dec 11, 2023

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

I've been implementing this on a different network dataset and noticed a few gotchas related to this. If you use the botgen/ code to generate your data, it adds the dummy vector. As a result, add_nfeat_ones=True to add it at training time causes an error. Additionally, the botgen code does not add is_directed or self_directed to the data, so you will need to do that manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants