Could you provide the input data format? #2

velpc · 2021-04-08T02:05:10Z

Detailed explanation of hdf5 instance format of pyg, dgl, nx, or dict.

jzhou316 · 2021-04-08T20:41:19Z

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

velpc · 2021-04-11T12:23:03Z

Thank you so much! they are very timely and helpful. Can you provide information on how to generate an xx_split_idx.pkl file from the dataset and its storage format?

jzhou316 · 2021-04-12T14:25:34Z

The xx_split_idx.pkl stores indexes of how to split the original large graph dataset in the HDF5 format into train/validation/test sets. It is a dictionary with keys "train", "val", and "test", where each value is a list of graph id numbers in the corresponding subset. We use this to split the original single graph dataset into separate storage for train/validation/test in preprocessing as in here. For our dataset, these splits are randomly generated based on the total number of graphs in the dataset with ratio 8:1:1 for train/validation/test, and fixed thereafter for community use.

jackd · 2021-04-12T22:22:17Z

Thanks for the clean datasets! One issue I have regarding the data specification:

graph_data_storage.md specifies x as node signals/features, but I can't find these in any of the hdf5 files. Furthermore, README.md suggests these are featureless graphs. Can you clarify?

jzhou316 · 2021-04-12T23:03:24Z

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

jackd · 2021-04-12T23:47:07Z

Thanks @jzhou316, that makes perfect sense - but it might be a nice addition to graph_data_storage.md :).

jzhou316 · 2021-04-13T00:46:40Z

cool. I'll add some details

velpc · 2021-04-13T01:54:07Z

The detailed instructions are very helpful. How to set num_evils and num_evils_avg if our problem is for multiclassification but not biclassification (evil/non-evil)?

jzhou316 · 2021-04-14T00:00:43Z

@velpc These are dataset statistics stored in the HDF5 file (and may not be used by the model). For different specific problems such as multiclassification, you can write your own data following our format with your other dataset attributes. For example, you could have attributes such as "num_class_0" "num_class_1" "num_class_2" etc. to describe the dataset. We have some example code of writing these attributes here. Hope this answers your question!

iohelder · 2021-05-05T08:14:27Z

Hi @jzhou316, is there a much much smaller dataset that can be used for quick testing of the algorithm? I wanted to try out with a smaller subset without having to download these ones specified on dataset_botnet.py file. Thanks

jzhou316 · 2021-05-05T15:32:48Z

@helmoai Sorry that we currently don't have an official mini dataset for quick testing. Could you download the data and take out a subset (e.g. a few graphs) to run the mini-test? Otherwise I could generate a smaller subset from one of the datasets for you.

iohelder · 2021-07-09T07:57:39Z

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

Found an issue in the code you gave, to read the hdf5 files here. I think you missed the h5py.File when opening the file. It should be:

import h5py
with h5py.File('filename', "r") as f:
    e = f['0']['edge_index'][()]             # take out the edge indexes from the first graph with id '0'
    num_nodes = f['0'].attrs['num_nodes']    # access the statistics stored in attributes of the first graph with id '0'
    num_graphs = f.attrs['num_graphs']       # access the statistics stored in attributes of the dataset file

jzhou316 · 2021-07-09T13:28:50Z

@helmoai yes you are right. Thanks for pointing it out! Updated it.

whxuexi · 2022-05-24T08:53:10Z

In scatter_ of common.py, out (SRC, index, 0, out, dim_size, fill_value) has 6 parameters, but the display can only enter 2-5 parameters.

tillson · 2023-12-11T03:23:43Z

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

I've been implementing this on a different network dataset and noticed a few gotchas related to this. If you use the botgen/ code to generate your data, it adds the dummy vector. As a result, add_nfeat_ones=True to add it at training time causes an error. Additionally, the botgen code does not add is_directed or self_directed to the data, so you will need to do that manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you provide the input data format? #2

Could you provide the input data format? #2

velpc commented Apr 8, 2021

jzhou316 commented Apr 8, 2021

velpc commented Apr 11, 2021

jzhou316 commented Apr 12, 2021

jackd commented Apr 12, 2021

jzhou316 commented Apr 12, 2021

jackd commented Apr 12, 2021

jzhou316 commented Apr 13, 2021

velpc commented Apr 13, 2021

jzhou316 commented Apr 14, 2021

iohelder commented May 5, 2021 •

edited

Loading

jzhou316 commented May 5, 2021

iohelder commented Jul 9, 2021 •

edited

Loading

jzhou316 commented Jul 9, 2021

whxuexi commented May 24, 2022

tillson commented Dec 11, 2023 •

edited

Loading

Could you provide the input data format? #2

Could you provide the input data format? #2

Comments

velpc commented Apr 8, 2021

jzhou316 commented Apr 8, 2021

velpc commented Apr 11, 2021

jzhou316 commented Apr 12, 2021

jackd commented Apr 12, 2021

jzhou316 commented Apr 12, 2021

jackd commented Apr 12, 2021

jzhou316 commented Apr 13, 2021

velpc commented Apr 13, 2021

jzhou316 commented Apr 14, 2021

iohelder commented May 5, 2021 • edited Loading

jzhou316 commented May 5, 2021

iohelder commented Jul 9, 2021 • edited Loading

jzhou316 commented Jul 9, 2021

whxuexi commented May 24, 2022

tillson commented Dec 11, 2023 • edited Loading

iohelder commented May 5, 2021 •

edited

Loading

iohelder commented Jul 9, 2021 •

edited

Loading

tillson commented Dec 11, 2023 •

edited

Loading