Skip to content
icoming edited this page Apr 23, 2014 · 22 revisions

Introduction This is a semi-external memory graph engine that stores the adjacency list of a vertex in an SSD array but contains the state of vertices in memory, with the assumption that there are many more edges than vertices in a graph. We hope to demonstrate that such a graph engine can deliver the same performance as in-memory graph engines and can also have a very flexible programming interface.

It adopts the vertex-centric programming model to express graph algorithms. i.e., a graph algorithm should be expressed from the perspective of vertices. Each vertex acts on its own. They can perform computation as well as interact with other vertices.

A graph algorithm usually progresses in iterations. In each iteration, the graph engine executes a user-defined task on each activated vertex. An iteration ends when there are no more active vertices in the iteration and no vertices have pending requests in the graph engine. An algorithm ends when there aren’t active vertices in the next iteration.

The graph engine has three main classes exposed to users to express graph algorithms: graph_engine: is the main class that executes users’ graph algorithms and provides facilities for vertices to interact with others and access data from adjacency lists on the SSDs. compute_vertex: is a base class for users to express their graph algorithms. graph_index: is a container of user-defined compute_vertex. This is an abstract class and there is one implementation called NUMA_graph_index, which partitions the graph and stores compute_vertex into multiple arrays. The graph engine also exposes two main helper classes to access vertices on SSDs: page_vertex: is an interface class that helps to access vertex information on the page cache. page_byte_array and its iterators: page_byte_array represents a byte array in the page cache. Although data in the byte array may be stored across multiple non-contiguous pages, it hides the non-contiguity and exposes vector-like interfaces. Its iterators are exposed to the graph engine users to access neighbors of vertices. graph_engine The graph engine creates its own worker threads to perform user computation. graph_engine has many methods. Some of them need to be invoked in the user’s thread and others need to be used in their own worker thread. Below is a list the methods in graph_engine, which are useful for users. Create and destroy The two static methods in graph_engine should be used to create and destroy a graph engine. static graph_engine *create(int num_threads, int num_nodes, const std::string &graph_file, graph_index *index); static void destroy(graph_engine *graph); In user’s thread void start(vertex_id_t ids[], int num) void start(std::shared_ptr<vertex_filter> filter); void start_all() These three methods trigger the graph engine to execute the user’s graph algorithm. They activate vertices for the first iteration to run. Users can start a graph algorithm on the specified vertices (the first method), or on the vertices that satisfy certain criteria (the second method), or on all vertices (the last method).

vertex_filter is an interface class, as shown below. keep() takes a compute_vertex and returns a boolean. If it returns true, the vertex will be activated. Otherwise, the vertex is ignored. class vertex_filter { public: virtual bool keep(compute_vertex &v) = 0; };

void wait4complete() Once a graph algorithm starts, users’ graph algorithms are performed inside the graph engine. User’s thread can call the following function to wait for a graph algorithm to complete if it has no other tasks to perform. A graph algorithm completes when there aren’t activated vertices. In vertext void activate_vertices(vertex_id_t ids[], int num) void activate_vertex(vertex_id_t vertex) A vertex can activate other vertices during the computation. Once a vertex is activated, the graph engine will perform the task associated with the vertex in the next iteration.

template void multicast_msg(vertex_id_t ids[], int num, const T &msg) template void send_msg(vertex_id_t dest, T &msg) A vertex can also send messages to other vertices. If a vertex needs to send the same message to many other vertices, it should use multicast instead of point-to-point communication. The message will be processed by the destination vertices at any time during the current iteration. run_on_messages() is invoked once a vertex receives a message.

compute_vertex &get_vertex(vertex_id_t id) This allows a vertex to access the state of other vertices. It only works on a shared-memory machine, so it may be replaced by request_vertices() or request_partial_vertices() in the future. compute_vertex compute_vertex is the base class for implementing a graph algorithm. Users need to inherit the class and implement three methods in the class to express their graph algorithms. The graph framework guarantees that these functions aren’t executed on a single vertex in multiple threads simultaneously.

void run(graph_engine &graph) If a vertex is active in an iteration, the graph engine will execute the method. This method is executed only once on an activated vertex in an iteration. Many graph algorithms can use this method to invoke request_vertices() or request_partial_vertices() to access the neighbor list of a vertex.

void run(graph_engine &graph, const page_vertex &vertex) If a graph algorithm invokes request_vertices() or request_partial_vertices(), the graph engine will execute this method in the same iteration when the neighbor list of the requested vertex is ready in the page cache. Each time it is executed, a vertex represented by page_vertex is passed to it. In an iteration, the method is invoked for the same number of times as the number of vertices requested by request_vertices() or request_partial_vertices() in the iteration.

void run_on_messages(graph_engine &, const vertex_message *msgs[], int num) Whenever a vertex receives messages from other vertices, the graph engine will invoke this method.

compute_vertex contains some basic vertex information obtained via the following methods: vertex_id_t get_id() const: the vertex Id off_t get_ext_mem_off() const: the location of the vertex in the graph file vsize_t get_ext_mem_size() const: the size of the vertex in the graph file.

void request_vertices(vertex_id_t ids[], int num) A vertex needs to read neighbors of itself or other vertices from SSDs. This method allows a vertex to access the entire vertex information (all edge lists) on SSDs asynchronously. Once the neighbor lists are read from SSDs, a version of run() of the vertex is invoked.

compute_vertex has two child classes: compute_directed_vertex and comptue_ts_vertex. These two classes can provide users more information and methods to customize the access of vertices on SSD.

compute_directed_vertex provides and maintains the number of in-edges and out-edges in memory and also allows users to request only the in-edge list or only the out-edge list of vertices. vsize_t get_num_in_edges() const; vsize_t get_num_out_edges() const; void request_partial_vertices(directed_vertex_request reqs[], int num);

compute_ts_vertex allows users to request vertices of the specified timestamps. void request_partial_vertices(ts_vertex_request reqs[], int num) graph_index graph_index is an interface class. Its implementation is a container of all compute_vertex in a graph. Currently, it has only one implementation: NUMA_graph_index. The implementation splits the list of compute_vertex into multiple partitions and partitions are evenly distributed in NUMA nodes.

graph_index has the following methods: compute_vertex &get_vertex(vertex_id_t id); It returns compute_vertex of the specified ID.

vertex_id_t get_max_vertex_id() const; It returns the largest vertex ID of the graph.

vertex_id_t get_min_vertex_id() const; It returns the smallest vertex ID of the graph.

size_t get_num_vertices() const; It returns the number of vertices in the graph.

const_iterator begin() const; const_iterator end() const; It returns an iterator that can iterate :w compute_vertex in the graph. page_vertex page_vertex represents a vertex in the page cache. This is a interface class and contains the following methods. size_t get_num_edges(edge_type type) const; It returns the number of edges of the specified type.

page_byte_array::const_iterator<vertex_id_t> get_neigh_begin(edge_type type) const; It returns an STL compatible iterator to access the edge list of the specified type. The returned iterator points to the beginning of the list. This iterator supports random access and can be used in std::sort().

page_byte_array::const_iterator<vertex_id_t> get_neigh_end(edge_type type) const; It returns the iterator pointing to the end of the list.

page_byte_array::seq_const_iterator<vertex_id_t> get_neigh_seq_it(edge_type type, size_t start, size_t end) const; It returns a Java-style iterator to sequentially access the edge list of the specified type. A user needs to specify the range [start, end) in the list where the iterator accesses.

vertex_id_t get_id() const; It returns the vertex ID.

bool is_complete() const; It indicates whether or not the vertex information in the page cache represents a complete vertex. Users are allowed to access part of a directed vertex and a time-series vertex. For example, users can only access the in-edge list.

edge_type is a enum type and defines four values: NONE: no edge IN_EDGE: in-edge OUT_EDGE: out-edge BOTH_EDGES: both in-edge and out-edge page_byte_array page_byte_array represents a byte array in the page cache. This class is to help to access data stored in non-contiguous pages in the page cache. It has two iterators that also help to interpret the data in the byte array.

page_byte_array::const_iterator is a STL-compatible read-only iterator. It supports sequential access as well as random access, and overrides ++, +=, ==, != operators.

page_byte_array::seq_const_iterator is a Java style iterator. It is much cheaper for sequential access than page_byte_array::seq_const_iterator. It has three methods: has_next, next and curr. has_next and next are the same as Java iterator and curr is to access the current element pointed by the iterator.

The configuration of SAFS and the graph engine Two global variables that contain all configurations of SAFS and the graph engine: params graph_conf

commands apps/tools/el2al convert an edge list to adjacency lists el2al [options] adj_list_file index_file edge_list_files (or directories) -u: undirected graph -d delimiter: the delimiter to seperate the input edge list -c: compress the graph (remove duplicated edges) -p: print adjacency list -v: verify the created adjacency list -t type: the type of edge data. Supported type: count, timestamp, -m: merge multiple edge lists into a single graph. -w: write the graph to a file

An example: apps/tools/el2al -w /mnt/nfs1/graph-data/wiki-Vote /mnt/nfs1/graph-data/wiki-Vote-index /mnt/nfs1/graph-data/wiki-Vote.txt

utils/SAFS-util SAFS-util conf_file command ... The supported commands are create file_name size: create a file with the specified size delete file_name: delete a file help: print the help info list: list existing files in SAFS load file_name [ext_file]: load data to the file verify file_name [ext_file]: verify data in the file

utils/SAFS-util test/conf/run_graph.txt create wiki-Vote-v3 1M utils/SAFS-util test/conf/run_graph.txt load wiki-Vote-v3 /mnt/nfs1/graph-data/wiki-Vote-v3

Clone this wiki locally