A Gentle Introduction to Graph Neural Networks

date

Oct 17, 2024

slug

gnn-1

status

Published

Graph

V Vertex attributes e.g., node identity, number of neighbors

E Edge attributes and directions e.g., edge identity, edge weight

U Global(Master node) attributes e.g., number of nodes, longest path

可以把每个节点/边/全局信息用一个标量或者embedding成向量存储。

如果用邻接矩阵来表示节点之间的连接关系，会有几个问题：

大且稀疏 Edges per node (degree)

Dataset	Domain	graphs	nodes	edges	min	mean	max
karate club	Social network	1	34	78		4.5	17
qm9	Small molecules	134k	≤9	≤26	1	2	5
Cora	Citation network	1	23,166	91,500	1	7.8	379
Wikipedia links,English	Knowledge graph	1	12M	378M		62.24	1M

对于一个图，有多个邻接矩阵可以表示，但是对于一个深度学习网络来说，输入的矩阵不同，最后的结果可能是不同的，虽然他们都表示同一个矩阵。

一个解决方法是使用邻接列表。

注意，邻接表中的第i个元素表示第i条边。这里这里Nodes、Edges的元素以及Global都是标量，实际可以为向量。

Type of Problems

Graph-level task

对图对全局信息进行预测，上图根据环的数量将这些分子分成两类

Node-level task

对节点进行预测，上图是说每个节点最后更有可能归为蓝和红中哪个阵营（有点像美国大选）

Edge-level task

对边进行预测，上图是根据人的相对位置判断他们之间的关系

Graph Neural Networks

最简单的是分别更新Vertex、Edge和Global，只需要三个MLP（多层感知机，将许多全连接层堆叠在一起，每一层都输出到上面的层，直到生成最后的输出。我们可以把前L−1层看作表示，把最后一层看作线性预测器）

Pooling Information

对于之前提到的三种问题（对节点、边和全局信息的预测），如果说我们缺少我们需要预测的那种层次的信息呢？比如我要对边预测，但我只有节点的信息。这时候我们可以利用其他层级的信息计算我们需要的信息，即Pooling，或者叫汇聚。

Passing messages between parts of the graph

之前我们构造了三个独立的MLP来更新Vertex、Edge和Global，但这并没有使用到图的连接性信息，比如点和点、点和边、边和边的连接。因此我们可以在更新信息的时候把周围的点/边的信息也考虑到。

以节点为例：

Learning edge representations

假设我们没有点的数据，只有边点数据，就可以使用Message Passing，利用边embedding得到点embedding

Which graph attributes we update and in which order we update them is one design decision when constructing GNNs. We could choose whether to update node embeddings before edge embeddings, or the other way around. This is an open area of research with a variety of solutions– for example we could update in a ‘weave’ fashion where we have four updated representations that get combined into new node and edge representations: node to node (linear), edge to edge (linear), node to edge (edge layer), edge to node (node layer).

Adding global representations

There is one flaw with the networks we have described so far: nodes that are far away from each other in the graph may never be able to efficiently transfer information to one another, even if we apply message passing several times. For one node, If we have k-layers, information will propagate at most k-steps away. This can be a problem for situations where the prediction task depends on nodes, or groups of nodes, that are far apart. One solution would be to have all nodes be able to pass information to each other. Unfortunately for large graphs, this quickly becomes computationally expensive (although this approach, called ‘virtual edges’, has been used for small graphs such as molecules).

One solution to this problem is by using the global representation of a graph (U) which is sometimes called a master node or context vector. This global context vector is connected to all other nodes and edges in the network, and can act as a bridge between them to pass information, building up a representation for the graph as a whole. This creates a richer and more complex representation of the graph than could have otherwise been learned.

In this view all graph attributes have learned representations, so we can leverage them during pooling by conditioning the information of our attribute of interest with respect to the rest. For example, for one node we can consider information from neighboring nodes, connected edges and the global information. To condition the new node embedding on all these possible sources of information, we can simply concatenate them. Additionally we may also map them to the same space via a linear map and add them or apply a feature-wise modulation layer, which can be considered a type of featurize-wise attention mechanism.

Into the Weeds

Other types of graphs (multigraphs, hypergraphs, hypernodes, hierarchical graphs)

Sampling Graphs and Batching in GNNs

很多层之后，一个节点可能包含了很多个节点的信息，甚至全图的信息，这就导致进行反向传播的时候计算量非常大。因此我们选择一些点构造一个原图的子图，在子图上做计算；或者是构造多个子图上，分批次在不同的子图上更新。

Comparing aggregation operations

有时候两种不同的聚合方式的结果是相同的

Graph Attention Networks

Another way of communicating information between graph attributes is via attention.

For example, when we consider the sum-aggregation of a node and its 1-degree neighboring nodes we could also consider using a weighted sum.The challenge then is to associate weights in a permutation invariant fashion. One approach is to consider a scalar scoring function that assigns weights based on pairs of nodes 𝑓(𝑛𝑜𝑑𝑒𝑖,𝑛𝑜𝑑𝑒𝑗). In this case, the scoring function can be interpreted as a function that measures how relevant a neighboring node is in relation to the center node. Weights can be normalized, for example with a softmax function to focus most of the weight on a neighbor most relevant for a node in relation to a task. This concept is the basis of Graph Attention Networks (GAT) and Set Transformers. Permutation invariance is preserved, because scoring works on pairs of nodes. A common scoring function is the inner product and nodes are often transformed before scoring into query and key vectors via a linear map to increase the expressivity of the scoring mechanism. Additionally for interpretability, the scoring weights can be used as a measure of the importance of an edge in relation to a task.