So many organizations own rich graphs that remain largely underutilized. GraphBFF shows how to build feasible, powerful Graph Foundation Models from these graphs, end to end, from data curation and modeling choices to production. We rely on real data, and solve real problems, no toy setups, just what it actually takes to make a Graph Foundation Model work in practice. And we also present the first neural scaling laws for general graphs 🤩
GraphNews
GraphBench: Next-generation graph learning benchmarking We present Graphbench, a comprehensive graph learning benchmark across domains and prediction regimes. GraphBench standardizes evaluation with consistent splits, metrics, and out-of-distribution checks, and includes a unified hyperparameter tuning framework. We also provide strong baselines with state-of-the-art message-passing and graph transformer models and easy plug-and-play code to get you started.
One of my biggest contributions to the GraphFrames project is scalable graph embeddings. While not perfect, my implementation is inexpensive to compute and horizontally scalable. It uses a combination of random walks and Hash2Vec, an algorithm based on random projection theory.
In the post, I provide the full code and an explanation of all the engineering decisions I made. For example, I explain why I used Reservoir Sampling for neighbor aggregation or Map Partitions instead of the DataFrame API.
The pull request (PR) has not been merged yet, so if you have any ideas on how to improve the approach, I would love to hear them! Overall, it appears to be a good, inexpensive way to create scalable embeddings of graph vertices that can easily be incorporated into existing classification or recommender system pipelines. Finally, GraphFrames will have real capabilities for graph data science! At least, I hope so. :)
Recently, there has been a lot of criticism of existing popular graph ML benchmark datasets concerning such aspects as lacking practical relevance, low structural diversity that leaves most of the possible graph structure space not represented, low application domain diversity, graph structure not being beneficial for the considered tasks, and potential bugs in the data collection processes. Some of these criticisms previously appeared on this channel.
To provide the community with better benchmarks, we present GraphLand: a collection of 14 graph datasets for node property prediction coming from diverse real-world industrial applications of graph ML. What makes this benchmark stand out?
Diverse application domains: social networks, web graphs, road networks, and more. Importantly, half of the datasets feature node-level regression tasks that are currently underrepresented in graph ML benchmarks, but are often encountered in real-world applications.
Range of sizes: from thousands to millions of nodes, providing opportunities for researchers with different computational resources.
Rich node attributes that contain numerical and categorical features — these are more typical for industrial applications than textual descriptions that are standard for current benchmarks.
Different learning scenarios. For all datasets, we provide two random data splits with low and high label rate. Further, many of our networks are evolving over time, and for them we additionally provide more challenging temporal data splits and an opportunity to evaluate models in the inductive setting where only an early snapshot of the evolving network is available at train time.
We evaluated a range of models on our datasets and found that, while GNNs achieve strong performance on industrial datasets, they can sometimes be rivaled by popular in the industry gradient boosted decision trees which are provided with additional graph-based input features.
Further, we evaluated several graph foundation models (GFMs). Despite much attention being paid to GFMs recently, we found that there are currently only a few GFMs that can handle arbitrary node features (which is required for true generalization between different graphs) and that these GFMs produce very weak results on our benchmark. So it seemed like the problem of developing general-purpose graph foundation models was far from being solved, which motivated our research in this direction (see the next post).
In my latest piece for Unite.AI, I dive into: 🔹 Why message passing alone isn’t enough 🔹 How Graph Transformers use attention to overcome GNN limitations 🔹 Real-world applications in drug discovery, supply chains, recommender systems, and cybersecurity 🔹 The exciting frontier where LLMs meet graphs