A new notebook exploring Semantic Entity Resolution & Extraction using DSPy and Google's new LangExtract library.
Just released a new notebook exploring Semantic Entity Resolution & Extraction using DSPy (Community) and Google's new LangExtract library.
Inspired by Russell Jurney’s excellent work on semantic entity resolution, this demo follows his approach of combining:
✅ embeddings,
✅ kNN blocking,
✅ and LLM matching with DSPy (Community).
On top of that, I added a general extraction layer to test-drive LangExtract, a Gemini-powered, open-source Python library for reliable structured information extraction. The goal? Detect and merge mentions of the same real-world entities across text.
It’s an end-to-end flow tackling one of the most persistent data challenges.
Check it out, experiment with your own data, 𝐞𝐧𝐣𝐨𝐲 𝐭𝐡𝐞 𝐬𝐮𝐦𝐦𝐞𝐫 and let me know your thoughts!
cc Paco Nathan you might like this 😉
https://wor.ai/8kQ2qa
a new notebook exploring Semantic Entity Resolution & Extraction using DSPy (Community) and Google's new LangExtract library.
Stop manually building your company's brain. ❌
Having reviewed the excellent DeepLearning.AI lecture on Agentic Knowledge Graph Construction, by Andreas Kollegger and writing a book on Agentic graph system with Sam Julien, it is clear that the use of agentic systems represents a shift in how we build and maintain knowledge graphs (KGs).
Most organizations are sitting on a goldmine of data spread across CSVs, documents, and databases.
The dream is to connect it all into a unified Knowledge Graph, an intelligent brain that understands your entire business.
The reality? It's a brutal, expensive, and unscalable manual process.
But a new approach is changing everything.
Here’s the new playbook for building intelligent systems:
🧠 Deploy an AI Agent Workforce
Instead of rigid scripts, you use a cognitive assembly line of specialized AI agents. A Proposer agent designs the data model, a Critic refines it, and an Extractor pulls the facts.
This modular approach is proven to reduce errors and improve the accuracy and coherence of the final graph.
🎨 Treat AI as a Designer, Not Just a Doer
The agents act as data architects. In discovery mode, they analyze unstructured data (like customer reviews) and propose a new logical structure from scratch.
In an enterprise with an existing data model, they switch to alignment mode, mapping new information to the established structure.
🏛️ Use a 3-Part Graph Architecture
This technique is key to managing data quality and uncertainty. You create three interconnected graphs:
The Domain Graph: Your single source of truth, built from trusted, structured data.
The Lexical Graph: The raw, original text from your documents, preserving the evidence.
The Subject Graph: An AI-generated bridge that connects them. It holds extracted insights that are validated before being linked to your trusted data.
Jaro-Winkler is a string comparison algorithm that measures the similarity or edit distance between two strings. It can be used here for entity resolution, the process of identifying and linking entities from the unstructured text (Subject Graph) to the official entities in the structured database (Domain Graph).
For example, the algorithm compares a product name extracted from a customer review (e.g., "the gothenburg table") with the official product names in the database. If the Jaro-Winkler similarity score is above a certain threshold, the system automatically creates a CORRESPONDS_TO relationship, effectively linking the customer's comment to the correct product in the supply chain graph.
🤝 Augment Humans, Don't Replace Them
The workflow is Propose, then Approve. AI does the heavy lifting, but a human expert makes the final call.
This process is made reliable by tools like Pydantic and Outlines, which enforce a rigid contract on the AI's output, ensuring every piece of data is perfectly structured and consistent.
And once discovered and validated, a schema can be enforced. | 32 comments on LinkedIn
by J Bittner John Sowa once observed: In logic, the existential quantifier ∃ is a notation for asserting that something exists. But logic itself has no vocabulary for describing the things that exist.
FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs
Sharing our recent research 𝐅𝐢𝐧𝐑𝐞𝐟𝐥𝐞𝐜𝐭𝐊𝐆: 𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐂𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐅𝐢𝐧𝐚𝐧𝐜𝐢𝐚𝐥 𝐊𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐆𝐫𝐚𝐩𝐡𝐬. It is the largest financial knowledge graph built from unstructured data. The preprint of our article is out on arXiv now (link is in the comments). It is coauthored with Abhinav Arun | Fabrizio Dimino | Tejas Prakash Agrawal
While LLMs make it easier than ever to generate knowledge graphs, the real challenge lies in ensuring quality without hallucinations, with strong coverage, precision, comprehensiveness, and relevance. FinReflectKG tackles this through an iterative, evaluation-driven agentic approach, carefully optimized across multiple evaluation metrics to deliver a trustworthy and high-quality knowledge graph.
Designed to power use cases like entity search, question answering, signal generation, predictive modeling, and financial network analysis, FinReflectKG sets a new benchmark for building reliable financial KGs and showcases the potential of agentic workflows in LLM-driven systems.
We will be creating a suite of benchmarks using FinReflectKG for KG related tasks in financial services. More details to come soon. | 15 comments on LinkedIn
barnard59 is a toolkit to automate extract, transform and load (ETL) tasks. It allows you to generate RDF out of non-RDF data sources
Reliability in data pipelines depends on knowing what went wrong before your users do. With the new OpenTelemetry integration in our RDF ETL framework barnard59, every pipeline and API integration is now fully traceable!
Errors, validation results and performance metrics are automatically collected and visualised in Grafana. Instead of hunting through logs, you immediately see where time was spent and where an error occurred. This makes RDF-based ETL pipelines far more transparent and easier to operate at scale.
SynaLinks is an open-source framework designed to make it easier to partner language models (LMs) with your graph technologies. Since most companies are not in a position to train their own language models from scratch, SynaLinks empowers you to adapt existing LMs on the market to specialized tasks.
In the history of data standards, a recurring pattern should concern anyone working in semantics today. A new standard emerges, promises interoperability, gains adoption across industries or agencies, and for a time seems to solve the immediate need.
MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains
MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains ...
When AI Diagnoses Patients, Should Reasoning Be a Team Sport?
👉 Why Existing Approaches Fall Short
Medical question answering demands precision, but current AI methods struggle with two key issues:
1. Error Accumulation: Linear reasoning chains (like Chain-of-Thought) risk compounding mistakes—if the first step is wrong, the entire answer falters.
2. Flat Knowledge Retrieval: Traditional retrieval-augmented methods treat medical facts as unrelated text snippets, ignoring complex relationships between symptoms, diseases, and treatments.
This leads to unreliable diagnoses and opaque decision-making—a critical problem when patient outcomes are at stake.
👉 What MIRAGE Does Differently
MIRAGE transforms reasoning from a solo sprint into a coordinated team effort:
- Parallel Detective Work: Instead of one linear chain, multiple specialized "detectives" (reasoning chains) investigate different symptoms or entities in parallel.
- Structured Evidence Hunting: Retrieval operates on medical knowledge graphs, tracing connections between symptoms (e.g., "face pain → lead poisoning") rather than scanning documents.
- Cross-Check Consensus: Answers from parallel chains are verified against each other to resolve contradictions, like clinicians discussing differential diagnoses.
👉 How It Works (Without the Jargon)
1. Break It Down
- Splits complex queries ("Why am I fatigued with knee pain?") into focused sub-questions grounded in specific symptoms/entities.
- Example: "Conditions linked to fatigue" and "Causes of knee lumps" become separate investigation threads.
2. Graph-Guided Retrieval
- Each thread explores a medical knowledge graph like a map:
- Anchor Mode: Examines direct connections (e.g., diseases causing a symptom).
- Bridge Mode: Hunts multi-step relationships (e.g., toxin exposure → neurological symptoms → joint pain).
3. Vote & Verify
- Combines evidence from all threads, prioritizing answers supported by multiple independent chains.
- Discards conflicting hypotheses (e.g., ruling out lupus if only one chain suggests it without corroboration).
👉 Why This Matters
Tested on three medical benchmarks (including real clinician queries), MIRAGE:
- Outperformed GPT-4 and Tree-of-Thought variants in accuracy (84.8% vs. 80.2%)
- Reduced error propagation by 37% compared to linear retrieval-augmented methods
- Produced answers with traceable evidence paths, critical for auditability in healthcare
The Big Picture
MIRAGE shifts AI reasoning from brittle, opaque processes to collaborative, structured exploration. By mirroring how clinicians synthesize information from multiple angles, it highlights a path toward AI systems that are both smarter and more trustworthy in high-stakes domains.
Paper: Wei et al. MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains
MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains
𝗛𝗼𝘁 𝘁𝗮𝗸𝗲 𝗼𝗻 𝘁𝗵𝗲 “𝗳𝗮𝘀𝘁𝗲𝗿 𝘁𝗵𝗮𝗻 𝗗𝗶𝗷𝗸𝘀𝘁𝗿𝗮” 𝗵𝗲𝗮𝗱𝗹𝗶𝗻𝗲𝘀:
The recent result given in the paper: https://lnkd.in/dQSbqrhD is a breakthrough for theory. It beats Dijkstra’s classic worst-case bound for single-source shortest paths on directed graphs with non-negative weights. That’s big for the research community.
𝗕𝘂𝘁 𝗶𝘁 𝗱𝗼𝗲𝘀𝗻’𝘁 “𝗿𝗲𝘄𝗿𝗶𝘁𝗲” 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗿𝗼𝘂𝘁𝗶𝗻𝗴.
In practice, large-scale systems (maps, logistics, ride-hailing) moved past plain Dijkstra years ago. They rely on heavy preprocessing. Contraction Hierarchies, Hub Labels and other methods are used to answer point-to-point queries in milliseconds, even on large, continental networks.
𝗪𝗵𝘆 𝘁𝗵𝗲 𝗱𝗶𝘀𝗰𝗼𝗻𝗻𝗲𝗰𝘁?
• Different goals: The paper targets single-source shortest paths; production prioritizes point-to-point queries at interactive latencies.
• Asymptotics vs. constants: Beating O(m + n log n) matters in principle, but real systems live and die by constants, cache behavior, and integration with traffic/turn costs.
• Preprocessing wins: Once you allow preprocessing, the speedups from hierarchical/labeling methods dwarf Dijkstra and likely any drop-in replacement without preprocessing.
We should celebrate the theoretical advance and keep an eye on practical implementations. Just don’t confuse a sorting-barrier result with an immediate upgrade for Google Maps.
𝗕𝗼𝘁𝘁𝗼𝗺 𝗹𝗶𝗻𝗲: Great theory milestone. Production routing already “changed the rules” years ago with preprocessing and smart graph engineering.
4.7 times better write query price-performance with AWS Graviton4 R8g instances using Amazon Neptune v1.4.5 | Amazon Web Services
Amazon Neptune version 1.4.5 introduces engine improvements and support for AWS Graviton-based r8g instances. In this post, we show you how these updates can improve your graph database performance and reduce costs. We walk you through the benchmark results for Gremlin and openCypher comparing Neptune v1.4.5 on r8g instances against previous versions. You'll see performance improvements of up to 4.7x for write throughput and 3.7x for read throughput, along with the cost implications.
Faster than Dijkstra? Tsinghua University’s new shortest path algorithm just rewrite the rules of graph traversal.
🚀 Faster than Dijkstra? Tsinghua University’s new shortest path algorithm just rewrite the rules of graph traversal.
For 65+ years, Dijkstra’s algorithm was the gold standard for finding shortest paths in weighted graphs. But now, a team from Tsinghua University has introduced a recursive partial ordering method that outperforms Dijkstra—especially on directed graphs.
🔍 What’s different?
Instead of sorting all vertices by distance (which adds log-time overhead), this new approach uses a clever recursive structure that breaks the O(m + n log n) barrier ✨.
It’s faster, leaner, and already winning awards at STOC 2025 🏆.
📍 Why it matters:
Think Google Maps, Uber routing, disaster evacuation planning, circuit design—any system that relies on real-time pathfinding across massive graphs.
Paper ➡ https://lnkd.in/dGTdRj2X
#Algorithms #ComputerScience #Engineering #Dijkstra #routing #planning #logistic
| 34 comments on LinkedIn
Faster than Dijkstra? Tsinghua University’s new shortest path algorithm just rewrite the rules of graph traversal.
Quality metrics: mathematical functions designed to measure the “goodness” of a network visualization
I’m proud to share an exciting piece of work by my PhD student, Simon van Wageningen, whom I have the pleasure of supervising. Simon asked a bold question that challenges the state of the art in our field!
A bit of background first: together with Simon, we study network visualizations — those diagrams made of dots and lines. They’re more than just pretty pictures: they help us gain intuition about the structure of networks around us, such as social networks, protein networks, or even money-laundering networks 😉. But how do we know if a visualization really shows the structure well? That’s where quality metrics come in — mathematical functions designed to measure the “goodness” of a network visualization. Many of these metrics correlate nicely with human intuition. Yet, in our community, there has long been a belief — more of a tacit knowledge — that these metrics fail in certain cases.
This is exactly where Simon’s work comes in: he set out to make this tacit knowledge explicit. Take a look at the dancing man and the network in the slider — they represent the same network with very similar quality metric values. And yet, the dancing man clearly does not don’t show the network's structure. This tells us something important: we can’t blindly rely on quality metrics.
Simon’s work will be presented at the International Symposium on Graph Drawing and Network Visualization in Norrköping, Sweden this year. 🎉
If you’d like to dive deeper, here’s the link to the GitHub repository https://lnkd.in/eqw3nYmZ #graphdrawing #networkvisualization #qualitymetrics #research with Simon van Wageningen and Alex Telea | 13 comments on LinkedIn
quality metrics come in — mathematical functions designed to measure the “goodness” of a network visualization
The New Dijkstra’s Algorithm: Shortest Route from Data to Insights (and Action)?
Reforms on the "Shortest Path" Algorithm, Parallels with Modular Data Architectures, and Diving Into Key Components: Product Buckets, Semantic Spine, & Insight Routers
True Cost of Enterprise Knowledge Graph Adoption from PoC to Production | LinkedIn
Enterprise Knowledge Graph costs scale in phases—from a modest $50K–$100K PoC, to a $1M–$3M pilot with infrastructure and dedicated teams, to a $10M–$20M enterprise-wide platform. Reusability reduces costs to ~30% of the original for new domains, with faster delivery and self-sufficiency typically b
Enabling Industrial AI: How Siemens and AIT Leverage TDengine and Ontop to Help TCG UNITECH Boost Productivity and Efficiency
I'm extremely excited to announce that Siemens and AIT Austrian Institute of Technology—two leaders in industrial innovation—chose TDengine as the time-series backbone for a groundbreaking project at TCG Unitech GmbH!
Here’s the magic: Imagine stitching together over a thousand time-series signals per machine with domain knowledge, and connecting it all through an intelligent semantic layer. With TDengine capturing high-frequency sensor data, PostgreSQL holding production context, and Ontopic virtualizing everything into a cohesive knowledge graph—this isn’t just data collection. It’s an orchestration that reveals hidden patterns, powers real-time anomaly and defect detection, supports traceability, and enables explainable root-cause analysis.
And none of this works without good semantics. The system understands the relationships—between sensors, machines, processes, and defects—which means both AI and humans can ask the right questions and get meaningful, actionable answers.
For me, this is the future of smart manufacturing: when data, infrastructure, and domain expertise come together, you get proactive, explainable, and scalable insights that keep factories running at peak performance.
It's a true pleasure working with Stefan B. from Siemens AG Österreich, Stephan Strommer and David Gruber from AIT, Peter Hopfgartner from Ontopic and our friends Klaus Neubauer, Herbert Kerbl, Bernhard Schmiedinger from TCG on this technical blog! We hope this will bring some good insights into how time-series data and semantics can transform the operations of modern manufacturing!
Read the full case study: https://lnkd.in/gtuf8KzU