Let's chat a bit about the use of graph databases in retrieval-augmented generation (RAG)
Let's chat a bit about the use of graph databases in retrieval-augmented generation (RAG). One problem in GenAI is that while the LLMs are fed a lot of text during training, perhaps a model isn't fed the specific information the user is asking about, which could be in a private corporate document. Since the dawn of GenAI, pipelines have existed to store private documents in a vector database and search for text relevant to the user's question in the database. This text is then fed to the LLM for use in generating the answer to the user query.
One problem in such pipelines is that the document search may retrieve a lot of text containing terms similar to those in the user query which still isn't relevant to answering the query. At this point, many folks say, "knowledge graphs to the rescue!" Knowledge graphs after all can store information about entities mentioned in private documents, so can't they help disambiguate user questions?
Graph DBs have been used in RAG for some time now; I started with them in 2021, before ChatGPT existed. There are various problems with using graph data in RAG. First off, the knowledge graphs we are trying to leverage are themselves generated by machine learning. But what are the guarantees that ML engineers are training their models or agents to produce useful KGs? Are we even using the right kind of statistical learning, never mind agent architectures? After all, if you are going to build a KG based on information in natural language, then you are parsing out conceptual relations from natural language, which are dependent on syntax. So perhaps we should be utilizing machine learning in the syntactic parsing problem, so that we ensure a relation isn't added to the graph if the syntax expresses the negation of the relation, for instance.
To graph data modelers, again I maintain that methods for extracting information from syntax have more bearing on the use of graph data in RAG than existing modeling techniques that fail to factor in natural language syntax just like most ML inference fails here. And perhaps graph databases aren't even the right target for storing extracted conceptual relations; I switched to logic databases after a month of working with graphs. The use of KGs and logic bases in RAG needs to be tackled through innovations in syntax parsing like semantic grammars, and through better techniques for performant inference engines than graph query, such as GPU-native parallel inference engines. This isn't a problem I expect to be solved through Kaggle competitions or corporate R&D leveraging recently minted ML engineers.
Let's chat a bit about the use of graph databases in retrieval-augmented generation (RAG)