Research Statement 0.1

<aside>

**Post VI of the Causal Discovery Series by https://diksha-shrivastava13.github.io/**

</aside>

The **https://arxiv.org/abs/1911.01547** paper, and more recent ~~battles~~ discussions, prompts one to ask the question “is memorisation enough to mimic intelligence?” The hybrid vector-graph approaches I have detailed in the previous sections are solely focused on memorisation. While new links can be added continuously, I will again take a simple real-life analogy to explain why this might not be enough:

Think back to the time you were in primary school. You were taught a lot of history, political science, literature and of course, mathematics and natural sciences. How much of it do you remember today? I, for one, only remember what made an impression on me, along with the mathematics and science since they’re building blocks and little else about the literature. And I read a lot. Sometimes I end up reading some things again, with a vague puzzling memory that leaves me wondering why do I know this word to word.

We don't consciously remember all the data we've encountered throughout our lives, but we form associations that help us reason about likely outcomes. Consider fire: when we see air shimmering around something hot and notice certain visual characteristics, we recognise these properties and instinctively know to avoid anything else that displays similar traits.

I’ve mentioned Property Graphs and GraphRAG a few times in this very long blog. To give a quick overview, Property Graphs are a superset of Hybrid Knowledge Graphs and Vector Databases which allows one to:

Assign labels and properties to nodes and relationships with associated metadata.
Represent text nodes as vector embeddings.
Perform both vector and symbolic retrieval and treat your graph as a superset of a vector database for hybrid search.
Express complex queries using the Cypher graph query language. (Read more: https://www.llamaindex.ai/blog/introducing-the-property-graph-index-a-powerful-new-way-to-build-knowledge-graphs-with-llms)

While GraphRAG is dependent on query-focused summarisation by forming communities. See my implementation here https://github.com/diksha-shrivastava13/graph-vector-rag-methods.

GraphRAG Pipeline taken from From Local to Global: A Graph RAG Approach to Query-Focused Summarisation https://arxiv.org/abs/2404.16130.

GraphRAG Pipeline taken from From Local to Global: A Graph RAG Approach to Query-Focused Summarisation https://arxiv.org/abs/2404.16130.

Property Graphs are the closest proxy to the analogies I talked about. It takes in your unstructured data and forms a property graph index where the entities and their relationships are defined by the LLM. In theory, it makes sense. But as we saw above during the experiment, LLMs are not good at identifying hidden or abstract relationships in data of multiple subsystems unless they are explicitly stated. Bringing back the BMZ use-case that I talked about for the first two sections, the entire need of the organisation from AI systems was to uncover new insights which are not explicitly stated in the reports. And this is where we get stuck.

What is the solution?

LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo. Image taken from the GraphRAG paper.

I have talked about the multiple ways in which GraphRAG and Hybrid Graph-Vector methods can be improved upon in the Swan AI section. The core problem remains identifying the entities and their hidden relationships. Let me take you back to the question with which we started this section, “is memorisation enough to mimic intelligence?” Consider the graph above, and for ease of understanding, let’s assume for a moment that this is how knowledge is represented for a newborn child. Now, as the child continues to adapt to its surroundings and continuously come across things, this knowledge graph will keep growing. However, this is not scalable for a system we design and as we established before, we do not consciously hold all memory of everything in our lives, instead we learn from relationships.

Taking inspiration from the https://arxiv.org/abs/2410.06209 paper here which explores a continual learning framework for formal theorem proving with a dynamic database and the generation of hypotheses at every step of the proof. This paper provides proof for two important concepts for any reasoning problem in a specific domain:

Generation of Hypotheses
Dynamic Database

LeanAgent Framework. Image taken from the LeanAgent Paper.