Can creative engineering hacks overcome this?

<aside>

**Post III of the Causal Discovery Series by https://diksha-shrivastava13.github.io/**

</aside>

The field moves so fast that some of these problems can be solved a lot better now than June 2024, however the core problem remains the ability of the system to find and take into consideration the relevant sections from multiple reports spread across a complex hierarchy and generate the final sections for the overviews. What makes this problem more complex over standard RAG use-cases is

(i) the critical government use-case where mistakes can be disastrous,

(ii) while I’ve previously designed RAG-based systems for codebases and technical documentation which requires an Abstract Syntax Tree (AST) or a well-designed property graph, this problem has complex entity relationships which need to be tracked across a five-level hierarchy and maintained over years,

(iii) there’s very little consistency in the structure of these reports which makes it a problem to always dynamically extract all the information correctly even from one report at a time.

Agent-based parsing with multi-hop reasoning and retrieval solves some of the issues. However, even with Agent-guided search and retrieval, the functionality of the application is limited with more technical challenges. More challenges arise due to:

I. On the Generation Side: Long Context is not enough

To get exact performance for reasoning-based search/retrieval/parsing/generation, the system prompt needs to be detailed and cover minute details for where the information can be found, what it might look like and what to do with it.

<aside> 🤔

Before going ahead with more Agentic terminology, let me pause here and clarify what I mean by agents.

</aside>

Some time ago, someone asked a community of ML Engineers and Researchers what they understand by “Agent” and everyone had widely varying answers. My personal answer is that **an Agent should have agency. What happened with Character.ai is a very irresponsible use of “Agents”. To give context here, a kid was talking to an AI-generated character in a system without a guardrail to report suicidal talk or cease the conversation and he died. This particular “agent” did not have any agency to act. For instance:

I can call my dog a bird and tell him he can fly to the moon and back.

This will not make him grow wings.

I can supply him with gear to mimic a bird’s behaviour (function-calling in agents).

Turns out he’s sad because birds cannot fly to the moon and back either.

While this is an analogy, it still showcases that human error in deciding all functions associated with an “agent” can lead to unpleasant to terrible results. It would have been more accurate to call this system of (system-prompt + function-calling) a style guide or machine. A system prompt, on its own, sadly does not have any agency to act. On the other hand, people are designing agencies for niche tasks which hypothesise and decide on an action (like LeanAgent: Lifelong Learning for Formal Theorem Proving).

Even though we were limited by the requirement of keeping everything on Azure which meant we could only use the OpenAI models, I’ve tested the system with the long context provided by Gemini, and it easily breaks. Long context makes it very difficult for the Large Language Model to keep track of all the instructions, as well as the many nodes (~2000) required to answer a single question from multiple reports at once.

I have tried LongContextReorder (Lost in the Middle: How Language Models Use Long Contexts) to handle the problem of rearranging context such that the most relevant data is in the beginning or the end of the entire context, but there are two more issues with it:

(i) The entire set of ~2000 nodes is relevant and needed for the query,

(ii) This is only useful when your use-case deals with a ranking problem among the nodes.

(iii) There are minute details buried in the long context which might be very important and might be ignored by the LLM.

For this particular use-case, it is only after strict, graph-based ranking that a set of ~2000 nodes have been selected.

For more complex pipelines, like chaining responses of a reasoning-retrieval-reasoning-parsing-generation process to perform the next reasoning-retrieval -reasoning-parsing-generation process with a different reasoning guideline and purpose, the response of the previous process adds to the context of the next process.

What results is absolute chaos which this single function cannot always effectively handle.

As the number of reports into consideration increase, the limitations of AI have to be handled by making better product choices and UI.