Cracking the Knowledge Code: Hybrid AI for matching Information Systems and Natural Language

This project explores data-driven automatic programming using graph transformers. Unlike traditional programming, where humans define transformations, this method leverages inductive logic programming to learn from data. By feeding the system input and output graphs, it creates a model (program) that generalizes these transformations, automating various tasks.

The core of the system is the graph transformer, which translates natural language commands (expressed as graphs) into executable programs. This is particularly valuable for complex domains and adaptable to a wide spectrum of business cases.

Here, a text-based user interface bridges the gap between humans and the system. To handle diverse information systems, the framework employs a platform-independent and expressive language inspired by Datalog, Lisp, and Prolog.

A key challenge lies in implicit knowledge gap – the gap between what’s explicitly stated and what’s needed for understanding. This applies to both user input and legacy information systems, which often lack clear organization. To address this, the framework incorporate large language models and other deep learning tools in very specific tasks, described in the talk. This ensures flexible ability to handle text prompts while providing precise and deterministic program generation for accessing complex data sources.

In essence, this work proposes a framework that bridges the gap between human users and information systems. By combining graph transformers, deep learning, and symbolic processing, it aims to provide a robust and versatile approach to automatic programming, especially for tackling complex tasks that involve natural language and intricate data structures.

Graphs all the way down

Imagine information as a web of interconnected points. Can we capture everything in this way, using graphs where data resides in nodes and connections are represented by edges? This approach seems promising for the kind of information handled by information systems, like tables and programs themselves. Programs, for example, can be viewed as Abstract Syntax Trees (ASTs), a type of graph. Even neural networks – as their name suggests – with their interconnected neurons, are graphs. Perhaps a more intriguing question arises: are there fundamental forms of information that simply cannot be represented as graphs?

Leaving that thought experiment aside, information systems clearly deal with a vast amount of interconnected data. Information processing is the act of transforming one graph into another. Consider an e-commerce order. The customer’s cart contents get into an order, triggering a cascade of actions: emails, notifications, inventory updates, and order management system alerts. In essence, it’s all about transforming one graph into another.

The graph transformer

Traditionally, programmers define transformations that manipulate information stored in interconnected structures. These transformations are the very essence of computer programming. Large language models offer exciting possibilities for automating this process, boosting programmer productivity. However, I’m exploring a different approach: data-driven automatic programming.

We’re familiar with machine learning, a form of inductive learning that identifies patterns between inputs and outputs. Here, though, we’re not concerned with statistical methods but rather a logic-based approach called inductive logic programming. Imagine feeding the system data points with input and output graphs. The goal is for it to create a model (program) that can transform any input graph into the expected output, generalizing as much as possible with minimal data.

With such a tool in hand, what problems could we solve? Imagine a detailed representation of a decentralized software architecture with enough training data. We could automate mitigation plans for security or disruption issues, for example. But let’s get even bolder and consider natural language. With its ambiguities, syntax rules, and exceptions, is there anything more complex?
Here, the system would have a text-based user interface interacting with a symbolic information system on the other end. The core of this system, the graph transformer, takes natural language commands expressed as graphs and translates them into executable programs.

The language designed for this project is both intuitive and effective for programming and data representation. Based on Datalog, it combines aspects of functional programming (like Lisp) and logic programming (like Prolog). This allows data and program control flows to be represented under a single framework, minimizing the cognitive load for users while maintaining high expressiveness for tackling complex computational tasks. This abstract language is platform-independent, meaning it helps declare just the essence of what you want to achieve, leaving the details of querying the information system aside. The next step is then transforming the abstract program into a specific program for a particular platform (like SQL, SparQL, etc.) or a more complex program to access remote information.

The case of Question/Answering

Let’s consider a question-answering system. Imagine it can access your legacy CRM to provide specific customer management details, or answer questions about cars in your dealership’s price list. Such a system would need to bridge “information gaps” on both sides of the graph transformer interface. On the natural language side, the system would need to understand the user’s intent. On the legacy information service side, it would need to interpret the often-messy data structures.

Implicit information gap

Human communication thrives on shared understanding. When we say “car XYZ was amazing”, we naturally assume the listener knows we’re referring to speed, acceleration, and comfort. But imagine a system lacking this background knowledge. Our messages carry a significant amount of implicit knowledge, the gap between what’s explicitly stated and what’s needed for comprehension. This challenge applies to both sides of the system.

On the user side, the system needs to interpret the complexities of natural language. It’s often ambiguous and filled with implicit meaning. For instance, a question like “a car X cheaper than Y” requires understanding that “cheaper” is a higher-order function. It needs to compare the prices of two entities and select the one with the lower value. How can such a concept be encoded in a basic question-answering system?

On the data side, the system might struggle with legacy systems, databases lacking proper metadata (a semantic map to its own contents). Extracting meaningful answers from such a system is far more difficult than using a knowledge graph, where information is explicitly linked and labeled. Accessing the information itself can be source of other problems. Legacy systems often require specific protocols or navigation through complex structures. The system needs a way to find the relevant information, even if it’s buried deep within the labyrinth of the data source accessibility.

Convergence of heterogeneous systems

For complex scenarios, a cognitive system must address implicit knowledge gaps on both the user input (prompts) and the data source (backend information systems). While the graph transformer is the core of this prototype, it’s just one layer in a larger framework.

The variety of business cases this system can address necessitates a set of supporting systems to mitigate implicit knowledge gaps throughout processing. The framework will incorporate several deep learning models, including large language models, each specializing in specific tasks. However, the symbolic nature of underlying information systems necessitates symbolic processing and learning at the core of the solution. This ensures robustness in handling highly ambiguous and noisy user inputs while maintaining precision and determinism when querying specific data sources.

Giancarlo Frison