Blob

Unlocking the Power of Documents with Structure-Aware AI

Anthony

Anthony

Chief Product Officer - AI Architect

The Untapped Wealth of Documents

Enterprise organizations handle vast volumes of unstructured business documents daily — including scanned forms, emails, legal contracts, medical records, financial statements, and more. It is estimated that over 80% of potentially usable data resides untouched within these text-heavy inputs.

That translates to a staggering amount of locked operational intelligence trapped in formats like PDFs, scanned images, and plain word files. Valuable insights such as transaction dates, customer details, product shipments lie dormant due to the difficulty in systematically extracting and connecting this knowledge.

Yet, efficiently leveraging documents is critical for accelerating decision-making, improving process efficiency, ensuring regulatory compliance, and gaining competitive advantages. So there exists both a massive need and tremendous opportunity.

Recent Advances Set the Stage

Fortunately, rapid advances in AI over the past two years promise to crack this nut by bringing natural language processing capabilities to document understanding.

Specifically, the emergence of three key techniques has been game-changing:

1. Function Calls: Self-contained modules enabling specialized document processing logic for extraction, validation, transformation, and more.

2. Constraints: Configurable rules that model outputs must comply with, ensuring reliability.

3. Structured Outputs: Presenting model results in standardized formats like JSON to power integrated analytics.

With these tools, language models can now do more than just generate text — they can execute complex document handling workflows in a structured manner.

The Coming of Structure-Aware AI Systems

By composing language models augmented with document functions, constrained by validation rules, and outputting normalized data structures — a new paradigm is fast emerging:

Structure-aware AI systems designed specially to extract intelligence from unstructured document inputs and connect insights across massive corpora via structured representations.

These systems promise to automate document understanding by mimicking human-level comprehension, provide integrated analytics by linking siloed data, streamline manual business processes through contextual automation, and enhance productivity exponentially.

The raw material exists in enterprises waiting to be tapped. The technology has arrived to process it at scale. The potential of structure-aware AI promises to unleash tremendous efficiencies for organizations wise enough to seize the opportunity.

LLM
Here's an image that represents a profound void, blending elements of Renaissance art, modernism, and cubism, with everything spiraling towards a central light. This unique combination creates a captivating scene of convergence and transformation.

Augmenting Foundation Models for Documents

Both DocLLM and DocGraphLM build on top of advanced neural language models like BERT and GPT-3 to leverage their rich linguistic knowledge and text analysis capabilities.

However, they go beyond plain language models by augmenting the underlying architecture to handle document visual structure and semantics as well.

DocLLM — Spatial Layout Encoding

DocLLM introduces a disentangled spatial attention mechanism on top of the standard transformer self-attention layers used in language models.

This parallel stream allows the model to analyze positional and proximity relationships between text blocks and visual elements like tables or figures.

So the transformer encoder captures textual semantics, while the spatial encoder models the document layout structure.

This provides a unified framework handling both modalities — enabling stronger context modeling during document understanding tasks.

DocGraphLM — Graphical Structure

Unlike DocLLM’s transformer approach, DocGraphLM uses graph neural networks to model documents.

The key idea is to create graph representations where nodes correspond to textual segments, and edges capture spatial/visual connections between elements.

This explicitly embeds the document structure within the neural architecture itself.

The model then combines the graphical semantics with pretrained language model node embeddings to unite both textual and structural understanding.

So in summary, DocLLM takes a transformer route with spatial augmentations, while DocGraphLM uses graphical networks — but both effectively equip foundation models with document layout comprehension capabilities.

Function Calls Enable Specialized Document Tasks

Documents present AI systems with significant challenges compared to plain text or structured data. They contain valuable information encoded in domain-specific layouts, free-flowing language, and contextual connections across evidence.

To reliably extract intelligence, systems need to handle specialized logic like deciphering complex table structures, linking entities across multiple sources, or transforming unstructured text into normalized formats.

Composable Functions as Custom Modules

This requirement motivates the incorporation of composable functions — self-contained subroutines executing particular document processing tasks:

Named Entity Recognition for identifying relevant names, dates, locations

Data Extraction based on visual cues like coordinates and proximity

Table Parsing to decode row/column relationships

Document Matching using similarity metrics to surface connections

Prompting Function Execution

Rather than hard coding workflows, prompts invoke desired sequences of functions required for a particular document understanding process:

Extract specified entities from these contracts using your best entity recognition functions.

Parse the tabular data on these pages into a normalized format with relevant table functions.

Structured Outputs Connect Tasks

Functions produce structured outputs — consistent data schemas that downstream functions can further process:

Entities formatted as JSON objects with labels, text snippets

Tables represented as row/column records

This pipes data effectively between subtasks, unlike fragmented text.

Composable functions allow customizing sophisticated document handling logic — facilitating robust information extraction, relationship detection and knowledge consolidation from unstructured sources.

Constraints and Rules Guide Valid Outputs

Techniques like DSPy Assertions allow imposing validation rules, data constraints, trust conditions that model outputs must comply with. These computational constraints act like integrity checks in a database, ensuring reliable downstream consumption.

By embedding such constraints and rerunning models when outputs fail checks until succeeding, robust structure-aware systems emerge.

Enforcing Reliability via Declarative Rules

Robust enterprise-grade systems require guarantees that model outputs meet certain quality and integrity bars before operationalization.

Documents handled at scale introduce volatility from language ambiguity, complex layouts, and domain diversity. So solely relying on the natural strengths of language models is insufficient.

Additional mechanisms to validate outputs and enforce constraints are imperative.

Constraints as Computational Guardrails

Constraint rules address this requirement for reliability by acting as guardrails restricting model outputs to expected formats:

Date values standardized to MM/DD/YYYY

Addresses fully parsed into fields

Entities labeled with known categories like Person, Company

Defined upfront, these rules operationally enforce reliability — much like database constraints preserving referential integrity.

Retries and Self-Correction

Upon constraint violations, execution reruns repeatedly providing feedback until outputs conform or exceptions trigger. This self-corrects through implicit learning.

Structuring Disorderly Documents

Together constraints enable dependably structuring the disorderly nature of documents into orderly knowledge — extracting entities consistently, relating evidence logically, and detecting discrepancies accurately.

They form the foundational building blocks facilitating enterprise integration of AI document understanding with constrained outputs reliably fueling downstream data consumers.

Structured Representations Connect Insights

Unlocking Integrated Analytics via Structured Outputs

Successfully extracting intelligence from individual documents is foundational. But the broader potential lies in consolidating insights across massive corpora — connecting the dots, detecting patterns, and answering nuanced questions.

Documents represent fragmented islands of information. Deriving aggregated understanding requires building connected knowledge structures.

Standardized Structures Bridge Documents

This motivates outputting extractions in standardized structures — consistent schemas like JSON defining typed fields, taxonomies, and relationships.

Unlike disjointed text, structured outputs enable consolidation:

Linked entities use shared identifier labels

Objects present attributes in predictable locations

Nesting connects hierarchical information

Output uniformity drives integration scalability.

Knowledge Graphs Manifest Connections

Based on unified outputs, knowledge graphs manifest to explicitly model documents as networks of interrelated entities, events, and concepts.

Sophisticated graph algorithms can then analyze these connections to uncover insights like:

Cycles indicating contract renewal misalignments

Subgraphs representing fraud linkages

Centralities exposing compliance risks

Augmenting Humans with Connected Intelligence

Structured outputs unlock a profound vision — augmenting business professionals via instantly accessible information mapped across massive document collections.

Integrated knowledge manifests contextual insights from across siloed sources — finally harnessing documents as assets rather than liabilities.

Enforcing Validity in LLM Planning with Formal-LLM

Large language models (LLMs) have shown tremendous capability in automatically generating multi-step plans for complex tasks. However, these models can sometimes produce invalid or non-executable action sequences due to their blackbox nature.

To address this, the proposed Formal-LLM framework integrates the precision of formal languages with the expressiveness of natural language to restrict LLM-based agents to valid plan spaces.

Representing Requirements as Automata

The key idea is to allow users to formulate planning requirements and constraints as a pushdown automaton (PDA) — a computational model well-suited for language.

The PDA components include:

States denoting possible configurations

Transitions mapping actions between states

A stack to track execution history

Constraining the LLM to only take transitions permissible by the PDA serves to bounding plan generation within valid possibilities.

Prompting Under PDA Guidance

At each step, the framework prompts the LLM to choose an action based on:

The current PDA state

Feasible transitions from this state

Past plan execution history on the stack

By prompting dynamically per automaton guidance, the model is steered towards guaranteed valid plans.

Optimization with Reinforcement Learning

Valid plans may still be suboptimal. Further fine-tuning using reinforcement learning from task feedback allows improving plan quality while retaining validity.

Benefits Over Baselines

Experiments demonstrate Formal-LLM:

Provides 100% validity versus 76% for the next best method

Boosts executable plan generation rates substantially

Makes LLM-based planning more controllable

By accounting for the dual aspects of expressiveness and precision, the framework significantly advances the reliability of LLM agents for multi-step reasoning tasks.

Automating Enterprise Insights via Connected Document Workflows

While individual structure-aware AI capabilities offer promise, collectively they possess immense potential for unlocking process efficiencies across massive document collections.

By coordinating functions, constraints and structures, robust systems can automate end-to-end workflows — ingesting contracts, extracting intelligence, detecting patterns, and triggering downstream decisions.

Ingestion and Extraction

Specialized ingestion functions like form parsers and table detectors intake diverse document types into a standard format. Chained extraction functions then identify salient entities, relationships, and metadata via constraints. This facilitates systematically indexing varied corpora.

Knowledge Consolidation

Normalized structured outputs load into a consolidated knowledge graph encoding people, locations, transactions, durations, and hierarchies across sources.

This manifests contextual connections otherwise trapped in silos.

Analysis and Insights

With unified integrated data, graph algorithms surface network patterns, clusters, and anomalies — answering sophisticated business queries.

Dedicated agents continual mine documents to expand and refine the systemic intelligence.

Decision Workflows

Finally, the system triggers notifications, recommendations, transactions, and record updates per the document insights via API integrations.

This automates downstream processes like financial notices, supplier alerts, customer targeting, and more based on contractual developments.

In totality, orchestrating robust ingestion, reliable extraction, connected analytics and automated decisions unlocks tremendous latent efficiencies at enterprise scale.