Introducing Assertions in DSPy: Self-Refining LLM Behavior

4 minute read


DSPy Logo

Here is a video of my talk to the Stanford NLP Group on ‘Self-Refinement in DSPy’:

The evolution of prompting Large Language Models (LLMs) has become a focal point of the ML/AI world. With every interaction, we aim to fine-tune it to best produce responses. However, this can often be a rather manual, often inefficient task. Instead, we can take advantage of the growing capability of these models to introspectively improve their responses with new information and refine their outputs. This self-refinement naturally points towards the direction of building autonomous and context-aware frameworks, reducing the dependency on manual model tuning.

As covered in an earlier post, DSPy is an emerging framework for solving advanced natural language tasks with language models and retrieval models. Through providing a unique, programmable interface, DSPy proposes a paradigm shift in replacing brittle “prompt engineering” techniques with a modular approach.

Despite its programmatic versatility of prompting models, DSPy lacks an apt way to express computational constraints. To accomplish this, my research introduces an innovative approach: integrating assertions in DSPy. Drawing from Pythonic assert statements, these assertions are designed to refine control over LLM behavior, enabling them to guide the behavior of LLM calls during compiling and inference for greater program behavior control. Assertions interact with LLMs under the hood to provide hints for compiling (e.g., add permanent guidelines in the instructions to avoid common failures), for accurate inference (e.g., facilitate backtracking and self-correcting calls to LMs), and for debugging (i.e., dual-use of the error messages for programmer’s benefit).

We introduce two key constructs to impose these computational constraints: Assertions and Suggestions (soft-Assertions).

  • Assertions: Serve as hard-enforced checkpoints during the pipeline, halting execution when conditions are not met.
  • Suggestions: Offer guidance to the LLMs, allowing for the pipelines to receive feedback on errors during the paragraphs and respectively refine the prompt.

A key component of this approach is the autonomous handling of occurrences where the suggestion is not met. This happens through the Retry module, which serves as a wrapper around the current program and enables the pipeline to backtrack and re-attempt execution. Essentially, on each Retry pass, the module first determines if the program’s generation passes the specified constraint through a validation function. If this validation check fails, the Retry module captures all of the past outputs of the wrapped module’s generations and feedback from the instruction message and passes this forward for the re-attempt pass. This dynamic modification of the LLM inputs enables self-awareness within the program’s prompt generations, ensuring each retry is informed by the history of its past attempts and thereby, allowing for the LLM to make more informed decisions over subsequent iterations.

Integrating assertions within DSPy programs serves to enhance the performance of downstream tasks in two evaluation components: Intrinsic Quality and Extrinsic Quality.

  • Intrinsic Quality: Emphasizes how closely the outputs adhere to the defined assertions in the program. This can be particularly expressed as fundamental benchmarks to the system’s performance on passing intended functionality validation checks.

  • Extrinsic Quality: Focuses on how the assertions serve as guiding principles that will shape performance on broader downstream metrics. The assertions in essence serve as proxies to more comprehensive objectives of the program, providing “hints” on the impact of integrating assertions and leading to improvements in task-specific metrics

We put assertions to the test in two question-answering tasks.

  1. Multi-Hop QA:
    • generating brief answers to natural-language questions from the HotPotQA dataset
    • we include constraints on refining search queries to improve retrieval effectiveness (intrinsic) and answer precision (extrinsic)
  2. LongForm Multi-Hop QA:
    • generating long-form paragraphs that cite retrieved references to answer questions in the HotPotQA dataset
    • we include constraints to verify citation faithfulness (intrinsic) along with citation recall, precision and overall answer accuracy (extrinsic)

Our paper details the extensive methodology and evaluation results of including assertions on these QA tasks.

LLM prompt sensitivity continues to impact growing prompting techniques. As we continue to expand DSPy with features like assertions to improve performance, we aim to push the boundaries of LLMs in innovative ways that reduce manual prompting. We envision the growth of DSPy as a framework that revolutionizes traditional methods and encourages autonomous, programmatic workflows, emphasizing an optimal approach to LLM prompting.

Further Reading

To dive deeper into DSPy, refer to my blog post, our paper, GitHub repository, and Twitter thread