Using grammars to constrain llama.cpp output

Context-free grammars have increased the accuracy of my large language model-based biomedical data extraction pipeline.

The llama.cpp project, which is a high-performance library for running LLMs locally on CPUs, GPUs, and Apple’s Metal graphics platform (e.g M1, M2), has recent added the support of grammars to guide and constrain the output of the LLM.

A grammar is a notation that describes the valid syntax of text.

The GGML grammar notation (GBNF) is documented here and there are example grammars for generic JSON, C programming language, and chess moves.

I have gotten pretty good at crafting my grammars by hand, but these tools are helpful for getting started:

For Python usage, this capability was exposed in the llama-cpp-python project starting in version 0.1.78.

To use it, there is a class called LlamaGrammar that is passed into your LLM instance on inference:

from llama_cpp.llama import Llama, LlamaGrammar
grammar = LlamaGrammar.from_string(grammar_text)
llm = Llama(model_path)
response = llm(prompt, grammar=grammar)

LlamaGrammar also has a from_file helper function.

Grammars work by guiding and constraining the LLM as it is predicting the next token.

This feature eliminates the challenges with trying to force the model to generate well-formed JSON via prompt engineering or via post-processing on the response text.

In addition to guaranteeing the output, the overall quality and accuracy of the underlying response "logic" improves as well. The grammar acts like guardrails in bowling which not only prevents gutter balls (i.e. not well-formed JSON) but also increases the likelihood of a strike (i.e. the correct answer).

Subscribe to Ian Maurer's Notes

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.