EvaluationsPromptfoo

Using Promptfoo with cognee

Promptfoo is a tool that facilitates the testing and optimization of prompt-based AI systems by comparing outputs against expected results. It utilizes Assertions to validate LLM outputs in various ways, including:

  • Equality Checks: Verifies if the LLM output matches the expected value exactly.
  • JSON Structure Validation: Ensures the output adheres to a valid JSON format.
  • Similarity Measures: Assesses how closely the LLM output resembles the expected result.
  • Custom Functions: Allows users to define specific validation logic using Python or JavaScript functions.

Promptfoo supports Model-Graded Metrics, which leverage language models to grade outputs based on specified rubrics. For instance, the llm-rubric assertion uses a model to evaluate the output against defined criteria, providing a score that reflects how well the output meets the expectations. Cognee’s evaluation framework defines LLM-as-a-judge metrics in promptfoo as described in Microsoft’s GraphRAG paper and offers them as promptfoo metric options:

  • Comprehensiveness: How much detail does the answer provide to cover all aspects and details of the question?
  • Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
  • Empowerment: How well does the answer help the reader understand and make informed judgements about the topic?
  • Directness: How specifically and clearly does the answer address the question?

We also provide correctness as an additional custom metric option.

  • Correctness: Is the actual output factually correct based on the expected output?

Using Promptfoo with Cognee

Imagine you’re testing the Cognify pipeline to ensure its answers align with what users expect. Here’s how Promptfoo fits in:

  1. Define Your Metrics Load the promptfoo prompt template and add the metric parameters to it.
prompts = {}
for metric_name in metric_name_list:
  if is_valid_promptfoo_metric(metric_name):
    prompts[metric_name] = llm_judge_prompts[metric_name.split(".")[1]]
 
with open(os.path.join(os.getcwd(), "evals/promptfoo_config_template.yaml"), "r") as file:
    config = yaml.safe_load(file)
 
config["defaultTest"] = {
    "assert": [
        {"type": "llm-rubric", "value": prompt, "name": metric_name}
        for metric_name, prompt in prompts.items()
    ]
}
  1. Define Your Input Create a list of questions and generate the context for them using cognee. Save the questions and the context information in a config file. Depending on your chosen metrics, the expected answer could be included as well. Save the updated config file.
tests = []
for instance in instances:
  context = await get_context_with_cognee(instance)
  test = {
      "vars": {
          "name": instance["question"][:15],
          "question": instance["question"],
          "context": context,
      }
  }
  tests.append(test)
config["tests"] = tests
 
updated_yaml_file_path = os.path.join(os.getcwd(), "config_with_context.yaml")
with open(updated_yaml_file_path, "w") as file:
    yaml.dump(config, file)
  1. Run Promptfoo Tests Feed the questions and the Cognify pipeline’s outputs into Promptfoo. It runs the LLM judge or computes your chosen metrics.

    promptfoo_path = shutil.which("promptfoo")
    wrapper = PromptfooWrapper(promptfoo_path=promptfoo_path)
    results = wrapper.run_eval(
        prompt_file=os.path.join(os.getcwd(), "evals/promptfooprompt.json"),
        config_file=os.path.join(os.getcwd(), "config_with_context.yaml"),
        out_format="json",
    )
  2. Refine Prompts Based on the results, tweak your prompts. Maybe the pipeline misunderstood the question, or perhaps the expected answer was too rigid. Promptfoo helps you find the right balance.

Join the Conversation!

Join our community now to connect with professionals, share insights, and get your questions answered!