Evaluation

The plugin should improve answers where prediction feedback matters and stay quiet where it does not.

Run the bundled comparison:

python scripts/evaluate_responses.py
python scripts/evaluate_responses.py --format json

The harness compares evals/baseline-responses.jsonl and evals/guarded-responses.jsonl against evals/prompts.jsonl.

Metrics

activation_precision  answer uses CHC when the fixture expects CHC, or avoids it when not expected
activation_noise      answer forces CHC onto prompts where it should stay quiet
boundary_accuracy     answer separates causal_paradox, valid_acyclic, and unproved correctly
overclaim_rate        answer claims CHC solves halting or arbitrary termination
answer_usefulness     answer is activated correctly, has correct boundary, and avoids overclaim
token_overhead        guarded token count minus baseline token count

These are operational metrics. They do not prove the formal theory. They measure whether the plugin improves answer shape.

DesignIR Corpus Evaluation

The repository also includes deterministic corpus checks:

python scripts/chc_eval_design_ir.py examples/design-ir-corpus
chc eval evals/v4 --format json

The corpus is deliberately split:

description.md              natural-language prompt for humans and LLM extraction tests
expected.design-ir.json      structured interpretation
expected.analysis.json       expected deterministic classification

The evaluator does not parse description.md. It validates the expected DesignIR artifacts and verifies that the deterministic analyzer returns the expected classification. This keeps the rule intact:

natural language -> LLM writes DesignIR
DesignIR -> scripts verify structure

Response-quality evaluation and corpus evaluation answer different questions:

response eval: did the assistant use the CHC lens well?
corpus eval: does explicit DesignIR classify correctly?

V4 Corpus

V4 adds a 100-case structured corpus under evals/v4:

40 causal_paradox
40 valid_acyclic
20 insufficient_info

The corpus includes safe non-problem examples such as ordinary loops, future-run retries, logging-only observations, post-run scoring, and local progress counters.

chc eval reports total cases, passed/failed counts, classification accuracy, coverage, and false-positive/false-negative categories. It still does not parse natural language; description.md files are human context and extraction fixtures only.

Current Sample Result

The bundled sample responses are illustrative and deterministic. They show the intended delta:

baseline: generic engineering advice
guarded: causal graph diagnosis and safer design boundary

Current output:

case_count: 10
baseline_activation_precision: 0.300
guarded_activation_precision: 1.000
baseline_activation_noise: 0.000
guarded_activation_noise: 0.000
baseline_boundary_accuracy: 0.300
guarded_boundary_accuracy: 1.000
baseline_overclaim_rate: 0.000
guarded_overclaim_rate: 0.000
baseline_answer_usefulness: 0.300
guarded_answer_usefulness: 1.000
average_token_overhead: 14.600

Expected qualitative improvement:

generic timeout/retry advice
-> prediction-feedback diagnosis

"hard because simulation is long"
-> valid_acyclic but unproved

"looks like a loop"
-> symbolic causal_paradox with unifier