Scalable Summary Evaluation with LLMs

Scalable Summary Evaluation with LLMs
This UC San Diego capstone project builds a scalable system to generate and evaluate summaries, and to measure how well automated “LLM-as-judge” evaluations align with human judgments of quality. It is motivated by the need for reliable, repeatable evaluation when summaries are used to communicate information quickly, but quality can vary widely across content, style, and constraints.
The team develops an end-to-end pipeline that ingests source material, segments it into coherent units, and produces candidate summaries under consistent constraints (for example, length and structure). This setup makes it possible to compare approaches and isolate what changes improve or degrade quality.
To score summary quality, the project uses a rubric-driven evaluator model to rate criteria such as coverage, faithfulness, organization, and clarity. These scores are paired with lightweight automatic checks that flag common failure modes like missing key points, unsupported claims, and constraint violations.
To improve reliability, the evaluation can use multiple judge passes and aggregation to reduce variance and mitigate ordering or phrasing effects. The scoring outputs are also structured to support debugging, so teams can trace low scores back to specific issues rather than only getting a single number.
This work will be continued into the spring quarter by expanding the benchmark size, adding systematic human ratings to calibrate and validate automated scores, and testing improvement strategies (for example, critique-guided rewriting) to quantify whether the pipeline can reliably raise summary quality across diverse inputs.
Stay Connected
Follow our journey on Medium and LinkedIn.
