Scalable Summary Evaluation with LLMs

Automated quality measurement and refinement aligned with human judgment

University

University of California, San Diego

Date

Fall 2025 - Spring 2026

Scalable Summary Evaluation with LLMs

This UC San Diego capstone project builds a scalable system to generate and evaluate summaries, and to measure how well automated “LLM-as-judge” evaluations align with human judgments of quality. It is motivated by the need for reliable, repeatable evaluation when summaries are used to communicate information quickly, but quality can vary widely across content, style, and constraints.

‍

The team develops an end-to-end pipeline that ingests source material, segments it into coherent units, and produces candidate summaries under consistent constraints (for example, length and structure). This setup makes it possible to compare approaches and isolate what changes improve or degrade quality.

‍

To score summary quality, the project uses a rubric-driven evaluator model to rate criteria such as coverage, faithfulness, organization, and clarity. These scores are paired with lightweight automatic checks that flag common failure modes like missing key points, unsupported claims, and constraint violations.

‍

To improve reliability, the evaluation can use multiple judge passes and aggregation to reduce variance and mitigate ordering or phrasing effects. The scoring outputs are also structured to support debugging, so teams can trace low scores back to specific issues rather than only getting a single number.

‍

This work will be continued into the spring quarter by expanding the benchmark size, adding systematic human ratings to calibrate and validate automated scores, and testing improvement strategies (for example, critique-guided rewriting) to quantify whether the pipeline can reliably raise summary quality across diverse inputs.

Stay Connected

Follow our journey on Medium and LinkedIn.

Read Our Blog Connect on LinkedIn