Dataflow architect

Turning Raw Data into Smart Reports with AI-Powered Insights

LLMs

Automated Data Profiling

Report Generation System

University

University of Colorabo, Boulder

Date

Spring 2025

Dataflow architect

This capstone project from the University of Colorado Boulder tackles a common problem: organizations have data but don’t know how to use it. The team developed a tool that uses large language models (LLMs) to generate data analysis reports directly from uploaded CSV files. Users—especially non-technical domain experts—get instant suggestions for how their data could be used, cleaned, and analyzed, without needing to code or consult a data scientist upfront.

At its core, the system accepts various data inputs and summarizes their structure and quality. Instead of analyzing the raw data directly, it sends summary statistics (like data types and missing values) to the LLM to reduce privacy risks and avoid misleading computations. The LLM then returns a structured, readable report that outlines potential data science projects, from basic exploratory data analysis to machine learning applications.

The interface, built with Streamlit, lets users choose their expertise level (from beginner to expert) and pick how they want the report—comprehensive or step-by-step. It provides flexible access and simple controls for uploading files, viewing output, and giving feedback with thumbs-up/down reactions. Reports can be exported as PDFs for sharing with teams or stakeholders.

The team focused heavily on prompt engineering, testing various open-source LLMs on Groq, and later transitioning to OpenAI’s GPT-4o mini model for better reliability and formatting. Reports were first prototyped in Markdown and iteratively improved by comparing them to a 25-page gold-standard benchmark built for one of the datasets. Based on these evaluations, the team refined prompts and optimized for clarity, brevity, and relevance.

To ensure scalability and reproducibility, the application was containerized using Docker. This makes it easy to deploy across systems and avoid dependency issues. Performance and cost trade-offs were carefully tracked for each model, including metrics like generation time, output token count, and overall report quality. The conclusion: newer compact models can offer competitive performance with fewer resources.

This project shows how AI can support—not replace—human analysts. By automating the tedious first steps of data exploration and making insights easier to interpret, the tool empowers users to ask better questions and work more efficiently with data teams. While LLMs still struggle with precise computation, their ability to contextualize and explain data unlocks real value for organizations that want to make smarter, faster decisions.

Stay Connected

Follow our journey on Medium and LinkedIn.

Read Our Blog Connect on LinkedIn