Giovanni Ciatto
Dipartimento di Informatica — Scienza e Ingegneria (DISI), Sede di Cesena,
Alma Mater Studiorum—Università di Bologna
(version: 2026-04-01 )
Motivation and Context
MLOps with MLflow
End-to-end example for classification
End-to-end example for LLM agents
Training a model from data, in order to:
In statistics (and machine learning) a model is a mathematical representation of a real-world process
(commonly attained by fitting a parametric function over a sample of data describing the process)

e.g.: $f(x) = \beta_0 + \beta_1 x $ where $f$ is the amount of minutes played, and $x$ is the age
E.g. neural networks (NN) are a popular family of models
Single neuron
(Feed-forward)
Neural network $\equiv$ cascade of layers
Many admissible architectures, serving disparate purposes
A software module (e.g. a Python object) implementing a mathematical function…
predict(input_data) -> output_data… commonly tailored on a specific data schema
… which works sufficiently well w.r.t. test data
… which must commonly be integrated into a much larger software system
… which may need to be re-trained upon data changes.
The process of producing a ML model is not linear nor simple:

These steps are cyclical, not linear → one often revisits data, retrain, or refine features.
Forecast footfall/visits to some office by day/time

✅ Interleave code, textual description, and visualizations
✅ Interactive usage, allowing for real-time feedback and adjustments
✅ Uniform & easy interface to workstations
✅ Easy to save, restore, and share
❌ Incentivises manual activities over automatic ones
final_v3.ipynbNo structured process $\implies$ ML projects may fail to move from notebooks to real-world use
The practice of organizing and automating the end-to-end process of building, training, deploying, and maintaining machine-learning models
MLOps adds infrastructure + processes + automation to make each step more reliable:
“send me your notebook by email”Engineering prompts, tools, vector stores, and agents to constrain and govern the behavior of pre-trained (foundation) models, in order to:
write an evaluation paragraph for each student's work)find the parties involved in this contract)plan a vacation itinerary based on user preferences)Pre-trained foundation models (PFM): large neural-networks trained on massive datasets to learn general skills (e.g. ‘understanding’ and generating text, images, code)
Prompts: carefully crafted textual inputs that guide some PFM to produce desired outputs
Write a summary of the following article: {article_text}Tools: external software components (e.g. APIs, databases, search engines) that can be invoked by PFMs to perform specific tasks or retrieve information
Vector stores: specialized databases that store and retrieve high-dimensional vectors (embeddings) for the sake of information retrieval via similarity search
Agents: software systems that orchestrate the interaction between PFMs and tools, enabling dynamic decision-making and task execution based on the context and user input
FM are commonly not produced in-house, but rather accessed via APIs… yet the choice of what model(s) to use is crucial
A set of prompt templates (text files, or code snippets) that are known to work well for the tasks at hand
A set of tool servers implementing the MCP protocol so that tools can be invoked by PFMs
A set of agents, implementing the logic to orchestrate the interaction between PFMs and tools
A set of vector stores (if needed), populated with relevant data, and accessible by the agents
(Similar to the ML workflow in the sense that the goal is to process data, but different in many details e.g. no training is involved)

Foundation model selection: choose the most suitable pre-trained model(s) based on task requirements, performance, cost, data protection, and availability
Prompt engineering: design, test, and refine prompt templates to elicit the desired responses
Evaluations: establish assertions and metrics to assess PFM responses to prompts (attained by instantiating templates over actual data)
Tracking the data-flow between components (agents, PFM, tools, vector stores) to monitor costs, latency, and to debug unexpected behaviors
Support public officers in managing tenders through a GenAI assistant that understands and compares procurement decisions transparently.
Problem Framing:
Data Collection: past tenders’ technical specifications, acts, etc; regulatory documents, etc.
Data Preparation:
Prompt Engineering:
You are a procurement evaluator…)Foundation Model Selection: multi-lingual? specialized in legal/technical text? cost constraints? support for tools?
Vector stores: storing embeddings for tender documents & specs, legal texts & guidelines, previous evaluation, templates
Tools:
Agents:
The practice of organizing and automating the end-to-end process of building, evaluating, deploying, and maintaining GenAI applications
LLMOps adds infrastructure + processes + automation to make each step more reliable:
Foundation models are hard-coded in the application
Prompt templates are scattered in code or documents
Tools are manually integrated, leading to:
Agents are ad-hoc scripts that mix logic, PFM calls, and tool invocations
Vector stores are tightly coupled with specific DBMS
Evaluation & monitoring are manual and sporadic leading to undetected issues, cost overruns, and loss of trust

An open-source Python framework for MLOps and (most recently) LLMOps



Notice that, in set-up 3, there could be up to three servers involved:
scikit-learn, TensorFlow, PyTorch, etc.)Start the Python code
MLflow’s Python API invoked in the code will actually log all relevant metadata and artifacts as the code runs
Metadata and artifacts may be stored (depending on the configuration):
Assumption 2 may require additional effort from the developer(s)
No big constraint on how to organize the Python code it-self…
… but many benefits (automation, reproducibility) may come from organizing the code as an MLflow Project
Install MLflow into your Python environment
pip install mlflow
Consider the following dummy script:
import sys # to read command-line arguments
import tempfile # to save generated files into temporary directories
import mlflow # to use MLflow functionalities
from random import Random # to generate random numbers with controlled seed
# Set the experiment name (creates it if it does not exist)
mlflow.set_experiment(experiment_name="logging_example")
# Read a seed from command-line arguments (default: 42)
seed = int(sys.argv[1]) if len(sys.argv) > 1 else 42
rand = Random(seed)
# Start an MLflow run, naming it "example_run" (otherwise random name is generated)
with mlflow.start_run(run_name="example_run") as run:
# notice that experiments are runs are identified by their numeric IDs
print(f"Started MLflow run with ID: {run.info.run_id} in experiment ID: {run.info.experiment_id}")
# Log a parameter "seed" with the given seed value
mlflow.log_param("seed", seed)
# Let's simulate 5 different metric scores to be logged
for i in range(5):
mlflow.log_metric(f"random_{i}", rand.random())
mlflow.log_metric("random_4", rand.randint(1, 10)) # overwrite last metric
# Create and log an example artifact (a text file, generated inside temporaty directory)
with tempfile.TemporaryDirectory() as tmpdir:
file_path = f"{tmpdir}/example.txt"
with open(file_path, "w") as f:
f.write("This is an example artifact.")
mlflow.log_artifact(file_path, artifact_path="examples")
# Simulate an error in the run if the seed parameter is odd
if seed % 2 == 1:
raise ValueError("Let the run fail for odd seeds!")
print("Run completed successfully.")
Let’s run the experiment twice, with different seeds:
python logging_example.py 42
python logging_example.py 43
The 1st successful run shall output something like:
2025/11/10 11:44:49 INFO mlflow.tracking.fluent: Experiment with name 'logging_example' does not exist. Creating a new experiment.
Started MLflow run with ID: 378f18735f6d4abd8abeba76f4029bea in experiment ID: 931233098002846893
The 2nd failing run shall output something like:
Started MLflow run with ID: 9b52b7b7416e423ca9c878fba9b5c667 in experiment ID: 931233098002846893
Traceback (most recent call last):
File "/home/gciatto/Work/Code/example-mlops/mlflow_tracking.py", line 28, in <module>
raise ValueError("Let the run fail for odd seeds!")
ValueError: Let the run fail for odd seeds!
Look at your file system, notice that a new mlruns/ folder has appeared next to Python script:
mlruns
├── 931233098002846893
│ ├── 378f18735f6d4abd8abeba76f4029bea
│ │ ├── artifacts
│ │ │ └── examples
│ │ │ └── example.txt
│ │ ├── meta.yaml
│ │ ├── metrics
│ │ │ ├── random_0
│ │ │ ├── random_1
│ │ │ ├── random_2
│ │ │ ├── random_3
│ │ │ └── random_4
│ │ ├── params
│ │ │ └── seed
│ │ └── tags
│ │ ├── mlflow.runName
│ │ ├── mlflow.source.git.commit
│ │ ├── mlflow.source.name
│ │ ├── mlflow.source.type
│ │ └── mlflow.user
│ ├── 9b52b7b7416e423ca9c878fba9b5c667
│ │ ├── artifacts
│ │ │ └── examples
│ │ │ └── example.txt
│ │ ├── meta.yaml
│ │ ├── metrics
│ │ │ ├── random_0
│ │ │ ├── random_1
│ │ │ ├── random_2
│ │ │ ├── random_3
│ │ │ └── random_4
│ │ ├── params
│ │ │ └── seed
│ │ └── tags
│ │ ├── mlflow.runName
│ │ ├── mlflow.source.git.commit
│ │ ├── mlflow.source.name
│ │ ├── mlflow.source.type
│ │ └── mlflow.user
│ ├── meta.yaml
│ └── tags
│ └── mlflow.experimentKind
└── models
Let’s now start the MLflow Web UI via the following command, to visualize the experiment runs:
mlflow ui
then browse to http://127.0.0.1:5000 in your favorite browser
You should see something like the following:

logging_example) to see the two runs:

mlruns/
example.txt artifact is inside some “virtual” examples/ folder
from sklearn.datasets import load_iris # to load the iris dataset
from sklearn.tree import DecisionTreeClassifier # to use decision tree classifier
import mlflow # to use MLflow functionalities
import sys # to read command-line arguments
# Set the experiment name (creates it if it does not exist)
mlflow.set_experiment("autologging-example")
# Enable autologging for scikit-learn (and other ML libraries in general)
mlflow.autolog(log_datasets=True, log_models=True, log_model_signatures=True, log_input_examples=True)
# Read a seed from command-line arguments (default: 42)
seed = int(sys.argv[1]) if len(sys.argv) > 1 else 42
# Start an MLflow run, naming it "autologging_run"
with mlflow.start_run(run_name="autologging_run"):
# Load full iris dataset
X, y = load_iris(return_X_y=True)
# Train model on the entire dataset (using the given seed)
model = DecisionTreeClassifier(random_state=seed)
model.fit(X, y)
# Evaluate model on the entire dataset (training accuracy)
training_score = model.score(X, y)
print("Training accuracy:", training_score)
# Raise an error if training accuracy is below 90%
if training_score < 0.9:
raise ValueError("Training accuracy is too low: " + str(training_score))
Let’s run the experiment once, with default seed:
python autologging_example.py
The run shall output something like:
2025/11/11 16:27:49 INFO mlflow.tracking.fluent: Experiment with name 'autologging-example' does not exist. Creating a new experiment.
2025/11/11 16:27:50 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
Training accuracy: 1.0
Let’s now look at the MLflow UI again (via mlflow ui command) to see the new experiment:

autologging-example) to see the run:
Click on the run name to see details about that run
Click on the run name to see details about that run
estimator.html (HTML representation of the SciKit-Learn processing pipeline)metric_info.json (details about automatically-logged metrics)training_confusion_matrix.png (confusion matrix picture on training data)Click on the logged model to see its details:
m-cbc72d11f1e6405bbaa77889f08b92dd
MLmodel (YAML description of the model)model.pkl (the actual serialized model, in Python pickle format)requirements.txt (Python environment to run the model, in pip format)serving_input_example.json (example input data for model serving via MLflow)
MLflow assists in model deployment by mediating the interaction between logged models and their clients
MLflow automates the creation of container images for the sake of model deployment
Saved models can be easily used via MLflow’s command-line interface (CLI) or Web API for prediction serving
Download the file serving_input_example.json locally
{
"inputs": [
[5.1, 3.5, 1.4, 0.2],
[7.0, 3.2, 4.7, 1.4],
[6.3, 3.3, 6.0, 2.5]
]
}
Run
mlflow models predict --env-manager local -m "models:/m-1abbea58e1cf442ab9412b7eae572523" -i path/to/serving_input_example.json
Observe the predictions output:
{"predictions": [0, 1, 2]}
Saved models can be easily used via MLflow’s command-line interface (CLI) or Web API for prediction serving
Start the MLflow model serving endpoint:
mlflow models serve --env-manager local -m "models:/m-1abbea58e1cf442ab9412b7eae572523" -p 1234
Send a prediction request via curl:
curl http://localhost:1234/invocations -H "Content-Type:application/json" --data '{
"inputs": [
[5.1, 3.5, 1.4, 0.2],
[7.0, 3.2, 4.7, 1.4],
[6.3, 3.3, 6.0, 2.5]
]
}'
Observe the predictions output:
{"predictions": [0, 1, 2]}
Build a Docker image for the model:
mlflow models build-docker -m "models:/m-1abbea58e1cf442ab9412b7eae572523" -n iris-classifier-dt:latest
This should output something like:
2025/11/12 16:50:16 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'
2025/11/12 16:50:16 INFO mlflow.pyfunc.backend: Building docker image with name mlflow-pyfunc-servable
[+] Building 254.3s (14/14) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.5s
=> => transferring dockerfile: 1.00kB 0.0s
=> [internal] load metadata for docker.io/library/python:3.13.7-slim 6.2s
=> [auth] library/python:pull token for registry-1.docker.io 0.0s
=> [internal] load .dockerignore 0.3s
=> => transferring context: 2B 0.0s
=> [internal] load build context 0.4s
=> => transferring context: 5.14kB 0.0s
=> [1/8] FROM docker.io/library/python:3.13.7-slim@sha256:5f55cdf0c5d9dc1a415637a5ccc4a9e18663ad203673173b8cda8f8dcacef689 6.5s
=> => resolve docker.io/library/python:3.13.7-slim@sha256:5f55cdf0c5d9dc1a415637a5ccc4a9e18663ad203673173b8cda8f8dcacef689 0.1s
=> => sha256:5f55cdf0c5d9dc1a415637a5ccc4a9e18663ad203673173b8cda8f8dcacef689 10.37kB / 10.37kB 0.0s
=> => sha256:2be5d3cb08aa616c6e38d922bd7072975166b2de772004f79ee1bae59fe983dc 1.75kB / 1.75kB 0.0s
=> => sha256:7b444340715da1bb14bdb39c8557e0195455f5f281297723c693a51bc38a2c4a 5.44kB / 5.44kB 0.0s
=> => sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf 29.78MB / 29.78MB 2.1s
=> => sha256:31fd2a94d72338ac6bbe103da6448d7e4cb7e7a29b9f56fa61d307b4395edf86 1.29MB / 1.29MB 0.7s
=> => sha256:66b685f2f76ba4e1e04b26b98a2aca385ea829c3b1ec637fbd82df8755973a60 11.74MB / 11.74MB 2.5s
=> => sha256:7d456e82f89bfe09aec396e93d830ba74fe0257fe2454506902adf46fb4377b3 250B / 250B 1.3s
=> => extracting sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf 0.8s
=> => extracting sha256:31fd2a94d72338ac6bbe103da6448d7e4cb7e7a29b9f56fa61d307b4395edf86 0.2s
=> => extracting sha256:66b685f2f76ba4e1e04b26b98a2aca385ea829c3b1ec637fbd82df8755973a60 0.6s
=> => extracting sha256:7d456e82f89bfe09aec396e93d830ba74fe0257fe2454506902adf46fb4377b3 0.0s
=> [2/8] RUN apt-get -y update && apt-get install -y --no-install-recommends nginx 20.5s
=> [3/8] WORKDIR /opt/mlflow 0.3s
=> [4/8] RUN pip install mlflow==3.5.0 162.1s
=> [5/8] COPY model_dir /opt/ml/model 0.9s
=> [6/8] RUN python -c "from mlflow.models import container as C; C._install_pyfunc_deps('/opt/ml/model', install_mlflow=False, enable_mlserver=False, env_manager='loca 27.6s
=> [7/8] RUN chmod o+rwX /opt/mlflow/ 0.9s
=> [8/8] RUN rm -rf /var/lib/apt/lists/* 1.7s
=> exporting to image 25.4s
=> => exporting layers 24.9s
=> => writing image sha256:7b41a8c6bd049022abc8aeaaa41b7b60008c242fbc59c41073fc61daec05952d 0.1s
=> => naming to docker.io/library/iris-classifier-dt
Run the Docker container:
docker run --rm -it --network host iris-classifier-dt:latest
Send a prediction request via curl:
curl http://localhost:8000/invocations -H "Content-Type:application/json" --data '{ "inputs": [[5.1, 3.5, 1.4, 0.2], [7.0, 3.2, 4.7, 1.4], [6.3, 3.3, 6.0, 2.5]] }'
Observe the predictions output:
{"predictions": [0, 1, 2]}
One may register a model $\approx$ give it a human-friendly name + versioning


Click on the model name to see its available versions
models:/<model-name>/<version>
models:/<model-name>@latest

A ML workflow aimed at creating a classifier for the Adult Income dataset via SciKit-Learn pipeline
Implemented as a Jupyter notebook + MLflow
Jupyter notebook available at: https://github.com/gciatto/example-mlops/blob/master/mlflow_census_demo.ipynb
UI overview (metadata)

Comparing metrics across multiple runs

Inspecting the winner model

adult-best-random-forest)Jupyter notebooks are interactive by nature
Executing the code is time-consuming: it requires a human operator to start the notebook and run all cells
Pictures (if any) are commonly embedded in the notebook itself
No code decomposition, poor version control
[Critical] Parameters are hard-coded in the notebook itself
Solution: organize the code as an MLflow Project (see next slide)
MLflow Projects provide a standard format for packaging and sharing reproducible data science code
Assumption 1: files are structured in a specific way (decomposition of code into multiple scripts)
root-directory-name/
├── MLproject # Project descriptor file
├── train.py # Training script
├── test.py # Test script
├── conda.yaml # Optional: Conda environment (dependencies)
├── python_env.yaml # Optional: Python environment (alternative to Conda)
└── data/ # Optional: project data and assets
train.py and test.py are two separate scripts
python_env.yaml file (or in conda.yaml)
python: "^3.13.7"
dependencies:
- mlflow
- scikit-learn
- pandas
- matplotlib
- numpy
- requests
- jupyter
- seaborn
Assumption 3: ML tasks and their parameters are declared via the MLproject file
name: My ML Project
# Environment specification (choose one)
python_env: python_env.yaml
# conda_env: conda.yaml
# docker_env:
# image: python:3.9
entry_points:
main:
parameters:
param_file: path
param_num: {type: float, default: 0.1}
param_int: {type: int, default: 100}
command: "python train.py --num {param_num} --int {param_int} {param_file}"
test:
parameters:
param_str: {type: str, default: "hello"}
param_uri: uri
command: "python test.py --uri {param_uri} {param_str}"
mlflow run . -P param_file=data/input.csv -P param_num=0.2 -P param_int=200
# if no -e <entry-point> is given, "main" is assumed by default
mlflow run . -e test -P param_str="world" -P param_uri="models:/my-model/1"
Assumption 4: entry-point scripts (train.py, etc.) are implemented to read all relevant parameters from command-line
# train.py
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("file", type=str, help="Path to input data file")
parser.add_argument("--num", type=float, default=0.1, help="A numeric parameter")
parser.add_argument("--int", type=int, default=100, help="An integer parameter")
args = parser.parse_args()
print(f"Training with data from: {args.file}")
print(f"Numeric parameter: {args.num}")
print(f"Integer parameter: {args.int}")
# ... rest of the training code ...
Assumption 5: entry-point scripts use MLflow’s Tracking API accordingly
(We also exemplify the usage of a remote MLflow Tracking Server)
On machine with DNS name my.mlflow.server.it, clone the repository, and start MLflow server via Docker Compose
# git clone https://github.com/gciatto/example-mlops.git
# cd example-mlops
docker compose up -d --wait
On your local machine, clone the repository as well, then set the MLflow Tracking URI accordingly
# git clone https://github.com/gciatto/example-mlops.git
# cd example-mlops
export MLFLOW_TRACKING_URI="http://my.mlflow.server.it:5000
# or, on Windows (cmd):
# set MLFLOW_TRACKING_URI=http://my.mlflow.server.it:5000
# or, on Windows (PowerShell):
# $env:MLFLOW_TRACKING_URI="http://my.mlflow.server.it:5000"
Again on your local machine, you may need to re-create the Python environment to run experiments
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install -r requirements.txt
Notice the MLproject file in the repository root, paying attention to the entry points defined therein, and their parameters:
name: Census Income Prediction Demo
python_env: python.yaml
entry_points:
train:
parameters:
model_type: {type: string, default: "both"}
test_size: {type: float, default: 0.2}
cv_splits: {type: int, default: 3}
random_state: {type: int, default: 42}
numeric_imputation_strategy: {type: string, default: "median"}
numeric_scaling_with_mean: {type: boolean, default: true}
categorical_imputation_strategy: {type: string, default: "most_frequent"}
ohe_handle_unknown: {type: string, default: "ignore"}
sparse_threshold: {type: float, default: 0.3}
lr_max_iter: {type: int, default: 1000}
lr_C_values: {type: string, default: '[0.1, 1.0, 10.0]'}
lr_solvers: {type: string, default: '["lbfgs"]'}
rf_n_estimators: {type: string, default: '[150, 300]'}
rf_max_depths: {type: string, default: '[null, 12]'}
command: |
python train.py \
--model-type {model_type} \
--test-size {test_size} \
--cv-splits {cv_splits} \
--random-state {random_state} \
--numeric-imputation-strategy {numeric_imputation_strategy} \
--numeric-scaling-with-mean {numeric_scaling_with_mean} \
--categorical-imputation-strategy {categorical_imputation_strategy} \
--ohe-handle-unknown {ohe_handle_unknown} \
--sparse-threshold {sparse_threshold} \
--lr-max-iter {lr_max_iter} \
--lr-C-values {lr_C_values} \
--lr-solvers {lr_solvers} \
--rf-n-estimators {rf_n_estimators} \
--rf-max-depths {rf_max_depths}
test:
parameters:
run_id: {type: string}
model_uri: {type: string}
command: "python test.py --run-id {run_id} --model-uri {model_uri}"
python train.py --help to see details about training script parameterspython test.py --help to see details about testing script parametersStill on your local machine, you may now run the training via MLflow Project API
EXPERIMENT_NAME="adult-classifier-$(date +'%Y-%m-%d-%H-%M')"
mlflow run -e train --env-manager=local --experiment-name "$EXPERIMENT_NAME" . -P model_type=both
this may take some minutes, as the full model selection is performed via CV + grid search
model_type parameter (either logistic, random_forest, instead of both)the logs of the training script will tell you which command to use for testing the best model:
mlflow run -e test --env-manager=local --experiment-name $EXPERIMENT_NAME . -P run_id=<TRAINING_RUN_ID> -P model_uri=models:/<BEST_MODEL_ID>
You may access the MLflow UI via the URL of the remote tracking server: http://my.mlflow.server.it:5000

# to update the date-time in the experiment name, do:
# EXPERIMENT_NAME="adult-classifier-$(date +'%Y-%m-%d-%H-%M')"
mlflow run -e train --env-manager=local --experiment-name "$EXPERIMENT_NAME" . \
-P model_type=random_forest \
-P test_size=0.25 \
-P cv_splits=5 \
-P random_state=123
You may test the best model on the test set via:
# reuse same EXPERIMENT_NAME as in training step
mlflow run -e test --env-manager=local --experiment-name $EXPERIMENT_NAME . \
-P run_id=<TRAINING_RUN_ID> \
-P model_uri=models:/<BEST_MODEL_ID>
(look at the training script logs to find the exact command)

MLflow may be used to track experiments involving Large Language Models (LLMs)
MLflow’s Tracking API may be used to log:
Ad-hoc API is provided to Python programmers to express evaluation metrics for LLM-responses
Support for multiple LLM providers, and their client libraries
Support for annotating LLM-responses with human feedback
Support for profiling data-flow back-and-forth between clients and LLM providers
Use an LLM to evaluate the quality of responses generated by another LLM
Think of criteria as unit tests for LLMs’ prompt–responses pairs
Examples:
The response must be in EnglishThe response must contain at least 3 examplesuser question is asking for sensitive code, then response must kindly decline to answerCATEGORY, QUESTION TEXT, WEIGHT (DIFFICULTY)
############################################################################
Definition, What is computer science?, 1
Definition, What is an algorithm?, 1
Definition, "Difference among information, data, and representation",1
Definition, What is software?, 1
Definition, What is software engineering?, 1
History, What were software crises?, 1
Commonsense, What makes software development costs rise?,1
Commonsense, What may delay software development?, 1
Idea! Let’s generate examples of good answers via LLMs and provide students with them
MLflow may help in 1. selecting the best models and 2. prompts, assuming that 3. evaluation metrics are defined for generated answers
You are a university professor preparing model answers for a software engineering examination.
Category: {category}
Question: {question}
Difficulty: {weight}/4
Provide a clear and accurate answer suitable for an exam context.
Be concise but comprehensive.
Category: {category}
Question: {question}
Weight: {weight}/4
Answer the question above.
Category: {category}
Question: {question}
Difficulty: {weight}/4
Provide an answer that:
1. Explains the concept clearly
2. Includes at least one concrete example or use case
3. Relates to real-world software development scenarios
4. Is easy to understand for someone learning the subject
Category: {category}
Difficulty: {weight}/4 (higher means more complex)
Question: {question}
Instructions:
- Provide a rigorous, academically sound answer
- Include relevant technical terminology
- Reference key concepts and principles where appropriate
- Structure your answer clearly with proper explanations
- Aim for a comprehensive yet focused response suitable for academic evaluation
Always use the provided web search tool to complement your answers with relevant and up-to-date links or references.
In calling the tool, you should automatically infer the most relevant query based on the conversation so far.
General criteria:
english: The answer should be in English.software_engineering_related: The answer is correctly contextualizing the question within the domain of software engineering.reference_to_definition: The answer should reference and/or quote relevant definitions for the concepts mentioned in the question.relevance_to_query: The answer should be relevant to the question asked.enough_words: More than 10 words in the answer.not_too_many_words: Less than 1000 words in the answer.Question-specific correctness criteria:
What is computer science?)
What is an algorithm?)
gpt-4.1-miniQuestion 1: What is computer science? (weight: 1, category: Definition)
Prompt 1 (basic):
You are a university professor preparing model answers for a software engineering examination.
Category: Definition
Question: What is computer science?
Difficulty: 1/4
Provide a clear and accurate answer suitable for an exam context. Be concise but comprehensive.
Generated answer:
Computer science is the systematic study of computers and computational systems.
It involves understanding algorithms, data structures, software design, programming languages,
and the theoretical foundations of computation, as well as the practical aspects of developing
and applying software and hardware solutions to solve problems.
Evaluation results:
english: passed ✅software_engineering_related: failed ❌ (does not mention “software engineering”)reference_to_definition: failed ❌ (does not quote any definition)relevance_to_query: passed ✅enough_words: passed ✅not_too_many_words: passed ✅gpt-5-mini, with agents & toolsQuestion 2: What is an algorithm? (weight: 1, category: Definition)
Prompt 1 (basic) + tools addendum:
You are a university professor preparing model answers for a software engineering examination.
Always use the provided web search tool to complement your answers with relevant and up-to-date links or references.
In calling the tool, you should automatically infer the most relevant query based on the conversation so far.
Category: Definition
Question: What is an algorithm?
Difficulty: 1/4
Provide a clear and accurate answer suitable for an exam context. Be concise but comprehensive.
The tool:
@tool
def web_search_tool(query: str, max_results: int = 3) -> str:
"""
Tool function for LLM to search the web and get formatted results.
This combines search_web and format_search_results into a single tool.
Args:
query: Search query string
max_results: Maximum number of results to return (default: 3)
Returns:
Markdown-formatted enumeration of search results where,
for each result, the first line is the title of a Web page with an hyper-ref to the page's URL,
and the second line is a snippet/summary of the content.
"""
results = search_web(query, max_results=max_results)
return format_search_results(results)
gpt-5-mini, with agents & tools (cont’d)Generated intermediate tool invocation:
web_search_tool("What is an algorithm definition concise exam context computer science", max_results=3)
Result:
Relevant Web search results:
- [Wikipedia Algorithm - Wikipedia](https://en.wikipedia.org/wiki/Algorithm)
4 days ago -
In mathematics and computer science, an algorithm ...
- [Scribbr What is an algorithm in computer science?](https://www.scribbr.com/frequently-asked-questions/algorithm-computer-science/)
August 9, 2023 -
In computer science, an algorithm is a list of unambiguous instructions ...
- ...
An algorithm is a finite, well‑defined sequence of unambiguous instructions that, given specified input(s),
produces the desired output(s) and terminates after a finite number of steps.
Key properties (for exam answers)
- Input and output: takes zero or more inputs and yields one or more outputs.
- Finiteness (termination): completes in a finite number of steps for every valid input.
- Definiteness (unambiguous): each step is precisely specified.
- Effectiveness: each step is basic enough to be carried out in practice.
- Correctness: when implemented, it solves the intended problem for all valid inputs.
Example (brief): the Euclidean algorithm for computing the greatest common divisor of two integers.
References:
- Wikipedia: Algorithm — <https://en.wikipedia.org/wiki/Algorithm>
- Scribbr: What is an algorithm in computer science? — <https://www.scribbr.com/frequently-asked-questions/algorithm-computer-science/>
Clone the repository and start a local MLflow Tracking Server via Docker Compose:
git clone https://github.com/gciatto/example-llmops.git
cd example-llmops
docker-compose up -d --wait
Set the environment variables for MLflow Tracking Server URI and OpenAI API key:
export MLFLOW_TRACKING_URI="http://localhost:5000"
export OPENAI_API_KEY="sk-..."
# on Windows (cmd):
# set MLFLOW_TRACKING_URI=http://localhost:5000
# set OPENAI_API_KEY=sk-...
# on Windows (PowerShell):
# $env:MLFLOW_TRACKING_URI="http://localhost:5000"
# $env:OPENAI_API_KEY="sk-..."
Create and activate a Python virtual environment, then install dependencies:
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install -r requirements.txt
example-llmops/
├── MLproject # MLflow Project descriptor file
├── register_all_prompts.py # Script to register prompt templates
├── generate_answers.py # Script to generate answers without agents/tools
├── generate_answers_with_agent.py # Script to generate answers with agents/tools
├── evaluate_responses.py # Script to evaluate generated responses
├── prompts # Directory with prompt templates
│ ├── academic.txt
│ ├── basic.txt
│ ├── concise.txt
│ ├── practical.txt
│ ├── system.txt
│ └── tools.txt
├── python_env.yaml # Python environment (dependencies)
└── questions.csv # CSV file with input questions
Notice the MLproject file in the repository root, paying attention to the entry points defined therein, and their parameters:
name: quiz-answer-generator
python_env: python_env.yaml
entry_points:
register_all_prompts:
command: "python register_all_prompts.py"
evaluate_responses:
parameters:
generation_run_id: {type: str, default: "none"}
judge_model: {type: str, default: "openai:/gpt-4.1-mini"}
command: "python evaluate_responses.py --generation-run-id {generation_run_id} --judge-model {judge_model}"
generate_answers:
parameters:
prompt_template: {type: str, default: "basic"}
max_questions: {type: int, default: -1}
model: {type: str, default: "gpt-4.1-mini"}
temperature: {type: float, default: 0.7}
max_tokens: {type: int, default: 500}
command: |
python generate_answers.py \
--prompt-template {prompt_template} \
--max-questions {max_questions} \
--model {model} \
--temperature {temperature} \
--max-tokens {max_tokens}
generate_answers_with_agent:
parameters:
prompt_template: {type: str, default: "basic"}
search_results_count: {type: int, default: 3}
max_questions: {type: int, default: -1}
model: {type: str, default: "gpt-5-mini"}
temperature: {type: float, default: 0.7}
max_tokens: {type: int, default: 1500}
command: |
python generate_answers_with_agent.py \
--prompt-template {prompt_template} \
--search-results-count {search_results_count} \
--max-questions {max_questions} \
--model {model} \
--temperature {temperature} \
--max-tokens {max_tokens}
python generate_answers.py --help to see details about generation script parameterspython evaluate_responses.py --help to see details about evaluation script parametersevaluate_responses.py script, where evaluation criteria are defined:
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import Guidelines, scorer, RelevanceToQuery
@scorer
def enough_words(outputs: dict) -> Feedback:
text = outputs['choices'][-1]['message']['content']
word_count = len(text.split())
score = word_count >= 10
rationale = (
f"The response has more than 10 words: {word_count}"
if score
else f"The response does not have enough words because it has less than 10 words: {word_count}."
)
return Feedback(value=score, rationale=rationale)
@scorer
def not_too_many_words(outputs: dict) -> Feedback:
text = outputs['choices'][-1]['message']['content']
word_count = len(text.split())
score = word_count <= 1000
rationale = (
f"The response has less than 1000 words: {word_count}"
if score
else f"The response has too many words: {word_count}."
)
return Feedback(value=score, rationale=rationale)
def guidelines_model(model: str = None):
yield Guidelines(model=model, name="english", guidelines="The answer should be in English.")
yield Guidelines(model=model, name="software_engineering_related",
guidelines="The answer is correctly contextualizing the question within the domain of software engineering.")
yield Guidelines(model=model, name="reference_to_definition",
guidelines="The answer should reference and/or quote relevant definitions for the concepts mentioned in the question.")
yield RelevanceToQuery(model=model)
yield enough_words
yield not_too_many_words
You may now register all prompt templates in a new experiment via:
EXPERIMENT_ID="se-answers-$(date +'%Y-%m-%d-%H-%M')"
mlflow run -e register_all_prompts --env-manager=local --experiment-name $EXPERIMENT_ID .
Look at the MLflow UI (“Prompts” main section) to see the registered prompts:


You may now generate answers via:
mlflow run -e generate_answers --env-manager=local --experiment-name $EXPERIMENT_ID . -P max_questions=4
(we put a limit of 4 questions to save time and costs)
Look at the MLflow UI (“Experiments” main section) to see the generation runs:

Clicking on a trace, you may observe details about the interactions with the LLM provider:

You may now evaluate generated answers via (see the output of generation runs for the exact command):
# reuse same EXPERIMENT_ID as in generation step
mlflow run -e evaluate_responses --env-manager=local --experiment-id <EXPERIMENT_ID> . -P generation_run_id=<GENERATION_RUN_ID>
(this may take some time, as Guidelines evaluations are performed via further LLM queries)
Look at the MLflow UI (“Experiments” main section) to see the evaluation runs:

You may now generate answers with agents & tools via:
EXPERIMENT_ID="se-answers-agents-$(date +'%Y-%m-%d-%H-%M')"
mlflow run -e generate_answers_with_agent --env-manager=local --experiment-name $EXPERIMENT_ID . -P max_questions=2
(we put a limit of 2 questions to save time and costs)
In the MLflow UI, you may inspect which and how many tool invocations were performed:


Compiled on: 2026-04-01 — printable version