You've spent weeks curating training data, carefully crafting your fine-tuning pipeline, and finally shipped a fine-tuned model to production. Congratulations — but the hard work is just beginning.
The dirty secret of production ML is that a model that performs brilliantly on launch day can silently degrade over weeks or months as the world changes around it. User language evolves. Data distributions shift. Edge cases pile up. Without a robust evaluation and monitoring strategy, you're flying blind.
This guide is your end-to-end playbook for keeping fine-tuned models — whether large language models (LLMs) or traditional ML models — healthy, measurable, and continuously improving in production. We'll cover the metrics that actually matter, how to detect drift before it tanks your product, how to safely test new model versions, and how to build dashboards your whole team can act on.
Why Production Monitoring Is Different from Offline Evaluation
Before we dive into tooling, it's worth understanding why offline evaluation (running your test set before deployment) is necessary but not sufficient.
During training and evaluation, you control the data. Your test set is a static snapshot of what the world looked like when you collected it. Production is a live, dynamic system where:
- Input distributions shift — users phrase things differently over time, new topics emerge, language evolves.
- Ground truth may be delayed or absent — for LLMs especially, you often don't have immediate feedback on whether an output was "correct."
- System-level factors interact — latency, upstream data pipelines, API changes, and infrastructure issues all affect effective model quality.
- Business KPIs can diverge from model metrics — a model with better BLEU scores might actually produce worse business outcomes.
Production monitoring closes this gap by giving you continuous visibility into how the model is behaving on real traffic.
Part 1: Metrics That Matter in Production
1.1 Core Performance Metrics for LLMs
For fine-tuned LLMs, your metric suite should span three categories: task-specific quality, safety/reliability, and operational efficiency.
Task-Specific Quality Metrics
BLEU and ROUGE remain standard for text generation tasks, but they have well-known limitations — they measure surface-level n-gram overlap and can miss semantic correctness.
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def compute_rouge(predictions: list[str], references: list[str]) -> dict:
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
aggregated = {"rouge1": [], "rouge2": [], "rougeL": []}
for pred, ref in zip(predictions, references):
scores = scorer.score(ref, pred)
for key in aggregated:
aggregated[key].append(scores[key].fmeasure)
return {k: sum(v) / len(v) for k, v in aggregated.items()}
def compute_bleu(predictions: list[str], references: list[str]) -> float:
smoothie = SmoothingFunction().method4
scores = []
for pred, ref in zip(predictions, references):
score = sentence_bleu(
[ref.split()],
pred.split(),
smoothing_function=smoothie
)
scores.append(score)
return sum(scores) / len(scores)
Semantic similarity using embedding-based metrics (like BERTScore or Sentence-BERT cosine similarity) catches meaning-level correctness that n-gram metrics miss:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_similarity(predictions: list[str], references: list[str]) -> list[float]:
pred_embeddings = model.encode(predictions, normalize_embeddings=True)
ref_embeddings = model.encode(references, normalize_embeddings=True)
return (pred_embeddings * ref_embeddings).sum(axis=1).tolist()
LLM-as-Judge has emerged as a powerful metric in 2026. You use a stronger or calibrated model to evaluate the outputs of your fine-tuned model at scale, rating dimensions like helpfulness, factual accuracy, and coherence:
import anthropic
client = anthropic.Anthropic()
def llm_judge_score(prompt: str, response: str, rubric: str) -> dict:
eval_prompt = f"""You are an expert evaluator. Score the following AI response.
User prompt: {prompt}
AI response: {response}
Rubric: {rubric}
Respond in JSON with keys: score (1-5), reasoning (str), issues (list[str]).
"""
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{"role": "user", "content": eval_prompt}]
)
import json
return json.loads(message.content[0].text)
Safety and Reliability Metrics
For LLMs in production, track:
- Hallucination rate — percentage of responses containing factually incorrect claims (verified by knowledge base lookup or LLM judge).
- Refusal rate — how often the model refuses appropriate requests (false positives in safety filters).
- Toxicity score — using tools like Perspective API or custom classifiers.
- Instruction-following accuracy — does the model respect constraints like output format, word count, or persona?
Operational Metrics
These matter as much as quality:
- Latency (p50, p95, p99) — don't just track averages; tail latency is often where user experience breaks.
- Tokens per second (TPS) — throughput under load.
- Cost per inference — especially important when running at scale on hosted APIs.
- Error rate — timeouts, context window overflows, API failures.
1.2 Metrics for Traditional Fine-Tuned ML Models
For classification, regression, or structured prediction models, the metric suite is more established but equally important to track continuously:
from sklearn.metrics import (
accuracy_score, f1_score, precision_score, recall_score,
roc_auc_score, mean_absolute_error, mean_squared_error
)
import numpy as np
def compute_classification_metrics(y_true, y_pred, y_prob=None) -> dict:
metrics = {
"accuracy": accuracy_score(y_true, y_pred),
"f1_macro": f1_score(y_true, y_pred, average="macro"),
"precision_macro": precision_score(y_true, y_pred, average="macro"),
"recall_macro": recall_score(y_true, y_pred, average="macro"),
}
if y_prob is not None:
metrics["roc_auc"] = roc_auc_score(y_true, y_prob, multi_class="ovr")
return metrics
def compute_regression_metrics(y_true, y_pred) -> dict:
return {
"mae": mean_absolute_error(y_true, y_pred),
"rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
"mape": np.mean(np.abs((y_true - y_pred) / y_true)) * 100
}
One critical production-specific metric for classification models is calibration — whether predicted probabilities reflect real confidence levels. A model that says "90% confident" should be right 90% of the time:
from sklearn.calibration import calibration_curve
def check_calibration(y_true, y_prob, n_bins=10):
fraction_of_positives, mean_predicted_value = calibration_curve(
y_true, y_prob, n_bins=n_bins
)
calibration_error = np.mean(
np.abs(fraction_of_positives - mean_predicted_value)
)
return {
"expected_calibration_error": calibration_error,
"fraction_of_positives": fraction_of_positives.tolist(),
"mean_predicted_value": mean_predicted_value.tolist()
}
Part 2: Drift Detection
Drift is the silent killer of production models. There are three main types you need to watch for.
2.1 Data Drift (Covariate Shift)
Data drift occurs when the distribution of input features changes — even if the underlying task relationship hasn't changed. For a fine-tuned sentiment model, this could be users starting to write shorter, more emoji-heavy reviews.
The industry-standard approach uses statistical tests to compare the current window of inputs against a reference distribution (typically the training or validation set).
from scipy import stats
import numpy as np
from dataclasses import dataclass
@dataclass
class DriftReport:
feature_name: str
test_statistic: float
p_value: float
drift_detected: bool
severity: str # "none", "low", "medium", "high"
def detect_numerical_drift(
reference: np.ndarray,
current: np.ndarray,
feature_name: str,
threshold: float = 0.05
) -> DriftReport:
"""Kolmogorov-Smirnov test for numerical features."""
statistic, p_value = stats.ks_2samp(reference, current)
drift = p_value < threshold
if not drift:
severity = "none"
elif statistic < 0.1:
severity = "low"
elif statistic < 0.2:
severity = "medium"
else:
severity = "high"
return DriftReport(
feature_name=feature_name,
test_statistic=statistic,
p_value=p_value,
drift_detected=drift,
severity=severity
)
def detect_categorical_drift(
reference: list,
current: list,
feature_name: str,
threshold: float = 0.05
) -> DriftReport:
"""Chi-squared test for categorical features."""
from collections import Counter
all_categories = set(reference) | set(current)
ref_counts = Counter(reference)
cur_counts = Counter(current)
ref_freq = np.array([ref_counts.get(c, 0) for c in all_categories])
cur_freq = np.array([cur_counts.get(c, 0) for c in all_categories])
# Normalize to proportions
ref_freq = ref_freq / ref_freq.sum()
cur_freq = cur_freq / cur_freq.sum()
statistic, p_value = stats.chisquare(cur_freq, ref_freq)
drift = p_value < threshold
return DriftReport(
feature_name=feature_name,
test_statistic=statistic,
p_value=p_value,
drift_detected=drift,
severity="high" if drift and statistic > 10 else "medium" if drift else "none"
)
For LLMs, where inputs are text, drift detection operates on embedding space rather than raw features:
def detect_embedding_drift(
reference_embeddings: np.ndarray,
current_embeddings: np.ndarray,
threshold: float = 0.15
) -> dict:
"""Detect drift using Maximum Mean Discrepancy (MMD) on embeddings."""
def mmd_rbf(X, Y, gamma=1.0):
from sklearn.metrics.pairwise import rbf_kernel
XX = rbf_kernel(X, X, gamma)
YY = rbf_kernel(Y, Y, gamma)
XY = rbf_kernel(X, Y, gamma)
return XX.mean() + YY.mean() - 2 * XY.mean()
# Sample for efficiency
n_sample = min(500, len(reference_embeddings), len(current_embeddings))
ref_sample = reference_embeddings[
np.random.choice(len(reference_embeddings), n_sample, replace=False)
]
cur_sample = current_embeddings[
np.random.choice(len(current_embeddings), n_sample, replace=False)
]
mmd_score = mmd_rbf(ref_sample, cur_sample)
return {
"mmd_score": mmd_score,
"drift_detected": mmd_score > threshold,
"severity": "high" if mmd_score > 0.3 else "medium" if mmd_score > threshold else "none"
}
2.2 Concept Drift
Concept drift means the relationship between inputs and correct outputs has changed, even if inputs look similar. For a fine-tuned model trained on 2024 customer support data, the "correct" answer to "how do I cancel?" might change if your cancellation flow changes in 2025.
Detecting concept drift requires labeled data — which is often the bottleneck. Strategies include:
- Active labeling windows — sample a small fraction of production traffic, label it, and run your evaluation metrics on it on a rolling basis.
- Proxy metrics — when ground truth is unavailable, track signals correlated with quality (user corrections, thumbs-down rates, escalation rates).
- Sliding window performance tracking — if you have periodic labeled batches, track metric curves over time and alert on sustained downtrends.
from collections import deque
import numpy as np
class ConceptDriftDetector:
"""ADWIN-inspired sliding window drift detector."""
def __init__(self, window_size: int = 1000, threshold: float = 0.1):
self.window = deque(maxlen=window_size)
self.threshold = threshold
self.baseline_mean = None
def add_sample(self, metric_value: float) -> dict:
self.window.append(metric_value)
if len(self.window) < 100:
return {"drift_detected": False, "message": "Warming up"}
if self.baseline_mean is None:
self.baseline_mean = np.mean(list(self.window)[:100])
current_mean = np.mean(list(self.window)[-100:])
delta = abs(current_mean - self.baseline_mean)
return {
"drift_detected": delta > self.threshold,
"baseline_mean": self.baseline_mean,
"current_mean": current_mean,
"absolute_change": delta,
"relative_change": delta / (self.baseline_mean + 1e-9)
}
2.3 Prediction Drift
Even without labels, you can track the distribution of model outputs over time. If your fine-tuned classifier suddenly starts predicting class B far more than it used to — without a corresponding change in business reality — that's a red flag worth investigating.
def monitor_output_distribution(
historical_predictions: list,
current_predictions: list,
label_map: dict = None
) -> dict:
from collections import Counter
hist_dist = Counter(historical_predictions)
curr_dist = Counter(current_predictions)
all_labels = set(hist_dist) | set(curr_dist)
total_hist = sum(hist_dist.values())
total_curr = sum(curr_dist.values())
shifts = {}
for label in all_labels:
hist_pct = hist_dist.get(label, 0) / total_hist * 100
curr_pct = curr_dist.get(label, 0) / total_curr * 100
shifts[label_map.get(label, label) if label_map else label] = {
"historical_pct": round(hist_pct, 2),
"current_pct": round(curr_pct, 2),
"absolute_shift": round(curr_pct - hist_pct, 2)
}
return shifts
Part 3: A/B Testing Fine-Tuned Models
3.1 Why A/B Testing Matters for Model Updates
When you have a new model checkpoint — whether it's a retrained version, a different fine-tuning recipe, or a prompt/system instruction change — you need to validate it on real traffic with real users before full rollout. Offline eval is necessary but offline distributions rarely perfectly match production.
A proper A/B test for model versions requires:
- Traffic splitting — routing a controlled percentage of requests to the new model.
- Metric tracking per variant — logging all relevant metrics separately for control (Model A) and treatment (Model B).
- Statistical significance testing — ensuring the observed difference isn't noise before making a decision.
- Guardrail metrics — metrics that must not regress even if the primary metric improves (e.g., latency, error rate).
3.2 Setting Up Traffic Splitting
A simple, deterministic traffic splitter based on a hash of the request ID:
import hashlib
from enum import Enum
class ModelVariant(str, Enum):
CONTROL = "control"
TREATMENT = "treatment"
def assign_variant(
request_id: str,
treatment_pct: float = 0.1, # 10% to new model
experiment_id: str = "exp_001"
) -> ModelVariant:
"""Deterministic, sticky variant assignment."""
hash_input = f"{experiment_id}:{request_id}".encode()
hash_value = int(hashlib.md5(hash_input).hexdigest(), 16)
bucket = (hash_value % 10000) / 10000.0 # 0.0 to 1.0
if bucket < treatment_pct:
return ModelVariant.TREATMENT
return ModelVariant.CONTROL
class ABTestRouter:
def __init__(self, control_model, treatment_model, treatment_pct=0.1):
self.models = {
ModelVariant.CONTROL: control_model,
ModelVariant.TREATMENT: treatment_model
}
self.treatment_pct = treatment_pct
def route(self, request_id: str, prompt: str) -> tuple[str, ModelVariant]:
variant = assign_variant(request_id, self.treatment_pct)
response = self.models[variant].generate(prompt)
return response, variant
3.3 Statistical Significance Testing
Never promote a new model based on raw metric differences alone. Use proper statistical tests:
from scipy import stats
import numpy as np
from typing import Optional
def analyze_ab_results(
control_metrics: list[float],
treatment_metrics: list[float],
metric_name: str,
alpha: float = 0.05,
min_effect_size: float = 0.01
) -> dict:
"""
Welch's t-test for comparing model variants.
Returns whether to promote the treatment model.
"""
control = np.array(control_metrics)
treatment = np.array(treatment_metrics)
t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=False)
# Cohen's d for effect size
pooled_std = np.sqrt((control.std()**2 + treatment.std()**2) / 2)
cohens_d = (treatment.mean() - control.mean()) / (pooled_std + 1e-9)
# Confidence interval for the difference
diff_mean = treatment.mean() - control.mean()
diff_se = np.sqrt(control.var()/len(control) + treatment.var()/len(treatment))
ci_lower = diff_mean - 1.96 * diff_se
ci_upper = diff_mean + 1.96 * diff_se
statistically_significant = p_value < alpha
practically_significant = abs(cohens_d) >= min_effect_size
treatment_wins = treatment.mean() > control.mean()
recommend_promotion = (
statistically_significant and
practically_significant and
treatment_wins
)
return {
"metric": metric_name,
"control_mean": round(control.mean(), 4),
"treatment_mean": round(treatment.mean(), 4),
"relative_improvement_pct": round((treatment.mean() - control.mean()) / control.mean() * 100, 2),
"p_value": round(p_value, 4),
"cohens_d": round(cohens_d, 4),
"ci_95": [round(ci_lower, 4), round(ci_upper, 4)],
"statistically_significant": statistically_significant,
"practically_significant": practically_significant,
"recommend_promotion": recommend_promotion,
"n_control": len(control),
"n_treatment": len(treatment)
}
3.4 Guardrail Metrics
Define metrics that must not regress before any new model ships, regardless of primary metric gains:
@dataclass
class GuardrailCheck:
metric_name: str
control_value: float
treatment_value: float
threshold: float # Max allowed regression
direction: str # "lower_is_better" or "higher_is_better"
passed: bool
def check_guardrails(
control_metrics: dict[str, float],
treatment_metrics: dict[str, float],
guardrails: dict[str, dict]
) -> list[GuardrailCheck]:
"""
Example guardrails config:
{
"p99_latency_ms": {"threshold": 0.05, "direction": "lower_is_better"},
"error_rate": {"threshold": 0.001, "direction": "lower_is_better"},
"toxicity_rate": {"threshold": 0.0, "direction": "lower_is_better"}
}
"""
results = []
for metric_name, config in guardrails.items():
control_val = control_metrics[metric_name]
treatment_val = treatment_metrics[metric_name]
threshold = config["threshold"]
direction = config["direction"]
if direction == "lower_is_better":
regression = treatment_val - control_val
passed = regression <= threshold
else:
regression = control_val - treatment_val
passed = regression <= threshold
results.append(GuardrailCheck(
metric_name=metric_name,
control_value=control_val,
treatment_value=treatment_val,
threshold=threshold,
direction=direction,
passed=passed
))
return results
Part 4: Building a Monitoring Dashboard
4.1 What Your Dashboard Must Show
A production-grade model monitoring dashboard should give stakeholders answers to five questions at a glance:
- Is the model healthy right now? (Real-time quality metrics, error rates, latency)
- Is anything drifting? (Input distribution, output distribution, performance over time)
- How does today compare to baseline? (Trend lines, rolling averages)
- What is this costing us? (Inference cost per request, tokens consumed)
- Are there active experiments? (A/B test status and current results)
4.2 Logging Infrastructure
Structured logging is the foundation of any monitoring system. Every inference should emit a structured log event:
import json
import time
import uuid
from dataclasses import dataclass, asdict
from datetime import datetime
@dataclass
class InferenceLog:
request_id: str
timestamp: str
model_id: str
variant: str # "control" or "treatment" for A/B tests
prompt_tokens: int
completion_tokens: int
latency_ms: float
cost_usd: float
quality_score: Optional[float]
drift_flags: list[str]
metadata: dict
def log_inference(
model_id: str,
prompt: str,
response: str,
latency_ms: float,
variant: str = "production",
quality_score: float = None
) -> InferenceLog:
# Rough token estimation (replace with actual tokenizer)
prompt_tokens = len(prompt.split()) * 1.3
completion_tokens = len(response.split()) * 1.3
cost_per_1k = 0.003 # Example rate
log = InferenceLog(
request_id=str(uuid.uuid4()),
timestamp=datetime.utcnow().isoformat(),
model_id=model_id,
variant=variant,
prompt_tokens=int(prompt_tokens),
completion_tokens=int(completion_tokens),
latency_ms=latency_ms,
cost_usd=(prompt_tokens + completion_tokens) / 1000 * cost_per_1k,
quality_score=quality_score,
drift_flags=[],
metadata={}
)
# Emit to your logging backend (CloudWatch, Datadog, etc.)
print(json.dumps(asdict(log)))
return log
4.3 Integrating with Observability Platforms
In 2026, the most common production stacks for LLM monitoring combine purpose-built tools:
Prometheus + Grafana for operational metrics:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Define metrics
inference_requests = Counter(
"model_inference_requests_total",
"Total inference requests",
["model_id", "variant", "status"]
)
inference_latency = Histogram(
"model_inference_latency_seconds",
"Inference latency in seconds",
["model_id", "variant"],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
quality_score_gauge = Gauge(
"model_quality_score",
"Rolling average quality score",
["model_id", "variant"]
)
drift_score_gauge = Gauge(
"model_drift_score",
"Current drift score",
["model_id", "feature"]
)
def record_inference_metrics(
model_id: str,
variant: str,
latency_s: float,
success: bool,
quality_score: float = None
):
status = "success" if success else "error"
inference_requests.labels(model_id, variant, status).inc()
inference_latency.labels(model_id, variant).observe(latency_s)
if quality_score is not None:
quality_score_gauge.labels(model_id, variant).set(quality_score)
LangSmith, Weights & Biases, or Arize for LLM-specific tracing and evaluation pipelines are strong choices for teams using LLMs at scale.
4.4 Alerting Rules
Don't wait for users to report problems. Configure proactive alerts:
# Example Prometheus alerting rules
groups:
- name: model_monitoring
rules:
- alert: ModelQualityDegraded
expr: avg_over_time(model_quality_score[30m]) < 0.75
for: 10m
labels:
severity: warning
annotations:
summary: "Model quality dropped below 0.75 threshold"
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(model_inference_latency_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "p95 latency exceeded 5 seconds"
- alert: InputDriftDetected
expr: model_drift_score > 0.2
for: 15m
labels:
severity: warning
annotations:
summary: "Significant input drift detected"
🚀 Pro Tips
-
Shadow mode before A/B testing — run your new model in shadow mode (receives real inputs, doesn't serve real users) to collect offline quality metrics before exposing any users to it. This catches catastrophic failures cheaply.
-
Track input length distributions — for LLMs, sudden changes in average prompt length are a cheap early warning signal for input drift, detectable without any labeling.
-
Use stratified sampling for labels — when sampling production traffic for human labeling, stratify by input type, user segment, or edge case categories to maximize coverage rather than labeling random samples.
-
Set per-segment alerts, not just global ones — a model might look fine globally but be degrading badly for a specific user segment or topic category. Monitor slices.
-
Version your evaluation datasets — your golden eval set should be versioned alongside your model. As the task evolves, you need to update both together and maintain historical comparisons.
-
Automate rollback triggers — define a threshold for any guardrail metric (e.g., error rate > 2%) that automatically triggers a rollback to the previous model, without requiring human intervention in the critical path.
-
Monitor your monitoring — test your alerting pipeline regularly. A silent monitoring system is worse than no monitoring because it creates false confidence.
Common Mistakes to Avoid
Relying solely on offline evaluation. Offline benchmarks give you a starting point, not a production guarantee. Always close the loop with production metrics.
Not establishing a baseline before launch. You need a reference distribution to detect drift against. Collect and store your day-one traffic distributions as a baseline artifact.
Ignoring latency percentiles. Average latency hides pain. p99 latency is what your worst-affected users experience. Always track and alert on tail latency.
Running underpowered A/B tests. Ending an experiment early because you see a positive trend is a textbook mistake. Calculate your required sample size upfront using power analysis and commit to it.
Treating all drift as bad. Sometimes drift reflects genuine shifts in user behavior that your model should adapt to — not a failure. Correlate drift signals with business metrics before alarming.
Skipping human evaluation. Automated metrics are efficient but imperfect. Budget for periodic human evaluation on sampled production outputs, especially after major model updates.
📌 Key Takeaways
-
Evaluation doesn't end at deployment. Production monitoring is a continuous process that starts on day one and never stops. The real world is more complex and dynamic than any test set.
-
Combine quantitative and qualitative metrics. BLEU scores and accuracy matter, but so do user thumbs-down rates, escalations, and LLM-judge scores that capture semantic quality.
-
Drift detection is your early warning system. Monitor input distributions, embedding spaces, and output distributions using statistical tests and alert before business metrics degrade.
-
A/B testing requires statistical rigor. Always run tests to statistical significance, define guardrail metrics upfront, and use shadow mode before exposing users to new model versions.
-
Your monitoring dashboard is a product. Invest in making it clear, actionable, and shared across engineering, product, and business stakeholders.
Conclusion
Shipping a fine-tuned model to production is not the finish line — it's the starting gun. The models that continue to deliver value over months and years are the ones backed by robust evaluation pipelines, continuous monitoring, and disciplined experimentation processes.
The tooling landscape has matured enormously in 2026. There's no longer an excuse for "set it and forget it" model deployment. Whether you're running a fine-tuned BERT classifier or a domain-specialized LLM, the patterns in this guide — metrics tracking, drift detection, A/B testing, and observability dashboards — give you the infrastructure to iterate with confidence.
Start with the metrics that matter most for your specific task, build your logging foundation first, then layer on drift detection and experimentation infrastructure. The compounding value of a well-monitored model system is enormous: faster iteration cycles, fewer production incidents, and the ability to prove business impact with data.
References
- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. ACL 2020.
- Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2021). Alibi Detect: Algorithms for Outlier, Adversarial, and Drift Detection. JMLR.
- Shankar, S., et al. (2022). Operationalizing Machine Learning: An Interview Study. arXiv:2209.09125.
- Garg, S., et al. (2022). LEEP: A New Measure to Evaluate Transferability of Learned Representations. ICML.
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
- Evidently AI Documentation: https://docs.evidentlyai.com
- Weights & Biases LLM Monitoring Guide: https://docs.wandb.ai/guides/prompts
- Prometheus Alerting Best Practices: https://prometheus.io/docs/practices/alerting
- LangSmith Documentation: https://docs.smith.langchain.com