Skip to content

Research

Selected papers on LLM reliability, evaluation, and robust systems.

Papers

2026

CertiPatch: Specification Repair for Frozen Language Models with Replayable Empirical Certificates

Train-certify-verify pipeline with counterexample-guided constrained repair and reproducibility gates for fail-closed evidence in production sign-off.

Why this matters: Turns model repair into a verifiable workflow teams can trust during production sign-off.

Authors: Ali Uyar

llm-reliabilityreproducibilitycertification

DOI: 10.5281/zenodo.18541322

2026

AvalancheLLM: Token-Layer Activation Event Cascades in LLMs under Gain Scaling

Rate-matched diagnostics with strong controls and deterministic artifact packaging to detect instability under gain and scaling changes.

Why this matters: Helps teams detect brittle model behavior before it becomes a production incident.

Authors: Ali Uyar

mechanistic-analysisllmstability

DOI: 10.5281/zenodo.18435925

2026

RIA: Retokenization Invariance Atlas

Deterministic audits for semantics-preserving formatting effects in LLM QA with no-truncation and semantics gates.

Why this matters: Provides auditable reliability evidence when prompt formatting changes could alter outcomes.

Authors: Ali Uyar

tokenizationevaluationauditability

DOI: 10.5281/zenodo.18682875

2026

ScrubID: Identifiability-Aware Auditing for Mechanistic Interpretability

Claim auditing with non-identifiability certificates to provide verifiable interpretability evidence and reduce overclaiming.

Why this matters: Reduces interpretability overclaims by requiring evidence that can be independently checked.

Authors: Ali Uyar

interpretabilityauditingsafety

DOI: 10.5281/zenodo.18330420

2026

CacheMedic++: Robust KV-Cache Stabilization via Self-Distillation

Deterministic KV-cache corruption protocols and distillation-driven repair operators with seed/holdout replication and bootstrap confidence intervals.

Why this matters: Improves serving stability by making KV-cache failures reproducible, diagnosable, and repairable.

Authors: Ali Uyar

kv-cachedistillationrobustness

DOI: 10.5281/zenodo.18669268

2026

CIS Technical Report

Technical report documenting the CIS reliability approach, evaluation criteria, and implementation notes for production-grade AI systems.

Why this matters: Translates deterministic reliability design into a practical report teams can adopt in real delivery environments.

Authors: Ali Uyar

technical-reportreliabilitysystems

DOI: 10.5281/zenodo.18234776

2026

Discover-Then-Distill: Separating Test-Time Adaptation from Self-Distillation Consolidation

Compute-matched discover-then-distill pipeline with ablation suite and retention-transfer diagnostics for clear separation of adaptation and consolidation effects.

Why this matters: Clarifies how to separate short-term adaptation from durable model improvement in production tuning loops.

Authors: Ali Uyar

test-time-adaptationdistillationevaluation

DOI: 10.5281/zenodo.18634022

Research Repositories

Related code and experiment repositories.

retokenization-invariance-atlas

Prompt reformatting can silently change model outputs; retokenization-invariance-atlas is a deterministic benchmark for retokenization effects; it helps teams validate robustness before deployment.

ScrubID

Interpretability claims are often hard to verify; ScrubID is an identifiability-aware auditing toolkit for mechanistic studies; it supports evidence-based and reproducible conclusions.

certipatch

Frozen models can fail specs without safe retraining paths; certipatch is a reproducible specification-repair framework with replayable certificates; it enables fix-and-verify workflows teams can trust.

discover-then-distill-paper

Adaptation gains can be confused with distillation gains; discover-then-distill-paper is a compute-matched experimental framework separating both effects; it yields clearer tuning decisions.

AvalancheLLM

Activation cascades can create unstable model behavior under scaling; AvalancheLLM is an experimental analysis suite for token-layer event cascades; it helps detect brittleness earlier.

cis-technical-report

Reliability practices are often documented inconsistently; cis-technical-report is a structured technical report on the CIS methodology; it gives teams a practical implementation reference.

cachemedicpp-research

KV-cache failures can be hard to reproduce and fix; cachemedicpp-research is an experimental toolkit for cache corruption and stabilization studies; it supports more reliable inference.