Research
Selected papers on LLM reliability, evaluation, and robust systems.
Papers
2026
CertiPatch: Specification Repair for Frozen Language Models with Replayable Empirical Certificates
Train-certify-verify pipeline with counterexample-guided constrained repair and reproducibility gates for fail-closed evidence in production sign-off.
Why this matters: Turns model repair into a verifiable workflow teams can trust during production sign-off.
Authors: Ali Uyar
llm-reliabilityreproducibilitycertification
DOI:
10.5281/zenodo.18541322
2026
AvalancheLLM: Token-Layer Activation Event Cascades in LLMs under Gain Scaling
Rate-matched diagnostics with strong controls and deterministic artifact packaging to detect instability under gain and scaling changes.
Why this matters: Helps teams detect brittle model behavior before it becomes a production incident.
Authors: Ali Uyar
mechanistic-analysisllmstability
DOI:
10.5281/zenodo.18435925
2026
RIA: Retokenization Invariance Atlas
Deterministic audits for semantics-preserving formatting effects in LLM QA with no-truncation and semantics gates.
Why this matters: Provides auditable reliability evidence when prompt formatting changes could alter outcomes.
Authors: Ali Uyar
tokenizationevaluationauditability
DOI:
10.5281/zenodo.18682875
2026
ScrubID: Identifiability-Aware Auditing for Mechanistic Interpretability
Claim auditing with non-identifiability certificates to provide verifiable interpretability evidence and reduce overclaiming.
Why this matters: Reduces interpretability overclaims by requiring evidence that can be independently checked.
Authors: Ali Uyar
interpretabilityauditingsafety
DOI:
10.5281/zenodo.18330420
2026
CacheMedic++: Robust KV-Cache Stabilization via Self-Distillation
Deterministic KV-cache corruption protocols and distillation-driven repair operators with seed/holdout replication and bootstrap confidence intervals.
Why this matters: Improves serving stability by making KV-cache failures reproducible, diagnosable, and repairable.
Authors: Ali Uyar
kv-cachedistillationrobustness
DOI:
10.5281/zenodo.18669268
2026
CIS Technical Report
Technical report documenting the CIS reliability approach, evaluation criteria, and implementation notes for production-grade AI systems.
Why this matters: Translates deterministic reliability design into a practical report teams can adopt in real delivery environments.
Authors: Ali Uyar
technical-reportreliabilitysystems
DOI:
10.5281/zenodo.18234776
2026
Discover-Then-Distill: Separating Test-Time Adaptation from Self-Distillation Consolidation
Compute-matched discover-then-distill pipeline with ablation suite and retention-transfer diagnostics for clear separation of adaptation and consolidation effects.
Why this matters: Clarifies how to separate short-term adaptation from durable model improvement in production tuning loops.
Authors: Ali Uyar
test-time-adaptationdistillationevaluation
DOI:
10.5281/zenodo.18634022
Research Repositories
Related code and experiment repositories.
Prompt reformatting can silently change model outputs; retokenization-invariance-atlas is a deterministic benchmark for retokenization effects; it helps teams validate robustness before deployment.
Interpretability claims are often hard to verify; ScrubID is an identifiability-aware auditing toolkit for mechanistic studies; it supports evidence-based and reproducible conclusions.
Frozen models can fail specs without safe retraining paths; certipatch is a reproducible specification-repair framework with replayable certificates; it enables fix-and-verify workflows teams can trust.
Adaptation gains can be confused with distillation gains; discover-then-distill-paper is a compute-matched experimental framework separating both effects; it yields clearer tuning decisions.
Activation cascades can create unstable model behavior under scaling; AvalancheLLM is an experimental analysis suite for token-layer event cascades; it helps detect brittleness earlier.
Reliability practices are often documented inconsistently; cis-technical-report is a structured technical report on the CIS methodology; it gives teams a practical implementation reference.
KV-cache failures can be hard to reproduce and fix; cachemedicpp-research is an experimental toolkit for cache corruption and stabilization studies; it supports more reliable inference.