Regression Test Selection for Updated Capability Modules in Compositional ML Systems via Atomic-Quality Probes
arXiv:2604.26689v4 Announce Type: replace Abstract: Compositional machine-learning (ML) systems assemble runtime behavior from libraries of independently re-trained capability modules. Replacing one module raises a regression-testing question that static dependence analysis cannot answer: which existing compositions stay valid, and at what test cost? We frame capability updates as regression test selection (RTS) and contribute four results. First, a paired cross-version swap protocol isolates t
Overview
arXiv:2604.26689v4 Announce Type: replace Abstract: Compositional machine-learning (ML) systems assemble runtime behavior from libraries of independently re-trained capability modules. Replacing one module raises a regression-testing question that static dependence analysis cannot answer: which existing compositions stay valid, and at what test cost? We frame capability updates as regression test selection (RTS) and contribute four results. First, a paired cross-version swap protocol isolates the marginal effect of a single module update. Second, on two contact-rich manipulation tasks we characterize a dominant-skill effect: one capability module reaches 88.0% atomic success while siblings stay at or below 32.0%, and its inclusion shifts composition success by up to 52 percentage points; a controlled weight-space interpolation tracks composition success against atomic quality point-by-point (pooled Pearson r=0.94), and the effect replicates on a second task, where the governing module must lie on the critical path of the phase sequence. Third, off-policy behavioral-distance metrics fail to identify the dominant module. Fourth, a margin-gated Hybrid Selector matches full revalidation at zero per-decision test cost (75.0% gold-label agreement, with no detectable difference) and reaches 81.25% match at half of full-revalidation cost, beating a cost-matched random budget (Monte-Carlo p=0.039). A resolution analysis shows that coarse evaluation overstates the apparent advantage of full revalidation. The atomic-quality probe gives a principled test-selection criterion for capability-update regression testing in compositional ML systems.
Source
Originally published at arxiv.org.
Related Articles
Source: https://arxiv.org/abs/2604.26689