Independent Security Research — Whitby, North Yorkshire, United Kingdom
DOI: 10.5281/zenodo.19855615 · ORCID: 0009-0008-4518-0064
Automated vulnerability triage of compiled binaries presents a combinatorial challenge that scales poorly with binary size and complexity. We present TriageForge, a four-stage pipeline applying random matrix theory (RMT), Boolean satisfiability phase-transition analysis, cyclomatic complexity gating, and symbolic execution to identify anomalous code structure in macOS ARM64 binaries. The spectral screen (C2) models a binary’s call graph as a weighted adjacency matrix and z-scores three statistics — largest eigenvalue (λmax), graph energy, and eigenvalue entropy — against a configuration-model null calibrated to the observed degree sequence. Functions passing the screen are prioritised by a SAT backbone proximity score (C1); a cyclomatic complexity gate (C3) applies five structural dataflow templates; high-priority candidates receive full symbolic taint analysis (C6) via angr. Applied to 335 macOS 26 PrivateFrameworks binaries, the pipeline yielded a 3.6% anomaly detection rate with zero exploitable buffer overflows confirmed. We characterise three false-positive sources and discuss theoretical limitations.
Keywords: binary analysis, random matrix theory, SAT phase transitions, symbolic execution, vulnerability triage, call graph spectral analysis
Security researchers face an adversarial triage problem: given a large corpus of compiled binaries, identify the small fraction containing exploitable vulnerabilities and, within those, the precise functions warranting costly symbolic execution or manual reverse engineering. The state of practice relies on human expertise developed over years of experience — expertise that does not scale. macOS 26 ships with over 3,000 Mach-O binaries in framework and private framework directories, each containing several hundred functions on average.
TriageForge is built around the observation that call graph spectral properties discriminate between typical and atypical code structure in a statistically principled way. The approach draws on three bodies of theory: random matrix theory [1,2], SAT phase-transition analysis [6,10], and symbolic execution [13]. Together, these provide a fast, theory-grounded screening mechanism that reduces the candidate space before expensive analysis begins.
The paper is organised as follows. Section 2 surveys the relevant mathematical background. Section 3 describes the TriageForge pipeline. Section 4 provides theoretical analysis. Section 5 presents an empirical evaluation on 335 macOS PrivateFrameworks binaries. Sections 6 and 7 discuss limitations and conclude.
Random matrix theory, initiated by Wigner in the study of nuclear energy levels [1,2], characterises the statistical properties of eigenvalue spectra of large random matrices. The central result is the Wigner semicircle law: for an n×n symmetric matrix with i.i.d. entries of mean zero and variance σ², the empirical spectral distribution converges as n→∞ to:
The Tracy–Widom distribution [3] characterises the limiting distribution of the largest eigenvalue: the normalised quantity n2/3(λmax / (σ√n) − 2) converges in distribution to the Tracy–Widom law of order 1 (GOE).
Graph energy, defined by Gutman [9] as E(G) = ∑|λi|, captures total spectral mass. Eigenvalue entropy H = −∑ pi log pi, where pi = |λi| / ∑|λj|, measures dispersion of spectral mass and is sensitive to structural regularities absent from random models.
For call graphs with non-uniform degree sequences, the appropriate null distribution is the configuration model [4]: the maximum-entropy random graph ensemble consistent with the observed degree sequence. TriageForge z-scores each spectral statistic against this null, using an analytical approximation for λmax and graph energy calibrated by Monte Carlo degree-sequence-preserving rewiring [16].
Random 3-SAT instances exhibit a sharp satisfiability phase transition [6]. As the clause-to-variable ratio α = m/n increases through a critical threshold αc, satisfiability probability drops from near-1 to near-0. Extensive numerical studies place this at αc ≈ 4.267 [6,10], with theoretical support from the cavity method [7]. Near the transition, backbone variables — those taking the same value in all satisfying assignments — constitute a significant fraction, correlating with computational hardness [10].
McCabe [5] defined cyclomatic complexity V(G) = E − N + 2P for a program control flow graph G = (N, E). V(G) > 10 is the classical threshold for high-complexity functions, widely used as a proxy for defect likelihood [17].
TriageForge processes binaries through four sequential stages with aggressive early termination. C2 provides a binary-level triage signal in approximately 28 seconds; full C1–C6 analysis of a flagged binary averages 8–12 minutes.
| Stage | Name | Input | Output |
|---|---|---|---|
| C1 | SAT Backbone | Control flow graph | Function rank |
| C2 | RMT Screen | Call graph | Binary flag (NORMAL / ANOMALOUS) |
| C3 | Template Scan | C1 top-N functions | Dataflow hits |
| C6 | Symbolic Taint | C3 hits | PoC candidate |
The function's control flow graph is encoded as 3-SAT-like reachability constraints. The clause-to-variable ratio α(f) is computed and a chi-squared proximity score χ²(f) = (α(f) − αc)² / αc is derived. Functions with small χ² — near the phase transition — are ranked highest, as proximity to αc correlates empirically with tightly-coupled computational structure.
The call graph is extracted via lief [11] for Mach-O parsing and capstone [12] for control-flow reconstruction. For ARM64e binaries, authenticated pointer chains are resolved through the DYLD chained fixup map. Three spectral statistics are z-scored against the configuration-model null. A binary is flagged ANOMALOUS if |z| > 2.0 for any statistic.
Engineering note. Swift and Objective-C binaries produce degenerate z = 0 results because BLR (branch-to-register) indirect calls are unresolvable statically. This is an expected limitation of static call graph analysis.
Five structural dataflow templates are evaluated on C1-ranked functions: MACH_OOB, XPC_TYPE, INT_OVF, PORT_UAF, and IOKIT_OOB. Each specifies a source (attacker-controlled input), a sink (memory operation), and barrier conditions (bounds checks, type validations). A V(G) > 10 cyclomatic complexity gate is enforced.
The angr framework [8] places taint marks at source operations and propagates them symbolically. Z3 synthesises a concrete input on finding a taint-to-sink path without intervening bounds checks. On Apple ARM64e, Pointer Authentication Codes (PAC) block function-pointer hijacking even where write primitives exist; C6 adjusts impact classification accordingly.
RMT universality results [2,3] establish that spectral statistics of many random matrix ensembles follow the same limiting distributions regardless of entry distribution. Two mechanisms drive discrimination:
Negative deviations. Cryptographic S-boxes and sorting routines implement near-regular call subgraphs (uniform out-degree in S-boxes; near-binary-tree structure in merge sort). For a call graph with heterogeneous degree distribution, such regularity produces large negative z-scores on energy and entropy — spectral mass is more concentrated than the null predicts.
Positive deviations. Complex XPC dispatch tables generate hub nodes with high in-degree, shifting λmax upward and producing large positive z-scores. Both anomaly directions are diagnostically useful but require different interpretations.
The mapping from function control flow graphs to SAT-like constraints follows naturally from symbolic execution [13]: the control flow graph induces reachability constraints between basic blocks, and encoding these in conjunctive normal form yields a clause-to-variable ratio α characterising the function's logical density.
The C1 score is a heuristic prioritisation signal rather than a rigorous phase-transition argument: software functions are far from random 3-SAT instances. It performs well in practice because high-scoring functions consistently implement complex protocol parsing or message dispatch — precisely the functions of security interest.
Symbolic execution is theoretically sound for bounded programs but suffers from path explosion in practice. TriageForge mitigates this by applying C6 only to C1–C3 pre-screened functions and by using angr's loop bounding and state merging. The C6 stage is a candidate PoC generator, not a sound vulnerability prover.
The evaluation corpus comprises 335 Mach-O binaries from /System/Library/PrivateFrameworks/ on macOS 26.4.1 (build 25E253, Apple Silicon ARM64e). Binary sizes ranged from 47 KB to 84 MB. All analyses used the ARM64e slice.
C2 flagged 12 of 335 binaries (3.6%) as ANOMALOUS. The remaining 323 (96.4%) were classified NORMAL and required no further analysis. All 12 candidates were fully resolved through C3 evaluation and manual reverse engineering. Zero exploitable buffer overflows were confirmed.
| Binary | Anomaly | Verdict |
|---|---|---|
| FindMyMacd | E, H | CLOSED — 37/37 memcpy callers have explicit bounds checks |
| mediasharingd | H (−8.44) | CLOSED — DMAP/DAAP TLV source is constant global |
| AppleCredentialManagerDaemon | E, H | CLOSED — entitlement gate; all callers checked |
| ecosystemd | λmax (+2.56) | CLOSED — Swift runtime sorting artifact |
| slapadd / slapacl / slapauth / slapcat | E, λmax | ARTIFACT — OpenLDAP S-box; do not file |
Category 1: Cryptographic lookup tables. OpenLDAP slap* S-box tables produce near-regular subgraphs with extreme negative z-scores on energy and entropy (z(energy) ∈ [−0.61, −1.55]). All four are third-party code bundled with macOS. Detection heuristic: identical binary family with uniform anomaly signature.
Category 2: Standard library sorting. Swift runtime tagged-pointer merge sort produces a near-regular subgraph indistinguishable spectrally from cryptographic regularity. Heuristic: presence of swift_retain/swift_release import symbols.
Category 3: No network attack surface. catutil and VoiceBankingDiagnostics have anomalous spectral properties from complex parsing routines but receive no data from unprivileged network sources. Automated attack surface reachability assessment remains an open problem.
On a 64 GB AMD Ryzen 9 Linux workstation, C2 processes a 3 MB ARM64e binary in approximately 28 seconds (FastC2 path: lief+capstone, no symbolic execution). Full C1–C6 analysis averages 8–12 minutes. The 335-binary sweep ran as a multi-day distributed campaign.
The configuration model null conditions on degree sequence only. A more accurate null would also condition on clustering coefficient [19], capturing the modular structure inherent in software call graphs. This extension is computationally feasible and is a direction for future work.
The C1 SAT backbone mapping is heuristic. Software control flow graphs are far from random 3-SAT instances; proximity to αc is a useful empirical signal but carries no formal guarantee of backbone fraction or computational hardness.
TriageForge is a prioritisation tool, not a sound vulnerability prover. The empirical false negative rate is unknown without a corpus of ground truth vulnerabilities. All closed findings were verified by full manual analysis. The pipeline was not deployed on any binary subsequently found to contain an exploitable vulnerability, leaving detection coverage an open question.
(1) Binary provenance heuristics to suppress third-party artifact signatures. (2) Dynamic call graph enrichment via DTrace to resolve BLR indirect calls. (3) Null distributions conditioned on clustering coefficient. (4) Extension to /System/Library/Frameworks for cross-corpus comparison.
We have presented TriageForge, a binary vulnerability triage pipeline grounded in random matrix theory, SAT phase transition analysis, cyclomatic complexity theory, and symbolic execution. The C2 spectral screen provides a statistically principled binary-level signal in approximately 28 seconds per binary, reducing a 335-binary corpus to 12 candidates (3.6%) for deeper analysis.
All 12 candidates were fully resolved; zero exploitable vulnerabilities were confirmed. The false positive taxonomy — cryptographic lookup tables, standard library sorting, and no-attack-surface binaries — provides actionable guidance for pipeline refinement. Spectral complexity screening represents a promising, mathematically principled direction for large-scale binary security analysis.
The author is neurodivergent (autism, ADHD). Claude (Anthropic) was used as assistive technology during the preparation of this paper: for drafting and proofreading, structural editing, formatting of equations, tables and citations, and discussion of mathematical clarity. The underlying research — problem formulation, pipeline design, software implementation, empirical evaluation on the 335-binary corpus, theoretical analysis, and interpretation of results — is the author’s own work.
The use of AI assistive technology is consistent with the principles of the Equality Act 2010: disability is defined as a protected characteristic under Section 6; reasonable adjustments are contemplated by Sections 20–21; and discrimination arising from disability is addressed by Section 15. This acknowledgement is provided in the spirit of transparent and accessible research practice.