Novel Findings — nuphirho

Each finding on this page has been verified by independent language models searching the primary literature across the relevant fields. Only work published before the finding was documented and tested is considered; literature that post-dates the verification run is not counted as prior art. No published work making the connection explicitly was found. A finding is recorded when that verification is unanimous.

Verified no substantive prior art found. Verified* verified, with adjacency to existing work documented below the finding.

Cluster 1: Mathematical epistemology and specification theory

Finding 1: Tarski's undefinability applied to executable specifications

A specification language that is expressive enough to describe all correct behaviours of a system cannot, within that language, fully define what makes a specification valid. The language used to specify correctness cannot itself be specified in the same terms. This is Tarski's undefinability theorem applied to the specification layer of AI-assisted software development, a connection not found in the prior software engineering or formal methods literature.

Verified

Finding 2: Curry-Howard correspondence applied to BDD scenarios

A BDD scenario has the same logical structure as a constructive proof. The Given clause is the assumption, the When clause is the inference step, and the Then clause is the conclusion. This structural correspondence between specification scenarios and formal proofs has not been identified in the prior software engineering or logic literature.

Verified*

Curry-Howard is a structural analogy rather than a formal isomorphism when applied to natural language BDD scenarios.

Finding 3: Mutation testing as Bayesian epistemology (epoché connection)

Running a mutation testing suite is a Bayesian updating process. Each surviving mutant shifts the probability distribution over the correctness of the specification. When no further mutants can be designed, the practitioner has reached an epistemic state that mirrors the Pyrrhonian epoché: the evidence has been exhausted and judgment is warranted. This connection between mutation testing and philosophical epistemology has not been found in the prior testing or philosophy of science literature.

Verified

Finding 4: Category A-E defect taxonomy mapped against Rice's theorem and abstract interpretation

Most defect taxonomies organise failures by severity or phase of introduction. This taxonomy organises them by specifiability: whether a failure mode can be expressed as a verifiable requirement. Category A defects are fully specifiable and automatically checkable. Category E defects are formally unspecifiable by Rice's theorem. No prior taxonomy in software engineering or formal methods uses specifiability as its organising principle.

Verified

Cluster 2: Ancient philosophy and AI governance

Finding 5: Phaedrus charioteer allegory as harness engineering etymology

The word "harness" in software engineering traces etymologically to the Platonic charioteer allegory in Plato's Phaedrus, where the harness is the constraint structure that holds the two horses of reason and appetite in productive tension. The test harness, the evaluation harness, and the agentic harness each instantiate the same structure: a constraint layer that keeps the executing system within the bounds set by the governing intent. This connection between the engineering vocabulary and its philosophical origin has not been previously noted in the software engineering or philosophy of technology literature.

Verified

Finding 6: Stoic prohairesis applied to prompt engineering

The Stoic concept of prohairesis is the deliberate choice of response to circumstances the agent cannot control. A governance prompt addresses the same design problem: it specifies what a system does with inputs it did not choose and outcomes it cannot guarantee. The structural correspondence between Stoic faculty theory and governance prompt design has not been found in the prior AI governance or philosophy of technology literature.

Verified

Finding 7: Stoic hegemonikon as governance layer analogy

The Stoic hegemonikon is the governing faculty that coordinates all other faculties without performing any of their functions directly. A governance layer in an agentic system holds the same structural position: it specifies the constraints under which all agent capabilities operate without executing any capability itself. The correspondence between the Stoic governing faculty and the AI governance layer has not been found in the prior literature.

Verified

Finding 8: Aristotelian entelechy applied to agentic flywheel

Aristotle's entelechy is the process by which a thing moves toward its full actualisation through the exercise of its proper function. The agentic flywheel, in which each iteration of specification, implementation, and verification moves the system closer to the intent encoded in the governing document, is an instantiation of entelechy in software development. This connection has not been found in the prior software engineering or philosophy of technology literature.

Verified

Finding 9: Aristotelian architecton as human-on-the-loop analogy

Aristotle's architecton is the master builder who holds the plan and directs the execution without laying bricks. The human-on-the-loop governance model holds the same position: the human holds the specification and retains authority over outcomes without performing the execution steps. The architecton is the first recorded formulation of this governance structure. The correspondence has not been found in the prior AI governance or philosophy of technology literature.

Verified

Finding 10: Pyrrhonian criterion problem applied to governance prompt evaluation

The Pyrrhonian criterion problem asks how you evaluate an evaluation instrument without an independent standard to evaluate it against. Any standard you use was itself chosen by some criterion, and that criterion requires its own justification. This regress is the foundational epistemological challenge of AI governance prompt evaluation: any scoring framework for governance prompts must eventually justify its own scoring criteria. No prior AI governance framework has identified this as the core problem of governance prompt evaluation.

Verified*

The meta-evaluation challenge (how to evaluate an evaluation instrument without circular reasoning) is recognised in AI evaluation discourse through concepts such as reward model circularity, benchmark validity, and evaluator bias. The Pyrrhonian and Agrippan formulations applied specifically to governance prompt scoring are absent from that literature. The ancient sceptical framing is the novel contribution; the underlying epistemological concern is not.

Cluster 3: Measurement theory

Finding 11: LLM evaluation as projection measurement (governance subspace formalism)

An LLM evaluation instrument functions as a projection operator. It takes the full space of possible model outputs and projects them onto a subspace defined by the governance criteria it tests. The score is a measure of how much of the output lands within that subspace. This geometric interpretation of evaluation, grounded in operator theory and measurement theory, has not been applied to LLM evaluation in the prior literature.

Verified*

Perrier (2025, arXiv:2507.05587) independently identifies the absence of formal measurement theory in AI evaluation as a critical gap, without the governance subspace projection formalism.

Finding 30: Absence of representational measurement theory grounding in AI evaluation frameworks

Voudouris, K. et al. (2026). Measuring What AI Systems Might Do: Towards A Measurement Science in AI. arXiv:2603.00063. 10 Feb 2026. Helmholtz Munich and Cambridge. Argues AI evaluation practices rarely specify what quantity they purport to measure. Proposes that capabilities and propensities are dispositional properties requiring measurement of counterfactual relationships. Most important new adjacent paper for F30 from any brief run. Establishes the broadest measurement critique of AI evaluation practice; does not propose RTM axiomatic grounding. Cite and distinguish: Voudouris et al. identify what must be measured; F30 identifies that the axiomatic structure required to measure it properly is absent.

Verified*

Cluster 4: Organisational design

Finding 12: Governance profiles applied to human team role design

The structural properties of a well-formed AI governance prompt map directly onto the design requirements for a human team role description. Both specify a purpose, a scope boundary, success criteria, and escalation conditions. The techniques developed to improve AI governance documents are transferable to human role design in AI-augmented teams. This connection has not been found in the prior AI governance or organisational design literature.

Verified

Cluster 5: Software engineering taxonomy

Finding 13: Five-category defect taxonomy organised by specifiability (Categories A-E)

Most defect taxonomies organise failures by severity or type. This taxonomy organises them by specifiability: the degree to which a failure mode can be expressed as a verifiable requirement. Category A defects are fully specifiable and automatically checkable. Category E defects are formally unspecifiable by Rice's theorem and require human judgment. No existing taxonomy in testing or formal methods literature uses specifiability as its organising principle.

Verified

Cluster 6: Measurement methodology

Finding 14: Scoring distribution as instrument reliability via framing-variation

If a scoring instrument is reliable, its scores should be stable when the same question is reframed in surface form without changing its substance. Running the instrument against framing variants and treating the resulting score distribution as a reliability measure has not been proposed as an instrument design principle in the prior AI evaluation literature.

Verified

Finding 15: Fixed point convergence applied to evaluation instruments as operators

If a governance prompt evaluation instrument is a well-behaved operator, iterating it on the same document should converge to a stable score. A document that scores differently on repeated application reveals something about the instrument rather than the document. This fixed-point property has not been applied to evaluation instrument design in the prior AI governance literature.

Verified

Finding 16: Silent specification drift as coordination model governance failure mode

When an AI agent's governing specification drifts from its deployment context without any explicit signal that the document no longer governs the system it was written for, the agent continues operating under the authority of a document that has lost its validity. The failure is silent: no error is raised, no exception is logged, and no human is notified. This failure mode has no established name in the prior AI coordination or governance literature.

Verified

Finding 17: Positive/negative complementary scoring as instrument calibration diagnostic

Evaluating a governance claim positively and negatively against the same document, then treating the degree to which the two scores sum to 1.0 as a calibration diagnostic, has not been proposed as an instrument design principle in the prior AI evaluation literature. A well-calibrated instrument should produce scores that are complementary: if the document scores 0.7 on a positive framing, it should score approximately 0.3 on the corresponding negative framing.

Verified

Cluster 8: SE professionalisation and convergence

Finding 18: AI reduces the marginal cost of coordinating knowledge across engineering disciplines

AI reduces the marginal cost of applying rigorous engineering practices across disciplinary boundaries. The same assistance that makes mutation testing viable for a solo developer also makes DO-178C-style traceability viable for a governance practitioner. The three components of this claim exist separately in prior literature, but their joint formulation as a cost-reduction mechanism for cross-disciplinary specification rigour has not been found in the prior literature.

Verified

Cluster 9: Cross-domain specification quality

Finding 19: PromptQ principles as domain-agnostic completeness calculus

The structural principles that appear in AI governance documents are not unique to AI. Equivalent requirements appear in aviation certification, medical device regulation, nuclear safety cases, and legal drafting. What is novel is their systematic absence from AI governance documents at scale, and the absence of the institutional mechanisms that compensate for them in those other domains. No prior work has documented this gap empirically or proposed an automated measurement instrument for it.

Verified*

The principles themselves are not novel. Their systematic absence from AI governance, the automated measurement of that absence, and the missing institutional compensating mechanisms are the novel contribution.

Cluster 10: Agentic coordination and governance drift

Finding 20: Handoff quality as a function of accuracy, latency, and bidirectional negotiation bandwidth

A governance handoff is complete only when both parties have recorded it. The sending party knowing what was transmitted is not sufficient; the receiving party must acknowledge receipt before being held accountable for the governance it has received. Governance handed over without acknowledgement produces silent authority gaps: the receiving agent operates under prior governance, not the intended governance, with no signal that the transfer failed.

Verified

Cluster 11: Compliance policy precision and governance

Finding 21: Human interpretive correction as unconscious compensating control; removal under agent execution compounds

Humans read between the lines. They carry context that the policy does not contain, and they apply it without being asked to. AI agents, operating without that context and without any reason to deviate from the written rule, follow the system to the letter. The interpretive layer that absorbed policy imprecision in human-executed governance disappears under agent execution.

Verified

Cluster 12: Regulatory compliance and governance readiness

Finding 22: PromptQ as Article 14 structural compliance readiness framework

Article 14 of the EU AI Act requires that high-risk AI systems be deployed with effective human oversight mechanisms. Three of the five Article 14(4) requirements map directly onto the PromptQ structural quality principles, and two map partially. A governance document that passes a PromptQ evaluation at full score provides the structural basis for Article 14 compliance readiness. No prior framework has demonstrated this mapping or proposed a document-level instrument for Article 14 readiness assessment.

Verified

Finding 23: Mandatory Tier 1 audit sampling as structural requirement for Article 14 delegation

When an AI system delegates a decision to a human who lacks the technical capacity to meaningfully override it, Article 14's human oversight requirement is not met by the presence of a human in the loop. It requires that a proportion of decisions be audited at depth by someone with the technical capacity to identify failures. This mandatory audit sampling is a structural requirement that follows from Article 14's delegation provisions, and its absence from current AI governance practice has not been identified in the prior literature.

Verified

Finding 24: Retrieval-layer governance asymmetry

When AI agents retrieve information from external sources, the governance document governing the agent cannot specify the retrieved content in advance. This creates a structural asymmetry between the governance document (static, authorship-time) and the operational inputs (dynamic, runtime). No existing AI governance instrument addresses this asymmetry as a distinct failure mode.

Verified

Finding 25: Two-layer AI governance methodology

Effective AI governance requires two distinct layers that address different failure modes at different timescales. The authorship-time layer governs the structural quality of the document that specifies system behaviour: it must be complete, unambiguous, internally consistent, and current. The runtime layer governs execution: it must enforce the document's constraints, detect deviation, and escalate appropriately. Most current governance frameworks address one layer without the other. No prior methodology has formalised the two-layer structure or identified the failure modes that arise when either layer is absent.

Verified

Finding 31: Evaluation estimands and governance estimands as structurally distinct types

Evaluation frameworks measure what an AI system does. Governance frameworks specify what it is authorised to do. These are different questions answered by different instruments, yet most current deployments treat evaluation performance as evidence of governance compliance. A system that performs well on an evaluation benchmark has not demonstrated that it operates within its governance boundaries. No published framework formally distinguishes evaluation estimands from governance estimands as structurally distinct types, or characterises the failure mode that arises from conflating them.

Verified

Cluster 13: Governance document lifecycle

Finding 26: Governance document feedback loop

A governance document is not a static artefact. It is one node in a feedback loop: the document specifies behaviour, the system executes it, the execution produces evidence, the evidence informs whether the specification was correct, and the specification is updated accordingly. When this loop is absent, the document degrades silently: it specifies behaviour that the system no longer produces, without any signal reaching the author. No prior AI governance framework has formalised this feedback loop or identified its absence as a distinct governance failure mode.

Verified

Finding 27: Epoch limits on governance document validity

Governance documents have an inherent validity horizon determined by the stability of the context for which they were written. Three trigger types exist: time-based (calendar interval), event-based (discrete change to deployment context), and assumption-based (foundational assumptions under which the document was written no longer hold).

Verified

Finding 28: Proof surfaces as the feedback mechanism for governance revalidation

A governance document makes claims. A proof surface is the set of mechanisms through which those claims can be verified: the tests that confirm the system did what the document said it would do, the evidence that the scope boundary held, the record that shows the authority chain remained intact. When a governance document does not specify its own proof surface at authorship time, there is no defined basis for revalidation. The document cannot be shown to have remained valid because validity was never operationally defined. No existing AI governance framework requires a proof surface to be declared at authorship time as a structural property of the document itself.

Verified

Cluster 14: Citation integrity and verification methodology

Finding 33: Memory-Seeded Confabulation as a distinct citation failure mode

Verified*

The closest adjacent work is Lathkar (2026), who identifies that providing a confirmed intermediate fact in a reasoning chain increases confident wrong-answer rates before full evidence resolves the chain. Both involve accurate context seeding incorrect completion. The distinction: reasoning chain completion versus citation generation; intermediate fact anchor versus persistent memory context as the seed.

Finding 34: Identifier Mismatch as a distinct citation failure pattern

Verified

Finding 35: Session isolation as a formally proposed mitigation for memory-contaminated verification

Verified

Finding 36: D1-D5 v1.1 as a PromptQ-scored multi-dimension citation verification framework

A five-dimension verification protocol for AI-generated citations with explicit pass/fail logic (D1 as hard gate, D5 failure as Overreach regardless of other scores), scope context standards for three distinct task contexts (scientific, practitioner, news), and re-evaluation triggers including scope drift detection. The integrated combination with these properties is not present in existing citation verification tools (CiteCheck, CiteAudit, RAGAS, TruLens). Individual dimensions have precedents in CRAAP, FActScore, and systematic review methodology; the operationalised combination with PromptQ governance scoring (6.0/7), scope standards, and scope drift mechanism is novel.

Verified

Cluster 15: Instruction Science and governance document design

Finding 37: Style guides lack compliance rubrics

Style guides -- documents specifying writing standards, editorial policy, and content design -- systematically lack a compliance rubric: they specify what content should look like but provide no mechanism for verifying that content conforms to the specification. No style guide in the evaluated corpus included a structured compliance-checking mechanism. The absence is systematic, not incidental.

Verified*

Finding 38: Branding guides lack compliance rubrics

Brand guidelines -- identity manuals, brand playbooks, and brand governance documents -- systematically omit the compliance rubric that would make them auditable. The guides specify desired brand properties without providing a structured check that content produced under those guidelines actually satisfies them. The pattern holds across technology companies, professional services firms, and media organisations.

Verified*

Finding 40: AI adoption measurement gap

No current organisational AI adoption measurement framework captures governance quality, structural completeness, or per-interaction quality as metrics. The standard measurement frameworks measure volume and user satisfaction. The structural quality of the governance layer through which AI interacts with users -- the completeness of the system prompt, the presence of oversight mechanisms, the scope boundaries -- is not measured in any published framework. The measurement gap is systematic and confirmed across the enterprise adoption literature.

Verified

Finding 41: Philosophical grounding for PromptQ empirical claims

The research programme's empirical methodology is most consistent with Lakatos's Methodology of Scientific Research Programmes at the programme level, Longino's contextual empiricism at the epistemological level, and construct validity theory at the instrument level, with a defensible moderate scientific realist interpretation. Strict Popperian falsificationism is inappropriate for an ongoing research programme of this type. The falsifiable claims are second-order: reproducibility of PromptQ scores across independent raters, convergent validity with related measures, and predictive validity against independently established governance outcomes. This four-part combination applied to AI governance measurement is not found in the prior philosophy of science or AI evaluation literature.

Verified

Finding 42: AI as experimental apparatus for unified instruction theory

Using AI systems as experimental apparatus for testing theories from human communication, persuasion, diplomacy, and negotiation disciplines -- with the explicit goal of developing a unified theory of instruction applicable across human and artificial intelligence systems -- is a methodological position not established in the literature. Adjacent research uses AI as social-science experimental subject, applies human communication theories to AI behaviour, or optimises AI for specific communication tasks. None frames AI as controlled apparatus for cross-substrate theory construction with instruction as the unit of analysis.

Verified

Finding 43: Unified theory of instruction across human and AI systems

There is no published unified theory of instruction spanning human communication systems (rhetoric, persuasion, diplomacy, negotiation) and artificial-intelligence instruction-following, and no one has operationalised human communication theories as a unified set of formal instruction variables tested at scale across both human and artificial agents. The programme's reframing of instruction as a coordination mechanism under uncertainty, unifying the two domains, occupies an unclaimed synthesis space. This is a synthesis proposal the programme is building, not a validated pre-existing theory.

Verified*

Prior art cited and distinguished: Gorsky, Caspi and Chajut (2007), human cognitive domain only; the RIGID framework (Kwak and Pardos 2026), an AI-mediated instructional-design workflow, not a unified human-AI theory; Hackenburg et al. (2025), asymmetric AI-persuades-human, not a symmetric instruction-variable system.

Finding 44: Coordination under contested authority as a distinct coordination domain

Coordination under contested authority and legitimacy uncertainty, the problem of whether an instruction is binding when multiple actors claim authority and the legitimacy of those claims is itself uncertain, is a distinct coordination domain supplied by the diplomacy and negotiation literature (two-level games, constructive ambiguity, costly signalling, mandate and recognition verification) and only weakly represented in the organisational-theory grounding of multi-agent coordination. Existing work assumes authority is known and then studies coordination; this framing reverses the dependency, inferring binding force under uncertain legitimacy before coordination proceeds.

Verified

The emerging AI-diplomacy works are real but none develops the contested-authority framing; the IETF multi-agent delegation standards are mechanical only (tokens, scopes, attenuation, audit), distinct from contested legitimacy.

Cluster 16: Regulatory science

Finding 29: No structural completeness requirement for AI governance documents in any confirmed jurisdiction

No published regulatory instrument in any confirmed jurisdiction requires the document governing an AI system (system prompt, AGENTS.md, governance policy) to meet a structural completeness criterion before deployment.

Verified

Finding 32: Within-session governance authority decay

A governance document can be structurally complete at authorship time and still lose effective authority during a single session. As operational history accumulates in an agent's context window, the governing specification is progressively displaced by accumulated content. The agent continues executing, but under the effective influence of its history rather than its governing document. The mechanism is content-driven, not length-driven: replacing history with governance reinsertion substantially restores compliance. No published AI governance framework identifies this as a distinct failure mode or proposes governance reinsertion as a design primitive.

Verified*

Liu et al. (2026, arXiv:2605.08060) demonstrate empirically that expanding context window history degrades cooperative intent in multi-agent systems. The governance authority framing is the programme's extension of those empirical findings.

Cluster 17: Shadow AI governance

Finding 39: Governance-convenience asymmetry in ungoverned AI use

When an organisation leaves employee use of unsanctioned AI ungoverned, it captures the productivity benefit when the output is good while the accountability for failures is pushed onto the individual, on the reasoning that they were operating outside policy. This governance-convenience asymmetry is a characterisation built from settled legal principle, not a settled legal doctrine. Under the close-connection test for vicarious liability and the accountability principle in data protection law, an employer that tolerates and benefits from ungoverned AI use generally retains liability regardless of an internal prohibition, so a paper policy is a weak defence. No court has yet recognised the asymmetry itself as a rule, and organisational liability and individual discipline can coexist. The contribution is to name the asymmetry and its incentive structure: because the upside is shared but the risk is pushed down, the organisation has a weak incentive to bring the practice into the open, the "don't ask, don't tell governance" dynamic.

Verified*

A characterisation built from settled law, not doctrine: the close-connection vicarious-liability line (Lister v Hesley Hall 2001; Mohamud v WM Morrison 2016; Various Claimants v WM Morrison 2020) and GDPR controller accountability (Articles 5(2) and 32) are cited and distinguished. No shadow-AI-specific ruling exists; the asymmetry is analysis, not a court finding.

Finding 45: The AI hiring-versus-enablement contradiction

Organisations increasingly require AI fluency from new hires while under-providing sanctioned tools, training, and clear usage policy to the staff they already have. Large workforce surveys measure both halves of this contradiction, but they report them as separate population statistics rather than as a matched measure within the same organisations, so the contradiction is a defensible cross-source inference rather than a directly-observed within-firm gap. The evidence is nonetheless consistent across independent sources: high and rising employer demand for AI skills alongside low training provision and widespread use of unsanctioned tools. The more recent workforce evidence reframes the same gap as one of organisational rather than individual readiness: workers are ready and their organisations are not.

Verified*

No single source measures the gap within the same organisations; the Microsoft/LinkedIn Work Trend Index carries both halves as separate population statistics, not a matched within-firm instrument. State the gap as a cross-source inference and do not attach a single headline percentage.

Cluster 18: People-decisions and reliability

Finding 46: Reliability and fairness are non-subsuming governance axes in people-decisions

In people-decisions such as hiring, promotion, performance review, and termination, human and algorithmic judgement fail in structurally different ways that are governed by different instruments. An algorithmic system's dominant observable failure mode is systematic bias: because a single decision function is applied repeatedly, its bias is roughly uniform across cases, visible in the aggregate, auditable by disparate-impact testing, and correctable in one place. Human judgement's dominant observable failure mode is idiosyncratic bias together with noise: equivalent cases receive materially different decisions depending on who decides and on mood, order, and fatigue, which is to say low inter-rater and intra-rater reliability. The decisive and neglected point is that the aggregate fairness audits which catch algorithmic macro-bias, demographic parity, disparate impact, and the four-fifths rule, are structurally blind to human micro-unreliability, because case-by-case noise distributes across groups and washes out in group means while still breaking equity for individuals. Reliability, meaning whether equivalent individuals are treated alike regardless of who evaluates them, and fairness, meaning distributional equity across groups, are therefore non-subsuming, co-equal governance dimensions: neither audit detects the other's characteristic failure.

Verified*

Not a claim that humans are more biased than AI; a claim about each side's dominant, observable, governable failure mode, keeping bias (directional) distinct from noise (variability). A synthesis bridging the human-judgement noise literature (Meehl; Kahneman, Sibony and Sunstein) and the algorithmic-fairness literature (Barocas and Selbst; Kleinberg et al.), which the finding cites and does not claim. Extends the reliability-not-accuracy line and the expert-fingerprint finding into people-decisions.