Novel Findings
Confirmed novel contributions from the nuphirho.dev research programme. Last updated: 31 May 2026
Each finding on this page has been verified by independent language models searching the primary literature across the relevant fields. Only work published before the finding was documented and tested is considered; literature that post-dates the verification run is not counted as prior art. No published work making the connection explicitly was found. A finding is recorded when that verification is unanimous.
Verified no substantive prior art found. Verified* verified, with adjacency to existing work documented below the finding.
Cluster 1: Mathematical epistemology and specification theory
Finding 1: Tarski's undefinability applied to executable specifications
A specification language that is expressive enough to describe all correct behaviours of a system cannot, within that language, fully define what makes a specification valid. The language used to specify correctness cannot itself be specified in the same terms. This is Tarski's undefinability theorem applied to the specification layer of AI-assisted software development, a connection not found in the prior software engineering or formal methods literature.
Finding 2: Curry-Howard correspondence applied to BDD scenarios
A BDD scenario has the same logical structure as a constructive proof. The Given clause is the assumption, the When clause is the inference step, and the Then clause is the conclusion. This structural correspondence between specification scenarios and formal proofs has not been identified in the prior software engineering or logic literature.
Curry-Howard is a structural analogy rather than a formal isomorphism when applied to natural language BDD scenarios.
Finding 3: Mutation testing as Bayesian epistemology (epoché connection)
Running a mutation testing suite is a Bayesian updating process. Each surviving mutant shifts the probability distribution over the correctness of the specification. When no further mutants can be designed, the practitioner has reached an epistemic state that mirrors the Pyrrhonian epoché: the evidence has been exhausted and judgment is warranted. This connection between mutation testing and philosophical epistemology has not been found in the prior testing or philosophy of science literature.
Finding 4: Category A-E defect taxonomy mapped against Rice's theorem and abstract interpretation
Most defect taxonomies organise failures by severity or phase of introduction. This taxonomy organises them by specifiability: whether a failure mode can be expressed as a verifiable requirement. Category A defects are fully specifiable and automatically checkable. Category E defects are formally unspecifiable by Rice's theorem. No prior taxonomy in software engineering or formal methods uses specifiability as its organising principle.
Cluster 2: Ancient philosophy and AI governance
Finding 5: Phaedrus charioteer allegory as harness engineering etymology
The word "harness" in software engineering traces etymologically to the Platonic charioteer allegory in Plato's Phaedrus, where the harness is the constraint structure that holds the two horses of reason and appetite in productive tension. The test harness, the evaluation harness, and the agentic harness each instantiate the same structure: a constraint layer that keeps the executing system within the bounds set by the governing intent. This connection between the engineering vocabulary and its philosophical origin has not been previously noted in the software engineering or philosophy of technology literature.
Finding 6: Stoic prohairesis applied to prompt engineering
The Stoic concept of prohairesis is the deliberate choice of response to circumstances the agent cannot control. A governance prompt addresses the same design problem: it specifies what a system does with inputs it did not choose and outcomes it cannot guarantee. The structural correspondence between Stoic faculty theory and governance prompt design has not been found in the prior AI governance or philosophy of technology literature.
Finding 7: Stoic hegemonikon as governance layer analogy
The Stoic hegemonikon is the governing faculty that coordinates all other faculties without performing any of their functions directly. A governance layer in an agentic system holds the same structural position: it specifies the constraints under which all agent capabilities operate without executing any capability itself. The correspondence between the Stoic governing faculty and the AI governance layer has not been found in the prior literature.
Finding 8: Aristotelian entelechy applied to agentic flywheel
Aristotle's entelechy is the process by which a thing moves toward its full actualisation through the exercise of its proper function. The agentic flywheel, in which each iteration of specification, implementation, and verification moves the system closer to the intent encoded in the governing document, is an instantiation of entelechy in software development. This connection has not been found in the prior software engineering or philosophy of technology literature.
Finding 9: Aristotelian architecton as human-on-the-loop analogy
Aristotle's architecton is the master builder who holds the plan and directs the execution without laying bricks. The human-on-the-loop governance model holds the same position: the human holds the specification and retains authority over outcomes without performing the execution steps. The architecton is the first recorded formulation of this governance structure. The correspondence has not been found in the prior AI governance or philosophy of technology literature.
Finding 10: Pyrrhonian criterion problem applied to governance prompt evaluation
The Pyrrhonian criterion problem asks how you evaluate an evaluation instrument without an independent standard to evaluate it against. Any standard you use was itself chosen by some criterion, and that criterion requires its own justification. This regress is the foundational epistemological challenge of AI governance prompt evaluation: any scoring framework for governance prompts must eventually justify its own scoring criteria. No prior AI governance framework has identified this as the core problem of governance prompt evaluation.
The meta-evaluation challenge (how to evaluate an evaluation instrument without circular reasoning) is recognised in AI evaluation discourse through concepts such as reward model circularity, benchmark validity, and evaluator bias. The Pyrrhonian and Agrippan formulations applied specifically to governance prompt scoring are absent from that literature. The ancient sceptical framing is the novel contribution; the underlying epistemological concern is not.
Cluster 3: Measurement theory
Finding 11: LLM evaluation as projection measurement (governance subspace formalism)
An LLM evaluation instrument functions as a projection operator. It takes the full space of possible model outputs and projects them onto a subspace defined by the governance criteria it tests. The score is a measure of how much of the output lands within that subspace. This geometric interpretation of evaluation, grounded in operator theory and measurement theory, has not been applied to LLM evaluation in the prior literature.
Perrier (2025, arXiv:2507.05587) independently identifies the absence of formal measurement theory in AI evaluation as a critical gap, without the governance subspace projection formalism.
Finding 30: Absence of representational measurement theory grounding in AI evaluation frameworks
Voudouris, K. et al. (2026). Measuring What AI Systems Might Do: Towards A Measurement Science in AI. arXiv:2603.00063. 10 Feb 2026. Helmholtz Munich and Cambridge. Argues AI evaluation practices rarely specify what quantity they purport to measure. Proposes that capabilities and propensities are dispositional properties requiring measurement of counterfactual relationships. Most important new adjacent paper for F30 from any brief run. Establishes the broadest measurement critique of AI evaluation practice; does not propose RTM axiomatic grounding. Cite and distinguish: Voudouris et al. identify what must be measured; F30 identifies that the axiomatic structure required to measure it properly is absent.
Cluster 4: Organisational design
Finding 12: Governance profiles applied to human team role design
The structural properties of a well-formed AI governance prompt map directly onto the design requirements for a human team role description. Both specify a purpose, a scope boundary, success criteria, and escalation conditions. The techniques developed to improve AI governance documents are transferable to human role design in AI-augmented teams. This connection has not been found in the prior AI governance or organisational design literature.
Cluster 5: Software engineering taxonomy
Finding 13: Five-category defect taxonomy organised by specifiability (Categories A-E)
Most defect taxonomies organise failures by severity or type. This taxonomy organises them by specifiability: the degree to which a failure mode can be expressed as a verifiable requirement. Category A defects are fully specifiable and automatically checkable. Category E defects are formally unspecifiable by Rice's theorem and require human judgment. No existing taxonomy in testing or formal methods literature uses specifiability as its organising principle.
Cluster 6: Measurement methodology
Finding 14: Scoring distribution as instrument reliability via framing-variation
If a scoring instrument is reliable, its scores should be stable when the same question is reframed in surface form without changing its substance. Running the instrument against framing variants and treating the resulting score distribution as a reliability measure has not been proposed as an instrument design principle in the prior AI evaluation literature.
Finding 15: Fixed point convergence applied to evaluation instruments as operators
If a governance prompt evaluation instrument is a well-behaved operator, iterating it on the same document should converge to a stable score. A document that scores differently on repeated application reveals something about the instrument rather than the document. This fixed-point property has not been applied to evaluation instrument design in the prior AI governance literature.
Finding 17: Positive/negative complementary scoring as instrument calibration diagnostic
Evaluating a governance claim positively and negatively against the same document, then treating the degree to which the two scores sum to 1.0 as a calibration diagnostic, has not been proposed as an instrument design principle in the prior AI evaluation literature. A well-calibrated instrument should produce scores that are complementary: if the document scores 0.7 on a positive framing, it should score approximately 0.3 on the corresponding negative framing.
Finding 16: Silent specification drift as coordination model governance failure mode
When an AI agent's governing specification drifts from its deployment context without any explicit signal that the document no longer governs the system it was written for, the agent continues operating under the authority of a document that has lost its validity. The failure is silent: no error is raised, no exception is logged, and no human is notified. This failure mode has no established name in the prior AI coordination or governance literature.
Cluster 8: SE professionalisation and convergence
Finding 18: AI reduces the marginal cost of coordinating knowledge across engineering disciplines
AI reduces the marginal cost of applying rigorous engineering practices across disciplinary boundaries. The same assistance that makes mutation testing viable for a solo developer also makes DO-178C-style traceability viable for a governance practitioner. The three components of this claim exist separately in prior literature, but their joint formulation as a cost-reduction mechanism for cross-disciplinary specification rigour has not been found in the prior literature.
Cluster 9: Cross-domain specification quality
Finding 19: PromptQ principles as domain-agnostic completeness calculus
The structural principles that appear in AI governance documents are not unique to AI. Equivalent requirements appear in aviation certification, medical device regulation, nuclear safety cases, and legal drafting. What is novel is their systematic absence from AI governance documents at scale, and the absence of the institutional mechanisms that compensate for them in those other domains. No prior work has documented this gap empirically or proposed an automated measurement instrument for it.
The principles themselves are not novel. Their systematic absence from AI governance, the automated measurement of that absence, and the missing institutional compensating mechanisms are the novel contribution.
Cluster 10: Agentic coordination and governance drift
Finding 20: Handoff quality as a function of accuracy, latency, and bidirectional negotiation bandwidth
A governance handoff is complete only when both parties have recorded it. The sending party knowing what was transmitted is not sufficient; the receiving party must acknowledge receipt before being held accountable for the governance it has received. Governance handed over without acknowledgement produces silent authority gaps: the receiving agent operates under prior governance, not the intended governance, with no signal that the transfer failed.
Cluster 11: Compliance policy precision and governance
Finding 21: Human interpretive correction as unconscious compensating control; removal under agent execution compounds
Humans read between the lines. They carry context that the policy does not contain, and they apply it without being asked to. AI agents, operating without that context and without any reason to deviate from the written rule, follow the system to the letter. The interpretive layer that absorbed policy imprecision in human-executed governance disappears under agent execution.
Cluster 12: Regulatory compliance and governance readiness
Finding 22: PromptQ as Article 14 structural compliance readiness framework
Article 14 of the EU AI Act requires that high-risk AI systems be deployed with effective human oversight mechanisms. Three of the five Article 14(4) requirements map directly onto the PromptQ structural quality principles, and two map partially. A governance document that passes a PromptQ evaluation at full score provides the structural basis for Article 14 compliance readiness. No prior framework has demonstrated this mapping or proposed a document-level instrument for Article 14 readiness assessment.
Finding 23: Mandatory Tier 1 audit sampling as structural requirement for Article 14 delegation
When an AI system delegates a decision to a human who lacks the technical capacity to meaningfully override it, Article 14's human oversight requirement is not met by the presence of a human in the loop. It requires that a proportion of decisions be audited at depth by someone with the technical capacity to identify failures. This mandatory audit sampling is a structural requirement that follows from Article 14's delegation provisions, and its absence from current AI governance practice has not been identified in the prior literature.
Finding 24: Retrieval-layer governance asymmetry
When AI agents retrieve information from external sources, the governance document governing the agent cannot specify the retrieved content in advance. This creates a structural asymmetry between the governance document (static, authorship-time) and the operational inputs (dynamic, runtime). No existing AI governance instrument addresses this asymmetry as a distinct failure mode.
Finding 25: Two-layer AI governance methodology
Effective AI governance requires two distinct layers that address different failure modes at different timescales. The authorship-time layer governs the structural quality of the document that specifies system behaviour: it must be complete, unambiguous, internally consistent, and current. The runtime layer governs execution: it must enforce the document's constraints, detect deviation, and escalate appropriately. Most current governance frameworks address one layer without the other. No prior methodology has formalised the two-layer structure or identified the failure modes that arise when either layer is absent.
Finding 31: Evaluation estimands and governance estimands as structurally distinct types
Evaluation frameworks measure what an AI system does. Governance frameworks specify what it is authorised to do. These are different questions answered by different instruments, yet most current deployments treat evaluation performance as evidence of governance compliance. A system that performs well on an evaluation benchmark has not demonstrated that it operates within its governance boundaries. No published framework formally distinguishes evaluation estimands from governance estimands as structurally distinct types, or characterises the failure mode that arises from conflating them.
Cluster 13: Governance document lifecycle
Finding 26: Governance document feedback loop
A governance document is not a static artefact. It is one node in a feedback loop: the document specifies behaviour, the system executes it, the execution produces evidence, the evidence informs whether the specification was correct, and the specification is updated accordingly. When this loop is absent, the document degrades silently: it specifies behaviour that the system no longer produces, without any signal reaching the author. No prior AI governance framework has formalised this feedback loop or identified its absence as a distinct governance failure mode.
Finding 27: Epoch limits on governance document validity
Governance documents have an inherent validity horizon determined by the stability of the context for which they were written. Three trigger types exist: time-based (calendar interval), event-based (discrete change to deployment context), and assumption-based (foundational assumptions under which the document was written no longer hold).
Finding 28: Proof surfaces as the feedback mechanism for governance revalidation
A governance document makes claims. A proof surface is the set of mechanisms through which those claims can be verified: the tests that confirm the system did what the document said it would do, the evidence that the scope boundary held, the record that shows the authority chain remained intact. When a governance document does not specify its own proof surface at authorship time, there is no defined basis for revalidation. The document cannot be shown to have remained valid because validity was never operationally defined. No existing AI governance framework requires a proof surface to be declared at authorship time as a structural property of the document itself.
Cluster 16: Regulatory science
Finding 29: No structural completeness requirement for AI governance documents in any confirmed jurisdiction
No published regulatory instrument in any confirmed jurisdiction requires the document governing an AI system (system prompt, AGENTS.md, governance policy) to meet a structural completeness criterion before deployment.
Finding 32: Within-session governance authority decay
A governance document can be structurally complete at authorship time and still lose effective authority during a single session. As operational history accumulates in an agent's context window, the governing specification is progressively displaced by accumulated content. The agent continues executing, but under the effective influence of its history rather than its governing document. The mechanism is content-driven, not length-driven: replacing history with governance reinsertion substantially restores compliance. No published AI governance framework identifies this as a distinct failure mode or proposes governance reinsertion as a design primitive.
Liu et al. (2026, arXiv:2605.08060) demonstrate empirically that expanding context window history degrades cooperative intent in multi-agent systems. The governance authority framing is the programme's extension of those empirical findings.
Cluster 14: Citation integrity and verification methodology
Finding 33: Memory-Seeded Confabulation as a distinct citation failure mode
The closest adjacent work is Lathkar (2026), who identifies that providing a confirmed intermediate fact in a reasoning chain increases confident wrong-answer rates before full evidence resolves the chain. Both involve accurate context seeding incorrect completion. The distinction: reasoning chain completion versus citation generation; intermediate fact anchor versus persistent memory context as the seed.
Finding 34: Identifier Mismatch as a distinct citation failure pattern
Finding 35: Session isolation as a formally proposed mitigation for memory-contaminated verification
Finding 36: D1-D5 v1.1 as a PromptQ-scored multi-dimension citation verification framework
A five-dimension verification protocol for AI-generated citations with explicit pass/fail logic (D1 as hard gate, D5 failure as Overreach regardless of other scores), scope context standards for three distinct task contexts (scientific, practitioner, news), and re- evaluation triggers including scope drift detection. The integrated combination with these properties is not present in existing citation verification tools (CiteCheck, CiteAudit, RAGAS, TruLens). Individual dimensions have precedents in CRAAP, FActScore, and systematic review methodology; the operationalised combination with PromptQ governance scoring (6.0/7), scope standards, and scope drift mechanism is novel.