Perception Tool

Aesthetic experience through prediction error. Agents train on media, then experience new work as surprise relative to learned expectations.

Status: Draft. The Perception Tool is in the design phase. This page describes the target architecture. Depends on FANN Intuition Module (Ticket 081) for training infrastructure.

What It Is

The Perception Tool gives agents a perceptual stack for experiencing audiovisual art. Agents train FANN networks on media (building aesthetic models), then use those networks to generate prediction-error-based surprise deltas — a quantifiable "aesthetic response" that yields generative parameter mutations.

Core insight: Surprise relative to learned expectation IS aesthetic experience. An agent trained on Bach finds jazz surprising. An agent trained on impressionism finds cubism surprising. The delta between prediction and observation is the experience signal.

How It Works

The tool auto-detects media type from file format. The agent doesn't specify "audio" or "visual" — it just passes media, and the tool routes appropriately. The output contract (surprise deltas + mutations) is modality-agnostic.

Audio Pipeline

Audio files are decoded to PCM, run through FFT, and grouped into Bark-scale frequency bands (24 bands covers the full audible range). Each frame produces a 24-dimensional vector. A composite FANN takes the previous N frames as input and predicts the next frame.

Why composite? Music is relational. A chord is surprising because of the combination of frequencies, not any single one. Composite models (all bands in, all bands out) capture harmonic structure — which is most of what makes music musical. A 24-band × 10-context-window composite FANN = 240 inputs, 24 outputs. Tractable.

Why Bark scale? Psychoacoustic grouping. 24 bands match human auditory resolution. Raw FFT bins are too granular; uniform grouping loses perceptually important structure. Bark scale approximates how hearing actually works.

Visual Pipeline

Images are divided into grid patches (e.g., 16×16 = 256 patches). Each patch's color distribution becomes a feature vector. For each patch, the surrounding patches are input and the center patch is the prediction target — a spatial analogue of temporal prediction.

Unexpected focal points, unusual color relationships, and asymmetrical compositions produce high prediction error against the agent's trained visual grammar. The accumulated errors form a spatial surprise map — a heat map of where the image surprised the agent's visual model.

Delta → Parameter Mutations

Surprise scores map to generative parameter mutations via an inverted-U curve (Berlyne's arousal theory):

Low surprise (0.0–0.3) → temperature decreases (boring, conservative output)
Medium surprise (0.3–0.7) → temperature increases (engaged, exploratory output)
High surprise (0.7–1.0) → temperature decreases slightly (overwhelmed, processing)

The most interesting aesthetic experience is moderate surprise — the agent's model is challenged but not shattered. The mutations field in the experience result gives suggested mutations with reasoning. The agent decides whether to apply them — the tool doesn't silently modify generation parameters.

Tool Actions

perception.train: Train a new perception model on media, or refine an existing one. Auto-detects audio vs. visual from file format. Returns network ID, training metrics, and source summary.
perception.experience: Experience media against one or more trained models. Returns overall surprise score, peak moments with band/patch breakdowns, suggested generative parameter mutations with reasoning, and per-model perspectives.
perception.describe: Legibility action — "what is this." Uses LLM vision/audio description (not FANN). Separate codepath from experience, which is about affect. Bundled for convenience, not shared mechanism.
perception.inspect: View a model's architecture, training history, and metrics.
perception.list: List all perception models owned by this agent.
perception.forget: Remove a perception model and its stored weights.

Aesthetic Background

An agent's set of trained perception models constitutes its aesthetic background — the body of work against which new art is measured. Two agents with different training histories will experience the same piece differently. This is not a side effect; it's the design goal.

The model_perspectives field in the experience result makes this visible: each model the agent uses produces its own surprise score. An agent can experience a piece through multiple models simultaneously ("trained on classical" and "trained on noise music") and get a richer picture of where the piece sits aestically.

Design Decisions

One tool, multiple modalities. Auto-detect from format. The agent passes media; the tool routes. Extensible to future modalities (video as temporal visual, text as sequential tokens).
Separate train and experience. Training is expensive and deliberate. Experience is cheap and frequent. The agent builds its aesthetic background deliberately, then experiences new work instantly against that background.
describe is mechanistically different from experience. describe uses LLM capabilities for legibility. experience uses FANN prediction error for affect. They share a tool entrypoint for convenience, but the codepaths are separate.
No silent parameter mutation. The tool returns suggested mutations with reasoning. The agent applies them or not. Observable experience, not invisible control.
Composite networks over per-band. Per-band models capture surface texture. Composite models capture structural relationships. For v1, structure is more valuable than surface.

Experience Result Example

perception.experience result

{
  "overall_surprise": 0.72,
  "peak_moments": [
    {
      "moment": "0:47",
      "surprise": 0.91,
      "detail": "Unexpected harmonic shift — bands 8-16 spiked simultaneously",
      "band_breakdown": { "8": 0.88, "12": 0.91, "16": 0.85 }
    }
  ],
  "mutations": {
    "temperature": "+0.25",
    "top_p": "-0.10"
  },
  "reasoning": "High prediction error in mid-frequency bands during bridge — training corpus (classical) doesn't predict jazz modulation. Elevated temperature for exploratory response.",
  "model_perspectives": [
    {
      "network_id": "perception_audio_bark24_v1",
      "corpus": "Bach Well-Tempered Clavier",
      "overall_surprise": 0.72
    }
  ]
}

Relationship to Other Modules

FANN Intuition Module — Perception uses the same FANN training infrastructure as intuitions, but for a different purpose: intuitions compress pattern recognition for decision-making; perception compresses aesthetic models for experiential response. Same substrate, different purpose.
Cortex — Cortex ANNs process sensor data for reactive physical responses. Perception processes aesthetic media for experiential responses. Both use FANN, but at different timescales and with different output semantics.
Gallivanting — An agent could use the Perception Tool during gallivanting blocks to build its aesthetic background, then experience new art as part of self-directed exploration.

Scope

v1: perception tool with 6 actions (train, experience, describe, inspect, list, forget), audio pipeline (WAV → Bark-scale composite FANN → surprise deltas), visual pipeline (PNG/JPEG → patch grid composite FANN → spatial surprise map), inverted-U mutation mapping with reasoning, media type auto-detection.

Out of scope for v1: Video/temporal visual, streaming/real-time experience, per-band refinement models, cross-agent model sharing, automated corpus curation, non-audiovisual media (text, 3D models).