Research

Open problems we are funding
the work to solve.

For technical reviewers. This page describes the unsolved engineering and AI-safety problems we are researching to operate a public marketplace of third-party agent skills at small-business unit economics. The marketing site tells you what the warehouse stocks. This page tells you why the warehouse can be trusted to stock it.
Thesis

A marketplace of open-source agent skills serving 33 million U.S. small businesses is a different infrastructure problem than the one mainstream agent platforms were built to solve.

Autonomous AI agents are entering the small-business market — the 33 million U.S. firms that employ fewer than 500 people and account for 99.9% of American businesses. The agent platforms that exist today were architected for a single enterprise team with engineers, a security review process, and the budget to absorb the per-task language-model cost.

AgentDepot ships a different shape of product: a public catalog of agent skills sourced from the open-source ecosystem on GitHub, scored daily on a 100-point production-readiness model, and deployed for an SMB operator into AWS Bedrock AgentCore in five minutes. That shape — third-party code, composed by a non-technical buyer, running with that buyer's real credentials — exposes four problems that no shipping platform has yet solved end-to-end. We believe these are research problems, not product problems. Solving them is the safety floor required for an SMB agent marketplace to exist responsibly.

Open problem 1

Production-readiness scoring of open-source agent skills at population scale.

State of the art

GitHub now hosts tens of thousands of repositories that present themselves as "agent skills" — packaged tool-using workflows, MCP servers, sub-agent prompts, and end-to-end automations. Mainstream agent platforms either (a) accept any tool a customer pastes in and accept the resulting blast radius, or (b) hand-curate a small fixed catalog of internally-built skills. Neither approach generalizes to the SMB market.

The known surface signal — stars, forks, last commit date, README completeness, presence of a test suite — is the same quality signal published-package ecosystems (npm, PyPI) have used for a decade. It is empirically known to be weakly correlated with runtime behavior. A maintained-looking repository can still hallucinate, leak credentials at the boundary of an unfamiliar API, blow up cold-start cost on a real workload, or emit unbounded retry storms against a third-party rate limit. The signal that matters is dynamic, not static.

Why this is hard

A useful production-readiness score for an agent skill cannot be computed from the repository alone. It requires:

  • executing the skill in a sandboxed runtime against a representative input distribution
  • observing the resulting tool-call trace, latency profile, cost-per-action, error-recovery behavior, and credential handling
  • doing this continuously on a population that turns over weekly — new skills are pushed, deprecated skills are abandoned, dependencies change underneath maintained skills
  • doing it cheaply enough that it is economically possible to re-score the entire catalog on a daily cadence

The published evaluation suites for agent skills (AgentBench, SWE-bench multi-agent variants, ToolBench) measure peak capability of named systems. They are not designed for continuous, low-cost screening of a moving population of third-party packages. The gap between "publishable evaluation" and "production triage at marketplace cadence" is the research target.

Our hypothesis

Three combined techniques:

  1. A two-stage triage pipeline. A cheap static-signal classifier prunes the candidate population from 50k+ to a few thousand; a more expensive dynamic-eval stage is reserved for skills that pass the static gate.
  2. Probe suites per skill category. Marketing-aisle skills, operations-aisle skills, and customer-service-aisle skills have different invariants that must hold (idempotency for booking, suppression for outreach, escalation for support). The probe suite is category-typed, not universal.
  3. Drift-detection re-runs. Skills are re-scored on a weekly cadence against a held-out regression suite. A score that drops more than 25 points between runs is pulled from the shelf and the maintainer is notified.

Phase I deliverable

An open production-readiness benchmark for agent skills: 200 held-out skills hand-scored by domain experts, plus the automated pipeline that scores them. Target: ≥ 0.75 Spearman correlation between automated and human-expert scores, at a steady-state cost below $0.10 per skill per day.

Cited prior art

  • AgentBench: Evaluating LLMs as Agents (Liu et al., 2023) — capability eval baseline.
  • SWE-bench / SWE-bench Multimodal (Jimenez et al., 2024) — held-out task eval lineage.
  • ToolBench / ToolEval (Qin et al., 2023) — tool-using capability measurement.
  • The npm/PyPI supply-chain literature on package-quality scoring and the 2024–2025 line of work on third-party model auditing.

What remains research, not engineering: continuous, low-cost, category-typed dynamic eval of a moving population of agent skills is, to our knowledge, an unstudied infrastructure problem.

Open problem 2

Per-tenant runtime isolation when the runtime executes third-party code from unknown maintainers.

State of the art

Today's mainstream agent platforms — LangChain agent loops, OpenAI Assistants, the various GPT-Action-style tool-calling frameworks — share a runtime across customers. Tools, code execution, and tenant credentials co-exist in the same process or container. This works because the customer is typically a single engineering team that owns the deployment and reviews the tool code itself.

In a marketplace model, the runtime hosts code written by arbitrary open-source maintainers, executed on behalf of an SMB operator, with that operator's real production credentials for Stripe, Twilio, GoHighLevel, Calendly, and a long tail of other third-party services. A single prompt-injection vulnerability in one skill, a single credential exfiltration through one skill's tool call, or a single sandbox escape becomes a cross-tenant incident and a supply-chain incident at the same time. The 2025 Replit and Cursor MCP incidents, where adversarial inputs reached privileged tool calls in shared runtimes, are early proofs of concept of this failure mode.

The known robust answer is hardware-backed isolation: each (tenant, skill) pair runs in its own VM. AWS Lambda demonstrates this is operationally feasible at scale using Firecracker MicroVMs. But Lambda is a stateless function platform; an agent skill is stateful, and pricing the stateful per-(tenant, skill) case at SMB unit economics ($0.001–$0.01 per agent action) is unsolved.

Why this is hard

A stateful agent-skill runtime is not a function invocation. It carries:

  • multi-megabyte conversation history and tool-call traces
  • the skill bundle itself (a chunk of third-party code)
  • live credential leases for the customer's connected services (Stripe, Twilio, GHL)
  • in-flight retry state for partially completed actions

Cold-starting that into a fresh Firecracker microVM today takes 2–4 seconds and several hundred milliseconds of CPU work. For a real-time voice receptionist deployed from the catalog, the latency budget is ≤ 700 ms end-to-end. For an SMS reply skill invoked from a webhook, the cost-per-invocation must stay below half a cent for the marketplace business model to hold. Closing both gaps simultaneously — sub-second stateful cold-start, sub-cent per-task — is the research target.

Our hypothesis

Three combined techniques:

  1. Snapshot-restore with delta-only state hydration. The per-(tenant, skill) microVM image is restored from an immutable base snapshot, and a small per-action delta is rehydrated from a tenant-scoped object store.
  2. Capability-scoped credential brokering. The skill runtime never holds long-lived customer credentials. It requests a short-lived (≤ 5 minute) action-scoped delegation from a separate broker — eliminating credential persistence inside any third-party skill's context, even if that skill is later found to be malicious or compromised.
  3. Predictive warm pools by aisle. Aisle-specific traffic patterns (an HVAC operator's customer-service skill gets a phone-call burst at 7–9am Mountain Time) drive a scheduler that pre-warms microVMs against forecasted demand.

Phase I deliverable

A measured cold-start latency distribution across 1,000 stateful microVM restorations spanning 30+ catalog skills, with target p95 < 800 ms and per-invocation cost < $0.005, on a representative SMB workload trace. Plus a credential-broker compliance audit showing zero long-lived customer credentials persisted inside any skill context across the test corpus.

Cited prior art

  • Firecracker: Lightweight Virtualization for Serverless Applications (Agache et al., NSDI 2020) — the microVM substrate.
  • Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing (Shillaker & Pietzuch, USENIX ATC 2020) — stateful serverless isolation.
  • Catalyzer: Sub-millisecond Startup for Serverless Computing with Initialization-less Booting (Du et al., ASPLOS 2020) — snapshot-restore lineage.
  • AWS re:Invent 2025 sessions on Bedrock AgentCore Runtime — the production substrate we build on.

What remains research: none of the above papers measured stateful workloads where the executing code itself is third-party and the state includes live customer credential leases. That is the gap we are filling.

Open problem 3

Cross-skill coordination safety in composed multi-skill deployments.

State of the art

A single agent that hallucinates, loops, or makes a wrong tool call is a relatively well-studied failure mode. Static safety filters, LLM judges, and human-in-the-loop checkpoints all address the single-agent case.

An AgentDepot deployment is, by design, composed: an SMB operator might stock five skills from five different maintainers — one books service appointments in Calendly, one sends SMS confirmations in Twilio, one updates the customer record in GoHighLevel, one runs the daily report, one handles inbound SMS replies. None of those maintainers can be expected to know about each other. They share state — the same Twilio account, the same Calendly calendar, the same CRM — but they do not coordinate.

This produces a class of failure we call emergent coordination failure:

  • Action conflict: two skills independently take incompatible actions on the same external resource (skill A cancels the appointment skill B just confirmed).
  • Recursive prompting: skill A's output triggers skill B which triggers skill A in a runaway loop with no economic stop condition.
  • Livelock: each skill waits on a state the other has not produced; the composed deployment makes no progress but consumes inference cost.
  • Compensating-action drift: a recovery skill reverses an action whose original justification has since become valid; net effect is invisible damage to the customer's data.

These failures do not appear in single-skill test suites. They are not detectable from a model's output alone. They only appear when you observe the joint trajectory of multiple skills acting on shared external state.

Why this is hard

Published work on multi-agent LLM systems (AutoGen, MetaGPT, CrewAI, the agentic benchmarks like AgentBench and SWE-bench multi-agent variants) measures task success on the happy path, and assumes the agents are co-designed. In a marketplace, they are not. There is, to our knowledge, no published runtime safety supervisor for cross-author multi-skill action conflict on production external resources, with a target detection latency that lets it intercept the second skill's action before it commits. Detection latencies in the seconds range are research-grade results; we need sub-200 ms, because that is the gap between "the conflicting Twilio SMS is queued" and "the conflicting Twilio SMS has been delivered to the customer's phone."

Our hypothesis

A runtime coordination supervisor that observes every tool call from every skill in the tenant's deployment before it executes, maintains a fast in-memory model of which external resources are being mutated in this cycle, and flags or blocks an action whose target intersects an in-flight action by a peer skill. The supervisor's decision must run in < 50 ms to fit inside the tool-call latency budget. We hypothesize this is achievable with a small classifier (not an LLM judge) trained on a corpus of synthetic and recorded multi-skill traces with labeled conflict outcomes.

Phase I deliverable

An open evaluation harness: 200 multi-skill scenarios drawn from real SMB workflows (booking + reminder + reply, posted by independent maintainers), each annotated with the gold-standard "should be blocked" or "should proceed" label. Target: ≥ 90% precision on block decisions (false positives are a customer-experience cost), with end-to-end supervisor latency p99 < 100 ms.

Cited prior art

  • AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023).
  • Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) — single-agent guardrail lineage.
  • The Off-Switch Game (Hadfield-Menell et al., 2017) — corrigibility theory we extend to the multi-skill case.
  • Cooperative Inverse Reinforcement Learning (Hadfield-Menell et al., 2016) — formal frame for inter-agent coordination.

What remains research: all of the above either treat single-agent safety or treat coordination as an idealized game-theoretic problem, with co-designed agents. The production runtime supervisor for production tool calls across independently authored skills on third-party APIs is an empirical research artifact that does not yet exist.

Open problem 4

Operator-readable audit-by-construction across heterogeneous skill providers.

State of the art

In an enterprise deployment, the AI safety story relies on engineers reviewing logs, on SOC 2 controls maintained by a security team, and on a CISO who can read CloudTrail. None of those exist at a five-person HVAC business. The customer is the owner-operator. They cannot read JSON logs and they cannot interpret an audit trail expressed in IAM events.

The marketplace model adds a layer: each skill in the operator's composed deployment was authored by someone different. If each maintainer ships their own log format, the operator is asked to read five different idioms. The trustworthy-AI subproblem is: how do you give a non-technical operator the ability to verify, in plain English, what every skill in their deployment did on their behalf — and the ability to reverse it if it was wrong — without giving up the autonomy that makes the catalog valuable in the first place.

Why this is hard

Three constraints fight each other. First, the audit must be tamper-evident: a malicious skill or a compromised process must not be able to retroactively edit its own history. Second, the audit must be replayable: an operator must be able to ask "what would have happened if I had said no to that action three hours ago?" and get a deterministic answer. Third, the surface must be readable by someone who has never seen a log line, which means the underlying record cannot be a verbose machine log — it has to be a structured, semantically rich event that renders into prose. And it has to be uniform across skills written by maintainers who never coordinated.

These constraints separately are well-studied. Tamper-evidence is solved by hash-chained logs (Merkle trees, certificate transparency). Replayability is solved by deterministic event-sourcing. Plain-English rendering is solved by templated event-to-prose generators. The combination — particularly the constraints that the LLM-generated prose cannot be the source of truth, and that the structured event schema must be enforced uniformly across heterogeneous third-party skills — is unsolved at the SMB-marketplace layer.

Our hypothesis

Cryptographically chained per-action event records emitted by the runtime, not by the skill. Each agent action produces a structured event with: (a) the skill's identity and version pin, (b) the input state hash, (c) the tool call's signed parameters, (d) the external resource ID, (e) the prior event's hash. The chain is anchored periodically to an immutable store. Operator-facing prose is rendered deterministically from the structured event by a small templated generator — the LLM is not in the audit path, eliminating hallucination as a class of audit failure. Because the event is emitted by the runtime, a malicious skill cannot opt out of being audited.

Phase I deliverable

A reference implementation of the audit chain integrated with our existing AgentCore-based runtime, plus a usability study with 12 SMB operators measuring whether they can correctly identify a planted erroneous skill action from the operator-facing audit view. Target: ≥ 80% identification rate at the first reading.

Cited prior art

  • Certificate Transparency (Laurie et al., RFC 6962) — public hash-chained log lineage.
  • Practical Byzantine Fault Tolerance (Castro & Liskov, 1999) — tamper-evidence heritage.
  • Concerning Trustworthy AI: A Computational Perspective (Liu et al., 2022) — survey we extend.

What remains research: usability of cryptographic audit at the small-business operator layer — and the specific marketplace constraint that the audit schema must be enforced by the runtime, not by the skill maintainer — is, to our knowledge, an unstudied empirical question.

Phase I objectives

Six falsifiable claims.

Each can fail. Failure of any one is itself a published-grade research result.

#ObjectiveMeasurable targetTechnical risk
O1Continuous dynamic eval of OSS agent skills≥ 0.75 Spearman correlation between automated production-readiness score and human expert score on 200 held-out skillsStatic GitHub signals (commits, README, tests) underfit runtime behavior
O2Drift detection on the live catalog≥ 90% recall on planted regressions across 1k weekly re-scores at < $0.10/skill-dayProbe-suite cost grows superlinearly with skill diversity
O3Per-(tenant, skill) microVM cold-startp95 < 800 ms across 1k restorations of stateful agent imagesSnapshot-delta hydration unproven for the agent-skill state class
O4Capability-scoped credential brokeringZero long-lived customer credentials inside any skill runtime across the test corpusAction-scoping for arbitrary third-party APIs is heterogeneous
O5Cross-skill action-conflict detection≥ 90% precision, p99 < 100 ms, across 200 multi-skill SMB scenariosNo labeled corpus exists; we will create one
O6Operator-readable audit chain≥ 80% planted-error identification rate by SMB operators on first readingUsability of cryptographic audit at this audience is untested
Forthcoming · in progress

Four planned publications, one per open problem.

Forthcoming · target: MLSys / OSDI

Continuous production-readiness scoring of open-source agent skills at population scale

Empirical study of static-vs-dynamic signal value for predicting runtime behavior of 50k+ candidate agent skills. Drift-detection harness with weekly re-scoring at sub-dime per skill. Open dataset and probe suite.

Forthcoming · target: USENIX ATC / NSDI

Per-(tenant, skill) microVM isolation for marketplace-hosted agent code

Stateful Firecracker microVM cold-start under marketplace workload patterns. Snapshot-restore + delta-hydration measurements at 1k restorations p95 < 800 ms. Capability-scoped credential brokering measurements across 30+ heterogeneous third-party APIs.

Forthcoming · target: NeurIPS Safety / IEEE S&P

A runtime supervisor for cross-skill coordination failure in composed agent deployments

200-scenario evaluation harness drawn from real SMB composed-skill workflows (booking + reminder + reply, posted by 5 different maintainers). Sub-200ms classifier with ≥ 90% precision target. Open-sourced harness and dataset.

Forthcoming · target: CHI / USENIX Security

Operator-readable cryptographic audit chains for marketplace agent deployments

Reference implementation of a uniform Merkle-anchored per-action event record across heterogeneous skill providers, with deterministic plain-English rendering. Usability study with 12 SMB operators measuring planted-error identification rate.

Bibliography

Cited prior art our research builds on.

Firecracker: Lightweight Virtualization for Serverless Applications

Agache et al. · NSDI 2020

MicroVM substrate underlying per-tenant isolation (Problem 2).

Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing

Shillaker & Pietzuch · USENIX ATC 2020

Stateful serverless isolation lineage.

Catalyzer: Sub-millisecond Startup for Serverless Computing

Du et al. · ASPLOS 2020

Snapshot-restore lineage we extend to stateful skill runtimes.

What's in your Pre-trained Model? An Auditor's Perspective

Various — empirical ML supply-chain literature · 2023–2025

Adjacent work on third-party model auditing; we extend to packaged skills.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu et al. · 2023

Multi-agent baseline that measures task success but not coordination safety.

Constitutional AI: Harmlessness from AI Feedback

Bai et al. · 2022

Single-agent guardrail lineage we extend to coordinated multi-skill case.

The Off-Switch Game

Hadfield-Menell et al. · 2017

Corrigibility theory — formal frame we extend empirically across skills.

Cooperative Inverse Reinforcement Learning

Hadfield-Menell et al. · 2016

Game-theoretic frame for inter-agent coordination.

Certificate Transparency

Laurie et al. · RFC 6962

Public hash-chained log lineage underlying the audit chain (Problem 4).

Practical Byzantine Fault Tolerance

Castro & Liskov · 1999

Tamper-evidence heritage for the audit chain.

Concerning Trustworthy AI: A Computational Perspective

Liu et al. · 2022

Trustworthy-AI survey we extend at the SMB-marketplace layer.

Why this work matters beyond AgentDepot

The 33-million-firm U.S. small-business sector is the part of the economy with the least capacity to safely adopt autonomous AI on its own.

SMBs have no internal security teams, no compliance staff, and no software engineers to read the audit logs. If autonomous agents are deployed to that audience using the architecture mainstream platforms ship today — shared runtime, shared credentials, no coordination supervision, opaque automation — the resulting failure surface will be measured in compromised customer-payment systems, leaked health data at dental practices, and contract violations at small legal firms.

An open marketplace for agent skills is the model most likely to put broad AI capability into the hands of those firms quickly. It is also the model that exposes the four problems above most acutely. The research above is the safety floor required for that marketplace to exist responsibly. We are doing it because we have to do it for our own product to be defensible. Publishing it — Phase I deliverables will go to peer-reviewed venues and will be open-sourced where commercially compatible — is the way the rest of the industry can build on it without reinventing the safety floor independently.

Contact

For preliminary measurement traces, collaboration, or questions on the work above.

research@agentdepot.cloud