Software supply chain: package hallucinations & slopsquatting

Large language models routinely invent packages that do not exist. Across standard web-development queries, package hallucinations average 19.6% of generated dependency suggestions (arXiv 2501.19012). Open-source models are worse, hallucinating at 21.7% on average, with CodeLlama 7B and 34B exceeding a third of all outputs, while proprietary models hold around 5.2%.

The danger is repeatability: 43% of hallucinated packages reappear when the same prompt is run again, and 58% appear more than once across ten queries. That predictability is what makes slopsquatting viable: attackers register the fictitious names an LLM keeps suggesting, then wait for developers (and agents) to install them (Aikido, Trend).

These are not hypotheticals. A malicious huggingface-cli impersonation drew over 30,000 downloads in three months, and a hallucinated react-codeshift spread to 237 GitHub repositories (Trend Micro).

How PeakStack addresses this

PeakStack checks the dependencies in each manifest against the live npm, PyPI, and crates.io registries on every commit, flagging packages that don't exist (hallucinations) and names a short edit-distance away from popular packages (slopsquatting/typosquatting), the exact vectors described above.

Insecure execution paths & logic flaws in machine-generated code

While traditional DevSecOps tools focus on deep dependency vulnerabilities and zero-day compliance, the primary failure modes of vibe- and agentic-coded applications are structural: fundamental logic errors, hallucinated packages, and runaway serverless infrastructure. These are a different class of risk than the human-written bugs legacy scanners were built to find.

Statistically, roughly 45% of AI-generated code contains security vulnerabilities(Wiz, Databricks). Aggregated across six major models, roughly 25.1% of samples contain confirmed vulnerabilities mapped directly to the OWASP Top 10:

Model	Confirmed vulnerability rate
Claude Opus 4.6	29.2%
Llama 4 Maverick	29.2%
DeepSeek V3	29.2%
Gemini 2.5 Pro	25.1%
Grok 4	25.1%
GPT-5.2	19.1%

Rates reflect confirmed (manually triaged) vulnerabilities mapped to the OWASP Top 10, aggregated from the cross-model SAST analyses cited above (Checkmarx, Wiz, Databricks). Model versions are illustrative of the variance reported across 2025–2026 studies; absolute rates vary with prompt set and triage methodology.

The problem compounds over time: the average number of vulnerabilities per codebase has more than doubled, up +107% to 581.6. And detection is unreliable: 78.3% of confirmed vulnerabilities in AI code are flagged by only one of five major SAST tools (Checkmarx), so any single scanner misses most issues. Common vectors include unsafe deserialization, heap-based buffer overflows, and exposed endpoints, as in the Base44 authentication bypass, where a vibe-coding platform exposed apps through a logic flaw (Wiz analysis, Calcalist).

How PeakStack addresses this

PeakStack reviews every change for security and logic flaws and returns each finding with its severity, the exact file and line, a plain-English explanation of why it matters, and a concrete fix, surfacing the structural issues a single scanner's blind spots would miss.

Agentic supply chain & configuration risks

Autonomous agents amplify every risk above. They can read configuration files, install packages, and execute code without a human in the loop, bypassing the review steps that normally catch a bad dependency or a dangerous command. The OWASP project now tracks this directly via the Agentic Skills Top 10 (AST-2026) and the broader Top 10 for Agentic AI (ASI-2026) (Promptfoo mapping).

Real CVEs already exist: CVE-2025-59536 in Claude Code is a concrete example of an agentic tool introducing an exploitable flaw (analysis).

How PeakStack addresses this

PeakStack analyzes every commit the same way whether a human or an agent authored it, running the same live registry checks over the packages an agent pulls in, and the same code review over the configuration and source files it changes.

Computational cost spirals in serverless runtimes

AI-generated infrastructure frequently ships with cost time-bombs. Recursive serverless triggers, a function that writes to the queue that invokes it, can loop infinitely; a single continuously-running agent has generated a $30,000 cost spike(FinOps Foundation, AWS recursive loops).

Even “normal” code leaks money. A high-volume Lambda processing 10 million transactions/day that logs full 5KB JSON payloads ingests 50GB of logs daily, about $750/month in logging costs alone (nOps, Kloudr). Polling overhead from event-source mappings adds more (the $0.04 bill that wasn't).

How PeakStack addresses this

PeakStack estimates the monthly cost of each capability, broken down across compute, database, storage, bandwidth, AI, and third-party services, and attaches an estimated dollar impact to cost findings, so an expensive pattern is visible before the bill arrives.

Unit economics: per-user cost visibility before launch

Traditional SaaS enjoys near-zero marginal cost and gross margins of 80%+. AI changes the math: inference cost scales with usage. Scaling AI startups average gross margins of just 25%, and even mature AI platforms stabilize around 60%, with infrastructure costs eroding margins for 84% of enterprises (unit economics of GenAI, the software paradox).

Flat-rate pricing becomes a trap. A marketing-agent tool priced at a flat $20/month might see average users consume $2.40–$4.00 in monthly inference, but power users can burn $8.00–$15.00, turning the best customers into the biggest losses (pricing models compared).

How PeakStack addresses this

PeakStack estimates per-request and per-user cost for each capability from the infrastructure and third-party APIs (such as OpenAI or Pinecone) it detects in the code, so you can see, before deployment, which features are expensive to serve per user before a flat-rate plan turns your heaviest users into your biggest losses.

A platform for automated repository governance

The research converges on a pre-production governance platform with four modules: (1) dependency verification against live registries, (2) deterministic static analysis for scalability bottlenecks, N+1 queries, full-table scans, unbounded queries, and blocking I/O, (3) an LLM-driven code review for security, architecture, and production readiness, and (4) per-capability cost estimation across compute, database, and third-party APIs. PeakStack implements this approach by connecting to GitHub and analyzing every commit so risks are caught before they reach production.

Critically, this is a hybrid deterministic/probabilistic pipeline rather than raw code fed to an LLM. PeakStack runs fast, deterministic passes first: live registry verification for dependencies and static pattern analysis for scalability bottlenecks, then applies targeted LLM review to reason about security, architecture, and cost, and to explain each finding in plain English with a concrete fix. The deterministic stages keep analysis fast and false positives low; the model layer is reserved for the work that genuinely needs reasoning.

Analyze your repo for free

References

37 sources & references

This page summarizes PeakStack's internal research report. Statistics are reproduced from the cited third-party sources; figures reflect the studies available at time of writing.

Vulnerability & Financial Risk Analysis of Vibe- and Agentic-Coded Applications

Software supply chain: package hallucinations & slopsquatting

Insecure execution paths & logic flaws in machine-generated code

Agentic supply chain & configuration risks

Computational cost spirals in serverless runtimes

Unit economics: per-user cost visibility before launch

A platform for automated repository governance

References