Laravel 12 Performance Optimization for AI SaaS: Octane, FrankenPHP, Queues & Caching (2026)

1. Why AI SaaS Breaks Traditional Laravel Stacks

A classic Laravel CRUD app can happily serve thousands of users on a $20 VPS. AI SaaS cannot. The moment you introduce streaming LLM responses, long-running agent workflows, and concurrent embedding jobs, the default PHP-FPM model collapses: each request holds a worker for 3–60 seconds, your pool exhausts, and your Time-to-First-Byte goes from 50ms to 12,000ms.

The fix is not a bigger server. It’s a different runtime model. In 2026, a fast Laravel AI SaaS runs on Octane with FrankenPHP, moves long work to horizontal queue workers, caches aggressively at four separate layers, and streams LLM output without holding a web worker hostage.

This guide is the full playbook. Every section has a real measurement, a real code snippet, and a real tradeoff — not just "enable OPcache and pray". By the end, you’ll know exactly where your latency is hiding and how to remove it.

2. Measure First: Benchmarks That Actually Matter

Optimizing without measuring is theater. Before you change a single line of config, capture these five numbers on a staging environment that mirrors production.

p50 / p95 / p99 response time per critical endpoint (login, dashboard, AI chat, billing).
Requests per second a single worker can sustain at 1x CPU.
Queue depth & worker saturation during peak AI usage.
Database p95 query time and slow-query count.
Time-to-First-Token (TTFT) for streaming AI responses.

Load Testing With k6

// tests/load/ai-chat.js
import http from 'k6/http';
import { check } from 'k6';

export const options = {
    stages: [
        { duration: '30s', target: 20 },
        { duration: '2m',  target: 100 },
        { duration: '30s', target: 0 },
    ],
    thresholds: {
        http_req_duration: ['p(95)<1500', 'p(99)<3000'],
        http_req_failed:   ['rate<0.01'],
    },
};

export default function () {
    const res = http.post('https://staging.app.test/ai/chat', JSON.stringify({
        message: 'Summarize this article in 3 bullets.',
    }), { headers: { 'Content-Type': 'application/json', 'Authorization': 'Bearer ' + __ENV.TOKEN }});

    check(res, { 'status 200': (r) => r.status === 200 });
}

Run this against both PHP-FPM and Octane. The delta is the argument you need to justify migration effort.

3. Laravel Octane: The 5-10x Baseline Speedup

Standard PHP-FPM boots Laravel from scratch on every request. Service providers register, config loads, routes compile — it’s cold-start every time. Octane keeps the framework in memory between requests, so each request only runs your actual handler code.

The speedup is real and measurable. On a typical Laravel SaaS dashboard with 15 Eloquent queries, a fresh view render, and a policy check:

Runtime	p50 (ms)	p95 (ms)	RPS (single core)
PHP-FPM 8.3	92	210	~110
Octane + Swoole	14	42	~1,100
Octane + FrankenPHP	11	36	~1,300

Installation

composer require laravel/octane
php artisan octane:install --server=frankenphp
php artisan octane:start --workers=4 --max-requests=1000

--max-requests restarts each worker after N requests to reclaim any leaked memory — a cheap safety net while you hunt for leaks properly (section 6).

4. FrankenPHP: The Modern Default for 2026

FrankenPHP is a modern PHP runtime built on the Caddy web server, with native support for HTTP/2, HTTP/3, automatic HTTPS, Early Hints, and worker mode. It’s become the recommended Octane backend for new projects because it ships as a single static binary and handles TLS without a reverse proxy.

Why It Matters for AI SaaS

HTTP/2 push + Early Hints lets you pre-warn the browser about critical CSS/JS while the LLM is still thinking — shaves 100–300ms off perceived load time on the first page visit.
Real HTTP/3 support matters for users on flaky mobile connections, who are a big chunk of chatbot traffic.
Native SSE & WebSocket handling — no nginx proxy buffering to disable, no hidden 60s timeouts eating your streaming responses.
Single binary deployment — one container, no PHP-FPM, no nginx, no confused health checks.

Production Dockerfile

# Dockerfile (FrankenPHP + Octane, multi-stage)
FROM dunglas/frankenphp:1-php8.3 AS base

RUN install-php-extensions \
    pcntl redis pdo_pgsql opcache intl bcmath

WORKDIR /app
COPY . .
RUN composer install --no-dev --optimize-autoloader --no-interaction \
 && php artisan config:cache \
 && php artisan route:cache \
 && php artisan view:cache \
 && php artisan event:cache

ENV FRANKENPHP_CONFIG="worker ./public/frankenphp-worker.php"
ENV SERVER_NAME=":80"
EXPOSE 80 443

CMD ["frankenphp", "run", "--config", "/etc/caddy/Caddyfile"]

Pair this with Laravel Cloud, Fly.io, or a Forge Docker recipe and you’re shipping a production-grade Octane stack in under 10 minutes.

5. FrankenPHP vs Swoole vs RoadRunner

Octane supports three runtimes. Here’s the honest comparison for an AI SaaS workload in 2026:

Runtime	Strengths	Watch out for
FrankenPHP	Single binary, HTTP/3, native SSE, auto-TLS, easy Docker	Newer ecosystem, fewer tuning knobs than Swoole
Swoole	Highest raw RPS, coroutines, deep tuning, mature ecosystem	Requires separate nginx, PECL build, some packages misbehave
RoadRunner	Written in Go, stable, good for gRPC, plays nice with nginx	Slightly lower peak RPS than Swoole; fewer Laravel-specific features

Our recommendation for 2026: start with FrankenPHP. Migrate to Swoole only if you hit a specific bottleneck that requires coroutine-level parallelism, like thousands of concurrent SSE streams on a single node.

6. Octane Pitfalls: Memory Leaks & Stale State

Octane’s speedup comes from a dangerous tradeoff: the PHP process never dies. Globals, static properties, and singletons persist across requests. Patterns that were harmless under PHP-FPM will now leak memory or leak data between users.

The Classic Offenders

// DANGEROUS - static caches that grow forever
class PriceLookup
{
    private static array $cache = [];

    public static function get(string $sku): Price
    {
        return self::$cache[$sku] ??= Price::findBySku($sku); // leaks!
    }
}

// DANGEROUS - singletons holding request data
$this->app->singleton(CurrentUser::class, fn () => auth()->user());
// Request 1: stores User A
// Request 2: CurrentUser is still User A!

// DANGEROUS - binding in boot() with closures capturing request
public function boot(): void
{
    $this->app->bind(Report::class, fn () => new Report(request()->team_id));
    // request() is the FIRST request forever
}

The Safe Patterns

Use $this->app->scoped() instead of singleton() for per-request state. Octane flushes scoped bindings between requests.
Clear static caches explicitly via Octane::tick() listeners if they must exist.
Never cache request(), auth()->user(), or Tenant::current() at boot time. Resolve them lazily inside handlers.
Run tinker with --watch and memory profiling in staging before shipping.

Memory Leak Detection

// app/Console/Commands/OctaneMemoryCheck.php
public function handle(): void
{
    $before = memory_get_usage();

    for ($i = 0; $i < 1000; $i++) {
        Container::getInstance()->make(YourSuspectService::class)->doWork();
    }

    $growth = memory_get_usage() - $before;
    $this->info("Growth after 1000 calls: " . number_format($growth) . " bytes");
    // > 500 KB growth? You have a leak.
}

7. Request-Level Optimizations That Matter

Even on Octane, a single slow handler eats your throughput. The boring fundamentals still dominate.

Kill N+1 Queries for Good

// AppServiceProvider::boot()
use Illuminate\Database\Eloquent\Model;

Model::preventLazyLoading(! $this->app->isProduction());
Model::preventAccessingMissingAttributes(! $this->app->isProduction());

In development, any accidental lazy load throws LazyLoadingViolationException. You’ll find — and fix — every N+1 in your app within a week.

Eager Load Aggregates, Not Relations

// BAD - loads every message just to count them
$chats = Chat::with('messages')->get()->map->messages->count();

// GOOD - single query, zero loaded models
$chats = Chat::withCount('messages')->get();

// BETTER - only the sum of tokens this week
$chats = Chat::withSum(
    ['messages as tokens_this_week' => fn ($q) => $q->whereDate('created_at', '>=', now()->subWeek())],
    'tokens'
)->get();

Cast Once, Not Per Access

Eloquent casts run on every attribute access. For a list of 1,000 records, a json cast on a 50KB column can add 200ms. Use toBase() for read-only aggregates:

// Returns a collection of stdClass - no model hydration, no casts
$rows = DB::table('ai_usage')
    ->where('team_id', $team->id)
    ->whereDate('created_at', '>=', now()->subMonth())
    ->select('feature', DB::raw('SUM(credits_charged) as credits'))
    ->groupBy('feature')
    ->get();

8. The Four Caching Layers Every SaaS Needs

"Caching" is not one thing — it’s four layers that each solve a different problem. A mature Laravel AI SaaS uses all of them.

Layer 1: OPcache + Route/Config Cache

php artisan config:cache
php artisan route:cache
php artisan view:cache
php artisan event:cache

# php.ini
opcache.enable=1
opcache.memory_consumption=256
opcache.max_accelerated_files=20000
opcache.jit_buffer_size=100M
opcache.validate_timestamps=0   # production only

Layer 2: Application Cache (Redis)

// Cache expensive computations with proper invalidation
public function getDashboardStats(Team $team): array
{
    return Cache::tags(["team:{$team->id}"])->remember(
        "dashboard:{$team->id}",
        now()->addMinutes(5),
        fn () => $this->computeStats($team),
    );
}

// Invalidate everything for a team in one call on any mutation
Cache::tags(["team:{$team->id}"])->flush();

Layer 3: HTTP Response Cache

For mostly-static endpoints (marketing pages, public docs, public blog), let the CDN do the work. Set Cache-Control and stale-while-revalidate:

return response($html)
    ->header('Cache-Control', 'public, max-age=60, s-maxage=300, stale-while-revalidate=3600');

Layer 4: AI Response Cache

This is the highest-ROI caching layer in an AI SaaS. Identical prompts produce identical outputs — cache them and you cut LLM spend and latency in half.

public function complete(array $messages, string $model): string
{
    $key = 'llm:' . $model . ':' . md5(json_encode($messages));

    return Cache::remember($key, now()->addDay(), function () use ($messages, $model) {
        return $this->openai->chat()->create(['model' => $model, 'messages' => $messages])
            ->choices[0]->message->content;
    });
}

Combine this with Anthropic’s prompt caching or OpenAI’s cached-input pricing and you get 50–90% LLM cost reduction on repeated system prompts. See the AI SaaS monetization guide for how this ties into margin control.

9. Queue Architecture for Concurrent AI Calls

The cardinal rule of an AI SaaS: no web request should wait on an LLM call that takes longer than 2 seconds. Anything longer goes to a queue and streams its result back via Broadcasting, SSE, or WebSockets.

Separate Queues by SLA

// config/queue.php
'connections' => [
    'redis' => [
        'driver' => 'redis',
        'connection' => 'default',
        'queue' => 'default',
        'retry_after' => 300,
        'block_for' => 1,
    ],
],

// Dispatch to named queues
BlogGeneratorJob::dispatch($team, $prompt)->onQueue('ai-fast');    // < 10s, 20 workers
AgentWorkflowJob::dispatch($team, $task)->onQueue('ai-long');     // up to 5 min, 5 workers
EmbeddingBatchJob::dispatch($team, $docs)->onQueue('ai-bulk');    // batch, 2 workers

Horizon Supervisor Config

// config/horizon.php
'production' => [
    'ai-fast' => [
        'connection' => 'redis',
        'queue' => ['ai-fast'],
        'balance' => 'auto',
        'minProcesses' => 4,
        'maxProcesses' => 20,
        'tries' => 3,
        'timeout' => 30,
    ],
    'ai-long' => [
        'queue' => ['ai-long'],
        'minProcesses' => 2,
        'maxProcesses' => 8,
        'timeout' => 300,
    ],
],

Job-Level Patterns That Scale

ShouldBeUnique — prevents double-processing on double-clicks.
Batch::add() — parallelize 100 embeddings, wait for all, then fan in.
$this->release(backoff()) — honor OpenAI rate-limit headers explicitly.
WithoutOverlapping middleware — one generation per user at a time on free plans.

For multi-step orchestration, see the Laravel AI workflows & pipelines guide — it layers on top of this queue foundation.

10. Database Tuning: Indexes, Replicas, Pooling

At 10,000 users your bottleneck is almost never PHP — it’s the database. Three moves move the needle most.

Find Missing Indexes

// Log every query > 100ms in production
DB::listen(function ($query) {
    if ($query->time > 100) {
        Log::channel('slow-queries')->warning('Slow query', [
            'sql' => $query->sql,
            'bindings' => $query->bindings,
            'time' => $query->time,
        ]);
    }
});

Then in Postgres: EXPLAIN (ANALYZE, BUFFERS) <your query> and add a composite index on whatever’s doing a Seq Scan. For a multi-tenant SaaS, every big table needs a (team_id, created_at) index at minimum.

Read Replicas for Reports

// config/database.php
'pgsql' => [
    'read' => [
        'host' => [env('DB_READ_HOST_1'), env('DB_READ_HOST_2')],
    ],
    'write' => [
        'host' => env('DB_HOST'),
    ],
    'sticky' => true, // reads after writes hit primary for consistency
    // ... rest of config
],

Connection Pooling with PgBouncer

Octane opens one DB connection per worker. At 50 workers × 3 nodes = 150 connections, which most Postgres instances don’t want. Put PgBouncer (transaction mode) between Laravel and Postgres and you can scale to thousands of workers with <50 real backend connections.

11. Streaming AI Responses Without Blocking Workers

Naive streaming holds a web worker for the full duration of an LLM response — 5 to 30 seconds. On 10 concurrent users your pool is empty. The fix depends on your runtime.

Option A: Laravel Reverb + Queue (Recommended)

Web request dispatches a job and returns immediately. The job calls the LLM and broadcasts each chunk to the user’s private channel. Web workers stay free; only queue workers block.

// Job
public function handle(): void
{
    $stream = $this->openai->chat()->createStreamed([
        'model' => 'gpt-4o-mini',
        'messages' => $this->messages,
        'stream' => true,
    ]);

    foreach ($stream as $chunk) {
        $delta = $chunk->choices[0]->delta->content ?? '';
        if ($delta !== '') {
            broadcast(new AiChunkReceived(
                chatId: $this->chatId,
                delta: $delta,
            ));
        }
    }

    broadcast(new AiStreamCompleted($this->chatId));
}

Option B: Native SSE on FrankenPHP

FrankenPHP handles SSE responses without nginx buffering headaches. This is simpler but does keep a worker busy for the response duration — fine at small scale, not at 1000 concurrent streams. Full implementation is in the AI chatbot streaming guide.

Measure Time-to-First-Token

Users perceive a chatbot as "fast" based on TTFT, not total duration. Log it per model and alert when p95 exceeds 800ms — that’s usually the sign that a provider is degraded and you should failover to your secondary model.

12. Edge, CDN & HTTP Caching

Every megabyte you serve from PHP is a megabyte you’re paying for twice: once in bandwidth and once in worker time. Push static assets, public pages, and JSON responses to the edge.

Static assets: immutable Cache-Control: public, max-age=31536000, immutable — Vite already fingerprints file names.
Marketing pages: Cloudflare/Fastly with s-maxage=300 — regenerate every 5 min without hitting origin.
Public JSON APIs: ETag + Last-Modified so clients send 304 Not Modified for free.
Laravel Cloud / Vercel / Cloudflare Workers: terminate TLS and cache at 250+ edge POPs — latency drops from 200ms to 20ms for users in other regions.

13. Monitoring: Pulse, Clockwork, Sentry

You can’t fix what you don’t see. Three tools cover 95% of the visibility a Laravel SaaS needs.

Laravel Pulse — first-party real-time dashboard. Slow queries, slow jobs, slow requests, exception rates, cache hit ratios. Free. Install it on day one.
Clockwork — browser extension showing per-request timeline with queries, cache hits, and job dispatches. Dev-only, invaluable for spotting N+1 visually.
Sentry (or Flare) — exception tracking with release tagging, user context, and performance transactions. Tie it to Slack for on-call.

Pulse Custom Recorder for AI Calls

// Track p95 latency per LLM model
Pulse::record('llm_latency', "{$provider}:{$model}", $latencyMs)
     ->avg()
     ->count();

Ten lines and you have a per-model latency chart in the Pulse dashboard. Add an alert for when any provider’s p95 crosses 3s, so you know about an OpenAI incident before your users do.

14. Production Performance Checklist

Before you scale past your first 1,000 users, confirm every item below:

Octane + FrankenPHP running in production with --max-requests safety.
OPcache with JIT enabled and validate_timestamps=0.
Config / route / view / event cached via deploy script.
preventLazyLoading enabled in non-production; zero N+1 in hot paths.
Queue separation — ai-fast, ai-long, ai-bulk on Horizon with sized pools.
No LLM call on the web path — everything >2s is queued & streamed via broadcast.
AI response cache on identical prompt+model, ≥ 1 day TTL.
Prompt caching enabled on Anthropic/OpenAI for repeated system prompts.
Redis for cache & queues, warm and metrics exported.
PgBouncer in transaction mode in front of Postgres.
Composite indexes on every (team_id, created_at) high-traffic table.
Read replicas for analytics & admin queries.
CDN + edge cache on static assets, marketing pages, public APIs.
Laravel Pulse + Sentry live, dashboards bookmarked, alerts wired.
k6 load test in CI hitting staging weekly — thresholds fail the build on regression.
TTFT < 800ms p95 for streaming AI endpoints.
Memory stable across 10,000 requests per Octane worker.

15. Conclusion

"Laravel is slow" is a myth from 2018. A correctly-tuned Laravel 12 AI SaaS on Octane + FrankenPHP sustains thousands of requests per second, streams LLM output at sub-second TTFT, and costs a fraction of what a Node or Go rewrite would take to build and maintain. Speed in Laravel is an architectural choice, not a language choice.

Start with measurement. Move web work off the critical path with queues. Cache at four layers. Index and pool your database. Monitor with Pulse and Sentry. Run a k6 test in CI so regressions fail the build. Every one of these is a small, boring change — and together they’re the difference between a product that melts at 500 users and one that cruises past 50,000 on a modest cluster.

LaraSpeed ships the fast path by default. Octane + FrankenPHP in the Dockerfile, Horizon with AI-tuned queues, Redis-backed cache with tags, Pulse preconfigured, preventLazyLoading in dev, broadcast-based streaming, and composite indexes on every tenant table. The performance work is already done — you just ship product.

Launch a Laravel AI SaaS that’s actually fast

LaraSpeed is the production-ready Laravel 12 SaaS starter kit. Octane + FrankenPHP, Horizon, Redis, Stripe, AI features, multi-tenancy, and full admin panel — wired for speed from day one.

Get LaraSpeed — Starting at $49

Table of Contents