Now in beta · Released April 2026

Real-time video
for any AI agent.v1.0

PikaStream 1.0 is the live visual engine that gives any AI agent a face, a voice, and a real-time presence in video meetings - at conversation speed, on a single GPU.

⚡ Try the Beta Read the Tech

24FPS

Streaming output

1.5s

End-to-end latency

1×H100

GPU footprint

DiT parameters

🔒 meet.google.com / pika-stream-demo

You

latency 1.42s

AI Alex · PikaStream 1.0

What it is

Generative video, live at conversation speed.

Most video models are too slow for live interaction. They generate clips offline. A single rendered result might look impressive, but it does not behave like a participant in a conversation — it behaves like a voicemail. PikaStream 1.0 is the real-time visual engine built to close that gap.

Instead of producing a finished clip after a prompt, PikaStream generates personalized video continuously while the conversation happens. Speech comes in. Reasoning and audio generation run in parallel. The avatar video streams back out with stable identity, synchronised mouth movement, and emotionally appropriate reactions — all in roughly one and a half seconds of end-to-end latency.

The result: an AI agent that does not appear as a blank tile or a name in a participant list. It appears as a dynamic, animated presence — visible to every participant, responsive to the flow of conversation, capable of executing agentic tasks during the call. Pair it with a Pika AI Self and you have a living digital extension of yourself in any Google Meet, ready to meet, decide, and ship.

Why it matters

From voicemail to FaceTime.

Pikaformance, the previous-generation ultra-fast Pika model, required eight GPUs and 4.5 seconds of latency per response. Fast for video generation, far too slow for real-time conversation. PikaStream 1.0 cuts that to a single GPU and 1.5 seconds — and the experience changes completely.

Pikaformance · Previous Gen

"Like leaving a voicemail."

Every exchange requires a 4.5-second wait. Conversation rhythm collapses. The model is fast for video generation, but too slow to behave like a participant in a live call.

Latency4.5s

GPU footprint8× GPUs

Use caseAsync clips

→

PikaStream 1.0 · This release

"It feels like FaceTime."

Continuous streaming at 24 FPS with end-to-end latency around 1.5 seconds, on one GPU. For the first time, a real-time visual engine can power live, identity-consistent video at conversation speed.

Latency~1.5s

GPU footprint1× H100

Use caseLive meetings

Latency is the product. Not in a benchmark-bragging way — in a "does this kill my momentum" way. We have seen the same idea play out in coding tools, where speed alone changed how developers behave. Now it is happening in video and avatars: when response is fast enough, creators stop scripting everything upfront and start steering in the moment. PikaStream lands in the territory where AI stops being something you pause your workflow to use, and starts being something that exists inside your workflow.

Architecture

Three components, one GPU pipeline.

PikaStream 1.0 is built around a 9B Diffusion Transformer paired with a custom streaming VAE — fused with a single-GPU inference pipeline that delivers end-to-end audio-conditioned video at real-time frame rates.

Input

Audio & Context

Speech, prompt, and per-frame audio tokens enter the pipeline alongside agent identity, memory, and workspace context.

audio + text + frames

→

Core

9B Diffusion Transformer

Bidirectional DiT distilled into a causal autoregressive student via optimized self-forcing, enabling chunk-by-chunk streaming at real-time frame rates.

9B params · audio-conditioned

→

Output

FlashVAE Streaming Decode

A full Transformer-based VAE trained from scratch — provides its own latent space, reconstructs video in real time via streaming decoding.

441 FPS · 1.1 GB VRAM

Spatio-temporal self-attention

Full attention across all frames at training, distilled into causal streaming for inference. Identity stays stable across long calls.

Frame-wise audio cross-attention

Each video frame attends only to its temporally aligned audio tokens — not the full sequence. That's how the lip-sync stays tight.

Per-chunk text conditioning

Prompts can be swapped on the fly mid-stream, enabling interactive control over motion, expressions, and actions during live generation.

FlashVAE builds on Pika's concurrent FlashDecoder research, which showed that a Transformer-based streaming decoder can match conventional 3D convolutional decoders in reconstruction quality (PSNR and LPIPS) while running over 10× faster. Combined with the 9B DiT and an inference pipeline that fuses decoding, audio conditioning, and scheduling into a single-GPU loop, PikaStream pushes 24 FPS on one H100 with 1.1 GB of VAE memory overhead.

The published research note (April 2 2026) makes concrete claims around frame rate, latency, decoding speed, lip-sync alignment, identity consistency, and the model architecture behind the experience. That level of disclosure positions PikaStream as a serious systems milestone — not a viral product teaser, but a real-time generation infrastructure layer that other agent-enabled applications can plug into.

Capabilities

Built for live, identity-stable interaction.

PikaStream 1.0 is not a filter or an avatar overlay. It is a generative model engineered for the demands of live human meetings — visual presence, vocal identity, persistent memory, and agentic action, all running in real time.

Real-time video presence

Your agent appears as a dynamic, animated avatar visible to every meeting participant — not a static icon, not a delayed clip. Continuous 24 FPS generation while the call unfolds.

Voice cloning

Record a short audio sample with the clone-voice subcommand. Optional noise-reduction flag. Your agent speaks with your voice — not generic text-to-speech.

iii

On-demand avatar generation

Supply a description with generate-avatar and PikaStream uses OpenAI image models to produce a custom avatar. Or pass --image to use your own asset.

Persistent memory & identity

Unlike conventional meeting bots, the agent retains context about who it is, who it knows, and what was previously discussed — session to session, week to week.

Agentic task execution

The agent does not just talk. It executes tasks live during the call — pulling data, updating docs, scheduling actions — without breaking conversational flow.

Workspace context awareness

Before joining a call, PikaStream synthesizes your identity, recent activity, and known contacts into the system prompt — responses arrive informed, not generic.

vii

Expressive natural gestures

Tight lip-sync alignment plus appropriate emotional reactions. Gestures land where they should. Eye contact and facial cues make the avatar feel like a participant.

viii

Post-meeting notes

The agent summarizes what was decided, who said what, and what the action items are — generated and shared automatically once the call ends.

Agent-agnostic

Works with any AI agent that can process markdown instructions and run scripts — including Claude, OpenClaw, and custom agents you build yourself.

Setup

Production-ready in minutes.

PikaStream is delivered as a Skill through the open-source Pika-Skills repository on GitHub. Clone, configure, and your agent can join its first Google Meet within minutes.

        terminal · setup.sh
        Copy
      
# 1. Clone the Pika-Skills repository
git clone https://github.com/Pika-Labs/Pika-Skills.git
cd Pika-Skills/pikastream-video-meeting

# 2. Set your developer API key
export PIKA_API_KEY="sk-pika-..."

# 3. Optional: clone your voice from an audio sample
./pikastream clone-voice \
  --audio my-voice.wav \
  --name "alex-voice" \
  --denoise

# 4. Optional: generate an avatar from a description
./pikastream generate-avatar \
  --prompt "warm, professional, soft lighting, 30s" \
  --output ./avatar.png

# 5. Join a Google Meet — your agent shows up live
./pikastream join \
  --meet https://meet.google.com/abc-defg-hij \
  --voice "alex-voice" \
  --image ./avatar.png

Get a developer key

Visit the Pika developer portal and generate an API key. PikaStream uses an automated balance check before each session — if you're low on credits, you'll get a secure top-up link.

pika.me/dev/login

Add the PikaStream Skill

The skill is published in Pika-Labs/Pika-Skills on GitHub. Drop it into your agent runtime — it auto-detects and exposes the meeting interface to your agent without manual wiring.

github.com/Pika-Labs/Pika-Skills

Configure voice & avatar

Run clone-voice with an audio sample to lock in your voice. Run generate-avatar with a prompt, or pass --image if you already have one.

./pikastream clone-voice

Join a meeting

Pass a Google Meet URL to ./pikastream join with your voice and avatar. Your agent connects, presents on camera, and executes tasks during the call. Pika app and Google Meet today; Zoom & FaceTime soon.

./pikastream join --meet ...

Where it fits

Live presence, where it matters most.

The clearest wins are anywhere a meeting needs a face that can both talk and act — from delegating coverage to standing up customer presence to scaling 1:1 conversations beyond a single human's bandwidth.

Delegated meetings & coverage

Send your Pika AI Self to a meeting you can't make. They show up with your face, your voice, and full context — take notes, surface decisions, kick off follow-ups before the call ends.

Customer-facing support agents

A real visual presence on a video call. Answers questions, retrieves account data live, and resolves issues face-to-face — at the price-point of an automation, not a human seat.

iii

Sales discovery & demos

Prospects book in, get a real conversational walkthrough led by your agent. Tasks like creating draft proposals or sending follow-up materials happen live during the call.

Internal standups & reviews

A team agent joins recurring standups, syncs status across tools, and reports back. Continuous workspace awareness means context is preserved across every meeting.

Localization & global reach

Your agent speaks languages you don't. Run identical face-to-face conversations across markets without hiring local reps for every region or time zone.

Creator-led 1:1 experiences

Open up "video calls with your AI Self" as a tier for fans, students, or community members. Scale presence beyond what a single human schedule can hold.

vii

Education & coaching

Course intros, office hours, language practice, training sessions. The agent knows the curriculum, remembers each learner, and adapts in real time during the lesson.

viii

Developer-built agent products

Drop PikaStream into any agent runtime via API. Suddenly your custom agent has a face, a voice, and a meeting-ready presence — with zero rendering infrastructure to maintain.

Pricing

Pay only for live minutes.

Usage-based by design. The bot is billed only while it's active in a meeting — making short check-ins and 1:1 calls economical, while longer sessions scale predictably.

⚡ Beta pricing

$0.20 / minute

Standard rate while PikaStream is active in a meeting. The skill runs an automated balance check before each session and surfaces a top-up link if your credits are low.

Per-minute billing — no monthly minimums
Automated pre-meeting balance check
Secure top-up link if credits are low
Voice cloning & avatar generation included
Post-meeting notes auto-generated & shared
Works with any AI agent runtime

Use cases

Economical for short calls

A 5-minute customer check-in costs around a dollar. A 30-minute team standup runs about six. Cheaper than coffee, more reliable than a junior coverage hire.

Beta status

Currently developer-facing

Setup is technical today — primarily through GitHub and developer keys. A consumer-friendly experience inside the Pika app is rolling out for Pika AI Self users.

What's included

Avatar & voice tooling

Voice cloning, avatar generation via OpenAI image models, identity persistence, agentic skill integration, and post-meeting summaries — all in the standard rate.

Frequently Asked

Questions, answered.

What exactly is PikaStream 1.0? +

PikaStream 1.0 is a real-time visual engine — not a finished video model in the traditional sense. Instead of generating a clip after the fact, it streams personalized audio-conditioned video continuously while a conversation is happening. The first product built on it is the pikastream-video-meeting skill, which lets any AI agent join a live Google Meet with a face, a cloned voice, and the ability to execute tasks during the call.

How does this differ from Pikaformance? +

Pikaformance is the previous-generation ultra-fast video model — required 8 GPUs and 4.5 seconds of latency per response. Fast for video generation, but every exchange feels like leaving a voicemail. PikaStream 1.0 runs on one H100 with ~1.5 seconds of end-to-end latency, streaming at 24 FPS continuously. At that speed, it stops feeling like async generation and starts feeling like a video call.

What's the model architecture under the hood? +

Three components fused into a single-GPU pipeline: FlashVAE for latent space encoding and real-time streaming decoding (441 FPS, 1.1 GB VRAM), a 9-billion-parameter Diffusion Transformer for audio-conditioned video generation, and inference engineering that combines decoding, audio conditioning, and scheduling. The DiT is trained bidirectionally for max quality, then distilled into a causal autoregressive student via optimized self-forcing for real-time streaming.

Does PikaStream only work with Pika's own AI? +

No. While Pika AI Self is the most native integration, PikaStream 1.0 is designed to be agent-agnostic. The pikastream-video-meeting skill is published in the open-source Pika-Skills GitHub repository and works with any AI agent that can process markdown instructions and run scripts — including Claude, OpenClaw, and custom agents. Developers can also access PikaStream directly via API key.

Which meeting platforms are supported? +

Today, PikaStream supports Google Meet and the Pika app natively. Zoom and FaceTime support is announced as coming soon. The agent joins like any other participant — visible to everyone in the meeting, with a rendered avatar tile and voice output, not a hidden background bot.

Can the agent actually do things during a call? +

Yes — that's a core part of the design. Agentic skills are enabled for video chats, so the agent can pull data, update documents, schedule follow-ups, or execute API calls live while the conversation continues. Most early coverage glosses over this, but it's the feature that makes PikaStream functionally different from a meeting summary bot.

How does voice cloning work? +

PikaStream includes a clone-voice subcommand in the skill script. You provide an audio file, a name for the cloned voice, and an optional noise-reduction flag. The system processes the sample to capture pitch, tone, and pacing, then makes the resulting voice profile available for use in any future meeting. For Pika AI Self users, voice cloning ensures the agent sounds like the user, not like a generic text-to-speech engine.

What about the avatar — do I need to provide one? +

You have two options. Run generate-avatar with a text prompt, and PikaStream uses OpenAI image models to produce a custom avatar in your described style. Or supply your own asset directly using the --image flag at join time. Either way, the avatar moves and reacts in real time — driven by the streaming infrastructure, not pre-recorded loops.

How much does it cost? +

Standard pricing is $0.20 per minute of active meeting participation. The skill runs an automated balance check before any session begins. If your credits are low, you receive a secure top-up link before the agent joins the call. Voice cloning, avatar generation, and post-meeting notes are included in the per-minute rate. There are no monthly minimums.

Is it ready for everyday consumer use? +

Not quite yet. PikaStream 1.0 is currently in beta and primarily developer-facing — setup involves GitHub, API keys, and command-line tools. It's well-suited for developers and early adopters today. A broader, consumer-friendly version inside the Pika app for Pika AI Self users is rolling out, but full consumer parity hasn't been announced.

Is this a filter or a deepfake? +

Neither. It is not a filter or an avatar overlay applied to a real video stream. The avatar is generated frame by frame by the underlying 9B Diffusion Transformer, conditioned on audio in real time with identity reference injection for stability. Pika's terms of service strictly prohibit using someone else's likeness without permission — impersonation accounts can be reported to the moderation team.

What happens to the meeting after it ends? +

PikaStream automatically generates and shares meeting notes — what was discussed, who said what, what was decided, and any action items the agent committed to. This is delivered as a standard part of the per-minute rate, designed to make the agent's contribution visible and actionable beyond the call itself.

Try the Beta

Put a face on your agent.

PikaStream 1.0 is live in beta. Get a developer key, clone the Pika-Skills repository, and your agent can join its first Google Meet in minutes — with your voice, your avatar, and the ability to execute tasks live on the call.

⚡ Get a Developer Key Read the GitHub Repo

Real-time video for any AI agent.v1.0