PikaStream 1.0 is the live visual engine that gives any AI agent a face, a voice, and a real-time presence in video meetings - at conversation speed, on a single GPU.
Most video models are too slow for live interaction. They generate clips offline. A single rendered result might look impressive, but it does not behave like a participant in a conversation — it behaves like a voicemail. PikaStream 1.0 is the real-time visual engine built to close that gap.
Instead of producing a finished clip after a prompt, PikaStream generates personalized video continuously while the conversation happens. Speech comes in. Reasoning and audio generation run in parallel. The avatar video streams back out with stable identity, synchronised mouth movement, and emotionally appropriate reactions — all in roughly one and a half seconds of end-to-end latency.
The result: an AI agent that does not appear as a blank tile or a name in a participant list. It appears as a dynamic, animated presence — visible to every participant, responsive to the flow of conversation, capable of executing agentic tasks during the call. Pair it with a Pika AI Self and you have a living digital extension of yourself in any Google Meet, ready to meet, decide, and ship.
Pikaformance, the previous-generation ultra-fast Pika model, required eight GPUs and 4.5 seconds of latency per response. Fast for video generation, far too slow for real-time conversation. PikaStream 1.0 cuts that to a single GPU and 1.5 seconds — and the experience changes completely.
Every exchange requires a 4.5-second wait. Conversation rhythm collapses. The model is fast for video generation, but too slow to behave like a participant in a live call.
Continuous streaming at 24 FPS with end-to-end latency around 1.5 seconds, on one GPU. For the first time, a real-time visual engine can power live, identity-consistent video at conversation speed.
Latency is the product. Not in a benchmark-bragging way — in a "does this kill my momentum" way. We have seen the same idea play out in coding tools, where speed alone changed how developers behave. Now it is happening in video and avatars: when response is fast enough, creators stop scripting everything upfront and start steering in the moment. PikaStream lands in the territory where AI stops being something you pause your workflow to use, and starts being something that exists inside your workflow.
PikaStream 1.0 is built around a 9B Diffusion Transformer paired with a custom streaming VAE — fused with a single-GPU inference pipeline that delivers end-to-end audio-conditioned video at real-time frame rates.
Speech, prompt, and per-frame audio tokens enter the pipeline alongside agent identity, memory, and workspace context.
Bidirectional DiT distilled into a causal autoregressive student via optimized self-forcing, enabling chunk-by-chunk streaming at real-time frame rates.
A full Transformer-based VAE trained from scratch — provides its own latent space, reconstructs video in real time via streaming decoding.
Full attention across all frames at training, distilled into causal streaming for inference. Identity stays stable across long calls.
Each video frame attends only to its temporally aligned audio tokens — not the full sequence. That's how the lip-sync stays tight.
Prompts can be swapped on the fly mid-stream, enabling interactive control over motion, expressions, and actions during live generation.
FlashVAE builds on Pika's concurrent FlashDecoder research, which showed that a Transformer-based streaming decoder can match conventional 3D convolutional decoders in reconstruction quality (PSNR and LPIPS) while running over 10× faster. Combined with the 9B DiT and an inference pipeline that fuses decoding, audio conditioning, and scheduling into a single-GPU loop, PikaStream pushes 24 FPS on one H100 with 1.1 GB of VAE memory overhead.
The published research note (April 2 2026) makes concrete claims around frame rate, latency, decoding speed, lip-sync alignment, identity consistency, and the model architecture behind the experience. That level of disclosure positions PikaStream as a serious systems milestone — not a viral product teaser, but a real-time generation infrastructure layer that other agent-enabled applications can plug into.
PikaStream 1.0 is not a filter or an avatar overlay. It is a generative model engineered for the demands of live human meetings — visual presence, vocal identity, persistent memory, and agentic action, all running in real time.
Your agent appears as a dynamic, animated avatar visible to every meeting participant — not a static icon, not a delayed clip. Continuous 24 FPS generation while the call unfolds.
Record a short audio sample with the clone-voice subcommand. Optional noise-reduction flag. Your agent speaks with your voice — not generic text-to-speech.
Supply a description with generate-avatar and PikaStream uses OpenAI image models to produce a custom avatar. Or pass --image to use your own asset.
Unlike conventional meeting bots, the agent retains context about who it is, who it knows, and what was previously discussed — session to session, week to week.
The agent does not just talk. It executes tasks live during the call — pulling data, updating docs, scheduling actions — without breaking conversational flow.
Before joining a call, PikaStream synthesizes your identity, recent activity, and known contacts into the system prompt — responses arrive informed, not generic.
Tight lip-sync alignment plus appropriate emotional reactions. Gestures land where they should. Eye contact and facial cues make the avatar feel like a participant.
The agent summarizes what was decided, who said what, and what the action items are — generated and shared automatically once the call ends.
Works with any AI agent that can process markdown instructions and run scripts — including Claude, OpenClaw, and custom agents you build yourself.
PikaStream is delivered as a Skill through the open-source Pika-Skills repository on GitHub. Clone, configure, and your agent can join its first Google Meet within minutes.
Visit the Pika developer portal and generate an API key. PikaStream uses an automated balance check before each session — if you're low on credits, you'll get a secure top-up link.
pika.me/dev/login
The skill is published in Pika-Labs/Pika-Skills on GitHub. Drop it into your agent runtime — it auto-detects and exposes the meeting interface to your agent without manual wiring.
github.com/Pika-Labs/Pika-Skills
Run clone-voice with an audio sample to lock in your voice. Run generate-avatar with a prompt, or pass --image if you already have one.
./pikastream clone-voice
Pass a Google Meet URL to ./pikastream join with your voice and avatar. Your agent connects, presents on camera, and executes tasks during the call. Pika app and Google Meet today; Zoom & FaceTime soon.
./pikastream join --meet ...
The clearest wins are anywhere a meeting needs a face that can both talk and act — from delegating coverage to standing up customer presence to scaling 1:1 conversations beyond a single human's bandwidth.
Send your Pika AI Self to a meeting you can't make. They show up with your face, your voice, and full context — take notes, surface decisions, kick off follow-ups before the call ends.
A real visual presence on a video call. Answers questions, retrieves account data live, and resolves issues face-to-face — at the price-point of an automation, not a human seat.
Prospects book in, get a real conversational walkthrough led by your agent. Tasks like creating draft proposals or sending follow-up materials happen live during the call.
A team agent joins recurring standups, syncs status across tools, and reports back. Continuous workspace awareness means context is preserved across every meeting.
Your agent speaks languages you don't. Run identical face-to-face conversations across markets without hiring local reps for every region or time zone.
Open up "video calls with your AI Self" as a tier for fans, students, or community members. Scale presence beyond what a single human schedule can hold.
Course intros, office hours, language practice, training sessions. The agent knows the curriculum, remembers each learner, and adapts in real time during the lesson.
Drop PikaStream into any agent runtime via API. Suddenly your custom agent has a face, a voice, and a meeting-ready presence — with zero rendering infrastructure to maintain.
Usage-based by design. The bot is billed only while it's active in a meeting — making short check-ins and 1:1 calls economical, while longer sessions scale predictably.
Standard rate while PikaStream is active in a meeting. The skill runs an automated balance check before each session and surfaces a top-up link if your credits are low.
A 5-minute customer check-in costs around a dollar. A 30-minute team standup runs about six. Cheaper than coffee, more reliable than a junior coverage hire.
Setup is technical today — primarily through GitHub and developer keys. A consumer-friendly experience inside the Pika app is rolling out for Pika AI Self users.
Voice cloning, avatar generation via OpenAI image models, identity persistence, agentic skill integration, and post-meeting summaries — all in the standard rate.
PikaStream 1.0 is live in beta. Get a developer key, clone the Pika-Skills repository, and your agent can join its first Google Meet in minutes — with your voice, your avatar, and the ability to execute tasks live on the call.