LLMs gave us reasoning. RAG gave us retrieval. Tool calling gave us action. What’s missing in the modern agent stack is perception: the ability to see, hear, and remember the world as it happens.
This workshop is a practical walkthrough of building a perception layer for agents using VideoDB. You’ll learn how to convert continuous media (screen, mic, camera, RTSP, files) into a structured context your agent can use:
- Indexes (searchable understanding)
- Events (real-time triggers)
- Memory (episodic recall with playable evidence)
We’ll implement the core loop:
Continuous Media → Perception Layer (VideoDB) → Agent (reasoning + action) → Output grounded in evidence
Who should attend:
- Engineers building agents that need continuous and temporal awareness (not one-shot screenshots).
- Research teams building in physical AI, desktop robots, and wearables.
- Product teams building meeting bots, desktop copilots, monitoring/ops, QA/compliance
- Founders building multimodal apps where “show me the moment” matters
What You’ll Discover:
- What “perception” actually means for agents: continuous, temporal, multi-source, searchable, actionable.
- How to support three input modes with one mental model: files, live streams, desktop capture.
- How to build searchable memory so your agent can retrieve results with playable evidence, not vibes.
- How to move from batch video AI to real-time event streams your agent can react to immediately.
Plus:
- A starter template you can reuse: “Index + Events + Memory” as the default perception stack
- Networking with builders working on agents + multimodal infra
Learn more and register here.