Artly Barista Bot: imitation learning, motion-capture training, autonomous latte art

Posted by – January 11, 2026
Category: Exclusive videos

Artly positions its VA platform as a “robot training school” for physical AI: instead of scripting a single demo, they build a reusable skill library that can drive a robotic barista workflow and then expand into other manipulation tasks. In this interview, CEO/co-founder Yushan Chen frames the coffee system as the first high-volume application, where the robot has to execute a full sequence—grind dose, tamp, pull espresso, steam milk, pour, and finish latte art—with repeatable timing and tool handling. https://www.artly.ai/

A key technical idea here is learning-from-demonstration (imitation learning): an engineer performs the task while wearing sensors (motion capture / teleoperation style inputs), and the robot later reproduces the same trajectories. During training, the platform records synchronized action data plus camera streams, then uses perception to re-localize target objects at runtime. In the demo, the arm-mounted vision stack identifies items like oranges and apples and closes the loop so the robot can continue a pick-and-place motion even when the scene is slightly different each try.

They also call out Intel RealSense depth cameras for object perception, which fits the need for 3D pose estimation, reach planning, and gentle grasp control around deformable objects. The robot detects failed grasps, retracts, and retries—suggesting basic recovery logic plus confidence checks that keep the arm from “committing” to a bad pickup. Even with a short training session (they mention about two minutes), you can see how fast a narrow, well-instrumented skill can be brought to a usable level.

Beyond the lab, Artly says it has around 40 deployments across North America, and the point of that footprint is data: every real execution can become additional training signal to refine the policy and improve robustness across different cups, fruit sizes, and counter layouts. The video itself was filmed at CES Las Vegas 2026, where this kind of closed-loop manipulation is showing up less as a novelty and more as a practical “physical AI” pattern for retail automation out on the floor there.

Artly’s roadmap in the conversation is basically dexterity plus generality: better end-effectors (including more hand-like grippers), richer sensory feedback, and progressively harder latte-art patterns that demand tighter control of flow rate, tilt angle, and microfoam texture. If the platform can keep turning demonstrations into dependable, auditable skills—perception, grasping, tool use, and recovery—it becomes a template for other tasks like drink garnish or fresh-ingredient handling without changing the overall training loop too much, which is the interesting part to watch next year.

I’m publishing about 100+ videos from CES 2026, I upload about 4 videos per day at 5AM/11AM/5PM/11PM CET/EST. Check out all my CES 2026 videos in my playlist here: https://www.youtube.com/playlist?list=PL7xXqJFxvYvjaMwKMgLb6ja_yZuano19e

This video was filmed using the DJI Pocket 3 ($669 at https://amzn.to/4aMpKIC using the dual wireless DJI Mic 2 microphones with the DJI lapel microphone https://amzn.to/3XIj3l8 ), watch all my DJI Pocket 3 videos here https://www.youtube.com/playlist?list=PL7xXqJFxvYvhDlWIAxm_pR9dp7ArSkhKK

Click the “Super Thanks” button below the video to send a highlighted comment under the video! Brands I film are welcome to support my work in this way 😁

Check out my video with Daylight Computer about their revolutionary Sunlight Readable Transflective LCD Display for Healthy Learning: https://www.youtube.com/watch?v=U98RuxkFDYY

source https://www.youtube.com/watch?v=B_TZLnS5Mw8