{"id":17405,"date":"2026-03-03T08:29:24","date_gmt":"2026-03-03T01:29:24","guid":{"rendered":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/"},"modified":"2026-03-03T08:29:24","modified_gmt":"2026-03-03T01:29:24","slug":"nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video","status":"publish","type":"post","link":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/","title":{"rendered":"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video"},"content":{"rendered":"<\/p>\n<p><a href=\"https:\/\/www.yankodesign.com\/tag\/ces-2026\/\">CES 2026<\/a> was crowded with humanoids doing simple household tasks such as folding laundry or stacking up the dishwasher. One thing I was sure of seeing this influx of robots at the world\u2019s biggest tech event, was that such service bots are going to be the next big thing invading our households in the near future.<\/p>\n<p>Staying with that thought, the robotics industry, for now, faces the biggest challenge in teaching robots to operate in the messy real world. The unstructured environment means robots need massive amounts of data to learn. Gathering and structuring that data is the costliest thing in robotics and perhaps the biggest impediment, slowing the entire development process.<\/p>\n<p>Designer: <a href=\"https:\/\/dreamdojo-world.github.io\/\">DreamDojo<\/a><\/p>\n<\/p>\n<p><a href=\"http:\/\/yankodesign.com\/tag\/nvidia\">NVIDIA<\/a> believes it has created a workaround. The company has released DreamDojo, an open-source \u201cworld model,\u201d which intends to help robots learn intuitive physics to interact in the physical world by seeing humans do it first. So, instead of relying on painstaking programming or teleoperating robots, Nvidia DreamDojo would allow robots to train on 44,000 hours of egocentric human video, which shows humans handling tools, assembling objects, and doing laundry.<\/p>\n<\/p>\n<p>NVIDIA terms this open-source world model as the \u201clargest dataset to date for world model training.\u201d The dataset is called DreamDojo-HV (Human Video) and comprises exactly 44,711 hours of footage, which includes 6,015 unique tasks and more than a million trajectories. This works in two independent phases and is billed by Nvidia to be 15 times larger and about 96 times more skill-packed. It is also believed to include 2000 times more scenes than ever seen in the previous largest datasets for world model training.<\/p>\n<h2>Two-phase robotic course for being human<\/h2>\n<p>Of course, collecting robot-specific data is the biggest bottleneck in the industry. By simplifying that with abundant human video, Nvidia is trying to make learning convenient and cheaper for robotic companies betting on humanoids. For me, this possibility of learning through seeing before touching physical objects is compelling. And for its execution is divided into two phases: Pre-Training and Post-Training.<\/p>\n<\/p>\n<p>Firstly, it pre-trains on large-scale human video using what Nvidia says is \u201clatent actions.\u201d Since human videos do not provide joint torque labels or motor commands, Nvidia has trained a \u201c700-million-parameter spatiotemporal Transformer\u201d to extract \u201cproxy actions\u201d from visual changes between frames, allowing the model to \u201ctreat any human video as if it came with motor commands attached.\u201d Secondly, it post-trains on a specific robot body with \u201ccontinuous robot actions.\u201d The idea is to separate physical understanding from hardware control, so that the robot learns the rules of the physical world first and then adapts them to need and limb requirements.<\/p>\n<h2>Real-time dreaming<\/h2>\n<p>With its world model designed to teach robots to watch humans first, Nvidia is suggesting to us that the best and fastest way to scale humanoids isn\u2019t more robot data. It is probably their exposure to more human experience. Considering this, it\u2019s imperative to note that this is not the first world model. Many have been devised before, but they have been considerably slower at achieving the outcome. NVIDIA has been able to clock up the pace by distilling DreamDojo to run at 10.81 frames per second in real time for over a minute. DreamDojo HV has been demonstrated across humanoid platforms like GR-1, G1, AgiBot, and YAM robots, the company says, and has shown what it calls \u201crealistic action-conditioned rollouts\u201d across diverse environments and object interactions.<\/p>\n<\/p>\n<p>From what I see, if DreamDojo can work as the press information reveals, it could make life easier for startups and robotic teams with limited resources to collect a large robot-specific dataset and use it to teach their robots. But before more use case scenarios trained on the Nvidia world model show up, I am skeptical how they will perform in every changing real-world condition, which are not absolutely the same at any two moments.<\/p>\n<\/p>\n<p>The post <a href=\"https:\/\/www.yankodesign.com\/2026\/03\/02\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/\">Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video<\/a> first appeared on <a href=\"https:\/\/www.yankodesign.com\/\">Yanko Design<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>CES 2026 was crowded with humanoids doing simple household tasks such as folding laundry or stacking up the dishwasher. One thing I was sure of seeing this influx of robots at the world\u2019s biggest tech &hellip; <\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v16.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video - Blog TSK<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video - Blog TSK\" \/>\n<meta property=\"og:description\" content=\"CES 2026 was crowded with humanoids doing simple household tasks such as folding laundry or stacking up the dishwasher. One thing I was sure of seeing this influx of robots at the world\u2019s biggest tech &hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/\" \/>\n<meta property=\"og:site_name\" content=\"Blog TSK\" \/>\n<meta property=\"article:published_time\" content=\"2026-03-03T01:29:24+00:00\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cstc.vn\/blogtsk\/#website\",\"url\":\"https:\/\/cstc.vn\/blogtsk\/\",\"name\":\"Blog TSK\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/cstc.vn\/blogtsk\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/#webpage\",\"url\":\"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/\",\"name\":\"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video - Blog TSK\",\"isPartOf\":{\"@id\":\"https:\/\/cstc.vn\/blogtsk\/#website\"},\"datePublished\":\"2026-03-03T01:29:24+00:00\",\"dateModified\":\"2026-03-03T01:29:24+00:00\",\"author\":{\"@id\":\"\"},\"breadcrumb\":{\"@id\":\"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/cstc.vn\/blogtsk\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video\"}]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video - Blog TSK","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/","og_locale":"en_US","og_type":"article","og_title":"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video - Blog TSK","og_description":"CES 2026 was crowded with humanoids doing simple household tasks such as folding laundry or stacking up the dishwasher. One thing I was sure of seeing this influx of robots at the world\u2019s biggest tech &hellip;","og_url":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/","og_site_name":"Blog TSK","article_published_time":"2026-03-03T01:29:24+00:00","twitter_misc":{"Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebSite","@id":"https:\/\/cstc.vn\/blogtsk\/#website","url":"https:\/\/cstc.vn\/blogtsk\/","name":"Blog TSK","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/cstc.vn\/blogtsk\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/#webpage","url":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/","name":"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video - Blog TSK","isPartOf":{"@id":"https:\/\/cstc.vn\/blogtsk\/#website"},"datePublished":"2026-03-03T01:29:24+00:00","dateModified":"2026-03-03T01:29:24+00:00","author":{"@id":""},"breadcrumb":{"@id":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/cstc.vn\/blogtsk\/nvidia-wants-robots-to-learn-before-executing-tasks-by-watching-44000-hours-of-human-video\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/cstc.vn\/blogtsk\/"},{"@type":"ListItem","position":2,"name":"Nvidia wants robots to learn before executing tasks by watching 44,000 hours of human video"}]}]}},"_links":{"self":[{"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/posts\/17405"}],"collection":[{"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/comments?post=17405"}],"version-history":[{"count":0,"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/posts\/17405\/revisions"}],"wp:attachment":[{"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/media?parent=17405"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/categories?post=17405"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cstc.vn\/blogtsk\/wp-json\/wp\/v2\/tags?post=17405"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}