Drawn in by the promise of an inside look at Tesla’s autonomous driving software, I watched Tesla’s AI day live, popcorn in hand. Prior to the event, rumours were circulating that there would be an announcement about a foray into general-purpose robotics. Like many who follow the company, I imagined they would extend their vision system to control high degree of freedom robots to improve manufacturing in their factories. But you already know how that story ends…
After picking my jaw up off the floor, I listened to the pitch: generalized human-like robots that will perform menial, repetitive, and dangerous tasks. If you asked anyone off the street, they would consider this decades away.
Yet, when I thought back to what was presented about their autonomous driving stack, I realized how wrong that timeline was. Tesla is more than prepared to create these robots, and AI day was the final piece of the puzzle. Let me explain:
The largest robot manufacturer in the world
If there’s one thing we know about Tesla, it’s that they are the world’s leading maker of electric cars. However, AI day showed us that Tesla thinks of their cars, not as vehicles, but as robots. This is a key insight to the seemingly disjointed product roadmap introduced by the Tesla Bot. And it’s not intuitive.
For decades cars have been incredibly manual and relatively ‘dumb’ compared to other technology platforms. Only in the last decade or so have we began to outfit cars with sensors, electronics, and actuators that can both monitor and control the vehicle. The original intention here was to introduce minor, software-controlled safety features, but a critical threshold has been crossed. Beneath our very noses, cars became robots, capable of being controlled entirely by software.
And no car is more robot than a Tesla.
It’s surreal to say, but this announcement makes Tesla’s entire autonomous driving effort look like a pilot project for an even larger use case: using AI to control robots. In other words, the goal post has moved back. Tesla’s new terminal point is to create a general-purpose AI stack to operate arbitrary autonomous robots.
To achieve this, they’ve revealed a full-stack strategy, including: novel model architecture, in-house data labelling, a training/deployment system to push consistently improving releases to the fleet of Full-Self-Driving cars. This process is designed to improve upon itself, and over time, contribute to a greater and greater flywheel.
Tesla’s AI Flywheel
By flywheel, we mean that thisAI stack is designed for an endless process of improvement. This is the famous ‘march of 9s’. It’s not enough to be 99% effective at autonomous driving - the remaining 1% of unsolved scenarios could be disastrous. To truly solve driving autonomy, we must strive to add as many 9s as possible: 99.99999999…% (you get the point). What Tesla really unveiled at AI day are the human processes that works in concert with their technology stack for achieving greater and greater levels of performance in autonomous robotics.
The Inner Loop: Model Training and Planning
The core functionality of their stack is as follows:
Create a high fidelity map of the environment: This involves synthesizing inputs from multiple cameras to create a model of the world in computerized space (i.e. vector space). This is far more analogous to how human use their cameras/eyes to create a holistic interpretation of physical space. You’re reading this with a synthesized view from both your eyes.
Extract key features from the world: Humans have a mental model of the world, but we don’t pay attention to EVERY detail of it. This would be wildly distracting and inefficient. Instead, we have a smaller, more selective ‘spotlight’ of focus, as well as the capacity to be broadly aware of visual disturbances caused by motion, and changes in light or color. Here, motion, light, and color are the important ‘features’ we need for our awareness and proper functioning in the world. An AI model will also need to learn its own set of features.
Learn narrowly defined tasks from a common set of features: Think of the model now as breaking off into different ‘heads’. Each head is specialized on a narrowly defined task, for which it can be trained specifically to improve on. This is immensely useful because adding more functionality only requires adding additional ‘heads’. We can see how this comes together in the picture below. Notice how the features split off into different networks that can now have different architectures specializing for a unique purpose.
Feed the output of all networks into planning software: This is the decision making layer that interfaces between the model’s output and control of the vehicle. The planner acts as a helpful quality assurance layer, as the output of the planner (which are the car’s ‘path’ or ‘decisions’) can be examined for the level of safety, comfort, and efficiency they provide.
The Outer Loop: Data Collecting and Labelling
While impressive, it’s not enough to have a competent model architecture. This would be about as useful as a human brain without any connecting neurons (i.e. not at all useful). Huge amounts of data are needed to train the model; to wire the neurons in a meaningful way. But it’s not just a game of quantity, Tesla has to select for the highest quality data sources to train on. So how should Tesla choose from its massive data set of real world driving scenarios?
The task of driving can be broken into the (effectively) infinite set of situations that cars find themselves in. Some of the more common scenarios include bumper to bumper traffic, stopping at stop signs, and navigating city streets. Other scenarios are far less common, such as debris falling off the back of a truck or an animal jumping across the road. When we say we want to solve autonomous driving, we’re really saying that we want to train a model that can behave safely in each of these scenarios while still accomplishing the overall goal (get from point A to point B).
Tesla has invested heavily in their data strategy by bringing 1000 full-time data labellers in house, developing custom tools to improve the accuracy and efficiency of their labels, and using highly realistic simulations to supplement the data coming in from their fleets.
So now let’s get to the flywheel.
Imagine Tesla identifies that the #1 issue their autonomous software is experiencing is that it fails to avoid potholes, leading to jarring bumps and sub-optimal path planning. Their action plan might look as follows:
Query the fleet to find all occurrences of cars driving through pot holes. Collect this into a new dataset.
Use available auto-label software and the army of data labellers to create ground truth labels in vector space.
Create a new network ‘head’ that specializes in identifying potholes of all types.
Train this head to suggest that the area of vector space taken up by a pothole is sub-optimal to drive on.
Evaluate the performance of this network in the planner software, to ensure that avoiding potholes doesn’t sacrifice safety, comfort, or efficiency. There is almost certainly a benchmark set of scenarios that act as ‘driving unit tests’ to ensure that increasing performance on pothole avoidance didn’t degrade some other area.
For a harder, less frequent scenario, like debris falling off the back of a truck, Tesla might use simulation to generate their entire dataset.
Once Tesla considers a certain class of scenario to be solved, they move on to the next most impactful scenario, and so on, and so on… and the march of 9s continues.
This all comes together to create a flywheel AI stack that Tesla can improve. There is the inner loop of model training and planning, wrapped within an outer loop that works to curate/generate a dataset of highly useful driving scenarios to train the model on.
But what about the human robots?
Great question. We’ve drifted far from the original question so let’s use this moment to take stock. Excluding their AI capabilities, we know Tesla has core competencies in the following areas:
Development of electric motors, batteries, heating/cooling systems, and new materials
Design of complex, software-integrated mechanical products
Supply chain management and design of the manufacturing process
The first two are robotics table stakes, and Tesla has a world class team on each of these fronts. Perhaps most important to making this vision feasible is Tesla’s manufacturing expertise. For costs to come down, the Tesla Bot will need to be made at scale.
We’ve also established, through the lens of autonomous driving, that Tesla has a system for creating an AI model that can understand the world and be trained to perform on arbitrarily many tasks. Read that again. Arbitrarily many tasks.
This is not AGI. Rather, this is a robot capable of learning a variety of narrow use cases. Think of this more like Alexa skills - software with a set of defined competencies and use cases. Tesla’s job will be to identify and train for the fundamental tasks and capabilities that will combine together to make the robot useful in a wide variety of situations.
For a self-driving car, these fundamental tasks might be: change lanes, park, slow down, speed up, turn left and right. Such tasks are then controlled by the planning layer that exercises each skill to accomplish a task: turning left at this intersection may involve changing lanes, slowing down, turning on your car signal, etc.
For a humanoid robot, we can imagine a different set of fundamentals, such as: walk from point A to B, pick up an object, avoid obstacles. Now imagine you want this robot to pick up your groceries for you. The planning routine will take the more abstract goal and apply the right fundamentals at the right time. When you view the world as a series of scenarios that can be solved with a simple set of fundamental actions, it takes the mystique out of autonomous robots.
The Endless Video Game
In a 2D platformer like Super Mario Bros, we have a fairly complex world that can be interacted with via only five inputs: move left, right, up (jump), down and some generic action (throw shell/fireball).
Take a moment to think of how astonishing it is that you can navigate every platform, goomba, and boss level in Super Mario Bros with only FIVE INPUTS. That’s because your brain is a pretty good planning layer for those five actions. When you lose to a challenge you’ve never encountered, you come back even better. That’s because your brain also has a really great data collection and training layer - you’ve collected (and synthesized) more data on how to handle that scenario the next time!
Now, nothing is more complicated than the real world, but we’ve shown that we can view the world as a series of scenarios that can be navigated with well-planned simple actions. Tesla’s bet is that they can learn to navigate a meaningful enough set of scenarios that the Tesla Bot can unlock significant economic value.
What exactly will these robots do?
The answer to that question will rely on the many tradeoffs that will be made to establish a prototype. For instance, an early prototype might only have enough degrees of freedom and load capacity in its limbs to move lightweight objects. But that’s okay. These won’t be robots that can do every tasks. They will be robots that can do many useful tasks. Even a robot that can reliably lift and place 15 pound objects would have substantial economic value. They could fulfill last mile package delivery, organize warehouses, relocate materials on a construction site.
Why does it have to be human?
This section was a recent addition after I saw a video explainer of the Tesla Bot by Marques Brownlee (aka MKBHD). In it, he asked the question we are all thinking: “Why does it have to be a humanoid robot?”. It seems excessive and dramatically more difficult to construct. Marques made a sound argument: you wouldn’t have a humanoid robot vacuuming your floor, you would just buy a Roomba.
I think the humanoid shape reflects the loftiness of Tesla’s goals. Reducing a robot to any use case would certainly change its ideal form factor. But this would result in a world where 1000 companies make 1000 different robots to solve 1000 of our most pressing tasks. In a world where one company makes one robot to automate as many tasks as possible, this robot would have to be humanoid. Our world is ergonomically adapted to being navigated and interacted with by a human. The human form actually IS the ideal form for generalized productivity in the world today.
The Long Game
Most importantly, this shows us that Tesla has no plans on slowing down. Elon Musk knows how to use ambitious goals to motivate his teams. After all, this announcement was so shocking it completely overshadowed Tesla’s attempt at making a Westworld-scale supercomputer in project Dojo.
The scope of challenges like these contribute to a sense of existential urgency, both in and out of the company. Many would say this has driven much of Tesla’s success so far.
It’s not enough to make electric vehicles, but to make them at a pace that we have a chance of preventing the worst effects of climate change. It’s not enough to make reusable rockets, but to commit to timelines so ambitious that watching your life’s labour explode is a joy compared to the thought of never making our species multi-planetary.
No matter how this plays out, we should see this as an attempt to kickstart our imaginations. Contemplating a future filled with humanoid robots is the greatest dose of sci-fi many of us have received in a long while.