Introduction
Welcome to the world of robots with AI brains! In today’s digital age, where information is constantly being shared and consumed, developing and harnessing Artificial Intelligence is more important than ever. The advancement of machines, whether through emergent reasoning or programmed instructions, sparks profound concerns regarding their impact on the real world. In particular, Language Model Machines (LLMs) raise apprehensions due to their inherent unpredictability and complexity. Many in the field of robotics express apprehension about the unrestricted use of LLMs in guiding robotic actions.
Robots are Evolving in Intelligence and Capabilities
Robots are now taking over kitchens in restaurants worldwide, from bustling Shanghai to vibrant New York City. They skillfully prepare a variety of dishes including burgers, dosas, pizzas, and stir-fries, employing the same methodical approach that robots have utilized for the past half-century: precise adherence to instructions, executing repetitive tasks in identical sequences.
However, Ishika Singh envisions a different kind of robot—one capable of crafting dinner autonomously. This innovative machine would venture into a kitchen, sift through cabinets and the fridge, select ingredients that harmonize into delectable dishes, and even set the table—a task so intuitive that even a child could manage it, yet currently beyond the capabilities of any robot. Singh, a Ph.D. candidate in computer science at the University of Southern California, attributes this limitation to the conventional planning pipeline used by roboticists. She explains that it meticulously defines every action, its prerequisites, and anticipated outcomes, leaving little room for unforeseen circumstances. Despite numerous iterations and extensive coding, such an approach invariably produces robots ill-equipped to handle unanticipated challenges.
Crafting a viable “policy” for a dinner-preparing robot necessitates comprehensive awareness not only of the culinary norms of the locale (“What constitutes ‘spiciness’ here?”) but also the specific kitchen environment (“Is there a hidden rice cooker?”), dietary preferences of individuals (“Hector will be ravenous after his workout”), and any unique circumstances (“Aunt Barbara’s visit necessitates gluten- and dairy-free options”). Moreover, the robot must exhibit adaptability in response to unexpected mishaps (“Oops! I spilled the butter—what’s a suitable substitute?”).
Jesse Thomason, Singh’s supervising professor at USC, views this aspiration as a monumental challenge with transformative potential, envisioning a future where robots assume a myriad of human tasks, revolutionizing industries and simplifying everyday life.
Despite the plethora of impressive robot demonstrations showcased on platforms like YouTube—ranging from warehouse operations to pet-like companions and automated healthcare assistants—none possess the versatility and adaptability inherent in human cognition. Naganand Murty, CEO of Electric Sheep, emphasizes the fragility of classical robotics, constrained by the need to impart a fixed map of the world to the robot, a world constantly in flux. As a result, contemporary robots, much like their predecessors, operate within narrowly defined parameters, executing repetitive tasks in controlled environments.
The desire to imbue robots with practical intelligence has long been the aspiration of robotics engineers. Historically, computers were as incapable as their robotic counterparts. However, in 2022, the advent of ChatGPT—a user-friendly interface for the “large language model” (LLM) GPT-3—heralded a breakthrough. Endowed with vast knowledge about dinners, kitchens, and recipes, ChatGPT and similar LLMs can address virtually any query a robot might encounter while navigating the intricacies of meal preparation in a specific kitchen.
Empowering Robots with Language Models: Bridging Knowledge and Action
LLMs possess the wealth of knowledge that robots lack, encompassing everything from quantum physics to K-pop to the nuances of defrosting a salmon fillet. Conversely, robots possess physical bodies capable of interacting with their environment, bridging the gap between words and reality. It appears only natural to merge mindless robots with bodiless LLMs so that, as a paper from 2022 suggests, “the robot can act as the language model’s ‘hands and eyes,’ while the language model supplies high-level semantic knowledge about the task.”
While many of us use LLMs for leisure or academic pursuits, certain roboticists view them as a means for robots to transcend preprogrammed limitations. According to an op-ed by security technologist Bruce Schneier and data scientist Nathan Sanders, the emergence of human-like models has triggered a “race across industry and academia to find the best ways to teach LLMs how to manipulate tools.”
While some technologists anticipate a significant advancement in robot comprehension, others express skepticism due to occasional odd errors, biased language, and breaches of privacy associated with LLMs. Despite their human-like qualities, LLMs fall short of human expertise, often “hallucinating” or generating erroneous content. Concerns abound regarding their integration with robots.
The debut of ChatGPT in late 2022 marked a pivotal moment for engineers at Levatas, a West Palm Beach firm specializing in software for robots deployed in industrial settings. With ChatGPT and Boston Dynamics, the company developed a prototype robot dog capable of verbal communication, responding to questions and commands in everyday language. This innovation eliminates the need for extensive training for human workers interacting with the robot. According to CEO Chris Nielsen, the goal is to provide ordinary industrial employees with the ability to communicate naturally with the robot, instructing it to perform tasks like sitting or returning to its dock.
The LLM-enhanced robot at Levatas demonstrates an understanding of language nuances and intent. It recognizes that different phrasings—such as “back up” and “get back”—convey the same meaning. Instead of analyzing data from patrol logs manually, workers can simply inquire about anomalies during the robot’s last inspection.
While Levatas’s integrated software streamlines operations in specific industrial environments, the prospect of household robot dogs remains distant. The Levatas system excels within confined settings, where it fulfills specific tasks, unlike scenarios where a robot might engage in recreational activities or navigate unpredictable domestic environments.
Understanding Robotics and Machine Learning: Sensors, Software, and Large Language Models
Regardless of its intricacies, every robot operates within the confines of a limited set of sensors, including cameras, radar, lidar, microphones, and carbon monoxide detectors, among others, which gather information about its surroundings. These sensors connect to a finite number of mechanical components such as arms, legs, grippers, or wheels. Facilitating the integration of the robot’s perceptions and actions is its computer, which processes sensor data and instructions from its programmer, translating information into binary code—0s and 1s representing the flow of electricity through circuits.
Utilizing its software, the robot evaluates its repertoire of available actions and selects those most aligned with its directives. It then transmits electrical signals to its mechanical parts, initiating movement, and subsequently learns from its sensors how its actions have influenced the environment, prompting further responses. This process is grounded in the physical realm, where metal, plastic, and electricity interact within the actual workspace of the robot.
In contrast, machine learning operates within a metaphorical space, executed by a “neural net”—a network of electrical circuits represented as cells organized in layers, initially inspired by the structure of the human brain. Each cell communicates with hundreds of connections, assigning weights to input signals. These weights are summed to determine whether the cell remains inactive or “fires,” transmitting its signal to other cells. Similar to how increased pixel density enhances image detail, a greater number of connections within the model yields more nuanced results. In “machine learning,” learning entails the model adjusting its weights to approximate desired outcomes.
Over the last 15 years, machine learning has demonstrated remarkable prowess in specialized tasks such as protein folding or candidate selection for job interviews. However, large language models (LLMs) represent a form of machine learning not limited to specific missions—they can engage in conversations on diverse topics.
Though an LLM’s response is essentially a prediction regarding word combinations, the program itself lacks true comprehension, while users do. Moreover, LLMs operate in plain language, requiring no specialized training or technical expertise to interact with. Users can engage with them in various languages, although certain languages remain less represented in the LLM landscape.
When presented with a prompt—a question, request, or instruction—the model converts words into numerical representations of their relationships, facilitating prediction. The resulting numerical data is converted back into text. The “largeness” of large language models refers to the vast number of adjustable parameters they possess. OpenAI’s GPT-1, unveiled in 2018, had roughly 120 million parameters, whereas GPT-4 reportedly exceeds a trillion. Wu Dao 2.0, from the Beijing Academy of Artificial Intelligence, boasts 1.75 trillion parameters.
LLMs excel in prediction due to their extensive parameter tuning and vast language data in their training sets, often serving as a surrogate for the common sense and background knowledge robots lack. As Jesse Thomason explains, this leap eliminates the need to specify extensive background information, such as kitchen details. Instead, LLMs draw from a wealth of recipes, enabling them to execute tasks like cooking a potato hash with precision.
Enhancing Robotic Capabilities through Language Models: ProgPrompt and SayCan Innovations
A robot paired with an LLM presents an asymmetrical arrangement: boundless linguistic capabilities tied to a robotic body capable of only a fraction of human functions. For instance, a robot equipped with a two-fingered gripper cannot delicately fillet a salmon. If tasked with preparing dinner, the LLM, drawing from vast linguistic data, might propose actions beyond the robot’s capabilities.
Compounding these limitations is what philosopher José A. Benardete termed “the sheer cussedness of things” in the real world. Altering the position of a curtain affects light reflection, potentially impairing a robot’s vision. Similarly, a gripper optimized for round objects might struggle with irregular shapes. To mitigate these challenges, roboticists often test software on virtual-reality robots before deployment.
Stefanie Tellex, a roboticist at Brown University, humorously remarks, “the robots have to get better to keep up” with advancements in language understanding. Recognizing this bottleneck, Thomason and Singh sought to align LLM-generated instructions with the robot’s capabilities. Singh proposed leveraging a technique used to prevent mathematical and logical errors: providing prompts with sample questions and solutions.
Singh envisioned adapting this method to constrain LLM responses within the robot’s operational range. By presenting simple, executable steps—”go to refrigerator” or “pick up salmon”—she could guide the LLM in generating Python code for the robot. This approach, named ProgPrompt, underwent testing on both physical and virtual robots.
In virtual simulations, ProgPrompt devised executable plans with a high success rate, surpassing previous systems. Real-world testing, focusing on simpler sorting tasks, yielded similarly promising results, with the robot succeeding almost consistently.
At Google, research scientists Karol Hausman, Brian Ichter, and their team have embarked on a novel approach to translate LLM outputs into robot actions. Their SayCan system utilizes Google’s PaLM LLM, which begins with a predefined list of simple behaviors that the robot can execute. It is instructed to incorporate items from this list into its responses. Following a human request in conversational English, French, or Chinese, the LLM selects behaviors deemed most likely to fulfill the request.
During a demonstration, a researcher types, “I just worked out, can you bring me a drink and a snack to recover?” The LLM prioritizes “find a water bottle” over “find an apple” as a more suitable response. The robot, resembling a one-armed, wheeled device akin to a cross between a crane and a floor lamp, navigates to the lab kitchen, retrieves a water bottle, and delivers it to the researcher. Subsequently, it returns. With the water already delivered, the LLM now favors “find an apple,” prompting the robot to retrieve the apple. The system, leveraging the LLM’s understanding of workout-related language, refrains from offering sugary drinks or junk food.
Fei Xia, one of the creators of SayCan, highlights the system’s ability to comprehend and fulfill nuanced requests, such as bringing coffee when someone mentions poor sleep. However, probing deeper into LLM understanding raises the question of whether these language programs merely manipulate words mechanically or retain some conceptual model of their meaning.
In a recent experiment conducted by Anirudha Majumdar, a professor of engineering at Princeton University, and his colleagues, they tapped into an LLM’s implicit world map to address a fundamental robotics challenge: enabling a robot to handle unfamiliar tools. Their system demonstrated signs of meta-learning, the ability to apply prior learning to new situations.
The researchers leveraged GPT-3’s responses to detailed questions about various tools to train a virtual robotic arm. When faced with a crowbar, the conventionally trained robot attempted to pick it up from its curved end. In contrast, the GPT-3-enhanced robot correctly grasped the crowbar’s long end. This ability to generalize, akin to human behavior, stemmed from the robot’s exposure to various tools and their characteristics through the LLM’s responses.
Navigating Concerns and Innovations: The Role of LLMs in Robotics
Whether machines engage in emergent reasoning or simply follow predefined instructions, their capabilities raise significant concerns about real-world implications. LLMs inherently possess less reliability and predictability compared to classical programming, prompting unease among many in the field. According to Thomason, some roboticists caution against allowing robots unrestricted actions, fearing potential consequences.
Despite acknowledging the novelty of Google’s PaLM-SayCan project, Gary Marcus, a psychologist and tech entrepreneur, voiced skepticism about LLMs, highlighting potential dangers if they misinterpret human intentions or fail to grasp the full ramifications of requests. Thomason echoes concerns about deploying LLMs, particularly in client-facing roles, citing safety issues. He rejected the idea of integrating LLMs into assistive technology for the elderly, emphasizing the need to leverage LLMs where they excel—conveying knowledgeable discourse.
In Thomason and Singh’s latest collaboration, an LLM devises plans for robot actions, but a separate program, employing traditional AI methods, translates these plans into executable instructions. This hybrid approach leverages the LLM’s simulated common sense while mitigating the risk of follies.
Critics underscore potential biases inherent in LLMs, rooted in the biased data used for their training. Instances like Joy Buolamwini’s experience highlight the biases, particularly in facial recognition technologies. Moreover, LLMs lack comprehensive knowledge, particularly about underrepresented languages and cultures. Additionally, concerns linger about LLMs perpetuating stereotypes present in their training data.
As LLMs evolve, researchers face challenges in regulating their behavior. Some propose developing “multimodal” models to generate not only language but also images, sounds, and action plans. Nonetheless, current concerns primarily revolve around the misuse of LLMs’ capabilities rather than their integration into robot bodies.
Despite the complexities associated with LLMs, the immediate dangers lie more in their replication of human behaviors, both positive and negative, rather than in their integration into robots. Tellex emphasizes that LLMs encapsulate both the best and worst aspects of the internet. Thus, deploying them in robots might be among the safer uses of such models compared to their potential for generating phishing emails, spam, or fake news.
Excerpt
In today’s digital age, where information is constantly being shared and consumed, developing and harnessing Artificial Intelligence is more important than ever. The advancement of machines, whether through emergent reasoning or programmed instructions, sparks profound concerns regarding their impact on the real world. In particular, Language Model Machines (LLMs) raise apprehensions due to their inherent unpredictability and complexity. Many in the field of robotics express apprehension about the unrestricted use of LLMs in guiding robotic actions.