A month ago, the New York brain researcher and machine learner Gary Marcus stated in an essay that deep learning was at a dead end (in the accompanying tweet he stated: "Deep learning is hitting a wall", thereby triggering a debate). The question of whether AI language models will also gain significantly in capabilities through further scaling to ever larger neural networks in the billion-parameter range and whether the increasing mass (but also quality) of training data will, for example, take them from pure pattern recognition in the manner of the proverbial "stochastic parrots". leads to something like "reasoning" and a deeper understanding of language and the world, divides the experts: The capabilities of large language models such as OpenAI's GPT-3 are quite impressive. A number of experts consider strong AI (Artificial General Intelligence, short: AGI) with potentially superhuman intelligence to be quite possible, while others consider it humbug and fundamentally impossible.
PaLM: Google's transformer model with 540 billion parameters can explain jokes, code and masters arithmetic
The Pathways Language Model (PaLM), Google's new large language model, impressively shows that the ceiling has not yet been reached and at least that deep learning technology still has room for improvement. PaLM includes 540 billion ("Billion") parameters and is a transformer model that, in addition to the natural language processing skills (language understanding and text generation) known from large language models, shines with some new capabilities, especially in arithmetic and in the recognition of Humor. According to the team, PaLM achieves outstanding performance values in a wide range of natural language processing (NLP) applications, in reasoning (reasoning) and in programming tasks (coding).
Multilevel logical inference and chains of thought
PaLM generates explicit explanations for more complex scenarios that require multi-level logical inference, it seems to have deep language understanding and some "world knowledge". It can recognize jokes as such and also provides high-quality explanations for newly invented jokes that were not available in its training data set or in the network. The model should distinguish between cause and effect and understand concepts and context. The Google team had fed PaLM "high-quality" web documents, books, Wikipedia, open-access conversations (chats) from the web, and GitHub code as training data. Also important from the team's point of view was that numbers were resolved into individual tokens and Unicode characters beyond the existing vocabulary translated into bytecode to produce a "lossless" vocabulary.
As a breakthrough in the area of "reasoning", the team emphasizes in the blog entry that PaLM reacts to input prompts with output in the form of chains of thoughts (chain-of-thought-prompting). For example, the model responds to word arithmetic problems by going through the task step by step and noting intermediate results. The team had tested PaLM with three arithmetic datasets and two "commonsense reasoning" datasets, on which it sometimes showed remarkable performance. With a 58 percent success rate in the GSM8K benchmark set, his score approaches the average 60 percent problem-solving ability of 9- to 12-year-old children, as can be seen by looking at this data set from school math tasks.
PaLM handles multi-level logical inference: chain-of-thought prompting versus traditional prompting
Strong in coding too
PaLM appears to show remarkable coding performance in converting natural language to program code (text-to-code), in translating code between several programming languages and in fixing compile errors (code-to-code). This is surprising given that only 5 percent of his pre-training data set contained code. In terms of performance values, PaLM is considered to be on par with the special model Codex, which is made explicitly for coding tasks and pair programming.
PaLM translates code from C to Python
Codex includes 12 billion parameters and had "seen" about fifty times as much Python code during training as PaLM, for example, which is why its coding performance was quite a surprise for the Google team. When repairing defective programs in the C language, PaLM is said to have restored the programs with an 82 percent success rate, which is far above the around 72 percent achieved so far, which are considered state of the art (DeepFix served as a comparison here).
An example from the DeepFix repertoire: on the right side the version repaired by the PaLM coder with compile errors resolved
Under the hood: new TPU architecture and pathways
PaLM is trained on a TPU-based system according to Google's Pathway system. Tensor processors (Tensor Processing Units, short: TPU) are application-specific chips for accelerating applications in machine learning, which can process numerous data flows in parallel and increase the processing speed through their architecture. Google designed it in 2016 for its in-house software collection TensorFlow, and the hardware development is now in its fourth generation (TPU v4 from 2021). According to the team, the system owes its scaling to the size of 540 billion parameters and 780 billion tokens to the acceleration and parallelization of data flows by the TPU chips.
The Google team emphasizes a special feature: The model should also require significantly less energy to operate than comparable large models so far, since it is "sparsely activated". This means that completing a task does not always activate the entire network. Similar to the human brain, PaLM activates small paths (pathways) through the neural network that only take action when required. The model should dynamically learn which parts of its network are suitable for which type of task, and then specifically accesses the areas recognized as relevant. According to the Google team, this speeds up processing and reduces energy consumption.
FLOPs as a benchmark for hardware in machine learning
According to the blog entry, 6144 chips of the TPU v4 type were used for PaLM, the largest training system at Google to date. Previous models such as GLaM, LaMDA, the Megatron Turing model or Gopher were partly still trained on conventional graphics processors (GPU) up to a maximum of around 4000 chips from the previous generation TPU v3. According to the blog announcement, PaLM shines with a "state-of-the-art" performance (SOTA) in most comparison values usual in machine learning. Its training efficiency is 57.8 percent of FLOP hardware utilization, which according to the Google team is currently the highest measured value for large language models (LLM) of this size.
For comparison: GPT-3 was 55 percent here. FLOP stands for Floating Point Operations Per Second and is a unit of measure for the speed of computers and processors. It denotes the number of floating-point number operations that can be carried out per second, and FLOPs can be used, among other things, to compare the performance of different processors. In conclusion, it can be said that with PaLM, the Google AI team has come a good deal closer to its vision of a generalizing AI system for thousands upon thousands of different tasks. In perspective, several providers are currently working on developing large models that can handle different types of data (multimodality) and a wide variety of tasks from a single system. A multimodal revolution is already underway, it seems.
More information about the Pathways Language Model "PaLM" and the architecture of the underlying Pathway system can be found in the blog article on the publication of the language model and a technical article by the Google team at arXiv.org ("Pathways: Asynchronous Distributed Dataflow for ML").