Put GPT-3.5 into a phone? There's no magic behind the burst of a small model.

Author：Friends of 36KrPublish：2024-04-26

The Scaling Law is the golden rule in the current AI field. In simple terms, it means that the more data, the larger the parameters, and the stronger the computing power, the stronger the final ability of the model. It is this rule that makes OpenAI believe that AGI may be achievable, as long as there is enough data and large enough parameters.

However, this rule seems to be facing challenges and even reversals in the recent release of a series of models.

Llama 3 has only been popular for a week, and the "small parameter" large model of 70B can compete with the rumored 1.8T parameter GPT4, while Llama 3 8B has overall capabilities that surpass those of Llama 2 70B, which is nearly ten times larger.

According to its official website, the main architectural changes of Llama 3 compared to Llama 2 are only the increase from 32k to 128K tokenizers, the use of Group Query Attention (GQA), and the RoPE technology to increase the context length. These changes are not considered significant changes to the architecture, but the capabilities of the small model have made a leap.

From this perspective, the golden rule of the Scaling Law, that more parameters make the model stronger, seems to be losing its effectiveness.

As a result, Microsoft has released the Phi-3 series models in the past few days. Its mini version with only 3.8B is claimed to be on par with GPT3.5 in terms of capabilities, and it outperforms Llama 3 8B and Mistreal 7B, which are twice its size. Microsoft has even directly installed it in the Apple A16 chip for mobile phones, occupying only 1.8G of memory and running perfectly smoothly.

The performance that surpasses the general understanding of people is described in the Phi 3 paper as deviating from the standard Scaling Law.

Do small models really break the Scaling Law? We need to first explore the methods these small models use to "break" the Scaling Law.

Two paths to "breaking" the Scaling Law

Although Llama 3 8B and Phi3 3.8B both demonstrate very good performance, the paths they choose are not the same. The three elements of large models—framework, data, and parameters—since the parameters are fixed to be small, it doesn't make sense to use MOE in such a small parameter space, and the changes that can be made are limited, so they can only focus on the data element.

Llama 3: The path of extravagance

The path taken by Llama 3 8B is to greatly increase the training data. A model with 80 billion (8B) parameters, Meta used 15 trillion (15T) data for training! This is consistent with the order of magnitude they used to train the 70B model. So it complies with the Scaling Law, but this time the increase is not in the number of parameters, but in the amount of data.

If that's the case, why have so few people previously attempted to feed super-large data to small parameter models?

Because there has always been a rule in the large language model community, called the Chinchilla Scaling rule. This comes from a paper published in 2022 by author Hoffman, who attempted to find the optimal amount of training data for a certain parameter. He found through three fitting methods that using about 20 times the amount of data as the parameters for training is the most efficient (i.e., tokens/parameters is 20/1). If the data is less than this, increasing the number of parameters doesn't make much difference; if the data is 20 times more than the parameters, the improvement in model performance is not as significant as training a larger parameter model. So if there is enough computing power to train more data, most models will choose to train a corresponding larger parameter level, because this can achieve the best effect under a certain computing power, bringing more generalization and the best performance.

But Meta conducted a stress test on the Chinchilla Scaling Law during the development of Llama 3. According to the simplified technical documentation of Llama 3, although the optimal training calculation volume of the Chinchilla in the 8B parameter model is almost 200B tokens, Meta found that even after training the model with data exceeding two orders of magnitude (approximately 40 trillion), the model's performance continued to improve. So Meta simply fed 15T tokens to the 8B and 70B parameter models for training, and they found that the model's capabilities continued to improve in a logarithmic linear manner.

In response to this, former OpenAI co-founder Andrej Karpathy posted a tweet specifically after the release of Llama 3, pointing out that as long as you continue to increase the amount of data, the model will continue to improve. He also pointed out that the reason why people don't do this is, on the one hand, due to misunderstanding: they think that the model's capabilities will significantly converge after exceeding the optimal data volume for Chinchilla. Llama 3 has proven with facts that this is not the case. On the other hand, in the current context of GPU shortages, it is not economical to continuously train a small model with so much data, because using the same computing power and data to train a large model will make it more powerful.

So only Meta, who owns 350,000 H100s and doesn't lack cards, dares to verify the Scaling Law only through the expansion of data.

Phi-3: Craftsman Carving Route

Although Microsoft is not short of cards, they obviously pay more attention to cost-effectiveness. In the technical description of Phi-3, the mini version uses a training set of 33 trillion tokens, which greatly exceeds the optimal Chichilla, but only 1/5 of Llama 3 8B.

Since its first generation, the Phi series has always leaned towards another path: optimizing data. In addition to carefully selecting data, Microsoft also uses larger models to generate corresponding textbooks and exercise sets, specifically optimizing the model's reasoning ability.

Returning to the point of optimizing data, currently, most of the data sets used in large-scale model training are obtained from web crawling. They are very messy, and a considerable portion of them are network junk mail or advertisements, which are repetitive and cannot increase the richness of information. Processing these data can significantly improve the effectiveness of training models on this data set.

For example, Huggingface recently released a dataset called Fineweb, training over 200 ablation models to carefully analyze and filter out duplicate content from Common Crawl from 2013 to 2024, resulting in a 15T training set. Training models on this dataset can significantly improve the final performance.

The core data processing method and logic of Phi3 have not changed much. Mainly some expansion and optimization have been done, and the 1.5T dataset has been increased to 3.3T. For detailed data processing logic, you can refer to the previous article "Microsoft's AI Overtaking on the Bend: Big Models Can't Keep Up, I Must Take the Lead" (click to read the original article at the end of the article).

Of course, Phi 3's approach is more complex. Its data consists of two main components: a) high-quality web data filtered by large language models. These data need to be further filtered according to "education level" to retain more web pages that can improve the model's "reasoning ability". b) Synthetic data generated by large language models. This part of the data is specifically used to teach the model logical reasoning and various specific domain skills.

Because the content capacity of Phi3 mini is relatively small and cannot accommodate all training data, it also divides the training into two independent stages: the first stage mainly uses web data sources to teach the model general knowledge and language understanding; the second stage mixes more strictly filtered web data with some synthetic data to improve the model's logical reasoning and specific domain capabilities. The second stage will cover some common sense data that is less important in the first stage, making room for data related to reasoning ability.

Through meticulous processing and carving of the data, Phi-3 mini can actually achieve at least 50 times the reasoning ability of GPT3.5, which is 50 times larger than itself.

Of course, the stunning performance of Phi3 mini cannot shake the Scaling Law itself. At most, it can be said that the force is effective, but with a little skill, the force (data) can be handled, and the brick can fly farther.

The stronger the small model, the closer the large model is to our lives.

In the recent discussions surrounding the Scaling Law, it is not limited to the "non-standard" performance exhibited by small models. After the release of Llama 3, Mark Zuckerberg mentioned in an interview that the scaling law has now encountered an energy bottleneck. From now on, the advancement of large models will be gradual rather than leapfrog. It is unlikely to achieve AGI by 2025.

Other experts, including one of the AI giants, Yoshua Bengio, and the opposition giant Garry Marcus, have expressed that without a fundamental update, the development and scaling speed of AI will slow down under the current inefficient Transformer framework.

This is also reflected in the practices of various AI giants. According to previous reports from foreign media, Microsoft built a training cluster of 100,000 H100 for GPT-6 training. However, the current capacity of the U.S. power grid cannot withstand such energy consumption. If the deployment of over 100,000 H100 GPUs in the same state occurs, the entire power grid will collapse.

If the Scaling Law really hits an energy wall, what should the next step for the big companies be?

In fact, the logic of development for internet giants is the same. If growth cannot be guaranteed, then quickly convert it into practical use cases to make money and stabilize the situation.

However, to this day, practical use cases for AI are still very scarce. This is partly because technological development takes time, and technologies such as Agent that can truly lead to practical applications are still in the process of improvement. On the other hand, the high inference cost of large models makes it difficult for many projects that seem to have less obvious benefits to truly take off.

But now, with the emergence of Llama 3 8B or Phi3 mini, the path to make large models practical is becoming clearer and clearer.

Netizens say that although training high-performance small models is expensive, inference is cheap, so overall it is still cheaper, especially for covering a larger user base, the cost of inference is very low.

Whether it is installed on increasingly powerful AI devices or simply providing low-cost cloud services, high-performance small models mean that AI will be easier to break free from cost constraints and be more effectively applied.

The dominance of small models actually brings large models closer to us.