Position: Home|News List

In order to find language data for the large model, some people have started to conduct "internet archaeology".

Author:San Yi LifePublish:2024-05-06

Over the past year, AI large models have undoubtedly been the most eye-catching protagonists in the technology industry. From FAAMG to BAT, and to a multitude of startups, countless excellent minds and massive resources have been invested in this race that is expected to liberate human productivity. Surrounding the construction of more powerful AI large models, tech giants and AI unicorns have engaged in round after round of fierce competition. The struggle for algorithms, data, and computing power has now reached a fever pitch, with a particular emphasis on data resources, which have become the top priority. After all, without data as fuel, not only will it be impossible to train stronger large models, but existing large models may also "stall."

In order to collect more data or language materials to feed large models, "buy, buy, buy" has become the solution for many AI companies. For example, Google spends $60 million annually to buy data from Reddit, and OpenAI pays for the content of publications under the Springer Publishing Group. Seeing the wealthy and powerful AI companies waving cash, more and more people are also beginning to realize the value of language materials.

Recently, the photo-sharing community EyeEm suddenly changed its terms of service, announcing that it will default to using photos on the platform to train AI large models.

According to reports, EyeEm notified users via email that the company has added a new clause to its terms and conditions, granting it the right to "copy, distribute, publicly display, transform, adapt, create derivative works, communicate to the public and/or promote" user content, including for training, developing, and improving software, algorithms, and machine learning models. Users have 30 days to opt out, otherwise it will be deemed as consent for this use, and in the future, it may take up to 180 days for users to remove content from EyeEm and its partner platforms.

The stone stirs up a thousand waves. EyeEm's move is almost putting the idea of coveting user photos on the table. It is important to note that in today's era where users generally value personal privacy, using the sudden change of user agreements to indicate that they are ready to use collected user data as AI training material is equivalent to directly offending users.

So the question is, why would EyeEm make such a move that almost amounts to cutting off ties with users? Of course, it's because they have no choice.

Founded in 2010, EyeEm was once seen as a competitor to the globally renowned photo social platform Instagram in the European market. At its peak, the former had over 20 million active and outstanding visual creators. Unlike Instagram, EyeEm is very popular among photographers because it has launched the highly commercial Mission feature, which allows brands to crowdsource photos from the EyeEm community, helping photographers on the platform make money.

Unfortunately, EyeEm, a company that successfully integrated commercialization and community building, was eventually overshadowed by Instagram. After Instagram was acquired by Meta, the former successfully swept the world through Meta's social network. However, EyeEm began to decline after 2018, and in 2021, the company was acquired by the Swiss social website Talenthouse for $40 million. Regrettably, even under Talenthouse, EyeEm was unable to revive, as ordinary users did not need two photo-sharing communities.

By mid-2022, EyeEm was unable to pay photographers on time. Subsequently, in April 2023, EyeEm officially filed for bankruptcy protection. In the same year, in October, the company, which had only three employees left, was acquired by the Spanish online graphic design resource website Freepik.

It is obvious that after filing for bankruptcy protection, EyeEm had become a mere shell, and its user base had shrunk to 150,000. For a non-technology-oriented or product-oriented internet company, the decline from 20 million users to 150,000 means that EyeEm can no longer convince users to continue using their products.

Freepik's acquisition of EyeEm is due to the latter's possession of 160 million image resources. Freepik has become a data broker, buying the now defunct EyeEm in order to sell linguistic resources to AI model manufacturers. In a sense, Freepik has shown insight in discovering the remaining value of a doomed internet company like EyeEm.

Since the beginning of the new century, numerous teams have attempted to start businesses in the internet industry, but successful companies like Meta, X, and Reddit are few and far between. Many startups have ultimately become "cannon fodder." Among these failures, many were once prominent but eventually fell silent for various reasons. Before the explosion of AI models, companies like EyeEm were essentially worthless as their business models failed and their competitors emerged victorious.

However, the hottest AI large models currently require massive amounts of data to train. Under the same conditions, the more data fed during pre-training, the stronger the performance of the AI large model, which has become an industry consensus. However, high-quality data is always a scarce resource. According to the prediction of the artificial intelligence research institution epoch, language data may be exhausted between 2030 and 2040, and high-quality language data that can train better performance may even be exhausted as early as 2026. In this case, EyeEm and similar companies that have accumulated data resources suddenly become valuable.

Therefore, with the success of Freepik, in the future, more and more companies may try to explore possible data resources from failed startups on the internet, making internet archaeology not just an interest of some netizens, but more likely to truly become a business.


Copyright © 2024 newsaboutchina.com