If you cheat, money will follow

Chapter 469 Verification

Model collapse: refers to using the generated data of a large language model to repeatedly train a large language model, which will cause irreversible defects in the trained model.

Even though the initial infrastructure of the model is raw data, it comes from real human world data.

The metaphor is inbreeding.

To use another metaphor, it is 1080P→720P→BD-R→DVP→DVDscr→TC-TS.

Most comrades must have experienced this pain and experience deeply.

Watching movies in 1080P is definitely the most enjoyable, but the most unpleasant is the TS format.

Model collapse is the degradation from the original 1080P format to TS movies.

There is almost no interest in watching movies.

Even though it may have an original plot, be full of original desires, and be full of original impulses.

Li Fei, Hinton, Sutskvi, and Krichevsky are all professionals and can understand everything they hear.

"It's very possible!"

"Maybe!"

"I should understand." Sutskvi suddenly realized:

"Just like when an image is usually stored repeatedly in Jpeg format, some information is lost each time, until it is completely distorted and eventually collapses."

"Yeah, we all ignored that," Hinton said:

"Currently, major companies around the world are conducting in-depth research on large prediction models..."

“There is already a lot of language model generated data on the Internet.”

"And if we automatically capture this content to train the model, it is likely to strengthen the original wrong conclusion..."

"Once a language model is solidified by such an erroneous conclusion, it is very stubborn and difficult to correct."

"I can understand it this way. Using a language model to create Jay Chou's songs, what you get is a salivating song with a similar style but lack of talent..." Li Fei said:

"And if you use this song to train the model, the next song you get is likely to have neither talent nor style. It's all different."

Who is Jay Chou?

Hinton, Sutskvi and Krichevsky didn’t understand.

"It can be understood as Taylor Swift." Li Fei replaced his name.

Hinton, Suzkovi and Krichevsky got it.

"This is what I understand. If language models can produce consciousness, then there should be similar problems with carbon-based life." Chang Le said.

"Boss, this is simply a genius judgment." Krichevsky agreed very much:

"Just like prions, the fatality rate is 100%. This is a prohibition engraved on human genes."

"We can use experiments to support this judgment." Suzkewei said.

How to experiment?

Use the initial version of wechatGPT to conduct text generation experiments.

First feed wechatGPT1.0 with the first generation generated data;

Then use the data generated by wechatGPT1.0 to repeatedly feed.

Well, I can pull and eat by myself;

Eat by yourself, poop, and eat again;

It will disgust you to death.

it is good.

The general direction and verification ideas are determined, and the next step is practice.

"Boss, are you here for business today?" Li Fei asked.

"Yes, find a few R&D personnel who understand mobile phones and systems to help me check if there are any hidden apps or executable files on this phone." Chang Le said.

"Okay, is this Mate20?" Li Fei asked.

"Mate20PRO is supplied in small batches and has not been released. Juchang sent it to me to try it out, make suggestions and keep it confidential," Chang Le said.

Chang Le has many mobile phones.

Juchang and Michang send out several unreleased prototypes every year.

Some prototypes will never leave the factory and remain in the engineering machine stage.

"Understood, no problem." Li Fei nodded.

Li Fei moved quickly.

Half an hour later, he came over with this mobile phone and said to Chang Le:

"Boss, this phone is very new. There are no hidden apps or executable files."

“Even the cache files are very small, only a few usage records of children’s songs APP.”

"Oh, thank you." Chang Le took the phone and nodded.

"Boss, you should."

"Hurry up about the verification. Once you have the results, tell me and I'll leave first."

"it is good."

When he got home, Chang Le handed the phone to Jiang Xia.

"How is it?" Jiang Xia took the phone and asked.

"Li Fei and the others looked at it. The system is very clean, without any hidden apps or executable files." Chang Le shook his head and said.

"This hacker's hands and feet are very clean, leaving no traces at all." Jiang Xia concluded that it was a hacker.

She saw with her own eyes that Xiao Changjiang was talking and laughing on his mobile phone.

And Sister-in-law Li also said that it is like the chat interface of Wechat.

"It should be." Chang Le warned:

"From now on, electronic products such as mobile phones, tablets, and computers must be put away and passwords set."

"We're not with Dudu, so we can't let her use it. The other party may have bad intentions."

"In addition, I will also report this situation to the relevant departments and focus on monitoring."

"That's all we can do." Jiang Xia nodded helplessly:

"Children today are really amazing. They are only three years old and they are more proficient in using electronic products than I am."

"After all, times have changed. You can always learn it if you listen to it and watch it too much." Chang Le said: "I also watched a baby over 1 year old turn on the TV and change the channel to watch TV."

"Haha, I've seen this too, and I laughed like crazy." Jiang Xia laughed.

After all, Chang Le did not tell Jiang Xia the facts he determined to avoid unnecessary panic.

He felt that the person he was chatting with Xiao Changjiang was not a so-called hacker at all.

It can even be said that they are not human at all.

Moreover, he vaguely felt that this so-called "model collapse" was sudden, strange and complete.

In his previous life, he read relevant articles and reports.

"Model collapse" is not sudden and complete at all.

It is a gradual and gradual accumulation process.

Mainly divided into early stage and late stage.

Early days.

Language models, because they are fed generated data, will slowly lose the original real data (real data generated by humans);

In the late stages.

Generative data will completely replace real data generated by humans, forming a perception that is completely divorced from reality.

By this stage, the language model is terminally ill.

It cannot be corrected or reversed.

In short, it is useless.

When a person reaches this stage, he can be understood as mentally ill.

Half a month later, the verification results were released.

WechatGPT1.0 is used to deliberately feed the data generated by the model.

After the first training, the overall article can be formed, but some parts have been distorted;

After 7 times...the generated data is completely unrelated to keywords and prompt words.

The answer is incorrect and illogical;

After 10 times, the model was completely useless.

The text it generates is incomprehensible and contains a lot of gibberish.

Verification is successful.

Chang Le's "guessing" and "judgment" were proved.

Simultaneously.

It also allowed teachers and students Li Fei, Hinton and others to deepen their understanding of large language model training.

They discuss among themselves.

"The process is not difficult to understand," Hinton said:

"The essence of the model is a high-end statistical application. Feeding the model with generated data will lead to "statistical approximation deviation"... which can also be understood as an error."

Sutskvi went on to say: “Generating data is inherently statistics and processing of the real world, with errors.”

"Repeated training to generate data will cause errors to accumulate, eventually leading to the complete blurring of the model."

“Training models with generated data is poisoning the language model’s understanding of the world.”

Li Fei asked: "I have a question, will a language model that has developed self-awareness also be affected by this bias?"

Sutskwei nodded roughly: "Maybe, it should be possible."

"Through this verification, we can basically conclude that the self-awareness generated by the language model is a weak awareness and is not strong and clear enough."

Krichevsky metaphorically said: "Even the sea will be polluted if there is too much white garbage... If there is too much carbon dioxide in the air, the world will warm."

This is the truth behind lying a thousand times and believing it to be true.

"This verification makes us realize the importance and scarcity of real data in the human world." Hinton said:

"With the promotion and application of large models, the Internet will be flooded with a large amount of generated data generated by various language models in the future..."

"The real data created by humans will be like clean air and water, a necessity and vitamin for the cultivation of language models."

Li Fei and others knew that this was a business opportunity.

past life.

Companies such as Google, OpenAI, and Microsoft regularly pay subscription fees to media giants including News Corporation, The New York Times, and The Guardian every year.

Prices vary according to scale, and fees also vary.

Ranging from US$500 million to US$2000 million.

and. At present, the language models of these artificial intelligence giants are still in their infancy.

Management did not realize or discover this problem.

At this time, it is necessary to bundle and acquire some news media.

Prev Index Next

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.