Long Text Capability: The New "Standard Configuration" for Large Models
From 4000 to 400,000 tokens, the large model is significantly enhancing its ability to process long texts.
The ability to handle long texts seems to have become another new "standard feature" for large model manufacturers.
Internationally, OpenAI has undergone multiple upgrades, increasing the context input length of GPT-3.5 from 4,000 to 16,000 tokens, and GPT-4 from 8,000 to 32,000 tokens. Its competitor Anthropic expanded the context length to 100,000 tokens all at once. LongLLaMA has extended the context length to 256,000 tokens or even more.
In the country, some large model startups have also made breakthroughs in this field. For example, a company has released an intelligent assistant product that supports the input of 200,000 Chinese characters, which is approximately 400,000 tokens. Additionally, a research team has developed a new technology called LongLoRA, which can extend the text length of a 7B model to 100,000 tokens and a 70B model to 32,000 tokens.
Currently, many top model technology companies and research institutions both domestically and internationally have made expanding context length a key focus for updates and upgrades.
These companies and institutions are mostly favored by the capital markets. For example, OpenAI has received nearly $12 billion in investments; Anthropic's latest valuation may reach $30 billion; a domestic company that was established only half a year ago has also quickly completed multiple rounds of financing, with a market valuation exceeding $300 million.
Why are large model companies so focused on long text technology? What does it mean to expand the context length by 100 times?
On the surface, this means that the length of the input text is becoming longer, and the model's reading ability is becoming stronger. From initially being able to read only a short article, it can now handle a full-length novel.
From a deeper perspective, long text technology is driving the application of large models in specialized fields such as finance, justice, and scientific research. In these areas, the ability to summarize, understand, and answer questions about long documents is a fundamental requirement and also a direction that urgently needs intelligent upgrades.
However, longer text is not necessarily better. Research shows that there is not a direct correlation between the model's support for longer context inputs and its performance. What truly matters is how effectively the model utilizes the contextual content.
Currently, the exploration of text length both domestically and internationally has not yet reached its limits. 400,000 tokens may just be the beginning, as major companies continue to break through this technical barrier.
Why improve long text processing capabilities?
A founder of a large model company stated that the limitation of input length has led to many difficulties in the implementation of large model applications. This is also why many companies are currently focusing on long text technology.
For example, in virtual character scenarios, due to insufficient long text capabilities, virtual characters may forget important information. When developing script-based games, if the input length is insufficient, it can only reduce rules and settings, affecting the game experience. In professional fields such as law and finance, in-depth content analysis and generation are often limited.
On the road to future AI applications, long texts still play an important role. AI agents need to rely on historical information for decision-making, and native AI applications require context to maintain a coherent and personalized user experience.
The founder believes that lossless compression of massive data, whether in text, voice, or video, can achieve a high level of intelligence. The upper limit of large models is determined by both single-step capability and the number of execution steps, where single-step capability is related to the number of parameters, and the number of execution steps refers to the context length.
At the same time, even models with a large number of parameters find it difficult to completely avoid hallucination issues. Compared to short texts, long texts can provide more context and detailed information, helping models to more accurately judge semantics, reduce ambiguity, and improve the accuracy of reasoning.
It is evident that long text technology can solve some early problems of large models and is also one of the key technologies to promote industrial applications. This indicates that general large models are entering a new stage, moving from LLM to the era of Long LLM.
Through some newly released products, we can glimpse the upgraded features of the Long LLM stage large model:
First, it involves key information extraction, summarization, and analysis of ultra-long texts. For example, it can quickly analyze the main ideas of an article, extract key information from financial reports, or conduct Q&A on an entire book.
In terms of code, it can generate code directly from text and even reproduce the code process based on papers. This is a significant advancement compared to the earlier sketch generation websites.
In long dialogue scenarios, more vivid role-playing can be achieved. By inputting specific character corpora and setting tone and personality, one can have one-on-one conversations with virtual characters.
These examples indicate that chatbots are developing towards specialization, personalization, and depth, which may be another lever for driving industry applications.
A company is aiming for the next consumer-facing super application: leveraging long text technology to derive multiple applications from foundational models. The founder of the company predicts that the domestic large model market will be divided into two camps: enterprises and consumers, and super applications based on self-developed models will emerge in the consumer market.
However, there is still a lot of room for optimization in long text dialogue scenarios currently available on the market. For example, some do not support online access to the latest information, cannot pause and modify during the generation process, and may still produce incorrect information even with background material support.
Technical challenges of long texts
In the field of long text technology, there exists a "triangle of impossibility" dilemma between text length, attention, and computing power.
This is reflected in that the longer the text, the harder it is to gather sufficient attention; when attention is limited, short texts are also difficult to fully interpret complex information; processing long texts requires a lot of computing power, increasing costs.
The root of this predicament lies in the fact that most models are based on the Transformer architecture. The self-attention mechanism, which is the most important part of this architecture, allows the model to flexibly analyze the relationships between information, but its computational load increases quadratically with the length of the context.
Some studies indicate that excessively long contexts can significantly reduce the proportion of relevant information, making attention dispersion seem unavoidable. This constitutes a contradiction between text length and attention, and is also the fundamental reason why large models struggle to handle long text.
At the same time, computing power has always been a scarce resource. In practical deployment, it is difficult for enterprises to provide significant computing power support, which requires manufacturers to strictly control computing power consumption when expanding model parameters or text length. However, breaking through the technology for longer texts often requires more computing power, creating yet another contradiction between text length and computing power.
In this regard, industry experts have stated that there is currently no unified solution for long text modeling with large models, and the root of the problem lies in the structure of the Transformer itself, while a brand new architecture is under development.
Currently, there are three main different solutions:
Use external tools to assist in processing long texts. The main method is to split the long text into multiple short texts for processing, loading only the necessary short text segments each time, thereby avoiding the issue of the model being unable to read the entire long text at once.
Reconstruct the self-attention calculation method. For example, divide long texts into different groups and perform calculations within each group, rather than calculating the relationships between each word, to reduce computational load and enhance speed.
Optimize the model itself. For example, fine-tune the existing model to allow it to extrapolate to longer sequences; or improve the context length by reducing the number of parameters.
The "impossible triangle" dilemma of long texts may currently have no solution, but it also clarifies the exploration direction for large model manufacturers: to find the optimal balance among text length, attention, and computing power, so that it can handle sufficient information while also considering the limitations of attention calculation and computing power costs.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Long Text Processing: New Standard for Large Models Challenges the "Unholy Trinity" Dilemma
Long Text Capability: The New "Standard Configuration" for Large Models
From 4000 to 400,000 tokens, the large model is significantly enhancing its ability to process long texts.
The ability to handle long texts seems to have become another new "standard feature" for large model manufacturers.
Internationally, OpenAI has undergone multiple upgrades, increasing the context input length of GPT-3.5 from 4,000 to 16,000 tokens, and GPT-4 from 8,000 to 32,000 tokens. Its competitor Anthropic expanded the context length to 100,000 tokens all at once. LongLLaMA has extended the context length to 256,000 tokens or even more.
In the country, some large model startups have also made breakthroughs in this field. For example, a company has released an intelligent assistant product that supports the input of 200,000 Chinese characters, which is approximately 400,000 tokens. Additionally, a research team has developed a new technology called LongLoRA, which can extend the text length of a 7B model to 100,000 tokens and a 70B model to 32,000 tokens.
Currently, many top model technology companies and research institutions both domestically and internationally have made expanding context length a key focus for updates and upgrades.
These companies and institutions are mostly favored by the capital markets. For example, OpenAI has received nearly $12 billion in investments; Anthropic's latest valuation may reach $30 billion; a domestic company that was established only half a year ago has also quickly completed multiple rounds of financing, with a market valuation exceeding $300 million.
Why are large model companies so focused on long text technology? What does it mean to expand the context length by 100 times?
On the surface, this means that the length of the input text is becoming longer, and the model's reading ability is becoming stronger. From initially being able to read only a short article, it can now handle a full-length novel.
From a deeper perspective, long text technology is driving the application of large models in specialized fields such as finance, justice, and scientific research. In these areas, the ability to summarize, understand, and answer questions about long documents is a fundamental requirement and also a direction that urgently needs intelligent upgrades.
However, longer text is not necessarily better. Research shows that there is not a direct correlation between the model's support for longer context inputs and its performance. What truly matters is how effectively the model utilizes the contextual content.
Currently, the exploration of text length both domestically and internationally has not yet reached its limits. 400,000 tokens may just be the beginning, as major companies continue to break through this technical barrier.
Why improve long text processing capabilities?
A founder of a large model company stated that the limitation of input length has led to many difficulties in the implementation of large model applications. This is also why many companies are currently focusing on long text technology.
For example, in virtual character scenarios, due to insufficient long text capabilities, virtual characters may forget important information. When developing script-based games, if the input length is insufficient, it can only reduce rules and settings, affecting the game experience. In professional fields such as law and finance, in-depth content analysis and generation are often limited.
On the road to future AI applications, long texts still play an important role. AI agents need to rely on historical information for decision-making, and native AI applications require context to maintain a coherent and personalized user experience.
The founder believes that lossless compression of massive data, whether in text, voice, or video, can achieve a high level of intelligence. The upper limit of large models is determined by both single-step capability and the number of execution steps, where single-step capability is related to the number of parameters, and the number of execution steps refers to the context length.
At the same time, even models with a large number of parameters find it difficult to completely avoid hallucination issues. Compared to short texts, long texts can provide more context and detailed information, helping models to more accurately judge semantics, reduce ambiguity, and improve the accuracy of reasoning.
It is evident that long text technology can solve some early problems of large models and is also one of the key technologies to promote industrial applications. This indicates that general large models are entering a new stage, moving from LLM to the era of Long LLM.
Through some newly released products, we can glimpse the upgraded features of the Long LLM stage large model:
First, it involves key information extraction, summarization, and analysis of ultra-long texts. For example, it can quickly analyze the main ideas of an article, extract key information from financial reports, or conduct Q&A on an entire book.
In terms of code, it can generate code directly from text and even reproduce the code process based on papers. This is a significant advancement compared to the earlier sketch generation websites.
In long dialogue scenarios, more vivid role-playing can be achieved. By inputting specific character corpora and setting tone and personality, one can have one-on-one conversations with virtual characters.
These examples indicate that chatbots are developing towards specialization, personalization, and depth, which may be another lever for driving industry applications.
A company is aiming for the next consumer-facing super application: leveraging long text technology to derive multiple applications from foundational models. The founder of the company predicts that the domestic large model market will be divided into two camps: enterprises and consumers, and super applications based on self-developed models will emerge in the consumer market.
However, there is still a lot of room for optimization in long text dialogue scenarios currently available on the market. For example, some do not support online access to the latest information, cannot pause and modify during the generation process, and may still produce incorrect information even with background material support.
Technical challenges of long texts
In the field of long text technology, there exists a "triangle of impossibility" dilemma between text length, attention, and computing power.
This is reflected in that the longer the text, the harder it is to gather sufficient attention; when attention is limited, short texts are also difficult to fully interpret complex information; processing long texts requires a lot of computing power, increasing costs.
The root of this predicament lies in the fact that most models are based on the Transformer architecture. The self-attention mechanism, which is the most important part of this architecture, allows the model to flexibly analyze the relationships between information, but its computational load increases quadratically with the length of the context.
Some studies indicate that excessively long contexts can significantly reduce the proportion of relevant information, making attention dispersion seem unavoidable. This constitutes a contradiction between text length and attention, and is also the fundamental reason why large models struggle to handle long text.
At the same time, computing power has always been a scarce resource. In practical deployment, it is difficult for enterprises to provide significant computing power support, which requires manufacturers to strictly control computing power consumption when expanding model parameters or text length. However, breaking through the technology for longer texts often requires more computing power, creating yet another contradiction between text length and computing power.
In this regard, industry experts have stated that there is currently no unified solution for long text modeling with large models, and the root of the problem lies in the structure of the Transformer itself, while a brand new architecture is under development.
Currently, there are three main different solutions:
Use external tools to assist in processing long texts. The main method is to split the long text into multiple short texts for processing, loading only the necessary short text segments each time, thereby avoiding the issue of the model being unable to read the entire long text at once.
Reconstruct the self-attention calculation method. For example, divide long texts into different groups and perform calculations within each group, rather than calculating the relationships between each word, to reduce computational load and enhance speed.
Optimize the model itself. For example, fine-tune the existing model to allow it to extrapolate to longer sequences; or improve the context length by reducing the number of parameters.
The "impossible triangle" dilemma of long texts may currently have no solution, but it also clarifies the exploration direction for large model manufacturers: to find the optimal balance among text length, attention, and computing power, so that it can handle sufficient information while also considering the limitations of attention calculation and computing power costs.