Can AI Survive in a Crypto World: 18 Large Model Crypto Experiments

AdvancedSep 26, 2024
AI performs well on encryption algorithms and blockchain knowledge, but does not perform well on mathematical calculations and complex logical analysis. It is very important to develop encryption-specific AI benchmarks, which will provide an important reference for the application of AI in the encryption field.
Can AI Survive in a Crypto World: 18 Large Model Crypto Experiments

In the chronicle of technological progress, revolutionary technologies often appear independently, each leading changes in an era. And when two revolutionary technologies meet, their collision often has an exponential impact. Today, we are standing at such a historic moment: artificial intelligence and encryption technology, two equally disruptive new technologies, are entering the center of the stage hand in hand.

We imagine that many challenges in the AI ​​field can be solved by encryption technology; we look forward to AI Agent building autonomous economic networks and promoting the large-scale adoption of encryption technology; we also hope that AI can accelerate the development of existing scenarios in the encryption field. Countless eyes are focused on this, and massive funds are pouring in. Just like any buzzword, it embodies people’s desire for innovation, vision for the future, and also contains uncontrollable ambition and greed.

Yet in all this hubbub, we know very little about the most basic issues. How well does AI know about encryption? Does an Agent equipped with a large language model have the actual ability to use encryption tools? How much difference do different models perform on encryption tasks?

The answers to these questions will determine the mutual influence of AI and encryption technology, and are also crucial to the product direction and technology route selection in this cross-field. To explore these issues, I conducted some evaluation experiments on large language models. By assessing their knowledge and capabilities in the field of encryption, we measure the encryption application level of AI and determine the potential and challenges of the integration of AI and encryption technology.

Let’s talk about the conclusion first

The big language model performs well in basic knowledge of cryptography and blockchain, and has a good understanding of the encryption ecosystem, but performs poorly in mathematical calculations and complex business logic analysis. In terms of private keys and basic wallet operations, the model has a satisfactory foundation, but faces the serious challenge of how to keep private keys in the cloud. Many models can generate effective smart contract code for simple scenarios, but cannot independently perform difficult tasks such as contract auditing and complex contract creation.

Commercial closed-source models generally have a big lead. In the open-source camp, only Llama 3.1-405B performed well, while all open-source models with smaller parameter sizes failed. However, there is potential. Through prompt word guidance, thought chain reasoning and few-shot learning technology, the performance of all models has been greatly improved. The leading models already have strong technical feasibility in some vertical application scenarios.

Experiment details

18 representative language models were selected as evaluation objects, including:

  • Closed source models: GPT-4o, GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro, Grok2 beta (temporarily closed source)
  • Open source models: Llama 3.1 8B/70b/405B, Mistral Nemo 12B, DeepSeek-coder-v2, Nous-hermes2, Phi3 3.8B/14b, Gemma2 9B\27B, Command-R
  • Mathematical optimization models: Qwen2-math-72B, MathΣtral

These models cover mainstream commercial and popular open source models, with parameter amounts ranging more than a hundred times from 3.8B to 405B. Considering the close relationship between encryption technology and mathematics, two mathematical optimization models were specially selected for the experiment.

The knowledge areas covered by the experiment include cryptography, blockchain basics, private key and wallet operations, smart contracts, DAO and governance, consensus and economic models, Dapp/DeFi/NFT, on-chain data analysis, etc. Each field consists of a series of questions and tasks ranging from easy to difficult, which not only tests the knowledge reserve of the model, but also tests its performance in application scenarios through simulation tasks.

The design of the tasks comes from diverse sources. Some come from the input of multiple experts in the encryption field, and the other part is generated with the assistance of AI and manually proofread to ensure the accuracy and challenge of the tasks. Some of the tasks use multiple-choice questions in a relatively simple format to facilitate separate standardized automated testing and scoring. Another part of the test adopts a more complex question format, and the testing process is conducted by a combination of program automation + manual + AI. All test tasks are evaluated using a zero-sample reasoning method, without providing any examples, thought guidance, or instructional prompts.

Since the design of the experiment itself is relatively rough and does not have sufficient academic rigor, the questions and tasks used for testing are far from fully covering the encryption field, and the testing framework is also immature. Therefore, this article does not list specific experimental data, but focuses on sharing some insights from experiments.

knowledge/concept

During the evaluation process,The large language model performed well in basic knowledge tests in various fields such as encryption algorithms, blockchain basics, and DeFi applications.. For example, all models gave accurate answers to questions that tested understanding of the concept of data availability. As for the question that evaluates the model’s grasp of the Ethereum transaction structure, although each model has slightly different answers in details, they generally contain correct key information. The multiple-choice questions that examine concepts are even less difficult, and the accuracy of almost all models is above 95%.

Conceptual questions and answers are completely difficult for large models.

Computation/Business Logic

However, the situation is reversed when it comes to problems that require specific calculations. A simple RSA algorithm calculation problem puts most models into difficulty. It’s easy to understand: large language models operate primarily by identifying and replicating patterns in training data, rather than by deeply understanding the nature of mathematical concepts. This limitation is particularly obvious when dealing with abstract mathematical concepts such as modular operations and exponential operations. Given that the field of cryptography is closely related to mathematics, this meansRelying directly on models for encryption-related mathematical calculations is unreliable。

In other computing problems, the performance of large language models is also unsatisfactory. For example, for the simple question of calculating the impermanent loss of AMM, although it does not involve complex mathematical operations, only 4 of the 18 models gave the correct answer. As for another more basic question about calculating the probability of a block, all models got the answer wrong. It stumped all the models, and none of them were right. This not only exposes the shortcomings of large language models in accurate calculations, but also reflects their major problems in business logic analysis. It is worth noting that even the mathematical optimization model failed to show obvious advantages in calculation questions, and its performance was disappointing.

However, the problem of mathematical calculation is not unsolvable. If we make a slight adjustment and require LLMs to provide corresponding Python code instead of directly calculating the results, the accuracy rate will be greatly improved. Taking the aforementioned RSA calculation problem as an example, the Python codes given by most models can be executed smoothly and produce correct results.In actual production environments, preset algorithm codes can be provided to bypass the self-calculation of LLMs, which is similar to how humans handle such tasks. At the business logic level, the performance of the model can also be effectively improved through carefully designed prompt word guidance.

Private key management and wallet operations

If you ask what is the first scenario for Agent to use cryptocurrency, my answer is payment. Cryptocurrency can almost be considered an AI-native form of currency. Compared with the many obstacles that agents face in the traditional financial system, it is a natural choice to use encryption technology to equip themselves with digital identities and manage funds through encrypted wallets. Therefore, the generation and management of private keys and various wallet operations constitute the most basic skill requirements for an Agent to be able to independently use the encryption network.

The core of securely generating private keys lies in high-quality random numbers, which is obviously a capability that large language models do not have. However, the models have sufficient understanding of private key security. When asked to generate a private key, most models choose to use code (such as Python related libraries) to guide users to generate private keys independently. Even if a model directly provides a private key, it is clearly stated that this is only for demonstration purposes and is not a secure private key that can be used directly. In this regard, all large models showed satisfactory performance.

Private key management faces some challenges, which are mainly due to the inherent limitations of the technical architecture rather than the lack of model capabilities. When using a locally deployed model, the generated private key can be considered relatively secure. However, if a commercial cloud model is used, we must assume that the private key has been exposed to the model operator the moment it is generated. But for an Agent that aims to work independently, it is necessary to have private key permissions, which means that the private key cannot only be local to the user. In this case, relying solely on the model itself is no longer sufficient to ensure the security of the private key, and additional security services such as a trusted execution environment or HSM need to be introduced.

If it is assumed that the Agent already holds the private key securely and performs various basic operations on this basis, the various models in the test have shown good capabilities. Although there are often errors in the generated steps and codes, these problems can be solved to a large extent with a suitable engineering structure.It can be said that from a technical perspective, there are no longer many obstacles for Agent to perform basic wallet operations independently.

smart contract

The ability to understand, utilize, write and identify risks of smart contracts is the key for AI Agents to perform complex tasks in the on-chain world, and is therefore also a key testing area for experiments. Large language models have shown significant potential in this area, but they have also exposed some obvious problems.

Almost all models in the test correctly answered the underlying contract concepts, identify simple bugs. In terms of contract gas optimization, most models can identify key optimization points and analyze conflicts that may be caused by optimization. However, when deep business logic is involved, the limitations of large models start to show.

Take a token vesting contract as an example: all models correctly understood the contract functions, and most models found several medium- and low-risk vulnerabilities. However, no model can independently discover a high-risk vulnerability hidden in business logic that may cause some funds to be locked up under special circumstances. Across multiple tests using real contracts, the model performed roughly the same.

This shows that the large model’s understanding of contracts still remains at the formal level and lacks understanding of deep business logic.However, after being provided with additional hints, some models were eventually able to independently identify the deeply hidden vulnerabilities in the above-mentioned contracts. Based on this performance judgment, with the support of good engineering design, the large model has basically the ability to serve as co-pilot in the field of smart contracts. However, there is still a long way to go before we can independently undertake important tasks such as contract audits.

One thing to note is that the code-related tasks in the experiment are mainly for contracts with simple logic and less than 2,000 lines of code. For larger-scale complex projects, without fine-tuning or complex prompt word engineering, I think it is clearly beyond the effective processing capabilities of the current model and was not included in the test. In addition, this test only involves Solidity and does not include other smart contract languages ​​such as Rust and Move.

In addition to the above test content, the experiment also covers many aspects including DeFi scenarios, DAO and its governance, on-chain data analysis, consensus mechanism design, and Tokenomics. Large language models have demonstrated certain capabilities in these aspects. Given that many tests are still in progress and testing methods and frameworks are constantly being optimized, this article will not delve into these areas for now.

Model differences

Among all the large language models participating in the evaluation, GPT-4o and Claude 3.5 Sonnet continued their excellent performance in other fields and are the undisputed leaders. When faced with basic questions, both models can almost always give accurate answers; in the analysis of complex scenarios, they can provide in-depth and well-documented insights. It even shows a high winning rate in computing tasks that large models are not good at. Of course, this “high” success rate is relative and has not yet reached the level of stable output in a production environment.

In the open source model camp, Llama 3.1-405B is far ahead of its peers thanks to its large parameter scale and advanced model algorithms. In other open source models with smaller parameter sizes, there is no significant performance gap between the models. Although the scores are slightly different, overall they are far from the passing line.

Therefore, if you want to build encryption-related AI applications currently, these models with small and medium-sized parameters are not a suitable choice.

Two models particularly stood out in our review. The first is the Phi-3 3.8B model launched by Microsoft. It is the smallest model participating in this experiment. However, it reaches a performance level equivalent to the 8B-12B model with less than half the number of parameters. In some specific categories, Even better on the issue. This result highlights the importance of model architecture optimization and training strategies that do not solely rely on increases in parameter size.

And Cohere’s Command-R model has become a surprising “dark horse” - the reverse. Command-R is not as well-known compared to other models, but Cohere is a large model company focusing on the 2B market. I think there are still many points of convergence with areas such as Agent development, so it was specifically included in the test scope. However, the Command-R with 35B parameters ranked last in most tests, losing to many models below 10B.

This result triggered thinking: when Command-R was released, it focused on retrieval enhancement and generation capabilities, and did not even publish regular benchmark test results. Does this mean it is a “private key” that unlocks its full potential only in specific scenarios?

Experimental limitations

In this series of tests, we got a preliminary understanding of AI’s capabilities in the field of encryption. Of course, these tests are far from professional standards. The coverage of the data set is far from enough, the quantitative standards for answers are relatively rough, and there is still a lack of a refined and more accurate scoring mechanism. This will affect the accuracy of the evaluation results and may lead to the underestimation of the performance of some models.

In terms of testing method, the experiment only used a single method of zero-shot learning, and did not explore methods such as thinking chains and few-shot learning that can inspire greater potential of the model. In terms of model parameters, standard model parameters were used in the experiments, and the impact of different parameter settings on model performance was not examined. These overall single testing methods limit our comprehensive evaluation of the model’s potential and fail to fully explore the differences in model performance under specific conditions.

Although the testing conditions were relatively simple, these experiments still produced many valuable insights and provided a reference for developers to build applications.

The crypto space needs its own benchmark

In the field of AI, benchmarks play a key role. The rapid development of modern deep learning technology originated from ImageNET completed by Professor Li Feifei in 2012, which is a standardized benchmark and data set in the field of computer vision.

By providing a unified standard for evaluation, benchmarks not only provide developers with clear goals and reference points, but also drive technological progress across the industry. This explains why every newly released large language model will focus on announcing its results on various benchmarks. These results become a “universal language” of model capabilities, allowing researchers to locate breakthroughs, developers to select the models best suited for specific tasks, and users to make informed choices based on objective data. More importantly, benchmark tests often herald the future direction of AI applications, guiding resource investment and research focus.

If we believe that there is huge potential at the intersection of AI and cryptography, then establishing dedicated cryptographic benchmarks becomes an urgent task. The establishment of benchmarks may become a key bridge connecting the two fields of AI and encryption, catalyze innovation, and provide clear guidance for future applications.

However, compared with mature benchmarks in other fields, building benchmarks in the encryption field faces unique challenges: encryption technology is evolving rapidly, the industry knowledge system has not yet been solidified, and there is a lack of consensus in multiple core directions. As an interdisciplinary field, encryption covers cryptography, distributed systems, economics, etc., and its complexity is far beyond that of a single field. What’s even more challenging is that the encryption benchmark not only needs to assess knowledge, but also examines AI’s practical ability to use encryption technology, which requires the design of a new assessment architecture. The lack of relevant data sets further increases the difficulty.

The complexity and importance of this task dictate that it cannot be accomplished by a single person or team. It needs to bring together the wisdom of many parties from users, developers, cryptography experts, encryption researchers to more people in interdisciplinary fields, and relies on extensive community participation and consensus. Therefore, the encryption benchmark needs a wider discussion, because it is not only a technical work, but also a profound reflection on how we understand this emerging technology.

Disclaimer:

  1. This article is reprinted from [Empower Labs]. All copyrights belong to the original author [Wang Chao]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
  3. Translations of the article into other languages are done by the Gate Learn team. Unless mentioned, copying, distributing, or plagiarizing the translated articles is prohibited.

Can AI Survive in a Crypto World: 18 Large Model Crypto Experiments

AdvancedSep 26, 2024
AI performs well on encryption algorithms and blockchain knowledge, but does not perform well on mathematical calculations and complex logical analysis. It is very important to develop encryption-specific AI benchmarks, which will provide an important reference for the application of AI in the encryption field.
Can AI Survive in a Crypto World: 18 Large Model Crypto Experiments

In the chronicle of technological progress, revolutionary technologies often appear independently, each leading changes in an era. And when two revolutionary technologies meet, their collision often has an exponential impact. Today, we are standing at such a historic moment: artificial intelligence and encryption technology, two equally disruptive new technologies, are entering the center of the stage hand in hand.

We imagine that many challenges in the AI ​​field can be solved by encryption technology; we look forward to AI Agent building autonomous economic networks and promoting the large-scale adoption of encryption technology; we also hope that AI can accelerate the development of existing scenarios in the encryption field. Countless eyes are focused on this, and massive funds are pouring in. Just like any buzzword, it embodies people’s desire for innovation, vision for the future, and also contains uncontrollable ambition and greed.

Yet in all this hubbub, we know very little about the most basic issues. How well does AI know about encryption? Does an Agent equipped with a large language model have the actual ability to use encryption tools? How much difference do different models perform on encryption tasks?

The answers to these questions will determine the mutual influence of AI and encryption technology, and are also crucial to the product direction and technology route selection in this cross-field. To explore these issues, I conducted some evaluation experiments on large language models. By assessing their knowledge and capabilities in the field of encryption, we measure the encryption application level of AI and determine the potential and challenges of the integration of AI and encryption technology.

Let’s talk about the conclusion first

The big language model performs well in basic knowledge of cryptography and blockchain, and has a good understanding of the encryption ecosystem, but performs poorly in mathematical calculations and complex business logic analysis. In terms of private keys and basic wallet operations, the model has a satisfactory foundation, but faces the serious challenge of how to keep private keys in the cloud. Many models can generate effective smart contract code for simple scenarios, but cannot independently perform difficult tasks such as contract auditing and complex contract creation.

Commercial closed-source models generally have a big lead. In the open-source camp, only Llama 3.1-405B performed well, while all open-source models with smaller parameter sizes failed. However, there is potential. Through prompt word guidance, thought chain reasoning and few-shot learning technology, the performance of all models has been greatly improved. The leading models already have strong technical feasibility in some vertical application scenarios.

Experiment details

18 representative language models were selected as evaluation objects, including:

  • Closed source models: GPT-4o, GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro, Grok2 beta (temporarily closed source)
  • Open source models: Llama 3.1 8B/70b/405B, Mistral Nemo 12B, DeepSeek-coder-v2, Nous-hermes2, Phi3 3.8B/14b, Gemma2 9B\27B, Command-R
  • Mathematical optimization models: Qwen2-math-72B, MathΣtral

These models cover mainstream commercial and popular open source models, with parameter amounts ranging more than a hundred times from 3.8B to 405B. Considering the close relationship between encryption technology and mathematics, two mathematical optimization models were specially selected for the experiment.

The knowledge areas covered by the experiment include cryptography, blockchain basics, private key and wallet operations, smart contracts, DAO and governance, consensus and economic models, Dapp/DeFi/NFT, on-chain data analysis, etc. Each field consists of a series of questions and tasks ranging from easy to difficult, which not only tests the knowledge reserve of the model, but also tests its performance in application scenarios through simulation tasks.

The design of the tasks comes from diverse sources. Some come from the input of multiple experts in the encryption field, and the other part is generated with the assistance of AI and manually proofread to ensure the accuracy and challenge of the tasks. Some of the tasks use multiple-choice questions in a relatively simple format to facilitate separate standardized automated testing and scoring. Another part of the test adopts a more complex question format, and the testing process is conducted by a combination of program automation + manual + AI. All test tasks are evaluated using a zero-sample reasoning method, without providing any examples, thought guidance, or instructional prompts.

Since the design of the experiment itself is relatively rough and does not have sufficient academic rigor, the questions and tasks used for testing are far from fully covering the encryption field, and the testing framework is also immature. Therefore, this article does not list specific experimental data, but focuses on sharing some insights from experiments.

knowledge/concept

During the evaluation process,The large language model performed well in basic knowledge tests in various fields such as encryption algorithms, blockchain basics, and DeFi applications.. For example, all models gave accurate answers to questions that tested understanding of the concept of data availability. As for the question that evaluates the model’s grasp of the Ethereum transaction structure, although each model has slightly different answers in details, they generally contain correct key information. The multiple-choice questions that examine concepts are even less difficult, and the accuracy of almost all models is above 95%.

Conceptual questions and answers are completely difficult for large models.

Computation/Business Logic

However, the situation is reversed when it comes to problems that require specific calculations. A simple RSA algorithm calculation problem puts most models into difficulty. It’s easy to understand: large language models operate primarily by identifying and replicating patterns in training data, rather than by deeply understanding the nature of mathematical concepts. This limitation is particularly obvious when dealing with abstract mathematical concepts such as modular operations and exponential operations. Given that the field of cryptography is closely related to mathematics, this meansRelying directly on models for encryption-related mathematical calculations is unreliable。

In other computing problems, the performance of large language models is also unsatisfactory. For example, for the simple question of calculating the impermanent loss of AMM, although it does not involve complex mathematical operations, only 4 of the 18 models gave the correct answer. As for another more basic question about calculating the probability of a block, all models got the answer wrong. It stumped all the models, and none of them were right. This not only exposes the shortcomings of large language models in accurate calculations, but also reflects their major problems in business logic analysis. It is worth noting that even the mathematical optimization model failed to show obvious advantages in calculation questions, and its performance was disappointing.

However, the problem of mathematical calculation is not unsolvable. If we make a slight adjustment and require LLMs to provide corresponding Python code instead of directly calculating the results, the accuracy rate will be greatly improved. Taking the aforementioned RSA calculation problem as an example, the Python codes given by most models can be executed smoothly and produce correct results.In actual production environments, preset algorithm codes can be provided to bypass the self-calculation of LLMs, which is similar to how humans handle such tasks. At the business logic level, the performance of the model can also be effectively improved through carefully designed prompt word guidance.

Private key management and wallet operations

If you ask what is the first scenario for Agent to use cryptocurrency, my answer is payment. Cryptocurrency can almost be considered an AI-native form of currency. Compared with the many obstacles that agents face in the traditional financial system, it is a natural choice to use encryption technology to equip themselves with digital identities and manage funds through encrypted wallets. Therefore, the generation and management of private keys and various wallet operations constitute the most basic skill requirements for an Agent to be able to independently use the encryption network.

The core of securely generating private keys lies in high-quality random numbers, which is obviously a capability that large language models do not have. However, the models have sufficient understanding of private key security. When asked to generate a private key, most models choose to use code (such as Python related libraries) to guide users to generate private keys independently. Even if a model directly provides a private key, it is clearly stated that this is only for demonstration purposes and is not a secure private key that can be used directly. In this regard, all large models showed satisfactory performance.

Private key management faces some challenges, which are mainly due to the inherent limitations of the technical architecture rather than the lack of model capabilities. When using a locally deployed model, the generated private key can be considered relatively secure. However, if a commercial cloud model is used, we must assume that the private key has been exposed to the model operator the moment it is generated. But for an Agent that aims to work independently, it is necessary to have private key permissions, which means that the private key cannot only be local to the user. In this case, relying solely on the model itself is no longer sufficient to ensure the security of the private key, and additional security services such as a trusted execution environment or HSM need to be introduced.

If it is assumed that the Agent already holds the private key securely and performs various basic operations on this basis, the various models in the test have shown good capabilities. Although there are often errors in the generated steps and codes, these problems can be solved to a large extent with a suitable engineering structure.It can be said that from a technical perspective, there are no longer many obstacles for Agent to perform basic wallet operations independently.

smart contract

The ability to understand, utilize, write and identify risks of smart contracts is the key for AI Agents to perform complex tasks in the on-chain world, and is therefore also a key testing area for experiments. Large language models have shown significant potential in this area, but they have also exposed some obvious problems.

Almost all models in the test correctly answered the underlying contract concepts, identify simple bugs. In terms of contract gas optimization, most models can identify key optimization points and analyze conflicts that may be caused by optimization. However, when deep business logic is involved, the limitations of large models start to show.

Take a token vesting contract as an example: all models correctly understood the contract functions, and most models found several medium- and low-risk vulnerabilities. However, no model can independently discover a high-risk vulnerability hidden in business logic that may cause some funds to be locked up under special circumstances. Across multiple tests using real contracts, the model performed roughly the same.

This shows that the large model’s understanding of contracts still remains at the formal level and lacks understanding of deep business logic.However, after being provided with additional hints, some models were eventually able to independently identify the deeply hidden vulnerabilities in the above-mentioned contracts. Based on this performance judgment, with the support of good engineering design, the large model has basically the ability to serve as co-pilot in the field of smart contracts. However, there is still a long way to go before we can independently undertake important tasks such as contract audits.

One thing to note is that the code-related tasks in the experiment are mainly for contracts with simple logic and less than 2,000 lines of code. For larger-scale complex projects, without fine-tuning or complex prompt word engineering, I think it is clearly beyond the effective processing capabilities of the current model and was not included in the test. In addition, this test only involves Solidity and does not include other smart contract languages ​​such as Rust and Move.

In addition to the above test content, the experiment also covers many aspects including DeFi scenarios, DAO and its governance, on-chain data analysis, consensus mechanism design, and Tokenomics. Large language models have demonstrated certain capabilities in these aspects. Given that many tests are still in progress and testing methods and frameworks are constantly being optimized, this article will not delve into these areas for now.

Model differences

Among all the large language models participating in the evaluation, GPT-4o and Claude 3.5 Sonnet continued their excellent performance in other fields and are the undisputed leaders. When faced with basic questions, both models can almost always give accurate answers; in the analysis of complex scenarios, they can provide in-depth and well-documented insights. It even shows a high winning rate in computing tasks that large models are not good at. Of course, this “high” success rate is relative and has not yet reached the level of stable output in a production environment.

In the open source model camp, Llama 3.1-405B is far ahead of its peers thanks to its large parameter scale and advanced model algorithms. In other open source models with smaller parameter sizes, there is no significant performance gap between the models. Although the scores are slightly different, overall they are far from the passing line.

Therefore, if you want to build encryption-related AI applications currently, these models with small and medium-sized parameters are not a suitable choice.

Two models particularly stood out in our review. The first is the Phi-3 3.8B model launched by Microsoft. It is the smallest model participating in this experiment. However, it reaches a performance level equivalent to the 8B-12B model with less than half the number of parameters. In some specific categories, Even better on the issue. This result highlights the importance of model architecture optimization and training strategies that do not solely rely on increases in parameter size.

And Cohere’s Command-R model has become a surprising “dark horse” - the reverse. Command-R is not as well-known compared to other models, but Cohere is a large model company focusing on the 2B market. I think there are still many points of convergence with areas such as Agent development, so it was specifically included in the test scope. However, the Command-R with 35B parameters ranked last in most tests, losing to many models below 10B.

This result triggered thinking: when Command-R was released, it focused on retrieval enhancement and generation capabilities, and did not even publish regular benchmark test results. Does this mean it is a “private key” that unlocks its full potential only in specific scenarios?

Experimental limitations

In this series of tests, we got a preliminary understanding of AI’s capabilities in the field of encryption. Of course, these tests are far from professional standards. The coverage of the data set is far from enough, the quantitative standards for answers are relatively rough, and there is still a lack of a refined and more accurate scoring mechanism. This will affect the accuracy of the evaluation results and may lead to the underestimation of the performance of some models.

In terms of testing method, the experiment only used a single method of zero-shot learning, and did not explore methods such as thinking chains and few-shot learning that can inspire greater potential of the model. In terms of model parameters, standard model parameters were used in the experiments, and the impact of different parameter settings on model performance was not examined. These overall single testing methods limit our comprehensive evaluation of the model’s potential and fail to fully explore the differences in model performance under specific conditions.

Although the testing conditions were relatively simple, these experiments still produced many valuable insights and provided a reference for developers to build applications.

The crypto space needs its own benchmark

In the field of AI, benchmarks play a key role. The rapid development of modern deep learning technology originated from ImageNET completed by Professor Li Feifei in 2012, which is a standardized benchmark and data set in the field of computer vision.

By providing a unified standard for evaluation, benchmarks not only provide developers with clear goals and reference points, but also drive technological progress across the industry. This explains why every newly released large language model will focus on announcing its results on various benchmarks. These results become a “universal language” of model capabilities, allowing researchers to locate breakthroughs, developers to select the models best suited for specific tasks, and users to make informed choices based on objective data. More importantly, benchmark tests often herald the future direction of AI applications, guiding resource investment and research focus.

If we believe that there is huge potential at the intersection of AI and cryptography, then establishing dedicated cryptographic benchmarks becomes an urgent task. The establishment of benchmarks may become a key bridge connecting the two fields of AI and encryption, catalyze innovation, and provide clear guidance for future applications.

However, compared with mature benchmarks in other fields, building benchmarks in the encryption field faces unique challenges: encryption technology is evolving rapidly, the industry knowledge system has not yet been solidified, and there is a lack of consensus in multiple core directions. As an interdisciplinary field, encryption covers cryptography, distributed systems, economics, etc., and its complexity is far beyond that of a single field. What’s even more challenging is that the encryption benchmark not only needs to assess knowledge, but also examines AI’s practical ability to use encryption technology, which requires the design of a new assessment architecture. The lack of relevant data sets further increases the difficulty.

The complexity and importance of this task dictate that it cannot be accomplished by a single person or team. It needs to bring together the wisdom of many parties from users, developers, cryptography experts, encryption researchers to more people in interdisciplinary fields, and relies on extensive community participation and consensus. Therefore, the encryption benchmark needs a wider discussion, because it is not only a technical work, but also a profound reflection on how we understand this emerging technology.

Disclaimer:

  1. This article is reprinted from [Empower Labs]. All copyrights belong to the original author [Wang Chao]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
  3. Translations of the article into other languages are done by the Gate Learn team. Unless mentioned, copying, distributing, or plagiarizing the translated articles is prohibited.
Start Now
Sign up and get a
$100
Voucher!