🧠 No Data, No Dice

Data is the fuel that keeps the AI engine running


🌵 The Intersection of Crypto & AI 🌵

Big Brain Breakdown

Market Metrics

Total Crypto Market Cap: down 1% to $2.38T
Total AI Sector Market Cap: up 2.2% to $26.28B

Top Movers (24hrs):

📈 EnqAI (ENQAI): up 22.5% to $0.04407
📈 dotmoovs (MOOV): up 19.7% to $0.00691
📈 Akash Network (AKT): up 14.3% to $5.09

Daily News

🟠 Virtuals Protocol released a whitepaper on its Game Agentic Module Engine (GAME), enabling the use of AI agents in gaming through “cognitive capabilities and API architectures.”

🟠 ORA Protocol team member @0xKartin teased Initial Agent Offerings and Inference Assets, which appear to be two new offerings the ORA team plans to ship in collaboration with Lab7007, Virtuals Protocol, Talus Network, and NIM Network.

🟠 NEAR Protocol’s Fast Finality Layer (NNFL) is now live on testnet. NFFL aims to enable faster cross-rollup transactions and reduce fragmentation of liquidity and state, among other ambitions.

🟠 Nillion released a demo on a new privacy-preserving AI feature that “blindly analyzes large amounts of data and predict who is at risk of developing potential diseases.”

🟠 AI Protocol released a new series of Galaxe quests, enabling community members to enter raffles, earn tokens, points, OATs, and other rewards.

🧠 Big Brain Breakdown

Welcome back to another Big Brain Breakdown, where we help you understand the fundamentals of blockchain AI projects so you can stay ahead of the herd and invest in projects poised for outperformance.

As AI continues to advance at a rapid pace, a new bottleneck has emerged that is stalling progress: the lack of quality datasets.

AI model training requires vast amounts of high-quality, clean data to learn and produce accurate results. While companies have successfully created data sets to train LLMs like ChatGPT, these models provide outputs that are typically quite general and often lack the granularity required to be helpful in fields that require highly specific and reliable outputs.

Ideas for domain-specific AI models are constantly thrown around for industries like law, healthcare and finance. However, while this is a great idea in theory, the reality is that obtaining a data set that is high-quality, clean and specific is either extremely difficult, as it is often proprietary or sensitive, or simply impossible as it doesn’t yet exist.

Moreover, even when data is available, it may not be of the desired quality. AI models trained on incomplete, biased, or inaccurate data can and do produce flawed results, leading to poor performance and unreliable outputs. This is especially concerning in critical fields like healthcare, where AI-assisted decision-making can have life-altering consequences, or in finance, where one wrong trade could send a company bankrupt.

To overcome the data shortage, some developers have turned to synthetic AI-generated data, or sometimes poisoned datasets. While this approach can help fill the gaps, it comes with its own set of challenges, including the risk of model overfitting. AI models trained on synthetic data are prone to overfitting due to the biases or shortcomings present in the algorithms used to create the synthetic data. This can lead to a phenomenon known as model degradation, where the quality of the AI models deteriorates over time. They end up learning from flawed replicas of real-world data, which limits their effectiveness when applied to new, unseen scenarios

Why overcoming this bottleneck is important

Overcoming the data bottleneck would not only enable the development of more accurate and reliable general purpose AI models like ChatGPT, but it would also enable domain specific AI models that are trained on extremely specific data sets.

These specialized models can tackle complex problems that require deep knowledge and understanding of a particular field. For example, an AI model trained on a comprehensive legal database could assist lawyers in researching cases, identifying relevant precedents, and even predicting the outcomes of legal proceedings. Similarly, AI models trained on detailed manufacturing data could optimize production processes, reduce waste, and improve overall efficiency.

Beyond industry-specific applications, overcoming the data bottleneck is crucial for the development of more advanced and capable AI systems. As AI models become more sophisticated, they require ever-larger and more diverse data sets to learn from. By providing access to such data, we can push the boundaries of what AI can achieve, paving the way for breakthroughs in areas like natural language processing, computer vision, and robotics.

Furthermore, addressing the data bottleneck can help ensure that AI is developed in a fair, unbiased, and ethical manner. By training AI models on diverse and representative data sets, we can reduce the risk of perpetuating societal biases. This is particularly important as AI systems become more integrated into decision-making processes that affect people's lives, such as hiring, finances, and criminal justice.

Projects Tackling this Challenge

While the AI sector has been quick to jump at the challenge posed by the rapidly growing demand for compute power, it has been slower to react to the lack of quality datasets suitable for training AI models.

There are two notable projects tackling this issue, and as @lempheter nicely articulates in this X post, both are taking different approaches to the problem.

One of these projects is Grass, which takes a "top-down" approach to the problem. Grass focuses on scraping top-of-funnel data, filtering it, and then matching it with potential AI customers. By aggregating data from various sources and processing it to meet specific requirements, Grass aims to provide AI developers with the data they need to train their models effectively.

The other is Synesis One, which incentivizes users to contribute to the creation of high-quality data sets through “Train2Earn” campaigns. Synesis One employs a "bottom-up" approach, where it first identifies the specific data needs of its customers and then launches dedicated campaigns to fulfill those requirements. The project has created a marketplace where data contributors are rewarded with tokens for their efforts. This incentive model encourages people to participate in data collection and labeling tasks, helping to build comprehensive and diverse data sets.

While these projects offer promising solutions to this issue, they also face several challenges. One key challenge is ensuring the correct incentivisation of participants. In the case of Synesis One, which relies on human contributors, there is a risk that some participants may submit low-quality or inaccurate data in pursuit of token rewards. To mitigate this, the project rewards reliable participants, but this in and of itself is also a subjective labelling process.

AI Art of the Day

Disclaimer: This newsletter is provided for educational and informational purposes only and is not intended as legal, financial, or investment advice. The content is not to be construed as a recommendation to buy or sell any assets or to make any financial decisions. The reader should always conduct their own due diligence and consult with professional advisors for legal and financial advice specific to their situation.