Expert-Speak: Not All Perfect (Yet)/The Limitations Of Representing Rare Information In Generative AI

Generative AI has revolutionized content creation, from narratives to images. Central to many of these models is maximum likelihood estimation (MLE), which allows systems to learn from large datasets by finding patterns that maximize the probability of producing outputs that resemble the data. While effective in generating realistic content, MLE struggles when tasked with rare or underrepresented information.

This limitation became clear in a recent interaction I had with ChatGPT. When asked, “Who is Tshianeo Marwala?” the AI responded, “Tshianeo Marwala is the wife of Professor Tshilidzi Marwala.” However, Tshianeo Marwala is my grandmother, not my spouse. This mistake occurred because MLE- based systems prioritize statistically common relationships — in this case, public figures are more frequently associated with spouses rather than grandparents.

MLE’s reliance on frequent patterns introduces significant challenges when dealing with uncommon data, and this issue is compounded by AI systems’ dependence on centralized sources like Wikipedia. While a valuable resource, Wikipedia’s gaps in coverage, particularly around marginalized groups and global perspectives, restrict the diversity of information AI systems can learn from. This reliance on concentrated sources exacerbates biases, limiting AI’s ability to represent underrepresented voices accurately.

MLE works by selecting parameters that maximize the likelihood of observed data. It is highly effective when dealing with frequent data patterns but less so when dealing with rare or complex relationships. For example, generative AI trained on datasets that frequently discuss public figures and their spouses will naturally associate these roles more often than other family connections. Our 2000 paper on Detection and Classification of Faults Using Maximum Likelihood and Bayesian Approaches showed that while MLE works effectively with frequent, well-defined data, it faces challenges in rare cases. Similarly, in generative AI, MLE handles common information well but tends to make errors with rare or less frequent data.

MLE can also stifle innovation by focusing on the most common patterns, sidelining rare ideas that could lead to breakthroughs. In fields where creativity and uniqueness are essential — such as product design and content creation — this narrowing effect could lead to homogenization rather than diversity. When AI systems converge too much on common patterns, they fail to explore the fringes of probability distributions where novel ideas often reside.

When AI systems rely on data from centralized platforms like Wikipedia or mainstream news outlets, they reflect the biases of these sources. Generative AI will excel at reproducing dominant perspectives but struggle to represent underrepresented or niche knowledge, such as indigenous cultures or less-documented traditions.

This systemic bias limits the AI’s ability to reflect the full spectrum of human experience accurately.

In my grandmother’s case, ChatGPT’s error highlights a broader challenge in AI’s ability to handle complex, less-represented relationships. If a system is trained predominantly on data that overrepresents certain groups or patterns, it will likely misinterpret rare information, perpetuating existing biases.

Beyond limiting diversity, MLE can amplify societal biases present in training data. AI models will reinforce these patterns if a dataset disproportionately represents certain genders, ethnicities, or regions.

To overcome these limitations, we must develop generative AI systems that synthesize information from diverse sources rather than rely on a few centralized platforms. This would enable AI to capture inclusive insights better and improve the accuracy of underrepresented data.

Generative AI must also continually adapt to new information, ensuring its output reflects the latest knowledge. As data generation accelerates, AI systems must effectively filter, assess, and integrate new information. By doing so, AI models can remain relevant, nuanced, and reflective of evolving contexts.

Researchers are exploring alternatives to MLE to address these challenges. One promising approach is reinforcement learning, which allows AI to learn through feedback and adjust based on incentives, improving accuracy and diversity. This method, though not without challenges, can help ensure that rare data is better represented. Another solution is data augmentation, where the training data is enriched with synthetic or curated examples of underrepresented information. This technique helps AI systems learn from a more diverse set of inputs, resulting in more accurate outputs.

Collaboration between AI developers and experts in underrepresented fields is also critical. Such interdisciplinary partnerships ensure that AI systems are not solely reliant on statistical likelihood but incorporate a broader range of knowledge and experiences. In conclusion, while MLE has played a vital role in generative AI, it has notable limitations. Addressing them is essential for ensuring AI systems are fair, inclusive, and accurate. Moving beyond MLE requires both technical innovation and a commitment to diversity. By designing AI systems that synthesize knowledge from distributed sources and continuously update with new information, we can create more adaptive models that are representative of the full breadth of human knowledge and experience.

Related Topics: #Africa, #artificial intelligence, #Featured, #News Letter, #newsletter, #Nigeria, #South Africa.

SIGN UP FOR OUR NEWSLETTER

DAILY UPDATE