This is where the data to build AI comes from
Artificial intelligence (AI) relies heavily on vast amounts of data to train algorithms effectively, forming the backbone of its development. A critical issue, however, is the lack of clarity surrounding the sources of this data, which poses a problem as these data collection practices lag behind the advanced development of AI models. Historically, AI data sets were deliberately sourced from a diverse array of materials, ranging from encyclopedias to parliamentary transcripts. This variety catered to specific tasks, much before the advent of transformers in 2017, which revolutionized AI, prompting a shift towards using extensive data from the web. Presently, most AI data sets are amassed from the internet without distinguishing the nature of the sources, leading to a heavy reliance on web-based material. This has resulted in a monopolization of data, where major companies like Alphabet, through YouTube, control a significant portion of AI data relevant to video models. Such concentrated control raises concerns about who can access these crucial data resources, as companies like Google benefit from their dual role as data controllers and AI model developers, potentially hindering competition. Another dimension to this issue is the geographical and cultural bias in data sets, which predominantly reflect Western and English-speaking perspectives. With over 90% of data sets evaluated originating from Europe and North America, there's a glaring omission of diverse global perspectives, reinforcing cultural biases in AI models. This limitation could lead AI systems to skew towards a US-centric view, impacting their effectiveness and fairness across different regions. The growing dominance of a few tech giants in controlling vast segments of the internet further exacerbates this imbalance. Moreover, the data-sharing exclusivity some companies have cultivated with major online platforms could underscore this power, raising stakes for researchers and non-profits who find it harder to access such data. These trends underscore the need for intentional and inclusive data collection practices to ensure AI development genuinely encompasses global diversity and equity, reflecting a broader spectrum of human experience.
Комментарии