From CAPTCHA to navigation… Every movement on the Internet becomes data for artificial intelligence

Among artificial intelligence technologies, large language models (LLM) such as ChatGPT, Gemini and Claude, which can create humanoid texts, are used by many people every day.

Text from books, websites, articles, and other written materials are used to train these models.

Although educational materials can be collected from publicly available sources, recent discussions have focused on compiling this process from Internet users' online movements.

The invisible work of the “I am not a robot” box

“CAPTCHA” and “reCAPTCHA” tests, which are designed to confirm that the user is a human and not a robot before accessing a service on the Internet, are considered more than just a security measure for technology companies.

There has been debate for years that these tests, which require users to perform simple tasks such as writing letters in images presented to them or distinguishing between certain objects, are being used in training artificial intelligence tools.

The tests used by Google often query objects such as pedestrian crossings, traffic lights and vehicles, leading to claims that the data obtained from them will be used for artificial intelligence-assisted unmanned vehicles.

“reCAPTCHA user data will not be used for any purpose other than improving the reCAPTCHA service, and this is clearly stated in the Terms of Service,” a Google Cloud spokesperson said in a statement. he said.

Realistic world map from the game on mobile

Discussions about the use of everyday applications in artificial intelligence training have recently expanded to other areas such as games.

The game called “Pokemon Go”, which was published by the US company Niantic in 2016 and quickly won a large audience in many countries, has recently been the focus of criticism.

The game, in which players search for characters from the animated series “Pokemon” in the real world using GPS and cameras on their cell phones, has created a large data pool of street images.

According to MIT Technology Review magazine, artificial intelligence company Niantic Spatial has created a realistic virtual model of the real world using 30 billion images collected from gamers.

Niantic announced that it has developed technology that allows people to view their location on the map by uploading photos of the images around them.

The company also wants to use this modeling to develop technology that will make it easier for robots to move in places where GPS is not reliable.

The statement on the company's website in November 2024 confirmed that it used data provided by players through real-world scanning, but emphasized that this feature was “entirely optional.”

Users directly contribute to the improvement of LLMs

Professor Christian Peukert from the University of Lausanne in Switzerland assessed the balance between the materials used in training artificial intelligence and the security and privacy of Internet users.

Professor Peukert explained that in the old versions of CAPTCHA tests, one of the words that users were asked to decrypt was known to the system, but the other was not.

Peukert explained that the word recognized by the system is used to verify the user's humanity and the reaction to the unknown word is stored as data as part of digitization efforts such as e-book applications.

Peukert emphasized that this means that “users contribute directly to improving text recognition systems,” saying, “Most artificial intelligence training is based on passive data that users produce on the Internet, often without realizing it.” he said.

Tags on display platforms help label visual data

Christian Peukert gave examples of areas in which data on the Internet is used alongside reCAPTCHA for training artificial intelligence:

“Social media platforms such as Reddit and Twitter provide large amounts of text that train language models. On image platforms such as Instagram, descriptions and tags (added to posts) help label visual data. Searches on Google help develop language understanding and ranking systems. Navigation applications such as Google Maps and Waze collect movement data that is used to train predictive models. Conversations with chatbots and voice assistants are often recorded and used to improve systems.”

Privacy and security concerns

Christian Peukert emphasized that these processes have privacy and security issues, explaining that the accumulation of data on a large scale can lead to “tagging,” “production of fake content,” and “users feeding systems that compete with them.”

Professor Peukert emphasizes that individual measures alone will not be enough to reduce data consumption: “Most of the data used for education has already been collected, is publicly available or replicated between systems. Once data is contained in large data sets, it is difficult to regain control.” gave his assessment.

On the other hand, Peukert pointed out that this data contribution also has some advantages, pointing to the use of human data in everyday services such as language technologies, translations, accessibility tools, scientific studies and search engines.