Monday, 30 July 2018

Twitter's vast metadata haul is a privacy nightmare for users

via ResearchBuzz Firehose

Metadata is everywhere.

Everything you tweet, every picture you take, and every status update you post on Facebook. It’s used by police and security forces to identify people who try to hide their identities and locations, while associated metadata in selfies can inadvertently ensnare criminals unaware that the data can destroy their alibi.

And metadata on Twitter can also be used in extremely precise identification each and every one of us – according to a new paper by researchers at University College London and the Alan Turing Institute.

Original article by Chris Stokel-Walker published in WIRED

Working with publicly available metadata from Twitter, a machine learning algorithm was able to identify users with 96.7 per cent accuracy

Conference paper (PDF 10pp)

Abstract

Metadata are associated to most of the information we produce in our daily interactions and communication in the digital world. Yet, surprisingly, metadata are often still categorized as non-sensitive. Indeed, in the past, researchers and practitioners have mainly focused on the problem of the identification of a user from the content of a message.

In this paper, we use Twitter as a case study to quantify the uniqueness of the association between metadata and user identity and to understand the effectiveness of potential obfuscation strategies. More specifically, we analyze atomic fields in the metadata and systematically combine them in an effort to classify new tweets as belonging to an account using different machine learning algorithms of increasing complexity.

We demonstrate that through the application of a supervised learning algorithm, we are able to identify any user in a group of 10,000 with approximately 96.7% accuracy. Moreover, if we broaden the scope of our search and consider the 10 most likely candidates we increase the accuracy of the model to 99.22%.

We also found that data obfuscation is hard and ineffective for this type of data: even after perturbing 60% of the training data, it is still possible to classify users with an accuracy higher than 95%.

These results have strong implications in terms of the design of metadata obfuscation strategies, for example for data set release, not only for Twitter, but, more generally, for most social media platforms.


No comments: