Understanding writing style in social media with a supervised contrastively pre-trained transformer

We introduce the Style Transformer for Authorship Representations (STAR) to detect and characterize writing style in social media. The model is trained on a heterogeneous large corpus derived from public sources with 4 . 5 & sdot; 10 6 authored texts from 70k authors leveraging Supervised Contrastive Loss to minimize the distance between texts authored by the same individual. This pretext pre -training task yields competitive performance at zero -shot with PAN challenges on attribution and clustering. We attain promising results on PAN verification challenges using STAR as a feature extractor. Finally, we present results from our test partition on Reddit, where using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy. We share our pre -trained model at huggingface AIDA-UPM/star and our code is available at jahuerta92/star.

​We introduce the Style Transformer for Authorship Representations (STAR) to detect and characterize writing style in social media. The model is trained on a heterogeneous large corpus derived from public sources with 4 . 5 & sdot; 10 6 authored texts from 70k authors leveraging Supervised Contrastive Loss to minimize the distance between texts authored by the same individual. This pretext pre -training task yields competitive performance at zero -shot with PAN challenges on attribution and clustering. We attain promising results on PAN verification challenges using STAR as a feature extractor. Finally, we present results from our test partition on Reddit, where using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy. We share our pre -trained model at huggingface AIDA-UPM/star and our code is available at jahuerta92/star. Read More