SpADe: Multi-Stage Spam Account Detection for Online Social Networks

In recent years, Online Social Networks (OSNs) have radically changed the way people communicate. The most widely used platforms, such as Facebook, Youtube, and Instagram, claim more than one billion monthly active users each. Beyond these, news-oriented micro-blogging services, e.g., Twitter, are daily accessed by more than 120 million users sharing contents from all over the world. Unfortunately, legitimate users of the OSNs are mixed with malicious ones, which are interested in spreading unwanted, misleading, harmful, or discriminatory content. Spam detection in OSNs is generally approached by considering the characteristics of the account under analysis, its connection with the rest of the network, as well as data and metadata representing the content shared. However, obtaining all this information can be computationally expensive, or even unfeasible, on massive networks. Driven by these motivations, in this article we propose SpADe, a multi-stage Spam Account Detection algorithm with reject option, whose purpose is to exploit less costly features at the early stages, while progressively extracting more complex information only for those accounts that are difficult to classify. Experimental evaluation shows the effectiveness of the proposed algorithm compared to single-stage approaches, which are much more complex in terms of features processing and classification time.


INTRODUCTION
T HE widespread diffusion of Online Social Networks (OSNs) has enabled new forms of communication that allow people to regularly share almost any kind of information within a virtual community. Nowadays, a number of OSNs are available to address the needs of different types of users, providing them with a variety of services and objectives. Some aim to create networks of people who know each other (e.g., Facebook), or to connect people interested in news coming from all over the world (e.g., Twitter); others are oriented to professional networking (e.g., LinkedIn), some offer instant messaging services, such as WhatsApp, Telegram, or Viber, while some others are mainly intended for sharing multimedia contents, e.g., Instagram and Youtube.
Thanks to their ease of use, popular OSNs claim billions of active users, most of which are unfortunately not aware of the threats coming from the cyber space. This represents the main reason why malicious users are attracted to the social networks as much as, or even more than, legitimate ones.
Research on social network security covers a wide number of topics, from account hijacking, fraud and impersonation attacks to malware distribution [1]. Beside them, spam detection is a well-known, and still open, challenge which affects social networks as well as any other type of network-based application [2].
In general terms, spammers are entities (real users or automated software agents) that repeatedly send unsolicited messages for various purposes, e.g., supporting commercial, slandering, or proselytizing campaigns [3]. Even though several spam detection techniques have been proposed in the literature, the art of spamming continuously evolves and new intelligent approaches for identifying spammers are constantly needed. The behavior of early social bots, for instance, was quite simplistic as they were just intended to spread messages to as many users as possible. As soon as the spam detection algorithms became able to identify the typical characteristics of these bots, such as the presence of a biased following/followers ratio (FF) as compared to real users, the attackers quickly improved their strategy [4]. A trustworthy FF value, for instance, could be easily forged by relying on groups of social bots which cooperate to mimic the interactions among normal OSN users, thus avoiding the corresponding countermeasures [5].
As a consequence, spam analysis in online social networks is generally approached by considering different levels of information that describe the user as a whole. To this aim, a variety of features and classification algorithms exist. Whereas the latter are typically borrowed from those adopted in other machine learning contexts, the feature extraction process is strictly dependent on the set of information that the OSN makes available. This commonly includes the characteristics of the account, its connection with the rest of the social network, as well as data and metadata representing the content shared. However, what is never considered in the existing works is the effort required to extract each feature, which deeply impacts on the capability of the classifier to provide timely results.
Driven by these motivations, in this paper we propose SpADe, a multi-stage Spam Account Detection technique with reject option, whose purpose is to exploit less costly features at the early stages, while progressively extracting more complex information only for those accounts that are more challenging to classify.
SpADe consists of four stages of analysis that progressively combine information about (i) the general characteristics of the account, (ii) the URLs shared, (iii) the similarity of the published contents, and (iv) the relationship between the user and the rest of the social network. Bayes classifiers are adopted to implement the accept/reject mechanism in all the stages except the last one, in which decision trees are exploited to make a decision regardless of the uncertainty degree. The effectiveness of the approach was proven by considering as case study the most widely diffused microblogging platform, i.e., Twitter. Nevertheless, it is worth noting that the features we have chosen, as well as the classification algorithm, are reasonable for any other OSN.
The major contributions of this work are summarized as follows.
The paper presents SpADe, a novel Spam Account Detection approach that takes into account the effectiveness of each feature as well as its observation (collection/processing) cost; at the best of our knowledge, this is the first work in which the two aspects are considered together. The most representative features presented in the literature were selected and organized into four consistent categories, each of which captures a different facet of spamming behaviors and is characterized by a homogeneous observation cost. A novel multi-stage classification algorithm with reject option that incrementally exploits set of features of increasing complexity was designed. This allows to classify an account as soon as the chosen confidence level is reached, without the need to capture the whole feature set for every account under analysis. SpADe is evaluated both on a dataset of about 40.000 users we retrieved from the Twitter stream during the last year, and on a popular reference public dataset of about 11.000 users collected in 2017. Comparing the results obtained on datasets of different sizes, and acquired in different epochs, made it possible to carry out a robust evaluation of the proposed method. The remainder of the paper is organized as follows: related works are outlined in Section 2. The mathematical background of the proposed multi-stage classification algorithm is provided in Section 3. Section 4 presents SpADe and the features exploited at each stage by highlighting their role in spam detection. Experimental settings and results are discussed in Section 5. Conclusions follow in Section 6.

RELATED WORK
Spam detection is a popular research topic that has been widely addressed in the last decades. More recently, the focus has shifted towards the detection of spam campaigns on Online Social Networks (OSNs), which represent one of the most fertile grounds for this type of cyber threat [6]. The main reason is that users of OSNs can share information in many different ways, and so do spammers, making their behavior difficult to predict. In this paper, we focus on Twitter analysis because tweets generally refer to popular events and are therefore characterized by a high information content.
This section presents a review of the state of the art by following the evolution of spam detection systems. Related works are arranged in categories, which reflect the most important characteristics of the studies discussed.
Honey-Profiles. Early spam detection was mainly based on statistical analysis of the account activities. This type of systems required a protected environment in which the spammer could act undisturbed, allowing the detection algorithm to monitor and learn its behavior. In the Social Honeypot Project [7], for instance, an automated bot is assigned to every account to be analyzed in order to capture meaningful features that may reveal a malicious activity. The authors of [8] exploited 60 Twitter bots as honeypots to attract a total of 36,000 accounts, which were analyzed by observing their activities and relationships with their neighborhood. In [9], an extensive study on how spammers operate to target Facebook, Twitter and MySpace is presented. In order to observe four categories of spammers, called displayers, braggers, posters and whisperers, large sets of honey profiles were created with the aim of capturing information about the accounts they are connected with, and the messages they received. Then, users were classified by Random Forest exploiting conventional features, such as following/follower ratios, URLs, message similarity, account activity and quality of the neighborhood. Such an analysis also revealed the possibility of identifying spam campaigns [10] in which several bots cooperate with the same goal. The use of social honeypots is also discussed in [11], which also highlights how these traps can be effective in identifying previously unknown spamming patterns. The major limitation of these solutions is that several honeypots must be implemented in order to make the approaches effective. However, when dealing with large communities of spammers, this turns out to be computationally expensive, or even unfeasible. URLs Blacklists. URLs are frequently injected by the spammers into trending topics and related messages. War-ningBird [12] aimed at detecting spammers by following the URLs through all their redirections so as to obtain the target IP addresses; then, a set of features is computed and analyzed in order to assign the suspicious label to the corresponding URLs. Results show the good performance of the system; nevertheless, the effectiveness of WarningBird dramatically drops when a obfuscation mechanism is applied to the URLs, e.g., via the URL shortening services. The authors of [13] extended the analysis of URLs by considering also how the links are received by the community, i.e., counting the actual number of clicks. However, the analysis is limited to a few different shortening services and the correlation between URLs and other type features is not considered.
Wide Feature Sets and ML. In order to identify the distinctive characteristics of a broader set of spamming strategies, several works proposed a variety of features [3], [14] that can be exploited as the basis for Machine Learning (ML) models. The system presented in [15], for instance, leverages on characteristics that are able to capture the way tweets are written, as well as the user's posting frequency, social interactions, and influence on the Twitter network. These features are exploited to train a Support Vector Machine (SVM) classifier capable of correctly identifying 70% of spammers and 96% of non-spammers. Even though these results are notable, the method lacks of considering other relevant aspects that are typical of spammers. In [16], elements such as the behavioral and content entropy, bait-techniques, and profile vectors are considered. The corresponding features were used to train four different supervised learning algorithms, namely Decision Tree, Random Forest, Bayes Networks, and Decorate. Results indicate that such a feature set allows to achieve good performance with any of the four algorithms. A different kind of features aimed to model the interactions between users and their followers is exploited in [17]. The idea is that spammers can easily alter features regarding their own behavior, while those based on to their relationships with the community are more difficult to change. Nevertheless, complex attacks based on sybil networks [18] might seriously reduce the effectiveness of this kind of features. Sybil account detection is addressed in [19], where a method called Ianus is proposed to discover fake accounts according to registration information. The study moves from the observation that sybil accounts are characterized by different registration patterns than legitimate ones. Then, sybil detection is solved as a graph inference problem in which registrations are modeled as nodes, and strongly connected nodes are more likely to represent sybils. Another approach to deal with compromised accounts is discussed in [20]. Malicious changes are distinguished from legitimate ones through statistical analysis and anomaly detection techniques. The system, called COMPA, exploits features capable of capturing recurring temporal patterns in the account usage, information about the messages (e.g., language, topic, the application used to share them), as well as the presence of URLs/mentions and the connections of the user with the social graph. Social graphs for spammer detection are also examined in [21], where Graph Convolutional Networks (GCNs) and Markov Random Fields (MRFs) are combined to detect neighbor message-passing and capture human insights in user following relations. The analysis of communities, and in particular of the topics that spread through them, can make it possible to identify groups of accounts with abnormal behaviors. POISED [22] is a system that models the different propagation paths of benign and malicious messages in order to distinguish between legitimate and spam accounts. Experimental evaluation performed on Twitter data shows the effectiveness of this approach, even against poisoning and evasion adversarial attacks. The detection of anomalous topics is addressed in [23], where a topology-based method to detect cooperative and organized spammer groups in micro-blogging communities is proposed. An anomaly detection problem is also formulated in [24], where spammers are described by means of 107 features. This system combines two data stream clustering [25] algorithms, namely StreamKM++ and DenStream, that allow to correctly identify the most of the spammers, with low percentage of false positives. Data stream clustering is also discussed in [26], where a modified version of DenStream based on a set of incremental Bayes classifiers is presented. In this case, the feature set is designed so as to capture relevant characteristics of both the user's behavior and the tweet content. A different approach is presented in [27], where the acceptance of a user from other members of the community is considered as an indicator of its reliability. In particular, the authors propose an unsupervised spam detection algorithm in which high peer acceptability values are assigned to users that have common interests, e.g., users discussing the same topics and/or sharing the same contents.
Deep Learning. Other works proposed the use of Deep Learning (DL) because of the lower effort required for feature extraction [31]. In [30], a novel DL technique is showed to outperform two machine-learning classifiers. The DL approach considers only users' tweets and needs Google's Word2vec algorithm to learn tweet syntax, while ML algorithms exploit 9 easy to extract user/account's meta-data and text-based features. Word2vec is also used in [32] to convert tweets into dense vectors, which are analyzed by means of Recurrent Neural Networks (RNNs). The limitations of this approach are common to all deep learning based techniques [33], i.e., the reasons for a certain output are difficult to understand and there is no standard theory to guide in the selection of the right DL strategy.
Other Learning Methods. Approaches based on pure ML suffer from the constant evolution of the spammers, which continuously worsens the performance of existing methods. Incremental learning aims to keep the models up to date in order to deal with new attack strategies. This aspect is deeply analyzed in [28], where a model, called Lfun (Learning from unlabeled tweets), is proposed to include new unlabeled spam tweets into the classifier training process. On the same principle operates AdaGraph [34], an unsupervised graph-based technique that dynamically builds and updates a graph of behaviors to detect spam in OSNs. Although the performance of this approach are quite relevant, it cannot be adopted for massive graphs analysis because of the cost of collecting community-based features. Social FingerPrinting [29] combines supervised and unsupervised techniques in order to identify two different types of automated spammers, namely, those interested in advertising products on e-commerce platforms and promoting a political candidate during the electoral campaign. The behavior of each account is encoded as a sequence of characters that represents a sort of digital DNA; then, a similarity measure between DNA sequences is used to detect genuine or spamming accounts.
Analysis of the literature reveals that spammers' behavior can be modeled through a variety of feature sets, capable of capturing the essence of a tweet, the characteristics of the account, as well as the interaction between users in the network. However, all the works described so far do not explicitly consider the cost of obtaining the features, which in many cases is prohibitive and can significantly affect the classifier's ability to provide timely results. For this reason, our approach aims to progressively exploit features of increasing complexity, depending on the the peculiarities of each spammer. The idea is somehow similar to the process of diagnosing a clinical condition through a series of investigations of increasing cost and complexity [35].
The characteristics of SpADe are summarized in Table 1, which also provides a comparison with some of the works discussed in this section.

DECISION UNDER UNCERTAINTY
Bayesian decision theory assumes that decisions are taken according to the probability of a possible outcome. Given   Table 2, Namely, Random Forest (RF), Decision Tree (DT), Support-Vector Machine (SVM), Bayesian Network (BN), k-Nearest Neighbors (k-NN), and Other the problem of associating an observation X with a class from the finite set V ¼ fv 1 ; v 2 ; . . . ; v Y g, a generic decision rule would suggest to choose the class that minimizes the classification error, i.e., the one whose posterior probability given X is the greatest pðv y jXÞ > pðv c jXÞ; c ¼ 1; . . . ; Y c 6 ¼ y; or equivalently, in terms of class-conditional and prior probabilities Such a decision process can also be seen as splitting the observation space into a set of regions, C ¼ fc 1 ; . . . ; c Y g, such that if X 2 c y then X is associated with the class v y . Because of their probabilistic nature, Bayes decision rules are not free from errors; hence, assuming that some classification errors are more costly than others, it is reasonable to associate to each decision d i , i 2 f1; . . . ; Y g, a loss function i;y that quantifies the penalty of classifying as v i when the actual class is v y . The zero-one loss function, for instance, assigns no loss to a correct decision, and a unit loss to any error Given the loss function i;y , the conditional risk associated with the ith decision is defined as i;y pðv y jXÞ: Considering a simple scenario in which only two classes exist, i.e., V ¼ fv 1 ; v 2 g, and D ¼ fd 1 ; d 2 g, the decision rule should suggest to classify X as v 1 if Rðd 1 Þ < Rðd 2 Þ, as v 2 if Rðd 2 Þ < Rðd 1 Þ, whereas the choice would be arbitrary if the two risks are equal. Choosing the lowest conditional risk allows to minimize the overall risk R T , which is defined as Based on Eq. (3), such a risk can be rewritten as that corresponds to the average probability error.

Classification With Reject
Unfortunately, the decision rules from Eqs. (1) and (2) do not directly consider the conditional risk of a wrong decision, then they always allow to classify an input, whatever the classification error is. This can sometimes lead to an excessive misclassification rate; for this reason, if the risk is too high, the possibility of refusing to decide is introduced. In this case, given Eq. (6), the decision to classify or reject can be made according to a threshold Q R 2 ½0; 1 on the overall risk R T [36]: Rejection, denoted by d 0 , provides an extra choice within the decision space D ¼ fd 0 ; d 1 ; . . . ; d Y g, and corresponds to define a new region c R 0 within the observation space, i.e., When the reject option is considered, the loss function is also redefined as where l c 2 ð0; 1Þ is the value of the loss cost, i.e., the penalty occurring when the decision d 0 is made [35]. The advantages gained from rejecting are demonstrated by the following theorem. Lemma 1. Given a classification problem with Y classes, if the reject threshold Q R ! 1 À 1 Y , then the decision to reject is never made.
Proof. Since the probabilities of all the Y outcomes sum to 1 As a consequence, the reject condition defined in Eq. (7) maxfpðXjv y Þp y g < 1 À Q R ; is verified for any Y , the average probability error of a classifier that includes the reject option, pðerror R jXÞ, is not higher than the error pðerrorjXÞ made by a classifier in which the reject option is not available pðerror R jXÞ pðerrorjXÞ: Proof. The proof is given for a binary classification problem and can be easily extended to a multi-class scenario. Each region within the observation space can be defined as According to Eq. (7), the reject option impacts on the observation space by redefining the regions as Following Eq. (6), the average probability error in a system with reject depends on these regions and can be expressed as Then, the difference between pðerrorjXÞ and pðerror R jXÞ is Now, let us evaluate the relationship between D and Q R . Case Q R ¼ 0: according to Eq. (7), a zero threshold causes any observation X to be rejected; thus, c R i ¼ ; 8i > 0, and the integral over c i results in D ¼ pðerrorjXÞ.
Case 0 < Q R < 0:5: rejection can be chosen and As a consequence, D depends on c R 0 , that is the reduction of the error is proportional to the size of the reject region.
t u The outcomes of the properties demonstrated so far can be also observed by comparing the plots in Fig. 1, in which the easiest case of a two-class problem is illustrated for the sake of clarity.
The first plot (Fig. 1a) refers to the classifier without reject; here, two decision regions exist, namely c 1 and c 2 , and the classification errors for the classes v 1 and v 2 correspond to the violet and grey areas respectively, as defined by Eq. (6). The other three plots show the effect of introducing the reject option. For instance, if a threshold Q R ¼ 0 is considered (Fig. 1b), every observation is rejected; as a consequence, since the classification is not performed, all the errors are null and only the reject region c R 0 exists. As the threshold value increases (Fig. 1c), the errors are reduced by an amount that depends on the size of the reject region. However, when Q R ¼ 0:5, the region c R 0 is null (Fig. 1d) and the errors, due to the wrong classification of v 1 and v 2 , are the same as Fig. 1a. The same holds for any value Q R > 0:5, as proved in Lemma 1.

Multi-Stage Classification With Reject
Given that proper threshold values are demonstrated to reduce the classification error, a new problem has to be faced: how to deal with the observations that are rejected.
A multi-stage classifier can be designed to address this issue by introducing s stages, s 2 f1; . . .; Sg, each of which applies the Bayes decision rule to a partial observation vector x s X. As one would expect, the decision at the stage s must take into account the reject decisions made at the previous stages. Such a sequence of decisions can be seen as a firstorder Markov chain [37], where the decision at the stage s is dependent only on the stage s À 1. Thus, starting from Eq. (4), the conditional risk for the multi-stage classifier is defined as Fig. 3. Data collection procedure. A initial set of static keywords is used to query the Twitter stream; then, topic detection is performed to find out new terms emerging from the topics and to update the queries. Tweets are analyzed to retrieve the corresponding metadata and authors. For each author, the timeline and the neighbors are explored. Then, the procedure is repeated for follower and following accounts extracted in the previous step.
where Rðd 0 sÀ1 Þ is the conditional risk of the previous stage, and i;y is the loss defined in Eq. (8).
In the scenario addressed here, such a loss strictly depends on the cost f of making an observation x s . As a consequence, the multi-stage loss function can be obtained from Eq. (8) as Such a multi-stage classification process brings advantages in terms of recognition performance because a decision is made only when the inputs are certain enough, and moving to the next stage is too costly. Moreover, an optimal multistage classifier will exhibit the following property [38]:

SPADE OVERVIEW
A common assumption of machine learning models is that a collection D of heterogeneous data can be described by a finite set of features f ¼ ff 1 ; f 2 ; . . .; f N g. The cost of Observing each feature value f n depends on two quantities, namely, the time required to Collect the subset of data d n D from which f n can be computed, and the complexity of the algorithms that actually Process d n in order to produce the feature value f n However, it is frequent that some groups of features f G f may be computed from the same subset of data, while also exploiting algorithms that have similar complexities. Therefore, groups of homogeneous features can be selected by imposing some constraints on the values of T C and T P where t and define a range of collection and processing times, respectively. In SpADe, f consists of 39 features ( Table 2) that have been deeply analyzed in the literature and are demonstrated to be effective in capturing the characteristics of different spam behaviors. The criteria in Eq. (19) were applied in order to split f in homogeneous groups; as a result, four groups were identified. In particular, the properties of the account are observed at the first stage ( An analysis of the works in which the features adopted were first described (see Table 3) revealed that Bayesian Network (BN) and Random Forest (RF) are the most frequently chosen algorithms for their classification. The former is particularly suitable to evaluate the rejection due to its probabilistic nature, while the latter is proved to be one of the most proper classifiers when dealing with large feature sets [28], [41]. These considerations led us to make SpADe exploit Bayes classifiers to implement the accept/ reject mechanism of the first three stages, while the laststage relies on Random Forest.
Given the methodological framework presented in the previous Section, classification performed at every stage s is based on a cumulative feature vector F s that includes all the observation made so far, i.e., F s ¼ S x 1;...;s . An overview of the multi-stage classification process is provided in Fig. 2. As long as the classification confidence does not reach the desired acceptance threshold, the process is repeated by choosing the rejecting option; however, at the last stage, when no further examinations are possible, a decision is made regardless of the achieved confidence. Multi-stage classification also results in a higher processing speed since the average number of features per stage is substantially lower than that required in a single-stage [42].

Feature Extraction
The following subsections describe the feature extraction processes that characterize each stage. A quantitative evaluation of the observation costs of the four feature sets is presented in Section 5.2, while other possible configurations are discussed in Section 5.3.

Stage 1: Account Analysis
Account analysis is the easiest to perform because the related features (top block in Table 2) can be easily extracted from public Twitter profiles. The features f 1 and f 2 capture the tendency of spammers to have a low number of followers and friend; f 3 and f 4 , together, allow to detect accounts that, despite having been recently created, have produced a large number of tweets, which could indicate an automated spamming behavior. Finally, features f 5 , f 6 , and f 7 contain important statistics about the presence of predatory elements, such as favorites and hashtags. Although these features are extremely easy to compute, they can be altered just as easily, e.g., by buying followers in the so-called social media's black market. Thus, it is reasonable to exploit the set f A to perform an early classification (at the first stage) only if the risk associated with a given observation is low; otherwise, it would be more convenient to extract more complex features at the next stages.

Stage 2: URLs Analysis
Tweets containing links to external websites are more likely to be re-tweeted, which is the primary goal of most spammers, i.e., to quickly reach as many people as possible. In order to evaluate the quality and the quantity of URLs shared by a user, we choose to rely on the 7 features reported in the second block of Table 2. The most intuitive element to observe in the user's timeline is the amount of URLs shared; in the simplest case, if every tweet contains an URL, then the probability the user is a spammer is very high. This aspect is captured by means of the features f 8 ; f 9 ; f 10 ; f 11 . Aside from sharing a large number of URLs, some types of spamming activities are aimed to promote specific URLs, such as those related to commercial products or untrusted/malicious websites. The features f 12 ; f 13 ; f 14 have been introduced to examine this kind of behavior; while f 12 traces all URLs that contain spam-related keywords, such as those regarding money gain and adult contents [43], the feature f 13 counts URLs that point to the same IP/domain, and f 14 tests URLs for malicious contents by relying on third-party services, such as Google'Safe Browsing.
Since the computation of this set of features requires to retrieve all the user's tweets, the observation cost is clearly higher than the one measured at the first stage. Moreover, f 14 requires external safe-browsing services, whose analysis-response time is not predictable.

Stage 3: Content Analysis
A straightforward characteristic of spammers is their tendency to repeatedly share same information, or similar information that implies the same content. In order to capture different aspects of content-based spamming, we selected the 16 features listed in the third block of Table 2.
The subset from f 15 to f 23 is intended to point out the differences between real tweets and those forged by automated spammers. The latter, for instance, are inclined to abuse popular hashtags so that their tweets can be more easily found; thus, the ratio of hashtags used in the tweets to the total number of tweets (f 15 ) is higher for spammers than trusted users. The same considerations apply to mentions, retweets, replies, and so on.
The analysis of timing is also important to evaluate if contents are shared by following regular (artificial) patterns. We address this aspect by means of the features f 24 to f 29 . For instance, the variance in the time taken by an account to post tweets (f 24 ), as well as the variance in the number of tweets (f 25 ) are two useful parameters to distinguish between bots and legitimate users, which are expected to tweet stochastically [16].
A further analysis is aimed to process the user's timeline in order to detect tweets that are not exact copies of each other, but differ in a few characters. The feature f 30 , that highlights the presence of near-duplicates contents, is computed through an effective clustering approach based on the combination of two algorithms, namely MinHash and Locality-Sensitive Hashing (LSH) [44], [45]. The output of the nearduplicates detection is a collection of clusters containing similar tweets. Thus, the feature f 30 is actually computed on this output and consists of a set of values representing the size and the number of clusters obtained from each timeline.

Stage 4: Neighborhood Analysis
The last, most computationally expensive, stage of analysis concerns the evaluation of the user as a member of a  community. As observed in [46], [47], it is quite difficult for a spammer to alter, or even influence, the behavior of its neighborhood, especially when composed of genuine users.
The most relevant characteristics to look at for performing neighborhood analyses are reported in the last block of Table 2. Some of them, i.e., f 31 À f 34 , describe the degree of interaction between users and their followers, or friends. The features f 35 and f 36 measure, respectively, the probability that two users become followers of each others (this value is expected to be high for genuine users that usually send requests to accounts they actually know), and the level of trust existing between two nodes, which, in the OSNs domain, depends on how close the connected nodes are, e.g., based on common friendships. The remaining two features, f 38 and f 39 , provide an evaluation of the account according to the community it belongs to. The former calculates the average reputation of the community in terms of reciprocity rate, i.e., the fraction of the users who follow back in response to followings. The latter studies the degree to which accounts tend to cluster together within the community. For both features the lower the value they provide, the more likely it is that the account is a spammer.
In order to obtain community-based features it is necessary to explore both the target user and all the accounts in its neighborhood. Hence, these features should be computed for a small number of users only, if previous classification stages have not led to a decision.

EXPERIMENTAL RESULTS
After describing the data collection process, the following sections present a set of experiments aimed at tuning the system parameters and evaluating the classification performance. Then, comparative analysis are provided in order to assess the performance of SpADe with respect to some relevant related works.

Data Collection
Experiments have been carried out on two different datasets, whose characteristics are summarized in Table 4. The former is a reference public dataset, named 1KS-10KN [48], which consists of 11 k accounts and more than 1 million tweets crawled by means of the Twitter APIs in the period April-July 2010. The dataset contains also a set of features regarding the accounts and their timelines, as well as information about the URLs contained in the tweets. As its name suggests, 1KS-10KN is characterized by a ratio of spammers to genuine of 1:10, i.e., 1,000 accounts are labeled as spammer and the remaining 10.000 as genuine. More details about the dataset and the features provided can be found in [46].
The second dataset was collected by carrying out the procedure summarized in Fig. 3. Data collection starts by querying the Twitter stream through a set of static keywords, which will be successively refined. Initial keywords include elements that are generally used by spammers to reach as many users as possible. In particular, we chose a set of common spammy words [43] (such as "earn money", "free money", "no credit check", "viagra", "enlargement pill", "legal bud", etc.), and a list of trending topics/hashtags that were obtained from a preliminary API request. Tweets matching the queries are analysed by a topic detection algorithm [49], [50] with the aim of putting together similar tweets and finding out important terms emerging from them, such as keywords that characterize newly discovered topics or recently popular hashtags and mentions. Hence, the initial set of keywords is progressively updated by including these terms or deleting those unused. Such a strategy allows to acquire a large volume of data while keeping the focus on relevant topics. As a results, between June and December 2020 we collected 8 million tweets and 40 thousands accounts, which were processed to compute all the features required by SpADe. For each tweet we retrieved both the associated metadata (e.g., tweet ID, date of creation, and so on) and the author's ID, which is essential for obtaining account-related information. Then, for each account, the tweets in the timeline and the set of followers and followings were acquired. Finally, timeline and neighbors extractions were performed even on followers and followings accounts so as to capture information needed for computing the neighborhood-based features.
A semi-automatic labeling procedure [44] was adopted to assign ground-truth to collected data. The scheme consists of three phases. The first two automatically ascribe labels to "easy" users based on the analysis of the URLs and the similarity of the content shared: if URLs are malicious (e.g., blacklisted) or the timelines contain many repeated elements, users are labelled as spammers. Otherwise, manual annotation is required. In this case, the third phase aims to minimize label assignment effort by creating groups of similar users, perform manual annotation of just a few samples per group, and then extend the label to the whole set. The entire process is detailed in [44].

Observation Cost Evaluation
Quantifying the effort needed to observe the four categories of features outlined in Table 2 is crucial to implement the reject mechanism. According to Eqs. (18) and (19), the observation time of every group f G is determined by the time required to collect and to processes the relative raw data.
The former amount depends on the capability of the observer to query the data source, e.g., by means of the available APIs. For instance, if the analyses are performed by the social media company itself, the collection time is negligible since all the data needed to calculate the features are available immediately. However, in the more general case, potential beneficiaries of SpADe include organizations Fig. 7. According to the confidence value and the the reject ratio r, the accounts analysed at each stage can be Accurately classified, Misclassified, Rejected, or Not rejected. As a results, four sets can be identified: a AN , a MN , a AR , and a MR .
involved in countering cybercrimes (e.g., cyberbullying or other phenomena conveyed by OSNs), marketing and advertising companies interested in distinguishing real and fake accounts, government agencies working in the field of cybersecurity (e.g., for discovering data flows that could trigger devious political campaigns and misinformation), and even academic researchers, from both the fields of computer and social sciences, that being external to the OSN would need a certain amount of time to collect data to be processed. In general terms, this time for a group of homogeneous features, i.e., T C ðf G Þ, can be defined as a function of four parameters, namely the number of data to collect (d), the maximum number of data that can be retrieved from a single API call (max d ), the maximum number of API calls allowed within a certain time window (max c ), and the duration of the time window itself (D t ).
The second component of the observation time, i.e., the processing time T P , depends on the complexity of the algorithms chosen to actually calculate the feature values. For instance, the core processing tasks performed in SpADe involve timeline browsing (TL), verification of blacklisted URLs (BL), near duplicate clustering (ND), analysis of tweets' contents (TC), community detection (CD) and browsing (CB), and neighborhood browsing (NB).
The observation times of the URL-, Content-, and Neighborhood-based features discussed so far were assessed through an experimental analysis conducted by means of the two datasets presented in Section 5.1. Tests were run on a multi-core server equipped with 4 Intel Xeon at 2.00 GHz and results are summarized in Fig. 4. Please note that the evaluation of T O ðf A Þ is omitted since f A is obtained directly from raw account information. The T C and T P bars exhibit similar trends on both datasets; in particular, it can be observed that neighborhood-based features (f N ) are the most expensive to obtain, both in terms of collection (striped) and processing (solid) times. The groups f U and f C are characterized by the same collection time since URLs are embedded in the tweets; however, processing content-based features takes more because of the computational complexity of near-duplicates analysis. Overall, results confirm an increasing trend in costs when moving from URL-to neighborhood-based features. It is worth noting that even in the specific case of the OSN data owner, though the collection time is negligible, the processing time is progressively higher.
These outcomes can be exploited to quantify the effort required to move from one stage to another. According to Eq. (17), the loss fðx s Þ of the multi-stage classifier at the stage s is strictly related to the cost of making the observation x s . For instance, at the second stage of SpADe, x 2 corresponds to the set F 2 ¼ ff A ; f U g. Thus, by averaging and normalizing the results shown in Fig. 4, the following loss values are chosen: fðF s Þ % f0; 10 À2 ; 10 À1 ; 1g. It is worth noting that the loss of the last stage is about 1, since the maximum cost is reached when all the features are considered together.

Stages Order Selection
The performance of a multi-stage classifier depends heavily on the order in which the features are employed. Using the most effective features in the early stages could in fact reduce the error, but may in some cases increase the overall complexity of the system.
One way to find the optimal sequence is by testing all possible permutations and determine the best trade-off between cost and error. However, if the number of features is large enough, this process may be impractical. As introduced in Section 4, and further discussed in the previous Section 5.2, the 39 features adopted in SpADe are grouped into four categories according to their semantics and observation costs. This design choice actually allowed us to reduce the search space from 39! to just 4!.
A different approach is suggested in [51], where the problem of establishing a good order of the features to adopt in a multi-stage system is addressed. In particular, CAS-CARO is a method based on a variant of the Monte Carlo Tree Search (MCTS), in which the problem of variable ordering is treated as a search problem in a tree of depth D +1, where D is the number of features. Each path in the tree is associated with a reward that depends on L, a parameter representing the penalty in case of misclassification. In order to demonstrate the effectiveness of the feature order used in SpADe, we applied the CASCARO procedure to a system with 39 stages (one for each feature).
Results of the tests performed with L 2 ½1; 20 are illustrated in Figs. 5 and 6. The former shows the percentage the feature f n was chosen by CASCARO at the generic stage s i ; thus, the darker the cell, the higher the usage of that feature. As it can be observed, simplest features (e.g., account-based ones, with n ¼ ½1; 7) are usually chosen in the initial stages, while as the value of n increases, the choice of the corresponding features is postponed to later stages.
In order to further analyze this aspect, the individual features used in the CASCARO experiment were correlated with the groups employed in SpADe. Fig. 6 shows how many times the ith stage processed by CASCARO has used features belonging to one of the four groups in SpADe. The different colors highlight the existence of some patterns that, with a few minor exceptions, well match the distributions of the features we considered in SpADe. In particular, in the early stages of CASCARO account-related features are regularly selected; then, from stage 9 to 15 features ascribable to the URL group are chosen more frequently. Stages 16 to 29 extensively rely on content-based features, while at the remaining stages neighborhood information is preferred. These groups, regardless of the inner order in which the features are picked, correspond to those that are the core of our method. Thus, these results confirm the validity of the order adopted in the four stages.

Reject Threshold Evaluation
Choosing the proper threshold Q s R at each stage is essential to balance the reject and classification rates, on which the performance of the spam detection system actually depends. On the one hand, low thresholds may increase the system accuracy as the decision to classify is made only when the outcome is almost certain; however, rigid thresholds could cause also inputs that would have been classified correctly at the current stage to be discarded. Conversely, high threshold values could lead the system to never reject, even when the outcome is uncertain. We present here a set of experiments aimed to find the proper threshold value for each classification stage.
In order to measure the performance of the classifier as a function of the fraction r of accounts rejected at each stage, we adopted the evaluation metrics proposed in [52]. Given a certain value of r, the classification of A accounts produces as output four distinct sets (see Fig. 7): a AN : accounts Accurately classified and Not rejected; a MN : accounts Misclassified and Not rejected; a AR : accounts Accurately classified and Rejected; a MR : accounts Misclassified and Rejected. According to these quantities, two performance measures, namely non-rejected accuracy ðNAÞ and classification quality ðCQÞ, are defined The non-rejected accuracy measures the ability of the system to properly classify samples, accounts in our case, that are not rejected; the values of non-rejected accuracy (NA) can be used as rough indicators of the effectiveness of both the classifier and the features, evaluated on the "most evident" inputs, i.e., those that are not rejected. A more in-depth analysis of the reject region can be carried out through the classification quality (CQ) index, which assesses the performance of the classifier on the set of non-rejected accounts, and the reject policy on the set of misclassified accounts. Fig. 8 shows the values of the two metrics computed on our dataset while varying the fraction of rejected inputs from r ¼ 0 (no reject) to r ¼ 1 (reject all). By observing the results of the first stage, Fig. 8a, we can notice that the more inputs are rejected the higher the non-rejected accuracy. The reason is that most of the inputs are uncertain because of the weakness of the features adopted. Hence, in order to achieve a proper level of accuracy it is necessary to impose a very high reject threshold. For instance, a threshold that guarantees a nonrejected accuracy of 90% would discard about 90% of the observed accounts (or accept only 10% of them). The same figure also provides a general measure of the effectiveness of the account-based feature set, which would lead to an accuracy of 60% without the reject option (r ¼ 0).
Given a target reject ratio of 0.9, the classification quality observed at this first stage is quite low; in fact, striving for high accuracy in the non-reject region inevitably leads to reject also some "good" samples that could have been classified correctly. With respect to Eq. (21), this means that the number of a AN and a MR decreases as the reject ratio increases.
The performances observed at the second stage are more promising, as summarized in Fig. 8b. Here, in order to achieve the same target accuracy of 90%, only half of the analyzed accounts need to be rejected (r ¼ 0:5). Even when no sample is rejected (r ¼ 0), the features adopted at this stage are more accurate than the previous ones; this proves the usefulness of including URL information in the classification process. Moreover, also the classification quality increases, confirming that a proper threshold allows to obtain a satisfying number of correctly classified samples (a AN ), but also an adequate number of misclassified and correctly rejected samples (a MR ).
This trend is confirmed at the third stage, in which the reject ratio that maximizes the accuracy (r ¼ 0:3) coincides  with the maximum classification quality, as depicted in Fig. 8c. It is worth noting that results shown in each column of the figure refer to the subset of samples rejected at the previous stage; thus, choosing a threshold that allows to achieve an accuracy of 90% at the third stage, for instance, would cause to reject only 30% of the accounts rejected at the second stage, which in turn are only half of those rejected at the second stage, which were 90% of those rejected at the first stage. The reject ratios are strictly dependent on the reject thresholds chosen at each stage; thus, other tested were performed in order to highlight the relationship between the two values. We can observe from Fig. 8 that the lowest reject ratios for a target accuracy of 90% are 0.9, 0.5 and 0.3 at stage-1, stage-2, and stage-3 respectively. Then, Fig. 9 shows that these ratios can be obtained by choosing the thresholds Q 1 R ¼ 0:03, Q 2 R ¼ 0:15, and Q 3 R ¼ 0:02. Moreover, the same figure points out that as the system moves to the next stages, the reject thresholds decrease; this indicates a progressively reduced uncertainty because of an increasing feature significance.

Classification Performance
Once the observation costs were estimated and the classifier was tuned with the proper threshold values, a 10-fold cross validation was performed in order to assess the performances of SpADe in terms of overall accuracy, F-score, and percentage of classified accounts. Tests were repeated multiple times on a balanced subset of the dataset, randomly selecting the same number of spammers and genuine accounts.
Results, summarized in Table 5, indicate that the idea of progressively rejecting uncertain accounts allows the system to achieve an adequate (above 90%) classification rate at every stage. The F-Score values also highlights the ability of the system to drastically reduce the number of false positives and false negatives. Furthermore, it is possible to note that, stage by stage, the reject rate decreases as the feature sets become more and more significant. This last result suggests that even the effort required to process the features is progressively reduced; for instance, the neighborhood features are computed at stage four for only 17% of the initial set of accounts.
Since the performance evaluation might be biased by the ratio n:m between spammers (n) and genuine accounts (m), other tests were performed by considering different versions of the dataset with ratios 1:2, 1:5, and 1:10, which are increasingly more representative of the real social networks. Table 6 shows that the system performances do not change significantly as different ratios are considered. Slightly better results are obtained when the proportion between spammers and genuine accounts is moderately unbalanced (e.g., 1:2); however, the average accuracy and fscore achieved in the 1:10 scenario are still above 90%. Also the percentages of inputs discarded at each stage are comparable, so confirming the quality of the rejection strategy.
The choice of the feature sets to adopt in each of the four stages was also supported by an experimental evaluation.
For each permutation of the four sets f A , f U , f C , and f N , a multi-stage system was defined and tuned by following the procedure described in Section 5.4. The performances of the 24 variations of the multi-stage system were measured in terms of their classification cost where a s c ¼ jja AN jj þ jja MN jj is the number of account classified at each stage s by means of the cumulative feature vector F s . It is worth noting that the value of F is maximum in a single-stage system, as all data are required at once in order to perform the classification. Table 7 reports the percentages of accounts classified at each stage by imposing a target accuracy of 90%, where each row indicates a different sequence of feature sets and the overall observation cost of the resulting multi-stage classifier. The first sequence of features is the one we adopted in SpADe, which exhibits the lowest value of F. The last six rows show the highest percentage of accounts classified at the first stage, which demonstrates the effectiveness of the community-based features. However, these configurations also yield the highest observation costs, which reflects the effort required to compute the set f N on a great number of accounts.
The costs of the other combinations we tested depend on how discriminative the feature sets are and the order in which they are used in the processing chain. In any case, it can be seen that none of the systems reaches the maximum complexity, stressing again the benefits of the proposed method over traditional approaches.

Comparison With State-of-the-Art ML Approaches
The last set of experiments aims to compare SpADe with three "single-stage" machine learning techniques without reject option, namely Random Forest (RF), Decision Trees (DT), and Bayesian Networks (BN). This choice is motivated by the results summarised in Table 3, which show that these algorithms are the most frequently employed in spam detection scenarios.
The main difference between our algorithm and those presented in the literature is undoubtedly the characterization of the observation cost. Hence, in addition to the  Table 8. Although these techniques are generally applied in balanced scenarios, we have selected a subset of our dataset in order to test their validity also with a spammer-genuine ratio of 1:10. Moreover, in order to make a fair comparison in terms of observation cost, we also considered a non-optimal configuration of SpADe based on the sequence of groups ff N ; f U ; f C ; f A g. We refer to this system as SpADe Ã , which is characterized by an observation cost similar to that of the baselines. All methods achieve fairly comparable performances in terms of accuracy and F-Score, but the classification cost of SpADe is extremely lower. The slight deterioration in the performance of SpADe Ã compared to the optimal configuration are mainly due to the fact that, although the SpADe reject option was tuned on a target accuracy of 90%, the last stage has no chance to reject and thus will be more prone to misclassification errors. Then, a higher probability of error subsists if poorly discriminative features are used at the last stage, such as f A in the case considered.
The effectiveness and generality of SpADe were further assessed by setting up a new four-stage classification system that exploits a different set of features, namely the four categories of features described in [17]: metadata, content, interaction, and network. Also for this system, tested on the public 1KS-10KN dataset [48], we tuned the reject capability by computing the observation cost for the new feature sets, as well as the thresholds to be chosen at each stage in order to achieve 90% of accuracy.
Results, reported in Table 9, show that SpADe still outperforms the considered competitors. Moreover, it is possible to note that the performances of RF, DT, and BN are lower than those measured while using the proposed feature set; this suggests that the features from [17] are probably less general in describing the spammers' behaviors. The costly version of the system, SpADe Ã , is characterized by a reduced detection accuracy that is very similar to RF and DT, and still superior to BN. These trends are almost the same regardless of the imbalance ratio, although the overall results are slightly worse in the 1:10 case.
In order to better understand the different performance of the methods considered as baselines, a final set of experiments was carried out to assess the contribution that different groups of features may have in such a "single-stage" classification process.
To this aim, the four groups f A ; f U ; f C , and f N were considered both individually and combined with each other, so making the cost of observing the feature subsets progressively higher. The leftmost plot in Fig. 10 shows that the accuracy of RF, DT, and BN increases as multiple groups are considered together, up to the best case -when the whole feature set is taken as input of the baselines. The curves indicate a very similar trend for the three classifiers, with RF providing slightly better performance regardless of the chosen feature group. The observation costs measured in the different cases provide further insights into the relationship between accuracy and efficiency of the classifiers.
The same procedure was repeated for the groups used in the experiments in Table 9, namely f m ; f c ; f i , and f n . Even in this case, we can observe a general trend indicating an improvement in performance as the observation cost increases, even though accuracy values are slightly lower than those observed in the private dataset.
These outcomes indicate that all features contribute to classification, i.e., they are not redundant, although the use of  certain groups (e.g., f C and f N , or f i and f n ) impacts on the performance more than others. This is also consistent with the other findings discussed in this Section, from which results the convenience of postponing the observation of expensive features to later stages, when fewer samples are evaluated.

CONCLUSION
In this paper we faced the problem of identifying spam accounts in social networks from a different perspective. Related works are generally oriented towards proposing new features capable of capturing the behavior of spammers, as well as new classifiers tuned on increasingly larger feature sets. However, feature acquisition and processing may be very costly in OSNs with millions of users. For this reason, we presented a multi-stage spam account detection technique with reject option, whose purpose is to initially exploit the features that are easier to compute, while progressively extracting more complex information only for those accounts that have not yet been classified.
The proposed system has been validated both on a dataset we retrieved from the Twitter stream, and on a reference public dataset. The performances have been also compared with single-stage state-of-the-art techniques that do not include the reject option, namely Random Forest, Decision Trees, and Bayesian Networks. The results highlighted the effectiveness of the multi-stage approach which achieves high accuracy in distinguishing between spammers and genuine accounts, while maintaining extremely low the overall complexity. These two characteristics are mainly due to the analysis, stage by stage, of increasingly significant features and to the ability of this system to classify the accounts only when it is quite confident of the outcome. Moreover, we observed that the accuracy of the multi-stage algorithm is comparable to that of a single-stage classifier that uses all the features at once; nevertheless, our approach allows to detect a spammer sooner, which also results in a lower complexity of the classification process.
The current approach equally weighs the misclassification of spammers and genuine accounts. However, while false positives could erroneously block honest users, undetected spammers could compromise the trustworthiness of the whole social network. As future work, this issue can be addressed by evaluating the effectiveness of a different loss function, which should be capable of assigning different penalties for an incorrect classification.
The solution proposed could be integrated in a more complex system in which the last classification stage is performed by entities that have knowledge of the problem, namely experts [53]. In fact, it happens more and more frequently that, especially in critical systems, machine learning algorithms are assisted by human experts that are able to better untangle uncertain situations. This kind of approaches must face a number of relevant open challenges, the most crucial of which is finding the right balance between classification accuracy and human overwork. These aspects will also be studied in future research.
Giuseppe Lo Re (Senior Member, IEEE) received the laurea degree in computer science from the University of Pisa, in 1990, and the PhD degree in computer engineering from the University of Palermo, in 1999. He is a full professor of computer engineering with the University of Palermo, since 2019. In the 1991, he joined the Italian National Research Council (CNR), where he achieved the senior researcher position. His current research interests are in the area of computer networks and distributed systems, focusing on reputation and security systems. He is senior member of the Communication Society, and of the Association for Computer Machinery.
Marco Morana received the laurea and the PhD degrees in computer engineering from the University of Palermo, Italy, in 2007 and 2011, respectively. He is an assistant professor of computer engineering with the University of Palermo since 2016. During his PhD studies, his research mainly focused on computer vision and pattern recognition. His current research interests include parallel and distributed computing, social network analysis, cyber security, intelligent data analysis for user profiling, data fusion and reasoning in smart environments.
Sajal K. Das (Fellow, IEEE) received the PhD degree in computer science from the University of Central Florida, Orlando, Florida, in 1988. He is a professor with the Department of Computer Science and a Daniel St. Clair Endowed chair of the Missouri University of Science and Technology, Rolla, Missouri. His research interests include cyber-physical systems, security and privacy, smart environments (smart city, smart grid, smart healthcare), IoTs, wireless and sensor networks, mobile and pervasive computing, Big Data analytics, parallel, distributed, and cloud computing, social networks, systems biology, applied graph theory and game theory. He is the founding editor-in-chief of Elsevier's Pervasive and Mobile Computing Journal and an associate editor for several journals including IEEE Transactions of Mobile Computing, IEEE Transactions on Dependable and Secure Computing, and ACM Transactions on Sensor Networks.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl. Open Access funding provided by 'Università degli Studi di Palermo' within the CRUI CARE Agreement