Sentiment Analysis and Topic Modeling on Arabic Twitter Data during Covid-19 Pandemic

Twitter Sentiment Analysis is the task of detecting opinions and sentiments in tweets using different algorithms. In our research work, we conducted a study to analyze and compare different Algorithms of Machine Learning (MLAs) for the classification task, and hence we collected 37 875 Moroccan tweets, during the COVID-19 pandemic, from 01 March 2020 to 28 June 2020. The analysis was done using six classification algorithms (Naive Bayes, Logistic Regression, Support Vector Machine, K-Nearest Neighbors, Decision Tree, Random Forest classifier) and considering Accuracy, Recall, Precision, and F-Score as evaluation parameters. Then we applied topic modeling over the three classified tweets categories (negative, positive, and neutral) using Latent Dirichlet Allocation (LDA) which is among the most effective approaches to extract discussed topics. As result, the logistic regression classifier gave the best predictions of sentiments with an accuracy of 68.80%.


INTRODUCTION
Twitter is a social networking site that allows its users to express their thinking, interests, and views on different topics. A large number of users in the world generally, and Morocco, in particular, use this micro-blogging website as a way to transmit their views freely; these opinions contain negative, positive, and neutral sentiments on a specific subject.
Sentiment Analysis (SA) in general, is the way of detecting and categorizing the polarity of a given text at document, phrase, and sentence level, and predicted performance of different text classifiers using various natural language processing (NLP) techniques.
There is a lot of research on sentiment analysis in social media, especially on Twitter. We present in this section some existing work. In (Maheshwari et al., 2019), the authors classified opinions about iPhone 6, using Naïve Bayes classifier with different training datasets and comparing it with baseline algorithm that grouped the tweets in negative and positive classes with 88.32% of accuracy. In the same way, the authors in (Tiwari et al., 2019) analyzed opinions of travelers about an airline company and classified them into negative and positive, in their experiments, they used BIRCH clustering (Balanced Iterative Reducing and Clustering Using Hierarchies) as a hierarchical clustering method and association rule mining to find the association between variables and for analysis the dataset.
The sentiment analysis on Twitter is used in different fields especially marketing (Smailovic et al., 2014) and policy (Ringsquandl and Petkociv, n.d.). Many algorithms are used to detect sentiment from the text, in (Smailovic et al., 2014), the authors used SVM (Support Vector Machine) to classify 152,570 tweets concerning eight companies, as positive, negative, or neutral. While, the authors in (Fatahillah et al., 2017) utilized naïve Bayes to group Indonesian tweets into positive and negative speeches, and they found in their research that Naive Bayesian classifier achieves high accuracy (93%) and speed to be comparable with selected neural network and decision tree classifiers. However, the authors in (Devika, n.d.) had as an outcome of their experiments, that Random forest gives better and stable results than Naive Bayes and KNN (K-Nearest Neighbors) in detecting spammers and non-spammers in English tweets concerning Apple, the comparison between classifiers is done based on their evaluation parameters (Accuracy, Precision, F-Measure, and Recall). Using those measures, the authors (Ahuja et al., 2019) found that logistic regression is the best classifier for sentiment analysis compared to five algorithms; K-Nearest Neighbors, Decision Tree, Support Vector Machine, Naive Bayes, and Random Forest. They applied their experiments on SS -TWEET (Sentiment Strength Twitter Dataset) labeled manually, and they extracted features using two techniques: N-Gram and Term Frequency -Inverse Document Frequency (TF-IDF), then they concluded that TF-IDF is the best technique of features with the word-level performance of SA of 3-4% superior to that of N-gram.
Some researchers focused on analyzing the content of tweets (topic modeling) in addition to sentiment analysis using various methods. LDA is one of the most popular techniques in text mining; it can offer good descriptions of broader topics compared with NMF (Non-negative Matrix Factorization) (O‖Callaghan et al., 2015). However, concerning short texts, the authors in (Ramage et al., 2009), were the first to experiment with the use of topic modeling to perform classification tasks.
Generally, most researchers opt for LDA as a basic and generative probabilistic model used for topic modeling associated with NLP and text mining in the data science field. In (Zhang et al., 2018), the authors used LDA to extract topics from chemotherapy radiation therapy or chemotherapy surgery tweets, between individuals and organizations in 2009/2010. To assess perceptions about chemotherapy of patients, the method was also combined with another method (Hajjem and Latiri, 2017), without modifying the basic mechanism of LDA. In addition to the medical domain, LDA was used in the political domain (Hagen, 2018), 87% of LDA-generated topics were meaningful to human judges, and therefore LDA was considered as an efficient method of topic modeling.
In our study, we used as a dataset about 37 875 tweets, collected from 01 March 2020 to 28 June 2020, during the COVID-19 pandemic, the six classifiers used are, Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest classifier (RF), and Decision Tree (DT). Then the evaluation performance of each model is determined based on precision, accuracy, recall, and F-measure. Thereafter the topics under each category (positive, neutral, and negative tweets) are extracted using the LDA model.
The global contribution is as follows. Section 2 summarizes related works. Algorithms and evaluation parameters are described in Section 3. Section 4 exposes the experimental results. We finish with a conclusion where we close this paper and outline the future work.

Proposed architecture
The proposed architecture (illustrated in Fig.1) is realized following the next steps: Step-1 Getting a dataset. We retrieved tweets using the Tweepy library (‗Tweepy,' n.d.) and used MongoDB (‗MongoDB: the application data platform | MongoDB,' n.d.) database to store the collected tweets.
Step-2 The tweets were pre-processed to be suitable for the extraction of the feature phase.
Step-3 After preprocessing this data was passed in the trained classifier, to classify them into negative, positive, or neutral class based on trained outputs.
Step-4: Finally, we extracted topics in the three classes of tweets using LDA.

Sentiment Analysis Naïve Bayes
Naïve Bayes (Fatahillah et al., 2017) is the supervised MLA used for classification, based on Bayes‖ Theorem. The Naive Bayes algorithm is facile to apply and practical for large datasets in particular. Bayes theorem allows to calculation from P(y), P(x), and P(x|y), a posterior probability P(y|x), as shown in the following expression: Where: P(y|x) : the posterior likelihood of class (target y) taking into consideration the predictor (attributes x). P(x|y) : the likelihood of predictor given class. P(y) : the prior likelihood of y class. P(x) : the prior likelihood of predictor.

Support Vector Machine (SVM)
SVMs (‗1.4. Support Vector Machines,' n.d.) are an ensemble of supervised learning algorithms, they are used for regression, classification, and outliers‖ detection. SVM is represented in 2 vectors where each vector is of size k. it is a classifier that partition the data taking into consideration that the margin should be maximum.
The aim of an SVM defined on an ndimensional vector space is to get a surface in ndimensional space separating the points of data in that space into multiple classes (Kelleher et al., n.d.). In two dimensions, this surface is often a straight line. In three dimensions, the support vector machines often find a plane. In general, the support vector machines find a hyperplane. These surfaces are optimal in the sense that based on the information available to the machine; it optimizes the separation of the n-dimensional spaces. In general, SVM can be used to partition a space into any number of classes by generalizing the task itself.

Logistic regression
Logistic regression (Lutfullaeva et al., 2018) is among the most popular algorithms for classification problems; it is a transformation of linear regression using the sigmoid function. In general, LR is used to relate one categorical dependent variable to one or more independent variables, and its equation has the following form: Where z represents columns of vectors containing independent variables v (feature vector) and their corresponding weights w for n variables:

Decision tree
A Decision tree (Ansari et al., 2020) is a classification algorithm that uses a decisions tree graph and its possible results. The main concept is to partition the dataset into small subsets and in parallel, a tree associated is incrementally created. This gives in a tree containing decision and leaf nodes. A node of decision has two or more branches, and the leaf node interprets the decisions. The highest node looks like the root node. Then we found the smallest tree that fits the data.

K Nearest Neighbor
The K Nearest Neighbor (KNN) (Imandoust and Bolandraftar, 2013) method is a classification technique based on the nearest learning models in the space of feature. The KNN is the simplest method of classification when the distribution of the data is not well known. This rule keeps all training data in the learning period and attributes it to every item of a class, which is defined in the training set as the most presented label of its k-nearest neighbors. Furthermore, the distance between the existing data points and test points, whom we want to determine class, is computed based on Euclidean distance or another distance measurement such as Manhattan distance. The k nearest neighbors initially decided, will adopt the class of new points of data, and then the class will be defined by the majority voting.

Random forest
Random forest classifier (RFC) (Al Amrani et al., 2018) was proposed by Adèle Cutler and Leo Breimanin 2001, it is part of the learning automatic technique that mixes two notions: ‗bagging' and random subspaces. The multiple decision trees used to train RFC based on softly different data subsets, this classifier consists of tree-structured classifiers collections{f x,β i , i=1,'} where the { } are random vectors independently and identically distributed and every tree gives a part of the vote for the popular class (input x). It is noted that reducing overfitting and variance, gives RFC more precise predictions in comparison with simple Decision Tree models.

Evaluation parameters
To get performance conclusions from different algorithms, and thus determine their accuracy, we used evaluation parameters such as precision, accuracy, recall, and F-measure. The calculation of these parameters is based on the False Statement. For example, an expression is indicated as positive even if it is negative or an expression is announced neutral even though it is negative or positive. To uncover the False Statement (Fitriet et al., 2019) we must make a performance evaluation. There are four classifications of performance evaluation: 1. True Positive : T_Pos 2. False Positive : F_Pos 3. True Negative : T_Neg 4. False Negative : F_Neg Accuracy The accuracy measurement is the correctly predicted instances divided by the total observations. Accuracy= T_Pos+T_Neg T_Pos+T_Neg+F_Pos+F_Neg (5)

Precision
The precision measurement is a value of positive prognostic. It is the percentage of total relevant results correctly classified by the result of the algorithm.

Recall
The Recall is the ratio of correctly classified positive observations to all observations in the specific class.

F-measure
It is a measure of the accuracy of the test, and it is defined as the average of the recall and precision.   Figure 2, the boxes in form of ‗plates' present the following replicates: The green plate interprets documents, the red plate marks the repeated selection of words and topics in the document, and the blue part denotes the latent topics concealed in the collection of documents.
We explain formally the following items: The basic element of distinct data is the Word which is from a vocabulary indicated by {1,', V}. Unitbasis vectors (v-th word) is used to represent these words, and these vectors have only component equal to one ( = 1) and others equal to zero ( = 0) for u ≠ v.
A sequence of N words constitute a document, presented by D= (w 1 , w 2 ,',w n ), and is the n-th word in the sequence of the document D. M documents collection i forms a corpus, denoted by C= (d 1 , d 2 ,',d M ).
The successive procedures for every document in the archive of text are presented as follows: 1. For the m document d for M ranging from 1 to M in the ensemble M document-corpus, determine θ ∼Dirichlet (α); 2. For every word , in the d document: a. Determine assignment of topic , ∼ Multinomial (θ ); b. Get correspondent distribution of topic , ∼Dirichlet(β); c. Sample a word , ∼ Multinomial ( , , ). By repeating the processes in the generative operation described on the top for M times within every corresponding to a document, which is shown in Fig.1 (where α and β are 2 hyper-parameters prior to Dirichlet), clearly, we get the probability of the corpus D: Where: z m ,n P(z m,n )|θ m ) P(w m ,n |φ z m ,n ).

Data
In this work, we collected 37 875 tweets published from 01 March 2020 to 28 June 2020 by Moroccan users and we stored them in the MongoDB database. Twitter Platform provides Twitter API to pull data from Twitter. For that, we created an account on: https://apps.twitter.com, and then we had the permission to access the database using 4 privy keys (access token, access secret token, consumer key, and consumer secret key) and pick up tweets using the REST API.
We filtrate tweets by place and language to obtain Moroccan tweets written in one of the standard languages (Arabic, French, or English), then we used the Python library Tweepy to handle these data. Finally, to store the collected data in a MongoDB database we utilized the Python library Pymongo (Siddharth et al., n.d.).

Preprocessing data
We used various NLP techniques to preprocess the stored tweets. We present under, the applied procedures: 1. Translation of Arabic and French tweet to English using a python code based on google translate. 2. Cleaning up irrelevant data, to have relevant tweet data, this clean consists of elimination of hyperlinks, usernames with preceding ‗@', hashes with preserving of tags?

Sentiment analysis
After preprocessing data, we extracted features using the TF-IDF, which is an important statistical measure, having as output a matrix of term frequency versus inverse document frequency, and a collection of raw documents as input.
We used the training corpus of Sentiment140 (‗For Academics -Sentiment140 -A Twitter Sentiment Analysis Tool,' n.d., p. 140) as training data, it is more relevant to use due to its large number of data (1,600,000 tweets), these tweets are labeled as following; 0 for negative tweets, 2 represent neutral expressions, and positive tweets are represented by number 4.
We separated this training set with a testing set; 80% of training data and testing data is 20 %. Then we implemented the classification on training data with six algorithms. Thereafter we applied to the testing data, the classification results from the training data. Finally, we achieved a performance evaluation to compare the six classifiers. We used scikit-learn (‗scikit-learn: machine learning in Python -scikit-learn 1.0.2 documentation,' n.d.) to conduct these experiments.

Topic modeling
To find out the hidden topics existing in the classified tweets using Logistic regression classifier, we implemented LDA model; firstly, we generated a dictionary from the corpus with the help of the genism package (‗gensim • PyPI,' n.d.). The dictionary established consists of a collection, which contains unique terms in the collection of documents. After that, we created a document-term matrix using the dictionary to use it in the model. ['virus', 'time', 'long', 'thing', 'still', 'minute', 'incredible', 'patient', 'corona', 'end'] As exposed in Table 1, we presented the three most frequent topics extracted by the LDA model (each topic comprising of 10 words) concerning the three categories (positive, neutral, and negative) classified using logistic regression. For instance; Positive tweets are about feeling in Topic 01, Topic 02 interprets life, and Topic 03 is about travel and holiday. Concerning neutral tweets are about events in Morocco in Topic 01, Topic 02 is about the weather, and Topic 03 interprets hobbies. While negative tweets are about feeling in Topic 01, Topic 02 interprets George Floyd death in the USA, and the third Topic is about the corona virus.

CONCLUSION
In this research paper, we have applied six various machine learning algorithms to classify Moroccan Twitter data during the period of containment due to COVID-19, to determine the general emotion, and extract discussed topics under each category (positive, negative, and neutral tweets), using LDA model. Thus we concluded after doing sentiment analysis of these tweets, that the logistic regression classifier yielded the best prediction of sentiments with better performance for all four evaluation measurements namely; accuracy, precision, recall, and f-measure. We also used the generative statistical model LDA to discover topics distribution within Moroccan tweets, and so a better understanding of the Moroccan mood. In future work, we plan to use deep learning algorithms (e.g. CNN, LSTM, Bi-LSTM '.).