Skip to content

Analyzing the Arabic web

 

Analyzing text creates insights into what’s going on and how your audience are reacting to your posts. Better yet, you can analyze the audience themselves and their interactions with each others!

In this post I will analyze Facebook comments on a sample of Facebook posts posted by I believe in Science, a community over science awareness. I chose it because this is where we can find comments that represent the audience more than the content. That means that comments reflect personal opinions and beliefs more than reactions to posts.

So here we go, let’s see how this goes.

Scrape the data!

To start scraping these sources I’m going to need a web scraper that integrates with Facebook’s API. The first result on Google shows this Python tool which turns out to be exactly what I need and works quite well. I created the Facebook App and fed the scraper with the credentials and let it run for a while. With a few thousand comments I think we’re good to go.

Load and explore the data

The scraper outputs the data in CSV format, I’ll use Pandas to load the data and take a look inside of it.

import pandas as pd
data = pd.read_csv("IbelieveInSci_facebook_comments.csv")
In [4]: data.shape
Out[4]: (11036, 14)

11k comments, but what are those 14 columns?

In [6]: data.columns
Out[6]: 
Index(['comment_id', 'status_id', 'parent_id', 'comment_message',
       'comment_author', 'comment_published', 'num_reactions', 'num_likes',
       'num_loves', 'num_wows', 'num_hahas', 'num_sads', 'num_angrys',
       'num_special'],
      dtype='object')

Let’s see the page’s user interactions, first how many comments does the top author have?

In [12]: data.groupby('comment_author').size().max()
Out[12]: 123

That means that some user has made 123 comments on different threads in the page (Either comments on posts or replies to comments)

what about the commenting rate of a user, how often does a user comment on a new post by the page?

user_has_commented_on_post = data.groupby(['status_id', 'comment_author'])['comment_message'].count().clip_upper(1)
num_posts = data['status_id'].unique().shape[0]

user_comment_rate = user_has_commented_on_post.groupby('comment_author').sum() / num_posts
In [53]: user_comment_rate.describe()
Out[53]: 
count    6734.000000
mean        0.005479
std         0.005525
min         0.000000
25%         0.004255
50%         0.004255
75%         0.004255
max         0.238298
Name: comment_message, dtype: float64

This is reasonable given the small sample (around 250 unique posts) and the fact that people don’t comment on every new post by the pages they like.

let’s explore comments that received the most reactions.

print(data.iloc[data['num_reactions'].sort_values()[-3:].index]['comment_message'].values)

 

[ ‘قَالَ النَّبِيُّ صَلَّى اللَّهُ عَلَيْهِ وَسَلَّمَ : ( إِذَا أَتَيْتَ مَضْجَعَكَ فَتَوَضَّأْ وُضُوءَكَ لِلصَّلاةِ ثُمَّ اضْطَجِعْ عَلَى شِقِّكَ الأيْمَنِ … الحديث ) رواه البخاري ومسلم.’

‘و نحن بنشوف فضلاتهم تحترف و بنفكرها نيزك !!! و بنتمنى امنية كمان !!! بطلت العب’

‘قوة الوعي هي اثر في تكوين الشخصية\nحيث ان الطفل سيتعلم حركات القرد ولو استمر معه دون وجود ولداها سيتغير بشكل مخيف\nهذا لأن الاستجابات لدى الانسان اسرع ومستقرة لسبب قوة الوعي ..ولو عكسنا الموضوع لكنا القرد سيتعلم اشياء نسبية بسبب الوعي المحدود.’]

These were the comments that received the top reactions: 1 religious comment, 1 satirical comment and 1 that appears psychological.

Now let’s dive into analyzing the content of the comments themselves.

Text analytics

We will now start analyzing the content of the comments. This typically includes Topic modelling and sentiment analysis.

Let’s see what topics can we extract from the comments.

Here’s a little script that will compute the top 5 topics from an LDA(Latent Dirichlet Allocation) model that is trained on these comments.

We will query the model for the top 5 topics it can infer from the data.

The model defines a topic as a probability distribution over K words, in its output we will read the probability of the word multiplied by the word.

You can interpret this as the terms that represent a certain topic, for example space travel would have terms like: Space – Ship – Shuttle – Spacex – Moon – Landing, ….

from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore

texts = data['comment_message'].dropna().values
texts = [comment.split(" ") for comment in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=5, chunksize=5000, passes=1)
lda.print_topics(5)

 

[(0,

‘0.028*”” + 0.010*”لا” + 0.006*”في” + 0.006*”من” + 0.006*”[[STICKER]]” + 0.005*”ما” + 0.005*”و” + 0.004*”يا” + 0.003*”كل” + 0.003*”على”‘),

(1,

‘0.017*”و” + 0.015*”من” + 0.011*”في” + 0.009*”لا” + 0.005*”ان” + 0.005*”الله” + 0.005*”هذا” + 0.005*”على” + 0.004*”” + 0.004*”ما”‘),

(2,

‘0.017*”من” + 0.016*”و” + 0.011*”في” + 0.006*”ما” + 0.005*”،” + 0.004*”لا” + 0.004*”على” + 0.003*”كل” + 0.003*”” + 0.003*”ان”‘),

(3,

‘0.021*”و” + 0.009*”من” + 0.006*”ان” + 0.006*”في” + 0.005*”لا” + 0.005*”على” + 0.005*”عن” + 0.003*”،” + 0.003*”او” + 0.003*””‘),

(4,

‘0.016*”من” + 0.011*”في” + 0.010*”و” + 0.009*”على” + 0.007*”ما” + 0.005*”ان” + 0.004*”أن” + 0.004*”هو” + 0.004*”لا” + 0.004*”هذا”‘)]

They don’t seem to be any useful, let’s try again with the stop words removed.

I will be using this list of arabic stop words. I will add it to the NLTK zip file for convenience, but you don’t have to do that.

from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore
from nltk.corpus import stopwords

stop_words = set(stopwords.words('arabic'))

texts = data['comment_message'].dropna().values
texts = [comment.split(" ") for comment in texts]
texts = [[word for word in comment if word not in stop_words] for comment in texts]


dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=5, chunksize=5000, passes=1)
lda.print_topics(5)

[(0,

‘0.004*”:” + 0.003*”” + 0.003*”الله” + 0.003*”…” + 0.002*”العلم” + 0.002*”..” + 0.002*”\n-” + 0.002*”.” + 0.002*”الملحد” + 0.002*”اللي”‘),

(1,

‘0.006*”الله” + 0.005*”” + 0.005*”.” + 0.002*”يعني” + 0.002*”:” + 0.002*”الكون” + 0.002*”والله” + 0.002*”😂😂😂” + 0.002*”شيء” + 0.002*”الارض”‘),

(2,

‘0.007*”..” + 0.004*”” + 0.003*”يعني” + 0.003*”العلم” + 0.002*”الله” + 0.002*”الكون” + 0.002*”Mohamed” + 0.002*”شي” + 0.002*”الدين” + 0.002*”محمد”‘),

(3,

‘0.021*”” + 0.003*”؟” + 0.003*”الله” + 0.003*”😂😂” + 0.002*”.” + 0.002*”😂” + 0.002*”..” + 0.002*”[[STICKER]]” + 0.002*”التطور” + 0.002*”1″‘),

(4,

‘0.010*”” + 0.004*”الله” + 0.004*”[[STICKER]]” + 0.003*”😂” + 0.003*”.” + 0.002*”؟” + 0.002*”العلم” + 0.002*”..” + 0.002*”ده” + 0.001*”كنت”‘)]

A little better however there’s still noise. Adding a more strict regular expression to remove all non Arabic characters.

from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore
from nltk.corpus import stopwords
import re

stop_words = set(stopwords.words('arabic'))
regex = re.compile('[^ا-ي]')

texts = data['comment_message'].dropna().values
texts = [comment.split(" ") for comment in texts]
texts = [[regex.sub('', word) for word in comment if word not in stop_words] for comment in texts]


dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=5, chunksize=5000, passes=1)
lda.print_topics(5)

[(0,

‘0.233*”” + 0.004*”العلم” + 0.003*”يعني” + 0.003*”الارض” + 0.002*”الله” + 0.002*”محمد” + 0.002*”لا” + 0.002*”نظرية” + 0.002*”التطور” + 0.002*”ممكن”‘),

(1,

‘0.306*”” + 0.003*”الله” + 0.003*”شي” + 0.002*”يعني” + 0.002*”الكون” + 0.001*”و” + 0.001*”الملحد” + 0.001*”اللي” + 0.001*”الانسان” + 0.001*”الارض”‘),

(2,

‘0.028*”” + 0.011*”الله” + 0.003*”شي” + 0.002*”الكون” + 0.002*”العلم” + 0.002*”ده” + 0.002*”لان” + 0.002*”الشمس” + 0.002*”يعني” + 0.002*”الانسان”‘),

(3,

‘0.022*”” + 0.002*”العلم” + 0.002*”يعني” + 0.002*”شي” + 0.002*”مو” + 0.002*”يوجد” + 0.002*”التطور” + 0.002*”الدين” + 0.002*”العالم” + 0.002*”لان”‘),

(4,

‘0.086*”” + 0.007*”الله” + 0.005*”شي” + 0.003*”الكون” + 0.002*”اللي” + 0.002*”المذيع” + 0.002*”العلم” + 0.002*”الملحد” + 0.002*”فعلا” + 0.001*”يعني”‘)]

So much better, yet there’s always an empty string in each topic, we should get rid of that.

from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore
from nltk.corpus import stopwords
import re

stop_words = set(stopwords.words('arabic'))
regex = re.compile('[^ا-ي]')

texts = data['comment_message'].dropna().values
texts = [comment.split(" ") for comment in texts]
texts = [[regex.sub('', word) for word in comment if word not in stop_words] for comment in texts]
texts = [[word for word in comment if len(word) > 0] for comment in texts]
texts = [comment for comment in texts if len(comment) > 0]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=5, chunksize=5000, passes=1)
lda.print_topics(5)

[(0,

‘0.009*”الله” + 0.007*”شي” + 0.003*”الكون” + 0.003*”يعني” + 0.002*”الانسان” + 0.002*”العلم” + 0.002*”كنت” + 0.002*”لا” + 0.002*”لان” + 0.002*”مو”‘),

(1,

‘0.005*”الله” + 0.004*”يعني” + 0.003*”التطور” + 0.003*”العلم” + 0.003*”الكون” + 0.002*”الدين” + 0.002*”نظرية” + 0.002*”الارض” + 0.002*”شو” + 0.002*”العدم”‘),

(2,

‘0.004*”الله” + 0.003*”العلم” + 0.003*”يعني” + 0.002*”الارض” + 0.002*”لا” + 0.002*”و” + 0.002*”نظرية” + 0.002*”شو” + 0.002*”البشر” + 0.001*”ههههه”‘),

(3,

‘0.005*”الله” + 0.005*”شي” + 0.003*”العلم” + 0.003*”الملحد” + 0.002*”المذيع” + 0.002*”ممكن” + 0.002*”الكون” + 0.002*”اللي” + 0.002*”هيك” + 0.002*”يعني”‘),

(4,

‘0.004*”الله” + 0.003*”يعني” + 0.003*”العلم” + 0.003*”ده” + 0.003*”الارض” + 0.003*”الشمس” + 0.002*”شي” + 0.002*”و” + 0.002*”انو” + 0.002*”اللي”‘)]

Let’s try a different algorithm for a different perspective. I’m interested in seeing how NMF (Non Negative Matrix Factorization) would do on this dataset.

From now on I’m going to use the following cell for preprocessing my dataset since I like its output and it would be redundant to repeat.

from nltk.corpus import stopwords
import re

stop_words = set(stopwords.words('arabic'))
regex = re.compile('[^ا-ي]')

texts = data['comment_message'].dropna().values
texts = [comment.split(" ") for comment in texts]
texts = [[regex.sub('', word) for word in comment if word not in stop_words] for comment in texts]
texts = [[word for word in comment if len(word) > 0] for comment in texts]
texts = [comment for comment in texts if len(comment) > 0]

Now let’s get going with NMF, first we need to use a TF-IDF vectorizer on our text data. What it does is basically turn each comment into a fixed length vector that is a numerical representation for it. In other words we transform our textual input to numerical input. You can take a look at this post for more details about the algorithm.

from sklearn.feature_extraction.text import TfidfVectorizer

# Because Tfidf works on documents rather than list of tokens we have to convert our arrays back into strings
texts = [' '.join(comment) for comment in texts]

vect = TfidfVectorizer()

texts = vect.fit_transform(texts)
In [50]: print(texts.shape)
(6976, 28524)

In [51]: print(type(texts))
<class 'scipy.sparse.csr.csr_matrix'>

TfidfVectorizer transforms the data into sparse representation because of the fact that each document is composed of a very small number of words while the rest of the vocabulary is zeros.

Now we can use NMF to decompose our matrix into topics.

from sklearn.decomposition import NMF

nmf = NMF(n_components=10)

nmf.fit(texts)

Once this is done we’ll have our NMF model fit and ready to be used.

I will be using this method taken from this tutorial to display topics.

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

Let’s see the topics it finds

tfidf_feature_names = vect.get_feature_names()
print_top_words(nmf, tfidf_feature_names, 25)

Topic #0: الله العظيم شا خلق الخالق رسول حمانا لا اله خالق ربي الكون ستغفر عز تعالى وجل يرحمهه قبله بعين اخي له اعلم الخالقين يكن الخلق
Topic #1: راع مقال كلام كان رابط اروع فعلا العلم كومنت مشكووور سعيكم الريبلاير ريبلاي منشور ساجان كارل الراع عمل الجانب لكني التركيز دايما الاجابة موضوع تصميم
Topic #2: شكرا لكم لك التصحيح مقال اه للتوضيح خي طفال للعلم اوكي المعلومه للتصحيح جزيلا المعلومات يادكتور لجهودك احمد معك يليق لجهودكم مقالة مفيدة مرورك للمشاركة
Topic #3: شي العدم هيك اخر هم يوجد خالق الشي يكن عقلك الالحاد خلق بالنسبة حل ولكنه عجيب انو يتي العظيم سبحانه لكل معقووول وقيلا قبله لازمني
Topic #4: محمد احمد التايه جميل حمد حسن ماهر البو جواري رسول يوسف ياسين الربيعي زعيم حميد العقاب الدعجة الحسيني السامعي المندلاوي غنايم رفيدة حسنت عبد عبدالله
Topic #5: يعني ايه معك شو مثلا ولاد بدي ال حرام شلون اش هيك تكذب مافهمت مرتزقة الفلوث نرجسيون الغبا انك الوجود ماكو وافرض شاطر مناصر لحقوق
Topic #6: ههههه القمر حلوين اخي لازم السبب نيااازييك صدمتني لصلو احسن تخافي ريحتيني بوليوود لالو بالصدفة وحش بلوتو تاغ مهجورة السينيال تذكرتك ريك مجرة قديمة تعمل
Topic #7: ههههههه ماشي عكس صحيح اضحكني ام حقير اكسبنسف اخي هههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههههه حكيو دوز للخاص طلع بعثلن النجازات تعودو وماصار كوارث قلتو يقلب تتحقق البلها خطر عقولكم
Topic #8: محمود سنان عيد الحمصي شفت فعلا لازم يسل الواحد مصطفى العظيم عليمىتصدق طلعت جابها بالتفاصيل جمال دكتور مين العالم مساره ينحرف نسى سنانه نشالله قوال
Topic #9: العلم ده الارض صحيح اللي والله الكون ايه ممكن التطور لا صدق نظرية مو الكلام شو كلام الانسان زي دي دا الرض الشمس عشان القمر

These topics are very weird but they look consistent to some extent.

Let’s draw a Word cloud of these comments and see what’s common.

We will be using the Wordcloud pip package

pip install wordcloud

It requires the input as a string so we will have to turn our texts array into one long string

texts = [' '.join(comment) for comment in texts]
texts = ' '.join(texts)

A common problem with Arabic is that Arabic is written Right-to-left which causes a lot of problems in visualization, so let’s fix that.

We will be using these two packages to solve our problem

pip install arabic_reshaper

pip install bidi.algorithm

reshaped_text = arabic_reshaper.reshape(texts)
bidi_text = get_display(reshaped_text)

Finally, because the Wordcloud package uses a font that does not support Arabic you’ll have to download Tahoma.ttf and supply Wordcloud with its path as the font_path argument

# Generate a word cloud image
wordcloud = WordCloud(font_path="tahoma.ttf").generate(bidi_text)

# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

And here’s the final result.

One last thing I’m interested in analyzing in this dataset is the sentiment. I’d love to see insights about polarity in this data.

For Sentiment Analysis I will be following the approach I followed earlier for analyzing an arabic dataset and publish the results in a separate post once I have enough time.

Hopefully you’ve had a basic walk through performing analytics on an Arabic text dataset that we built from Facebook using Python, you can follow the same approach for mining insights for any page/product/hashtag.

Published inData Science

2 Comments

  1. Ahmed Maher Ahmed Maher

    Thanks Omar for the detailed explanation.
    Can we achieve the same results using (elasticsearch custome tokenizer analyzer + kibana to visualize the results) ?. I think it will be more easier and faster

    • Omar Essam Omar Essam

      I haven’t used them before, I prefer using Python/Pandas because the data size is small, but if the data is huge I think ElasticSearch might be useful

Join the discussion