Descriptive Analysis of Twitter Post Comments

Andressa Contarato
7 min readSep 10, 2020

--

Part 2: Brawl Stars

Elaboration: Author’s own.

([SPOILER] YES! We will do this … but first …)

Hi! As mentioned in the previous post. This is a trilogy of posts on text analysis. The dataset is from Twitter, more specifically from comments on the BrawlStars page.

If you accessed this post directly, access the first post bellow.

I hope that you enjoy this second post!

Let’s do it!

Source

first, install the packages:

import pandas as pd, glob, numpy as np, os, re, time, json, nltk.collocations
from datetime import datetime
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, Text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import collections
from collections import Counter
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import unidecode
import spacy
from spacy.lang.pt.examples import sentences
from spacy.symbols import ORTH, POS, NOUN, VERB
import seaborn as sns
from itertools import combinations
import networkx as nx
nltk.download('punkt')

Now, the next step is to import the dataset and analysi all the columns:

# import dataset
data = pd.read_csv("collect_tweets_brawlstars.csv")

The dimension is 3 columns and 625 observations.

data.shape
# (625, 3)

The sample:

data.head()
Elaboration: Author’s own.

Graphical analysis:

# time series
reactions = data.groupby(['data_create']).count()
ax = reactions.replies.plot(figsize=(15,6),ls='--',c='red')
ax.xaxis.grid(True)
ax.yaxis.grid(True)

The data were collected between the 2020–09–10 14:12:31 and 2020–09–10 14:56:40 in the same time there was a new updated on the game. The largest volumes of data are concentrated in the part of 14:06 to 14:50.

Evolution of data about commnets volumes inside Brawl Stars’ Twitter page over time

Elaboration: Author’s own.

The next step, it shows the top 3 text with more reactions:

# top 3 texts
reactions = data.iloc[:,[1,2]].groupby(['replies']).count()
reactions.sort_values(by=['data_create'],ascending=False).iloc[0:3,:]
Elaboration: Author’s own.

Word Cloud

Now, let’s build a word cloud to analyze the most used words in the comments.

We can build the word cloud in Python language itself, but I will show you another very good and online website to build these clouds.

Structure the dataset:

data = data.dropna(subset=['replies'])
data = data.drop_duplicates('replies')
data = data.reset_index(drop=True)
data = data.reset_index(drop=True)
data.shape

Analysis the words

def analysis_words(data, message):
df = data.dropna(subset=[message])
df = df.reset_index()
text = ""
for i in range(0, len(df[message])):
text = text + " " + df[message][i]

sw = stopwords.words('portuguese')
words = word_tokenize(text)
words = [w for w in words if w not in sw]
text = ""
for i in words:
text = text +" "+ str(i)
d = Counter(words)
return(text)text = analysis_words(data, "replies")

Word Cloud

Elaboration: Author’s own.
Elaboration: Author’s own.

For some improvements in the cloud, short words or accents (as in the Brazilian case) and special characters can be removed. And, because it deals with data from social networks, also remove links from websites.

Interpretation:

  • Tara/Sprout are a very commented characters, it is interesting to analyze and to promote improvements in them
  • Mecha is a skin of one the characters glitched the game is giving error
  • Favor/Connection request help
  • There is a lot of word in Portuguese and Brazilian community is very engaged in the game s2
  • Atualizei/Update the day I searched was the day that a new game updated came out
  • Colt + Collete = Coltllete people want these two characters to fall in love in the game

To improve the accuracy of word analysis, let’s make a Trigram. Which is a trio of more related words, or arranged in the text. That is, when they put a word, consequently there is a great chance of using the other two in the same context.

def pad_words(s):
s = unidecode.unidecode(s)

# removes special characters with ' '
s = re.sub('[^a-zA-z\s]', '', s)
s = s.lower()
s = re.sub('_', '', s)

# Change any white space to one space
s = re.sub('\s+', ' ', s)

s = word_tokenize(s)
s = [w for w in s if w.isalpha() and len(w) >= 3]
text = ""

for i in s:
text = text +" "+ str(i)

return(text)

Structure

text = data.replies.dropna().unique().tolist()
text = [pad_words(x) for x in text]

Trigram

stop_words = set(stopwords.words('english'))for i, line in enumerate(text): 
text[i] = ' '.join([x for x in nltk.word_tokenize(line) if ( x not in stop_words )])
# trigrams
vector_text = CountVectorizer(ngram_range = (3,3))
X = vector_text.fit_transform(text)
f_text = (vector_text.get_feature_names())
# TFIDF
vector_text = TfidfVectorizer(ngram_range = (3,3))
Y = vector_text.fit_transform(text)
# ranking - top 10
sums = Y.sum(axis = 0)
df = []
for col, term in enumerate(f_text):
df.append( (term, sums[0,col] ))
ranking = pd.DataFrame(df, columns = ['Words Trigram','Ranking'])
words = (ranking.sort_values('Ranking', ascending = False))
words.head(10)
Elaboration: Author’s own.

In the Trigrams we see the thanks of the users with the game. Talking about buying gifts shop. In addition, it talks about the structure of the characters. Bring back old use talk about going back to the previous one, because they didn’t like the update.

Graphs

Graphs theory is a branch of mathematics that studies the relationship between objects (represented by points of graphs — nodes) and connections with other objects (represented by lines — edges). One of the motivations was to solve problems without a viable solution like the Königsberg bridges.

Let’s create the data structure for creating a graph and build it in Gephi software.

Functions

def flatten_text_tranform(a):for i in a:
if isinstance(i, collections.Iterable) and not isinstance(i, (str, bytes)):
yield from flatten_text_tranform(i)
else:
yield i
def remove_emoji_text(text):emoji_pattern = re.compile(
u'(\U0001F1F2\U0001F1F4)|'
u'([\U0001F1E6-\U0001F1FF]{2})|'
u'([\U0001F600-\U0001F64F])'
"+", flags=re.UNICODE)
return emoji_pattern.sub('', text)def clean_text(text, stopwords=None, words_to_replace=None):text = re.compile("(\r)|(\n)").sub(r" ", text)
text = text.lower()
text = re.compile("(#\S+)|(@\S+)|(http\S+)|([^\w\s])|(\S+\.com\S+)").sub(r" ", text)
text = re.compile("\S+(\w)\\1{2,}\S+").sub(r" ", text)
text = re.compile("\S+(ha)\\1{2,}\S+").sub(r" ", text)
text = remove_emoji_text(text)

words = Text(word_tokenize(text))
words = [word for word in words]

if stopwords:
words = [word for word in words if (word not in stopwords) and (len(word) > 1)]

return words
def get_bigrams(words, window=3):result = []
bigram = []

for i in range(1,len(words)-1):
wword = words[i-1:i+(window-1)]
combns = list(combinations(wword,2))
bigram.extend(combns)
bigram = [tuple(sorted(tup)) for tup in bigram]result.extend(bigram)return(result)def graph(bigram, label):d = {}
for node in bigram:
try:
d[node] = d[node] + 1
except:
d[node] = 1
g = nx.DiGraph()
for k, v in d.items():
g.add_edge(k[0], k[1], label=label, weight=v)

return g

Structure

data["words"] = data.replies.apply(lambda x: clean_text(text=x))
data["text"] = data["words"].apply(lambda x: ' '.join(x))

Graph Structure

# graph structurewords = list(flatten_text_tranform(data.words.to_list()))
bg = get_bigrams(words)
G = graph(bg, 'brawlstars')
nx.write_graphml(G, f"brawlstars.graphml")

Gephi

Gephi

Gephi is open source software for building graphs. It holds 3 windows: Overview for graph elaboration Data Laboratory for data manipulation Preview for graph export settings.

Import dataset:

Data Laboratory > Import SpreadSheet

Gephi
Elaboration: Author’s own.

Run: Modularity and Avg. Wieghted Degree.

Elaboration: Author’s own.

I chose to color the graph according to the nodes following the modularity metric.

Elaboration: Author’s own.

OpenOrd method was applied with the parameters in the figure above.

Elaboration: Author’s own.

Interpretation

  • The blue cluster (13,41% modularity) is a cluster with Portuguese words.
  • The pink cluster (11,17% modularity) talks about the updates of a character with the terms: paladin, mecha, how, old and stage.
  • The green cluster (11,17% modularity) talks about game updates with the terms: logo, idea, charge grass and up.

All the code is available on my GitHub:

Well .. that’s it soon with more news!

References

--

--

Andressa Contarato
Andressa Contarato

Written by Andressa Contarato

I’m a statistician, with a postgraduate degree in Finance from UF, specializing in Data Science from DSA, MBA in Data Science and MBA in Project Management USP.

No responses yet