Descriptive Analysis of Twitter Post Comments

7 min readSep 10, 2020

Part 2: Brawl Stars

([SPOILER] YES! We will do this … but first …)

Hi! As mentioned in the previous post. This is a trilogy of posts on text analysis. The dataset is from Twitter, more specifically from comments on the BrawlStars page.

If you accessed this post directly, access the first post bellow.

Twitter Data Collect With Python

Analysing the game: Braw Stars

medium.com

I hope that you enjoy this second post!

Let’s do it!

first, install the packages:

import pandas as pd, glob, numpy as np, os, re, time, json, nltk.collocations
from datetime import datetime
import nltk 
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize 
from nltk import word_tokenize, Text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import collections
from collections import Counter
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import unidecode
import spacy
from spacy.lang.pt.examples import sentences
from spacy.symbols import ORTH, POS, NOUN, VERB
import seaborn as sns
from itertools import combinations
import networkx as nxnltk.download('punkt')

Now, the next step is to import the dataset and analysi all the columns:

# import dataset
data = pd.read_csv("collect_tweets_brawlstars.csv")

The dimension is 3 columns and 625 observations.

data.shape
# (625, 3)

The sample:

data.head()

Graphical analysis:

# time series
reactions = data.groupby(['data_create']).count()
ax = reactions.replies.plot(figsize=(15,6),ls='--',c='red')
ax.xaxis.grid(True)
ax.yaxis.grid(True)

The data were collected between the 2020–09–10 14:12:31 and 2020–09–10 14:56:40 in the same time there was a new updated on the game. The largest volumes of data are concentrated in the part of 14:06 to 14:50.

Evolution of data about commnets volumes inside Brawl Stars’ Twitter page over time

The next step, it shows the top 3 text with more reactions:

# top 3 texts
reactions = data.iloc[:,[1,2]].groupby(['replies']).count()
reactions.sort_values(by=['data_create'],ascending=False).iloc[0:3,:]

Word Cloud

Now, let’s build a word cloud to analyze the most used words in the comments.

We can build the word cloud in Python language itself, but I will show you another very good and online website to build these clouds.

Structure the dataset:

data = data.dropna(subset=['replies'])
data = data.drop_duplicates('replies')
data = data.reset_index(drop=True)
data = data.reset_index(drop=True)
data.shape

Analysis the words

def analysis_words(data, message):
    df = data.dropna(subset=[message])
    df = df.reset_index()
    text = ""
    for i in range(0, len(df[message])):
            text = text + " " + df[message][i]
    
    sw = stopwords.words('portuguese')words = word_tokenize(text)
    words = [w for w in words if w not in sw]text = ""
    for i in words:
        text = text +" "+ str(i)
    d = Counter(words)return(text)text = analysis_words(data, "replies")

Word Cloud

Edit word art - WordArt.com

Edit description

wordart.com

For some improvements in the cloud, short words or accents (as in the Brazilian case) and special characters can be removed. And, because it deals with data from social networks, also remove links from websites.

Interpretation:

Tara/Sprout are a very commented characters, it is interesting to analyze and to promote improvements in them
Mecha is a skin of one the characters glitched the game is giving error
Favor/Connection request help
There is a lot of word in Portuguese and Brazilian community is very engaged in the game s2
Atualizei/Update the day I searched was the day that a new game updated came out
Colt + Collete = Coltllete people want these two characters to fall in love in the game

To improve the accuracy of word analysis, let’s make a Trigram. Which is a trio of more related words, or arranged in the text. That is, when they put a word, consequently there is a great chance of using the other two in the same context.

def pad_words(s):
    s = unidecode.unidecode(s)
    
    # removes special characters with ' ' 
    s = re.sub('[^a-zA-z\s]', '', s)
    s = s.lower()
    s = re.sub('_', '', s) 
    
    # Change any white space to one space 
    s = re.sub('\s+', ' ', s) 
    
    s = word_tokenize(s)
    s = [w for w in s if w.isalpha() and len(w) >= 3]text = ""
    
    for i in s:
        text = text +" "+ str(i)
    
    return(text)

Structure

text = data.replies.dropna().unique().tolist()
text = [pad_words(x) for x in text]

Trigram

stop_words = set(stopwords.words('english'))for i, line in enumerate(text): 
    text[i] = ' '.join([x for x in nltk.word_tokenize(line) if ( x not in stop_words )])# trigrams
vector_text = CountVectorizer(ngram_range = (3,3))
X = vector_text.fit_transform(text)
f_text = (vector_text.get_feature_names())# TFIDF 
vector_text = TfidfVectorizer(ngram_range = (3,3)) 
Y = vector_text.fit_transform(text)# ranking - top 10
sums = Y.sum(axis = 0) 
df = []for col, term in enumerate(f_text): 
    df.append( (term, sums[0,col] )) 
ranking = pd.DataFrame(df, columns = ['Words Trigram','Ranking']) 
words = (ranking.sort_values('Ranking', ascending = False))words.head(10)

In the Trigrams we see the thanks of the users with the game. Talking about buying gifts shop. In addition, it talks about the structure of the characters. Bring back old use talk about going back to the previous one, because they didn’t like the update.

Graphs

Graphs theory is a branch of mathematics that studies the relationship between objects (represented by points of graphs — nodes) and connections with other objects (represented by lines — edges). One of the motivations was to solve problems without a viable solution like the Königsberg bridges.

Let’s create the data structure for creating a graph and build it in Gephi software.

Functions

def flatten_text_tranform(a):for i in a:
        if isinstance(i, collections.Iterable) and not isinstance(i, (str, bytes)):
            yield from flatten_text_tranform(i)
        else:
            yield idef remove_emoji_text(text):emoji_pattern = re.compile(
        u'(\U0001F1F2\U0001F1F4)|'       
        u'([\U0001F1E6-\U0001F1FF]{2})|' 
        u'([\U0001F600-\U0001F64F])'     
        "+", flags=re.UNICODE)return emoji_pattern.sub('', text)def clean_text(text, stopwords=None, words_to_replace=None):text = re.compile("(\r)|(\n)").sub(r" ", text)
    text = text.lower()text = re.compile("(#\S+)|(@\S+)|(http\S+)|([^\w\s])|(\S+\.com\S+)").sub(r" ", text)
    text = re.compile("\S+(\w)\\1{2,}\S+").sub(r" ", text)
    text = re.compile("\S+(ha)\\1{2,}\S+").sub(r" ", text)
    text = remove_emoji_text(text)
    
    words = Text(word_tokenize(text))
    words = [word for word in words]
    
    if stopwords:
        words = [word for word in words if (word not in stopwords) and (len(word) > 1)]
    
    return wordsdef get_bigrams(words, window=3):result = []
    bigram = []
    
    for i in range(1,len(words)-1):
        wword = words[i-1:i+(window-1)]
        combns = list(combinations(wword,2))
        bigram.extend(combns)bigram = [tuple(sorted(tup)) for tup in bigram]result.extend(bigram)return(result)def graph(bigram, label):d = {}
    for node in bigram:
        try:
            d[node] = d[node] + 1
        except:
            d[node] = 1g = nx.DiGraph()
    for k, v in d.items():
        g.add_edge(k[0], k[1], label=label, weight=v)
            
    return g

Structure

data["words"] = data.replies.apply(lambda x: clean_text(text=x))
data["text"] = data["words"].apply(lambda x: ' '.join(x))

Graph Structure

# graph structurewords = list(flatten_text_tranform(data.words.to_list()))
bg = get_bigrams(words)
G = graph(bg, 'brawlstars')
nx.write_graphml(G, f"brawlstars.graphml")

Gephi

Gephi is open source software for building graphs. It holds 3 windows: Overview for graph elaboration Data Laboratory for data manipulation Preview for graph export settings.

Import dataset:

Data Laboratory > Import SpreadSheet

Run: Modularity and Avg. Wieghted Degree.

I chose to color the graph according to the nodes following the modularity metric.

OpenOrd method was applied with the parameters in the figure above.

Interpretation

The blue cluster (13,41% modularity) is a cluster with Portuguese words.
The pink cluster (11,17% modularity) talks about the updates of a character with the terms: paladin, mecha, how, old and stage.
The green cluster (11,17% modularity) talks about game updates with the terms: logo, idea, charge grass and up.

All the code is available on my GitHub:

contaratoandressa/pegasus

A repository for implementation of simple functions. This repository will have projects creates in Python languages and…

github.com

Well .. that’s it soon with more news!

References

TF - IDF for Bigrams & Trigrams - GeeksforGeeks

TF-IDF in NLP stands for Term Frequency - Inverse document frequency. It is a very popular topic in Natural Language…

www.geeksforgeeks.org

nltk.util - NLTK 3.5 documentation

docs]def usage ( obj , selfname = "self" ): str ( obj ) # In case it's lazy, this will load it. if not isinstance ( obj…

www.nltk.org

Graph theory

In mathematics, graph theory is the study of , which are mathematical structures used to model pairwise relations…

en.wikipedia.org

Seven Bridges of Königsberg

The Seven Bridges of Königsberg is a historically notable problem in mathematics. Its negative resolution by Leonhard…

en.wikipedia.org

The Open Graph Viz Platform

Gephi is the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source…

gephi.org

Descriptive Analysis of Twitter Post Comments

Twitter Data Collect With Python

Analysing the game: Braw Stars

Let’s do it!

Word Cloud

Edit word art - WordArt.com

Edit description

Graphs

contaratoandressa/pegasus

A repository for implementation of simple functions. This repository will have projects creates in Python languages and…

TF - IDF for Bigrams & Trigrams - GeeksforGeeks

TF-IDF in NLP stands for Term Frequency - Inverse document frequency. It is a very popular topic in Natural Language…

nltk.util - NLTK 3.5 documentation

docs]def usage ( obj , selfname = "self" ): str ( obj ) # In case it's lazy, this will load it. if not isinstance ( obj…

Graph theory

In mathematics, graph theory is the study of , which are mathematical structures used to model pairwise relations…

Seven Bridges of Königsberg

The Seven Bridges of Königsberg is a historically notable problem in mathematics. Its negative resolution by Leonhard…

The Open Graph Viz Platform

Gephi is the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source…

Written by Andressa Contarato

No responses yet