Wor2Vec

Andressa Contarato
3 min readSep 11, 2020

Part 3: Brawl Stars

Source

If you accessed this post directly, access the first post bellow.

Part 1:

Part 2:

I hope that you enjoy the last post about Brawl Stars!

Before talking about Word2Vec, I will insert an idea of artificial neural network.

It can be said that a study area known as Artificial Intelligence (AI) is composed of: Machine Learning (ML), Deep Learning (DL) and Artificial Intelligence itself.

Deep Learning becomes a sub-area of Machine Learning and its main purpose is the application of Deep Neural Networks.

A neural network is based on a biological neuron like the one below.

Representation of the biological neuron

Source

The simplest structure about artificial neural networks is: input, hidden layer and output, like bellow:

Source

To study more about Deep Learning!

Now, the Word2Vec … The main idea of word2vec is to further analyze the words and the intrinsic relationships between them.

The types of neural networks to carry out the word2vec methodology are:

  • CBOW (Continuous Bag of Words): has the idea of predicting the next word to be used in the context.
  • Skip Gram: predicts words surrounding the context, that is, in addition to a next word, it tends to predict a set that is related to the context.

Let’s do it!

Source

First, the packages:

# packages
import pandas as pd
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings(action = 'ignore')

Second, import dataset and adjust the structure:

# import dataset
data = pd.read_csv("collect_tweets_brawlstars.csv")
# structurea = ''
for i in range(0,len(data.replies)):
a = a + data.replies[i]

words = a.replace("\n", " ")
df = []

for elem in sent_tokenize(words):
aux = []

# tokenization
for word in word_tokenize(elem):
aux.append(word.lower())

df.append(aux)

And now, let’s apply the two models:

  • CBO
# CBO model
cbo_model = gensim.models.Word2Vec(df, min_count = 2, size = 200, window = 10)

We can see the similarity between two words:

# results
print("Cosine similarity between 'game' " + "and 'update' - CBOW : ",
cbo_model.similarity('game', 'update'))
  • Skip Gram model
# Skip Gram model
sg_model = gensim.models.Word2Vec(df, min_count = 2, size = 200, window = 10, sg = 1)

And, we can see the positive words with more relationship about a specific word:

sg_model.wv.most_similar(positive=["game"])[('join', 0.9270744323730469),
('good', 0.913809597492218),
('conection', 0.9052560329437256),
('suggestions', 0.9031330347061157),
('any', 0.8990808129310608),
('have', 0.7886446714401245),
('cant', 0.7721467614173889),
('the', 0.7458540201187134),
('i', 0.7391091585159302),
('appstore', 0.7330736517906189)]

This is very important to analysis the speech not only social media but in all historical context.

All the code is available on my GitHub:

And, see you!

Source

References

--

--

Andressa Contarato

I’m a statistician, with a postgraduate degree in Finance from UF, specializing in Data Science from DSA, MBA in Data Science and MBA in Project Management USP.