Wor2Vec
Part 3: Brawl Stars
If you accessed this post directly, access the first post bellow.
Part 1:
Part 2:
I hope that you enjoy the last post about Brawl Stars!
Before talking about Word2Vec, I will insert an idea of artificial neural network.
It can be said that a study area known as Artificial Intelligence (AI) is composed of: Machine Learning (ML), Deep Learning (DL) and Artificial Intelligence itself.
Deep Learning becomes a sub-area of Machine Learning and its main purpose is the application of Deep Neural Networks.
A neural network is based on a biological neuron like the one below.
Representation of the biological neuron
The simplest structure about artificial neural networks is: input, hidden layer and output, like bellow:
To study more about Deep Learning!
Now, the Word2Vec … The main idea of word2vec is to further analyze the words and the intrinsic relationships between them.
The types of neural networks to carry out the word2vec methodology are:
- CBOW (Continuous Bag of Words): has the idea of predicting the next word to be used in the context.
- Skip Gram: predicts words surrounding the context, that is, in addition to a next word, it tends to predict a set that is related to the context.
Let’s do it!
First, the packages:
# packages
import pandas as pd
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings(action = 'ignore')
Second, import dataset and adjust the structure:
# import dataset
data = pd.read_csv("collect_tweets_brawlstars.csv")# structurea = ''
for i in range(0,len(data.replies)):
a = a + data.replies[i]
words = a.replace("\n", " ")df = []
for elem in sent_tokenize(words):
aux = []
# tokenization
for word in word_tokenize(elem):
aux.append(word.lower())
df.append(aux)
And now, let’s apply the two models:
- CBO
# CBO model
cbo_model = gensim.models.Word2Vec(df, min_count = 2, size = 200, window = 10)
We can see the similarity between two words:
# results
print("Cosine similarity between 'game' " + "and 'update' - CBOW : ",
cbo_model.similarity('game', 'update'))
- Skip Gram model
# Skip Gram model
sg_model = gensim.models.Word2Vec(df, min_count = 2, size = 200, window = 10, sg = 1)
And, we can see the positive words with more relationship about a specific word:
sg_model.wv.most_similar(positive=["game"])[('join', 0.9270744323730469),
('good', 0.913809597492218),
('conection', 0.9052560329437256),
('suggestions', 0.9031330347061157),
('any', 0.8990808129310608),
('have', 0.7886446714401245),
('cant', 0.7721467614173889),
('the', 0.7458540201187134),
('i', 0.7391091585159302),
('appstore', 0.7330736517906189)]
This is very important to analysis the speech not only social media but in all historical context.
All the code is available on my GitHub:
And, see you!
References