Wor2Vec

Andressa Contarato

3 min readSep 11, 2020

Part 3: Brawl Stars

If you accessed this post directly, access the first post bellow.

Part 1:

Twitter Data Collect With Python

Analysing the game: Braw Stars

medium.com

Part 2:

Descriptive Analysis of Twitter Post Comments

Part 2

medium.com

I hope that you enjoy the last post about Brawl Stars!

Before talking about Word2Vec, I will insert an idea of artificial neural network.

It can be said that a study area known as Artificial Intelligence (AI) is composed of: Machine Learning (ML), Deep Learning (DL) and Artificial Intelligence itself.

Deep Learning becomes a sub-area of Machine Learning and its main purpose is the application of Deep Neural Networks.

A neural network is based on a biological neuron like the one below.

Representation of the biological neuron

The simplest structure about artificial neural networks is: input, hidden layer and output, like bellow:

To study more about Deep Learning!

Início - Deep Learning Book

Deep Learning Book - Em Português, Online e Gratuito. Nos acompanhe nesta incrível jornada! Serão mais de 50 capítulos…

deeplearningbook.com.br

Now, the Word2Vec … The main idea of word2vec is to further analyze the words and the intrinsic relationships between them.

The types of neural networks to carry out the word2vec methodology are:

CBOW (Continuous Bag of Words): has the idea of predicting the next word to be used in the context.
Skip Gram: predicts words surrounding the context, that is, in addition to a next word, it tends to predict a set that is related to the context.

Let’s do it!

First, the packages:

# packages
import pandas as pd
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec
import warnings 
warnings.filterwarnings(action = 'ignore')

Second, import dataset and adjust the structure:

# import dataset
data = pd.read_csv("collect_tweets_brawlstars.csv")# structurea = ''
for i in range(0,len(data.replies)):
    a = a + data.replies[i]
    
words = a.replace("\n", " ")df = [] 
  
for elem in sent_tokenize(words): 
    aux = [] 
    
    # tokenization
    for word in word_tokenize(elem): 
        aux.append(word.lower()) 
  
    df.append(aux)

And now, let’s apply the two models:

# CBO model
cbo_model = gensim.models.Word2Vec(df, min_count = 2, size = 200, window = 10)

We can see the similarity between two words:

# results
print("Cosine similarity between 'game' " + "and 'update' - CBOW : ", 
    cbo_model.similarity('game', 'update'))

Skip Gram model

# Skip Gram model
sg_model = gensim.models.Word2Vec(df, min_count = 2, size = 200, window = 10, sg = 1)

And, we can see the positive words with more relationship about a specific word:

sg_model.wv.most_similar(positive=["game"])[('join', 0.9270744323730469),
 ('good', 0.913809597492218),
 ('conection', 0.9052560329437256),
 ('suggestions', 0.9031330347061157),
 ('any', 0.8990808129310608),
 ('have', 0.7886446714401245),
 ('cant', 0.7721467614173889),
 ('the', 0.7458540201187134),
 ('i', 0.7391091585159302),
 ('appstore', 0.7330736517906189)]

This is very important to analysis the speech not only social media but in all historical context.

All the code is available on my GitHub:

contaratoandressa/pegasus

A repository for implementation of simple functions. This repository will have projects creates in Python languages and…

github.com

And, see you!

Source

References

Word2vec from Scratch with NumPy

How to implement a Word2vec model with Python and NumPy

towardsdatascience.com

A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model

Word embedding is one of the most important techniques in natural language processing (NLP), where words are mapped to…

towardsdatascience.com

Python | Word Embedding using Word2Vec - GeeksforGeeks

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words…

www.geeksforgeeks.org

gensim: topic modelling for humans

This module implements word vectors and their similarity look-ups. Since trained word vectors are independent from the…

radimrehurek.com