An Analysis of Word2Vec
for the Italian Language

Abstract

Word representation is fundamental in NLP tasks, because it is precisely from the coding of semantic closeness between words that it is possible to think of teaching a machine to understand text. Despite the spread of word embedding concepts, still few are the achievements in linguistic contexts other than English. In this work, analysing the semantic capacity of the Word2Vec algorithm, an embedding for the Italian language is produced. Parameter setting such as the number of epochs, the size of the context window and the number of negatively backpropagated samples is explored.

Trend of accuracy as the number of epochs changes for different configurations of Word2Vec

Link between W and N

The learning of the structure seems to be conditioned by an appropriate joint dimensioning between the window size and the negative sampling.

Accuracy up to the 50th epoch

The importance of epochs

The epochs for which the learning was carried out prove to be fundamental, so much so that even the worst models manage to reach (and even overcome) the previous models upon reaching the 50th epoch.

Semantic Syntactic Total
3COSADD Our model 58,42% 40,92% 48,81%
Tipodi's 53,21% 37,37% 44,51%
Berardi's 48,81% 32,62% 39,91%
3COSMUL Our model 58,31% 42,51% 49,62%
Tipodi's 55,56% 39,60% 46,79%
Berardi's 49,59% 33,70% 40,86%

Best model

Our best model (window = 10 and negative sampling = 20 at the 50th epoch) outclasses the two models currently present in the literature. The embedding of this model can be downloaded from the following link.

The KeyedVectors of the model can be used through the following code:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load("W2V.kv", mmap='r+')
vocabs = word_vectors.index2word
vectors = word_vectors.vectors

please note that unfortunately it is necessary to downgrade numpy to version 1.15.4

pip install numpy==1.15.4

Conclusion

We have analysed the Word2Vec model for Italian Language obtaining a substantial increase in performance respect to other two models in the literature (and despite the fixed size of the embedding). These results, in addition to the number of learning epochs, are probably also due to the different phase of data pre-processing, very careful in performing a complete cleaning of the text and above all in substituting the numerical values with a single particular token. We have observed that the number of epochs is an important parameter and its increase leads also the our two worst models to reach (and even exceed) the values obtained by others. In the future, thanks to the collaboration in the Laila project, we intend to expand the dataset by adding more user chats, so as to be able to verify whether the use of a less formal language improves accuracy in the syntactic macro-area.