“Poetry is, at its core, the art of identifying and manipulating linguistic similarity.”Allision Parrish
Believe it our not computers, without any labeled data, are capable of understanding the relationship of everyday things. By this I mean a computer can tell you that a shirt is more similar to a shoe than it is to a car.
In the past
In the past, to enable a computer to understand the relationship between various things you would have to provide it a labeled dataset.
| Weight (kg) | Volume (liters) | Average length of Use (years) | |
| Shirt | 0.2 | 1 | 2 |
| Shoe | 0.5 | 2 | 4 |
| Car | 1500 | 8000 | 12 |
In this table you can see that the shirt and the shoe are much more similar than the car. A computer, with simple math, could easily understand this too.
Now
Now, computers understand semantics, or the meaning behind a word, without labeled datasets. So a computer will be able to tell that a shoe is more similar to a shirt than a car with only one value as in input.
Computers can also perform linguistic operations like this:
(King – Man) + Woman = Queen
Amazing!
But how? Let me introduce you to embeddings.
Embeddings
Simply put, an embedding is a numerical way of representing a word. For example, If I embedded the word cat it would look something like this [0.94, -0.223, 0.54 …, 0.45, 1.34]. This is a vector of 300+ numbers (the size of this vector depends on the embedding model you use).
Once you have this array of floats you can plot it! Our brains aren’t able to understand such a high demential space, but it’s no problem for a computer.
Once the computer plots the word in vector space, it can understand the relationship between words based on their location. In this higher dimensional space you’d see cat, dog, rabbit plotted close together; and you’d also see car, motorcycle, RV plotted close in another section of the space. This multidimensional space has thousands of axises that we can’t label linguistically at this moment, but they do properly represent the word or phrase at hand.
That’s it!
That’s it! Embeddings are a crazy complex beast of computation that represent words as vectors. You can calculate the distance between two vectors to determine their similarity!
But how do you train a model that makes embeddings? What are some real world use cases of embeddings? How accessible are embeddings?
Stay tuned!
