
How Google converted translation into a problem of vector space mathematics
An article about how Google Translate works from Technology Review. Excerpt:
The new approach is relatively straightforward. It relies on the notion that every language must describe a similar set of ideas, so the words that do this must also be similar. For example, most languages will have words for common animals such as cat, dog, cow and so on. And these words are probably used in the same way in sentences such as “a cat is an animal that is smaller than a dog.”
The same is true of numbers. The image above shows the vector representations of the numbers one to five in English and Spanish and demonstrates how similar they are.
This is an important clue. The new trick is to represent an entire language using the relationship between its words. The set of all the relationships, the so-called “language space”, can be thought of as a set of vectors that each point from one word to another. And in recent years, linguists have discovered that it is possible to handle these vectors mathematically. For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.
It turns out that different languages share many similarities in this vector space. That means the process of converting one language into another is equivalent to finding the transformation that converts one vector space into the other.
This turns the problem of translation from one of linguistics into one of mathematics. So the problem for the Google team is to find a way of accurately mapping one vector space onto the other. For this they use a small bilingual dictionary compiled by human experts–comparing same corpus of words in two different languages gives them a ready-made linear transformation that does the trick.
It seems like this would be a fairly good technique for more isolating languages, which have a lot of individual words that can be mapped onto each other based on the things that occur between white spaces.
I’m wondering how well it works for really heavily agglutinative or polysynthetic languages though: since morphemes in these languages correspond to separate words in others, I guess you’d need to first parse the words into morphemes and then map them into the same vector space, which seems like it would be a bit harder.
At any rate, another entry for linguistics jobs.
Don't miss out on any interesting linguistics! Get my monthly newsletter in your inbox.
Notes
devastationwagon liked this
like-i-need-a-hole-in-my-head reblogged this from allthingslinguistic
mothragender liked this housmania liked this
bat-lilith liked this
thats-wavy liked this zacjimmermann-blog liked this
associahedron liked this tiuhtiviuhti liked this
delilahmidnight liked this
potatothief liked this
lethallilli reblogged this from allthingslinguistic
theperilousrealm liked this
fbstj liked this
kogiopsis reblogged this from acrickettofillthesilence
adoringcatfishmoved reblogged this from allthingslinguistic
jul-jul-jul liked this
adoringcatfishmoved liked this
aftershocked liked this
sparksbet reblogged this from allthingslinguistic
philippesaner liked this
no-euclidianos liked this defuse00 liked this
hybridzizi reblogged this from rosethyme
rosethyme reblogged this from suspected-spinozist
rosethyme liked this
hylleddin liked this
misterjoshbear liked this insaneyouthlifecrisis liked this
somewhere-in-the-dungeon liked this
taymonbeal liked this
voidprimer liked this
shacklesburst reblogged this from suspected-spinozist
shacklesburst liked this
tired-violence liked this
thewordywarlock reblogged this from suspected-spinozist
suspected-spinozist reblogged this from cohobbitation
cohobbitation reblogged this from allthingslinguistic
doubleasalways-blog liked this
gsc03549-02811-blog liked this
dbumblr liked this 91625 liked this
tiredstatue liked this
allthingslinguistic posted this
- Show more notes