
Every language can be represented as a physical shape and by taking the Universal Declaration of Human Rights, translating it into pure IPA phonetics, and mapping the contextual patterns of those sounds into a 2D space, the physical geometry of human speech reveals itself:
(1) Look at the Romance languages (Spanish, French, Italian, Portuguese, Catalan, Romanian) in crimson. They group into nearly identical crescent shapes, sharing the exact same geometric rhythm. You can hear this shared acoustic footprint in words like "freedom", whether it is "libertad" in Spanish, "liberté" in French, or "libertà" in Italian, they all share a similar phonetic bounce.
(2) German, Dutch, and Swedish (in blue) are different story, they stretch into a different quadrant of the map, carving out their own distinct structural rules. They rely on sharper, more consonant-heavy clusters. For the same concept of freedom, German gives us "Freiheit", Dutch uses "vrijheid", and Swedish says "frihet." We see these similar structural sounds together.
(3) And of course, my favourite, the outlier: Hungarian (purple). Because Hungarian is a Uralic language, not Indo-European like the other 11, its footprint is completely off the map. It forms a tight, isolated cluster far to the left, visually proving its unique origins. While the Romance and Germanic languages echo variations of "liberty" or "freedom", the Hungarian word is "szabadság" a completely different phonetic reality, and the geometry shows it perfectly.
The grey background represents the universal corpus of all sounds combined. No single language covers the whole area because every language has specific rules about what sounds can go together, restricting them to their own specific islands.
How was this mapped? I used an event2vector package, allowing to process the sequences and plot its contextual embeddings without any prior linguistic training.
by sulcantonin
13 Comments
Source corpus is UDHR: [https://www.un.org/en/about-us/universal-declaration-of-human-rights](https://www.un.org/en/about-us/universal-declaration-of-human-rights)
Tool used is event2vector: [https://github.com/sulcantonin/event2vec_public](https://github.com/sulcantonin/event2vec_public), also available at pip as event2vector.
Really should have x and y axis labeled. Otherwise very cool
This sounds so tantalisingly cool, but it’s hard to get anything from a graph with no axis labels, and thus no clarity on what the subtle differences in shapes (to a lay person) might mean. Would love to share in the coolness, but right now I’ve mostly got cloudy blobs.
It would be interesting to add Finnish
To the untrained eye these all look incredibly similar with the exception of the arbitrary colors assigned. I honestly don’t see crescents for the Romance languages. All the axes are the same…but no real differences apparent. Somehow you need to highlight some other dimension(s) to give this more meaning.
Its interesting, but if they weren’t colour coded I don’t think I could have grouped them… I’m not sure I’m seeing the differences.
Needs axis labels/explanations. Also, given that most of us on this site speak English, English would be nice to have as a comparison.
Interesting concept! It would be even cooler to have more linguistic diversity. How about including some Asian and African languages?
Interesting!! I’d love to see a similar charts for Danish, Norwegian, and English.
https://preview.redd.it/69uzglf6i1tg1.png?width=1795&format=png&auto=webp&s=040ef9878065348342e1801a0983a7b0dfa33e05
This data makes more sense as a PCA of all the data combined. It doesn’t really give intel on much with two dimensions and with each language separated out. Plot it all on a UMAP or something.
I cant understand why you wouldn’t understand English since its a reference point to everyone here on reddit
As a linguist, I was momentarily excited. But I don’t see much use to this. The axes are unlabeled, though if I understand correctly, they don’t make much intuitive sense because they are 2D compressions of a PCA? So how is it even meant to be interpreted?