Can A.I. Help Solve the Mystery of Lost Languages?

Francesco Ricardo Iacomino / Getty Images

There are many things that distinguish humans from other species, but the most important is language. The ability to string together various elements into essentially infinite combinations is a feature that “often in the past has been considered a basic defining feature of modern humans, a source of human creativity, cultural enrichment, and complex social structure,” as a linguist Noam Chomsky once said.

But as long as language has been in human development, there is not much, we do not know about how language has evolved. While dead languages ​​such as Latin carry a wealth of written records and descendants, through which we can better understand it, some languages ​​are lost to history.

Researchers have been able to recreate some lost languages, but the process of interpreting them can be a lengthy one. For example, the ancient script B was “solved” half a century after its discovery, and some of the people who worked on it did not look to complete the work. An old script called Leon A, the writing system of the Minoan civilization, is undefined.

Modern linguists have a powerful tool at their disposal, however: artificial intelligence. By training AI to detect patterns in unspecified languages, researchers can reconstruct them, uncovering the mysteries of the ancient world. More recently, novel neural approaches by researchers at the Massachusetts Institute of Technology (MIT) have already shown success in decrypting linear B, and may lead one day to resolve other lost languages.

Reviving the dead (languages)

Much like the skin of a cat, there is more than one way to decode lost language. In some cases, there is no written record of the language, so linguists try to reconstruct it by tracing the development of sounds through their descendants. Such is the case with Proto-Indo-European, the imaginary ancestor of many languages ​​through Europe and Asia.

In other cases, archaeologists trace the written record, which was along the Linear B. After archaeologists discovered the bullets on the island of Crete, researchers eventually destroyed it, spending decades writing on it. Unfortunately, this is not currently possible with linear A, as researchers almost do not have source material to study. But this may not be necessary.

But English and French are living languages ​​through centuries of cultural overlap. It is very difficult to decide a lost language.

A project by researchers at MIT illustrates AI’s ability to revolutionize the field as well as the difficulties of denaturation. Researchers developed a neural approach to understanding lost language “informed by patterns in language change documented in historical linguistics.” As detailed in the 2019 paper, while previous AI required languages ​​to be tailored to a specific language, this is not the case.

“If you look at a commercially available translator or translation product,” says Jiaming Luo, the lead author on the paper, “All of these technologies have access to a large number of what we call parallel data. You can call them Rosetta Stones. As one might think, but in very large quantities. “

A parallel corpus is a collection of texts in two different languages. For example, imagine a series of sentences in both English and French. Even if you do not know French, by comparing two sets and observing patterns, you can map words in one language to another.

“If you train a human to do this, if you look at 40-plus-million parallel sentences,” Luo explains, “I believe you will be able to locate a translation.”

But English and French are the living languages ​​for centuries of cultural overlap. It is very difficult to decide a lost language.

“We don’t have the luxury of parallel data,” Luo explains. “So we have to rely on some specific linguistic knowledge about how language develops, how words develop in their descendants.”

Nerve deformity / MIT

To create a model that can be used regardless of the languages ​​involved, the team set constraints based on trends that can be seen through the development of languages.

“We have to rely on two levels of insight on linguistics,” Luo says. “At a character level, which we all know is that when words develop, they usually evolve from left to right. You can think of this development like a string. So maybe there is a string ABCDE in Latin that most likely means that you are going to convert it to ABD or ABC, yet you preserve the original order in a way. We call it monotonic. “

At the level of vocabulary (words that make up a language), the team used a technique called “one-to-one mapping”.

Luo says as an example, “This means that if you take out the entire vocabulary of Latin and exclude the whole vocabulary of Italian, you will see some kind of one-to-one matching.” “The Latin word for ‘dog’ will probably evolve into the Italian word for ‘dog’ and the Latin word for ‘cat’ will possibly evolve into the Italian word for ‘cat’.”

To test the model, the team used some datasets. He translated the ancient language Ugaritic from Hebrew, line B to Greek, and performed cognates (words with common ancestry) within the romance languages ​​Spanish, Italian, and Portuguese, to confirm the efficacy of the model.

This was the first known attempt to understand linear B automatically, and the model successfully translated 67.3% of cognates. The system also improved on previous models for translation of Ugaritic. Given that languages ​​come from different families, it shows that the model is flexible, as well as more accurate than previous systems.


Linear A is one of the great mysteries of the language, and would be a notable achievement for ancient Nut AI. Something similar, says Luo, is purely theoretical, for some reason.

First, linear A also provides a smaller amount of data than linear B. It is also a matter of finding out what kind of script Liner A is.

“I’d say the unique challenge to Linear A is that you have too many illustrated or logical characters or symbols,” Luo says. “And usually when you have a lot of these symbols, it’s going to be very difficult.”

Brand X Pictures / Getty Images

As an example, Luo compares English and Chinese.

“If you do not have the capitalization count, then there are 26 letters in English and 33 in Russian. These are called alphabetic systems. So all you have to do is find a map for these 26 or 30-something characters, ”he says.

“But for the Chinese, you have to deal with thousands of them,” he continues. “I think the minimum amount of characters to read just one newspaper would be estimated at around 3,000 or 5,000. Linear A is not Chinese, but because of its pictorial or logical symbols and stuff, it’s definitely a comparison to Linear B. I am tough.

Although linear A is still unspecified, the success of MIT’s novel neural decomposition approach in automatically defining linear B is a promising sign, moving beyond the need for a parallel corpus.

Editors recommendations

Related Posts