Scientific discovery is one of the engines of human and societal progress. Scientists spend years and years learning, devising hypotheses, carrying out conclusions, and publishing results, in pursuit of advancing the state of the art. What now if a machine were able to do science, only much faster?

This is exactly what a recent study has demonstrated, in the field of material science. Researchers have developed an AI system that was able to find previously unknown compounds.

How do they know that their system actually works? The ingenious solution was to metaphorically go back in time and test whether the AI system would be able to predict discoveries that actually happened. For instance, if a new compound was discovered in 2018, they fed the system with data only prior to that year. A virtual time machine of sorts.

A machine on the shoulders of giants

More in detail, the system uses text from scientific publications to create embeddings, that is high-dimensional representations of words that allow to calculate how their meaning relates one to the other. If two molecules are described having similar characteristics, the embedding will place them close to each other in a high-dimensional space, even if the texts are in different publications (see examples in the figure below, where high dimensions are flattened to 2D).

By representing molecules with their context words, it is possible to create a map in which the materials cluster into applications, such as photovoltaic and organic.

How did the scientist go from this representation to finding yet to be discovered compounds? Using the power of abstract representation: both the text describing the molecules and that about the applications can be represented in the same high-dimensional space. Which means that their similarity can be measured simply by calculating their distance in the same space. The image below shows the result as shades of purple: the more purple the point, the more it is amenable for the same application — in this case showing heat to electricity conversion. Among the purple dots, many have never been considered as having thermoelectric properties.

This feels counterintuitive because in the previous map, the applications grouped together, while here they are more scattered. This is one example of being data-informed: it doesn’t matter what feels counterintuitive to us.

Moreover, it’s a prime example of the power of visualization: it makes it easy to understand complex concepts. As professor Munzner (2009) observed: “Visualization allows people to offload cognition to the perceptual system, using carefully designed images as a form of external memory.“ The freed cognitive capabilities can then be used to understand more complex relationships.

Another point from the study worthwhile emphasizing is the fact that the AI system uses scientific publications as input data to find new compounds. This is remarkable because it means that the content for this specific kind of scientific discovery is already available in previous works. As the saying goes, standing on the shoulders of giants. Hence, in this case the task of scientists is to combine available information in the right way and distill it into new knowledge.

The grain of salt in the compound

In conclusion, remarkable studies like these one always require more scrutiny. The bolder the claim, the more evidence is needed, as my PhD advisor used to say. The study certainly caters to the narrative of AI becoming more and more capable. On the other hand, the scientists’ role in creating the right form of visualization and, most importantly, asking the right questions cannot be undermined. Ultimately, this is the real power of AI and Data Science in these and the coming years: they allow us to ask better and more interesting questions, advancing the state of the art, and the progress of human society much faster.


Nicola Rohrseitz

Lead, Strategic AI Program

Chief Architecture and Technology Organization