Zurich Zurich

Picture This: Quantum Approach Matches Classical AI in Text-Image Tasks

quantum Images
quantum Images
Quantum Source Quantum Source

Insider Brief

  • Researchers at University College London have developed a multimodal quantum framework, called MultiQ-NLP, that integrates language and image data into a unified, structure-aware quantum model.
  • By translating both text and images into quantum circuits, the approach leverages quantum computing’s natural capacity for handling complex tensor structures, potentially enhancing the transparency and interpretability of AI systems.
  • Tested on a mainstream image classification task (SVO-Probes), the best quantum-based model performed on par with leading classical methods, which suggests quantum-enhanced approaches in the future may tackle language-and-image understanding.

Quantum computing researchers at University College London have introduced a new framework that may help bring clarity to the “black box” nature of large language models. Their approach, called MultiQ-NLP, encodes both text and images into a unified quantum model designed to highlight and preserve the structural relationships that make language meaningful.

The researchers, who published their findings on the pre-print server arXiv, report their method could one day match state-of-the-art classical models in classifying image-text pairs, potentially laying the groundwork for more interpretable and robust artificial intelligence (AI) systems.

LLM Models Lack Transparency

Modern large language models have made major advances in natural language processing, but their billions of parameters form a dense web that is nearly impossible to fully understand. Their decision-making processes remain opaque, making it hard to know why they choose certain words or how they reason about images paired with text.

Responsive Image

The UCL team’s work aims to address this challenge by treating language and images as mathematical structures that quantum computers are well-suited to handle. Rather than relying on brute-force pattern-matching, the researchers organize data around syntax, grammar, and compositional meaning—elements they say can be modeled naturally using the tools of quantum theory.

Translating Text and Images into Quantum Circuits

At the heart of this approach is the idea that language and its building blocks—words, sentences, and their grammatical roles—can be represented as higher-order tensors. Tensors are multi-dimensional arrays that capture how words relate to one another. Traditionally, training tensor-based models on classical hardware is prohibitively expensive. But on a quantum processor, tensors can be encoded as states of qubits, offering a more direct and potentially more efficient way to handle complex linguistic relationships.

The new MultiQ-NLP framework extends an existing method for quantum natural language processing (QNLP) to incorporate images. The researchers enrich the underlying “types” and “type homomorphisms” in their model to account for both text and images. By representing images as feature vectors extracted by a classical neural network (ResNet-50) and then turning these vectors into quantum states, the method places language and visuals into the same mathematical framework.

In essence, each word and image feature corresponds to a set of qubits, and the meaning of sentences combined with images emerges from how these qubits interact. The operations that link words together—akin to “function-argument” relationships in linguistics—are mapped onto quantum gates that entangle states, preserving the compositional structure of language in a quantum format. Similarly, image features undergo a dimensional reduction into a small vector that can be encoded as quantum rotations, capturing visual properties that the circuit can process alongside linguistic data.

Matching State-of-the-Art Classical Performance
To test their approach, the researchers turned to a mainstream image classification task from Google’s SVO-Probes dataset. This dataset challenges models to match captions to images by focusing on the roles of subjects, verbs, and objects. A sentence like “A dog is sitting on the road” might be paired with two images: one of a dog actually sitting on a road, and another of a dog doing something else, like running.

The best of the quantum-based models, which fully integrated syntactic structure, performed on par with top classical models. This finding is notable. It not only shows that the quantum approach can keep pace with established methods — it does so while exposing the underlying structure of the language and images. The researchers argue that this more “transparent” modeling could pave the way for more interpretable AI. Instead of relying solely on statistical patterns gleaned from massive training sets, the quantum method encodes explicit structural information, giving developers and users a clearer understanding of how and why the model arrives at its conclusions.

Structured Versus Unstructured Data
The team tested their models on two types of tasks. In the “unstructured” scenario, the model simply had to tell which image matched a given sentence when verb usage varied. In the “structured” scenario, the model confronted a trickier linguistic puzzle: subject-object swaps. In one scenario, the model might encounter both “A child holds the mother’s hand” and “A mother holds the child’s hand” alongside a single image that only matches one of the sentences. The structure-aware quantum models excelled here, reinforcing the idea that capturing grammar and syntax pays off when language gets more complex.

Interestingly, even a simpler “bag-of-words” quantum model — one that ignores syntax and treats each sentence as a jumbled collection of words — performed relatively well on the simpler, unstructured data. But when it came to the more complex structured data, the syntax-driven quantum models significantly outperformed the bag-of-words approach, according to the researchers, who suggest this underlines the importance of grammar-sensitive modeling.

Limitations and Next Steps
While these results are encouraging, the researchers acknowledge there’s still a lot of work remaining and added that some of the limitations could point toward future work. First, the experiments were conducted on simulators rather than actual quantum hardware. Quantum simulations on classical machines are computationally expensive, forcing the team to work with small datasets and reduced feature dimensions for the images. The researchers note that they only used about 20 features from images, far fewer than the thousands that classical image recognition models typically employ.

They also used a particular training method — an optimization algorithm called Simultaneous Perturbation Stochastic Approximation (SPSA) — that introduces some noise and may have limited the model’s potential. With more refined optimization techniques, better hardware, and larger training sets, the team believes performance could improve further.

Scaling up appears to be the next big challenge. While their dataset already surpasses what early QNLP papers used, it is still modest by today’s machine learning standards. Realizing the full promise of quantum language-and-image processing may require running on more advanced quantum devices or leveraging GPU acceleration to simulate larger quantum circuits more efficiently.

Implications for AI and Quantum Computing
If the approach can scale, however, then the implications could be wide-ranging. Large language models have transformed fields like search, recommendation systems and content generation, but their black-box nature remains a concern in high-stakes areas like healthcare, finance and law. A quantum method that is inherently more interpretable could offer a way to assure users and regulators that these systems are making logical, justifiable decisions.

Integrating structured representations with quantum states could also open new directions for quantum machine learning. Quantum computers are still in their infancy, but this work aligns with the broader vision of using quantum devices not only as faster computers but as engines for new forms of problem-solving—ones that leverage quantum properties to represent and manipulate data in ways classical machines cannot.

On the Road to More Transparent AI
This new MultiQ-NLP framework shows that quantum methods can hold their own against classical models on challenging multimodal tasks. Perhaps more importantly, it does so while preserving a compositional structure that could make models more interpretable and trustworthy. As quantum computing matures and researchers find smarter ways to encode and process data, approaches like MultiQ-NLP may play a key role in shaping a future where AI is both powerful and transparent.

The research team included Hala Hawashin and Mehrnoosh Sadrzadeh, both of University College London. Dimitri Kartsaklis, or Quantinuum, also offered insights about the project to the team.

For a deeper, more technical dive — which this article can’t provide — please read the paper here. Please also note that pre-print servers, like arXiv, offer a way for researchers to gain immediate feedback on new work, but it is not officially peer-reviewed, a key step in the scientific process.

Matt Swayne

With a several-decades long background in journalism and communications, Matt Swayne has worked as a science communicator for an R1 university for more than 12 years, specializing in translating high tech and deep tech for the general audience. He has served as a writer, editor and analyst at The Quantum Insider since its inception. In addition to his service as a science communicator, Matt also develops courses to improve the media and communications skills of scientists and has taught courses. [email protected]

Share this article:

Keep track of everything going on in the Quantum Technology Market.

In one place.

Related Articles

Join Our Newsletter