Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00
2 min read

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.04631
$0.04631$0.04631
-2.75%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Pi Network Officially Enters Open Mainnet Phase III, A New Era of Crypto and Web3 Begins

Pi Network Officially Enters Open Mainnet Phase III, A New Era of Crypto and Web3 Begins

Pi Network has once again captured global crypto attention following the official announcement of its transition into Open Mainnet Phase III. This milestone re
Share
Hokanews2026/02/13 12:41
Cloud mining is gaining popularity around the world. LgMining’s efficient cloud mining platform helps you easily deploy digital assets and lead a new wave of crypto wealth.

Cloud mining is gaining popularity around the world. LgMining’s efficient cloud mining platform helps you easily deploy digital assets and lead a new wave of crypto wealth.

The post Cloud mining is gaining popularity around the world. LgMining’s efficient cloud mining platform helps you easily deploy digital assets and lead a new wave of crypto wealth. appeared on BitcoinEthereumNews.com. SPONSORED POST* As the cryptocurrency market continues its recovery, Ethereum has once again become the center of attention for investors. Recently, the well-known crypto mining platform LgMining predicted that Ethereum may surpass its previous all-time high and surge past $5,000. In light of this rare market opportunity, choosing a high-efficiency, secure, and low-cost mining platform has become the top priority for many investors. With its cutting-edge hardware, intelligent technology, and low-cost renewable energy advantages, LgMining Cloud Mining is rapidly emerging as a leader in the cloud mining industry. Ethereum: The Driving Force of the Crypto Market Ethereum is not only the second-largest cryptocurrency by market capitalization but also the backbone of the blockchain smart contract ecosystem. From DeFi (Decentralized Finance) to NFTs (Non-Fungible Tokens) and the broader Web3.0 infrastructure, most innovations are built on Ethereum. This widespread utility gives Ethereum tremendous growth potential. With the upcoming scalability upgrades, the Ethereum network is expected to offer improved performance and transaction speed—likely triggering a fresh wave of market enthusiasm. According to the LgMining research team, Ethereum’s share among institutional and retail investors continues to grow. Combined with shifting monetary policies and global economic uncertainties, Ethereum is expected to break past its previous high of over $4,000 and aim for $5,000 or more in the coming months. LgMining Cloud Mining: Unlocking a Low-Barrier Path to Wealth Traditional crypto mining often requires expensive mining rigs, stable electricity, and complex maintenance—making it inaccessible for the average person. LgMining Cloud Mining breaks down these barriers, allowing anyone to easily participate in mining Ethereum and Bitcoin without owning hardware. LgMining builds its robust and efficient mining infrastructure around three core advantages: 1. High-End Equipment LgMining uses top-tier mining hardware with exceptional computing power and reliability. The platform’s ASIC and GPU miners are carefully selected and tested to…
Share
BitcoinEthereumNews2025/09/18 03:04
Meme wanes, narratives cool: Solana's cyclical boom ends as it falls below $80.

Meme wanes, narratives cool: Solana's cyclical boom ends as it falls below $80.

Written by: Mach , Foresight News Solana, one of the most successful public blockchains, is also facing a winter. Since the market crash on February 5th, the Solana
Share
PANews2026/02/13 12:05