This section examines how Transformers get beyond the local reasoning barrier with the aid of scratchpads, which are explicit sequences of reasoning stages. By teaching models intermediary steps like cumulative operations or depth-first search (DFS) traces, traditional "educated scratchpads" effectively reduce task localization. Nevertheless, these methods frequently restrict out-of-distribution generalization by overfitting to particular sequence lengths. In order to overcome this, the study presents the inductive scratchpad, a novel formulation that simulates algorithmic induction by having the model iteratively update a state variable using masked attention and reindexed word positions.This section examines how Transformers get beyond the local reasoning barrier with the aid of scratchpads, which are explicit sequences of reasoning stages. By teaching models intermediary steps like cumulative operations or depth-first search (DFS) traces, traditional "educated scratchpads" effectively reduce task localization. Nevertheless, these methods frequently restrict out-of-distribution generalization by overfitting to particular sequence lengths. In order to overcome this, the study presents the inductive scratchpad, a novel formulation that simulates algorithmic induction by having the model iteratively update a state variable using masked attention and reindexed word positions.

How Inductive Scratchpads Help Transformers Learn Beyond Their Training Data

2025/11/03 19:11

Abstract and 1. Introduction

1.1 Syllogisms composition

1.2 Hardness of long compositions

1.3 Hardness of global reasoning

1.4 Our contributions

  1. Results on the local reasoning barrier

    2.1 Defining locality and auto-regressive locality

    2.2 Transformers require low locality: formal results

    2.3 Agnostic scratchpads cannot break the locality

  2. Scratchpads to break the locality

    3.1 Educated scratchpad

    3.2 Inductive Scratchpads

  3. Conclusion, Acknowledgments, and References

A. Further related literature

B. Additional experiments

C. Experiment and implementation details

D. Proof of Theorem 1

E. Comment on Lemma 1

F. Discussion on circuit complexity connections

G. More experiments with ChatGPT

\

3 Scratchpads to break the locality

Prior literature. It has been shown that training Transformers with the intermediate steps required to solve a problem can enhance learning. This idea is usually referred to as providing models with a scratchpad [32]. The improved performance due to scratchpads has been reported on a variety of tasks including mathematical tasks and programming state evaluation [32, 11, 17]. See Appendix A for further references.

3.1 Educated scratchpad

\ Figure 4: (Left) Learning the cycle task with a scratchpad. (Right) OOD generalization for the DFS and inductive scratchpads (see Section 3.2.1).

\

\ Lemma 2. The parity task with the cumulative product scratchpad has a locality of 2.

\ Transformers with such a scratchpad can in fact easily learn parity targets, see Appendix B.3.

\ Results for the cycle task. Consider the cycle task and a scratchpad that learns the depth-first search (DFS) algorithm from the source query node.[9] For example, consider the following input corresponding to two cycles a, x, q and n, y, t: a>x; n>y; q>a; t>n; y>t; x>q; a?t;. In this case, doing a DFS from node a gives a>x>q>a where the fact that we have returned to the source node a and not seen the destination t indicates that the two nodes are not connected. Therefore, the full scratchpad with the final answer can be designed as a>x>q>a;0. Similarly, if the two nodes were connected the scratchpad would be a>…>t;1. One can easily check that the cycle task becomes low-locality with the DFS scratchpad.

\ Lemma 3. The cycle task with the DFS scratchpad has a locality of 3.

\ This follows from the fact that one only needs to find the next node in the DFS path (besides the label), which one can check with polynomial chance by checking the first edge.

\ In Figure 4a we show that a decoder-only Transformer with the DFS scratchpad in fact learns the cycle task when n scales.

\ Remark 4. If one has full knowledge of the target function, one could break the target into sub-targets using an educated scratchpad to keep the locality low and thus learn more efficiently (of course one does not have to learn the target under full target knowledge, but one may still want to let a model learn it to develop useful representations in a broader/meta-learning context). One could in theory push this to learning any target that is poly-time computable by emulating a Turing machine in the steps of the scratchpad to keep the overall locality low. Some works have derived results in that direction, such as [33] for some type of linear autoregressive models, or [12] for more abstract neural nets that emulate any Turing machine with SGD training. However, these are mostly theory-oriented works. In practice, one may be instead interested in devising a more ‘generic’ scratchpad. In particular, a relevant feature in many reasoning tasks is the power of induction. This takes place for instance in the last 2 examples (parity and cycle task) where it appears useful to learn inductive steps when possible.

3.2 Inductive Scratchpads

As discussed previously, scratchpads can break the local reasoning barrier with appropriate mid-steps. In this part, however, we show that fully educated scratchpads can be sensitive to the number of reasoning steps, translating into poor out-of-distribution (OOD) generalization. As a remedy, we put forward the concept of inductive scratchpad which applies to a variety of reasoning tasks as in previous sections.

\ 3.2.1 Educated scratchpad can overfit in-distribution samples

\ Consider the cycle task with 40 nodes. For the test distribution, we use the normal version of the cycle task, i.e., either two cycles of size 20 and the nodes are not connected or a single cycle of size 40 where the distance between the query nodes is 20. For the train distribution, we keep the same number of nodes and edges (so the model does not need to rely on new positional embeddings for the input) but break the cycles to have uneven lengths: (1) a cycle of size 10 and a cycle of size 30 when the two nodes are not connected (the source query node is always in the cycle of size 10) or (2) a cycle of size 40 where the nodes are at distance[10]. Thus, in general, we always have 40 nodes/edges in the graphs. However, the length of the DFS path (i.e., number of reasoning steps) is 10 at training and 20 at test. We trained our model on this version of the task with the DFS scratchpad. The results are shown in Figure 4b. We observe that the model quickly achieves perfect accuracy on the training distribution, yet, it fails to generalize OOD as the model overfits the scratchpad length. In the next part, we introduce the notion of inductive scratchpad to fix this problem.

\ 3.2.2 Inductive scratchpad: definition and experimental results

\ In a large class of reasoning tasks, one can iteratively apply an operation to some state variable (e.g., a state array) to compute the output. This applies in particular to numerous graph algorithms (e.g., shortest path algorithms such as BFS or Dijkstra’s algorithm), optimization algorithms (such as genetic algorithms or gradient descent), and arithmetic tasks.

\ Definition 5 (Inductive tasks). Let Q be the question (input). We say that a task can be solved inductively when there is an induction function (or a state transition function) g such that

\ s[1] = g(Q, ∅), s[2] = g(Q, s[1]), . . . , s[k] = g(Q, s[k − 1]),

\ where s[1], . . . , s[k] are the steps (or states) that are computed inductively. For example, the steps/states could be an array or the state of an automata that is being updated. Note that the termination is determined by the state. In the context of Transformers, one can use the generation of the end of sequence token to terminate.

\ Inductive tasks with a fully educated scratchpad can overfit proofs. The fully educated scratchpad for the question Q as input would be s[1];s[2];…;s[k], where the token ends the generation. However, this method may not fully utilize the fact that each state is only generated from the last state by applying the same (set of) operation(s). In particular, s[k] typically attends to all of the previous states. Further, the model may not be able to increase the number of induction steps beyond what it has seen during training, as shown in Figure 4b for the cycle task.

\ Now we show that by using attention masking and reindexing the positions of the tokens, one can promote the desired ‘inductive’ behavior. We call this the inductive scratchpad. As three showcases, we demonstrate that the inductive scratchpad can improve OOD generalization on the cycle task and length generalization on parity and addition tasks.

\ Inductive scratchpad implementation. The inductive scratchpad for an inductive task is similar in format to the fully educated scratchpad but it has the following modifications: (1) tokens: two new special tokens are used: the token which separates the question from the intermediate states and

\ Figure 5: Length generalization for parity and addition tasks using different random seeds. The medians of the results are highlighted in bold.

\ the token (denoted # hereafter) to separate the states. Using these tokens, for an input question Q, the format of the inductive scratchpad reads s[1]#s[2]#…#s[k]. (2) generation: we want the model to promote induction and thus ‘forget’ all the previous states except the last one for the new state update. I.e., we want to generate tokens of s[i+1] as the input was Qs[i]#. To implement this, one can use attention masking and reindex positions (in order to have a proper induction) or simply remove the previous states at each time; (3) training: when training the scratchpad, we want the model to learn the induction function g, i.e., learning how to output s[i+1]# from Q s[i]#, which can be achieved with attention masking and reindexing the positions. As a result, the inductive scratchpad can be easily integrated with the common language models without changing their behavior on other tasks/data. We refer to Appendix C.2 for a detailed description of the inductive scratchpad implementation.

\ Inductive scratchpad for the cycle task. The DFS scratchpad of the cycle task can be made inductive by making each step of the DFS algorithm a state. E.g., for the input a>x;n>y;q>a;t>n;y>t; x>q;a?t;, the DFS scratchpad is a>x>q>a;0, and the inductive scratchpad becomes a#x#q#a;0*<EOS>* where each state tracks the current node in the DFS. In Figure 4b, we show that the inductive scratchpad for the cycle task can generalize to more reasoning steps than what is seen during training, and thus generalize OOD when the distance between the nodes is increased.

\ Length generalization for parity and addition tasks. We can use inductive scratchpads to achieve length generalization for the parity and addition tasks. For parities, we insert random spaces between the bits and design an inductive scratchpad based on the position of the bits and then compute the parity iteratively. Using this scratchpad we can train a Transformer on up to 30 bits and get generalization on up to 50, or 55 bits. Results are provided in Figure 5a. For the addition task, we consider two inductive scratchpad formats. We provide an inductive scratchpad that requires random spaces between the digits in the input and uses the position of the digits to compute the addition digit-by-digit (similar to the parity). With this scratchpad, we can generalize to numbers with 20 digits while training on numbers with up to 10 digits. In the other format, we use random tokens in the input and we compute the addition digit-by-digit by shifting the operands. The latter enables us to generalize from 4 to 26 digits at the cost of having a less natural input format. The results for different seeds are provided in Figure 5b. See details of these scratchpads in Appendices B.4, B.5. 10 Also see Appendix A for a comparison with recent methods.

\

:::info Authors:

(1) Emmanuel Abbe, Apple and EPFL;

(2) Samy Bengio, Apple;

(3) Aryo Lotf, EPFL;

(4) Colin Sandon, EPFL;

(5) Omid Saremi, Apple.

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

[9] For the particular graph structure of the two cycles task, DFS is the same as the breadth-first search (BFS).

\ [10] Our code is also available at https://github.com/aryol/inductive-scratchpad.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

CaaS: The "SaaS Moment" for Blockchain

CaaS: The "SaaS Moment" for Blockchain

Source: VeradiVerdict Compiled by: Zhou, ChainCatcher Summary Crypto as a Service (CaaS) is the "Software as a Service (SaaS) era" in the blockchain space. Banks and fintech companies no longer need to build crypto infrastructure from scratch. They can simply connect to APIs and white-label platforms to launch digital asset functionality within days or weeks, instead of the years that used to take. ( Note: White-labeling essentially involves one party providing a product or technology, while another party brands it for sale or operation. In the finance/crypto field, this refers to banks or exchanges using third-party trading systems, wallets, or payment gateways and then rebranding them.) Mainstream markets are accelerating adoption through three channels. Banks are partnering with custodians like Coinbase, Anchorage, and BitGo while actively exploring tokenized assets; fintech companies are issuing their own stablecoins using platforms like M^0; and payment processors such as Western Union (with $300 billion in annual transactions) and Zelle (with over $1 trillion in annual transactions) are now integrating stablecoins to enable instant, low-cost cross-border settlements. Crypto as a Service (CaaS) isn't actually that complicated. Essentially, it's Software as a Service (SaaS) based on cryptocurrency, making it a hundred times easier for institutions and businesses to integrate into the cryptocurrency space. Banks, fintech companies, and enterprises no longer need to painstakingly build internal cryptocurrency functionality. Instead, they can simply plug and play, deploying within days using proven APIs and white-label platforms. Businesses can focus on their customers without worrying about the complexities of blockchain. They can leverage existing infrastructure to participate in cryptocurrency transactions more efficiently and cost-effectively. In other words, they can easily and seamlessly integrate into the digital asset ecosystem. CaaS is poised for exponential growth. CaaS is a cloud-based business model and infrastructure solution that enables businesses, fintech companies, and developers to integrate cryptocurrency and blockchain functionality into their operations without having to build or maintain the underlying technology from scratch. CaaS provides ready-to-use, scalable services, typically delivered via APIs or white-label platforms, such as crypto wallets, trading engines, payment gateways, asset storage, custody, and compliance tools. This allows businesses to quickly offer digital asset functionality under their own brand, reducing development costs, time, and required technical expertise. Like other "as-a-service" offerings, this model allows businesses of all sizes, from startups to established companies, to participate in a cost-effective manner. In September 2025, Coinbase Institutional listed CaaS as one of its biggest growth areas. Since 2013, Pantera Capital has been committed to driving the development of CaaS through investment. We strategically invest in infrastructure, tools, and technology to ensure that CaaS can operate at scale. By accelerating the development of backend fund management, custody, and wallets, we have significantly enhanced the service tier of CaaS. Advantages of CaaS By using CaaS to transparently integrate encryption capabilities into their systems, enterprises can achieve numerous strategic and operational advantages more quickly and cost-effectively. These advantages include: One-stop integration and seamless embedding : The CaaS platform eliminates the need for custom development cycles, enabling teams to activate features in days rather than months. Flexible profit models : Businesses can choose a subscription-based fixed-price model for predictable costs, or a pay-as-you-go billing model to keep expenses in line with revenue. Either approach avoids large upfront capital investments. Outsourcing blockchain complexity : Enterprises can offload technical management while benefiting from a powerful enterprise-grade backend, ensuring near-perfect uptime, real-time monitoring, and automatic failover. Developer-friendly APIs and SDKs : Developers can embed wallet creation and key management functions, smoothly handle on-chain settlements, trigger smart contract interactions, and create a comprehensive sandbox environment. White-label branding and an intuitive interface : The CaaS solution is easy to customize, enabling non-technical teams to configure free infrastructure, supported assets, and user onboarding processes. Other value-added features : Leading providers bundle ancillary services together, such as fraud detection based on on-chain analytics; automated tax filing; multi-signature fund management; and cross-chain bridging for asset interoperability. These characteristics transform cryptocurrency from a technological novelty into a revenue-generating product line while maintaining a focus on core business capabilities. Three core use cases We believe the world is rapidly evolving towards a cryptocurrency-native environment, with individuals and businesses interacting more frequently with digital assets. This shift is driven by increasing user acceptance of blockchain wallets, decentralized applications, and on-chain transactions, which in turn benefits from continuously improving user interfaces, abundant educational resources, and practical application value. However, for cryptocurrencies to truly integrate into the mainstream and achieve widespread adoption, a strong and seamless bridge must be built to bridge the gap between traditional finance (TradFi) and decentralized finance (DeFi). Institutions seek the advantages of cryptocurrencies (speed, programmability, and global accessibility) while relying on trustworthy intermediaries to manage their underlying complexities: tools, security, technology stack, and liquidity provision. Ultimately, this ecosystem integration could gradually bring billions of users onto the blockchain. Use Case 1: Bank Banks are increasingly partnering with regulated cryptocurrency custodians such as Coinbase Custody, Anchorage Digital, and BitGo to provide institutional-grade custody, insured storage, and seamless spot trading services for digital assets like Bitcoin and Ethereum. These foundational services—custody, execution, and basic lending—represent the most readily achievable aspects of cryptocurrency integration, enabling banks to easily embrace customers without forcing them out of the traditional banking system. Beyond these fundamental elements, banks can leverage decentralized finance (DeFi) protocols to generate competitive returns from idle treasury assets or customer deposits. For example, they can deploy stablecoins into permissionless lending markets (such as Morpho, Aave, or Compound) or liquidity pools of automated market makers (AMMs) like Uniswap to obtain real-time, transparent returns that typically outperform traditional fixed-income products. The tokenization of Real-World Assets (RWAs) presents transformative opportunities. Banks can initiate and distribute on-chain versions of traditional securities (e.g., tokenized U.S. Treasury bonds, corporate bonds, private credit, or even real estate funds issued through BlackRock's BUIDL fund), bringing off-chain value to public blockchains like Ethereum, Polygon, or Base. These RWAs can then be traded peer-to-peer through DeFi protocols such as Morpho (for optimizing lending), Pendle (for yield sharing), or Centrifuge (for private credit pools), while ensuring KYC/AML compliance through whitelisted wallets or institutional vaults. RWAs can also serve as high-quality collateral in the DeFi lending market. Crucially, banks can offer seamless stablecoin access without losing customers. Through embedded wallets or custodial sub-accounts, customers can hold USDC, USDT, or FDIC-insured digital dollars directly within the bank's app (for payments, remittances, or yield-generating investments) without leaving the bank's ecosystem. This "walled garden" model resembles a new bank but with regulated trust. Looking ahead, major banks may form alliances to issue branded stablecoins backed 1:1 by centralized reserves. These stablecoins could be settled instantly on public blockchains while complying with regulatory requirements, thus connecting traditional finance with programmable money. If a bank views blockchain as infrastructure, rather than an accessory tool, it is likely to capture the next trillion dollars in value. Use Case 2: Fintech Companies and New Types of Banks Fintech companies and new-age banks are rapidly integrating cryptocurrencies into their core offerings through strategic partnerships with established platforms such as Robinhood, Revolut, and Webull. These collaborations enable seamless use and secure custody of digital assets, while providing instant trading of tokenized versions of traditional stocks, effectively bridging the gap between traditional finance and blockchain-based markets. Beyond partnerships, fintech companies can leverage professional service providers like Alchemy to build and launch their own blockchain infrastructure. Alchemy, a leader in blockchain development platforms, offers scalable node infrastructure, enhanced APIs, and developer tools that simplify the creation of custom Layer-1 or Layer-2 networks. This allows fintech companies to tailor blockchains for specific use cases, such as high-throughput payments, decentralized authentication, or RWA (Risk Weighted Authorization), while ensuring compliance with evolving regulatory requirements and optimizing for low latency and cost-effectiveness. Fintech companies can further deepen their involvement in the cryptocurrency space by issuing their own stablecoins and leveraging decentralized protocols on platforms like M^0 to mint yielding, fungible stablecoins backed by high-quality collateral such as US Treasury bonds. By adopting this model, fintech companies can mint their own tokens on demand, maintain full control over the underlying economic mechanisms (including interest accumulation and redemption mechanisms), ensure regulatory compliance through transparent on-chain reserves, and participate in co-governance through decentralized autonomous organizations (DAOs). Furthermore, they can benefit from enhanced liquidity pools on major exchanges and DeFi protocols, reducing fragmentation and increasing user adoption. This approach not only creates new revenue streams but also positions fintech companies as innovators in the field of programmable money and fosters customer loyalty in the competitive digital economy. Use Case 3: Payment Processor Payment companies are building stablecoin "sandwiches": a multi-tiered cross-border settlement system that receives fiat currency at one end and exports instant, low-cost liquidity in another jurisdiction, while minimizing foreign exchange spreads, intermediary fees, and settlement delays. The components of the "sandwich" include: Top Slice (Entry Point) : US customers send US dollars to payment providers such as Stripe, Circle, Ripple, or newer banks like Mercury. Filling (minting) : US dollars are immediately exchanged at a 1:1 ratio for regulated stablecoins—usually USDC (Circle), USDP (Paxos), or bank-issued digital dollars. Bottom Slice (Export) : Stablecoins are bridged or exchanged for local currency stablecoins—for example, aARS (pegged to the Argentine peso), BRLA (Brazil), or MXNA (Mexico)—or become central bank digital currency pilot projects directly (for example, Drex in Brazil). Settlement : Funds arrive in local bank accounts, mobile wallets or merchant payments on a T+0 (instant) basis, with total costs typically below 0.1%, compared to 3-7% through SWIFT + agent banks. Western Union, a 175-year-old remittance giant that processes over $300 billion in remittances annually, recently announced the integration of stablecoins into its ecosystem. Pantera Capital CEO Devin McGranahan stated in July 2025 that the company had historically been "cautious" about cryptocurrencies, concerned about their volatility and regulatory issues. However, the enactment of the Genius Act has changed this. “As the rules become clearer, we see a real opportunity to integrate digital assets into our business,” McGranahan said on the Q3 2025 earnings call. The result: Western Union is currently actively testing stablecoin solutions for Treasury settlements and customer payments, leveraging blockchain technology to eliminate the cumbersome processes of correspondent banking. Zelle, a bank-backed peer-to-peer payment giant (part of Early Warning Services, a consortium of JPMorgan Chase, Bank of America, Wells Fargo, and others), facilitates over $1 trillion in fee-free transfers annually within the United States via simple phone numbers or email addresses, currently boasting over 2,300 partner institutions and 150 million users. However, cross-border payments have been a previous challenge. On October 24, 2025, Early Warning announced a stablecoin plan aimed at bringing Zelle to the international market, offering "the same speed and reliability" overseas. As banks, fintech/new banks, and payment processors integrate cryptocurrencies in an intuitive, plug-and-play, and compliant manner (with as few regulators as possible), they can continue to expand their global reach and strengthen relationships. in conclusion CaaS is not hype—it represents a revolution in infrastructure that makes cryptocurrencies invisible to end users. Just as people don't think of AWS when watching Netflix or Salesforce when checking a CRM, consumers and businesses won't think of blockchain when making instant cross-border payments or accessing tokenized assets. The winners of this revolution are not companies that add cryptocurrencies as an afterthought to traditional systems, but rather institutions and enterprises that see blockchain as infrastructure, and the investors who support the underlying technology that underpins it all.
Share
PANews2025/11/05 16:00