You’ll train a simple MLP on MNIST using TensorFlow Core plus DTensor in a data-parallel setup: create a one-dimensional mesh (“batch”), keep model weights replicated (DVariables), shard the global batch across devices via pack/repack, and run a standard loop with tf.GradientTape, custom Adam, and accuracy/loss metrics. The code shows how mesh/layout choices propagate through ops, how to write DTensor-aware layers, and how to evaluate/plot results. Saving is limited today—DTensor models must be fully replicated to export, and saved models lose DTensor annotations.You’ll train a simple MLP on MNIST using TensorFlow Core plus DTensor in a data-parallel setup: create a one-dimensional mesh (“batch”), keep model weights replicated (DVariables), shard the global batch across devices via pack/repack, and run a standard loop with tf.GradientTape, custom Adam, and accuracy/loss metrics. The code shows how mesh/layout choices propagate through ops, how to write DTensor-aware layers, and how to evaluate/plot results. Saving is limited today—DTensor models must be fully replicated to export, and saved models lose DTensor annotations.

Data Parallel MNIST with DTensor and TensorFlow Core

2025/09/09 16:00

Content Overview

  • Introduction
  • Overview of data parallel training with DTensor
  • Setup
  • The MNIST Dataset
  • Preprocessing the data
  • Build the MLP
  • The dense layer
  • The MLP sequential model
  • Training metrics
  • Optimizer
  • Data packing
  • Training
  • Performance evaluation
  • Saving your model
  • Conclusion

\ \ \

Introduction

This notebook uses the TensorFlow Core low-level APIs and DTensor to demonstrate a data-parallel distributed training example.

Visit the Core APIs overview to learn more about TensorFlow Core and its intended use cases. Refer to the DTensor Overview guide and Distributed Training with DTensors tutorial to learn more about DTensor.

This example uses the same model and optimizer as those shown in the Multilayer Perceptrons tutorial. See this tutorial first to get comfortable with writing an end-to-end machine learning workflow with the Core APIs.

\

:::tip Note: DTensor is still an experimental TensorFlow API which means that its features are available for testing, and it is intended for use in test environments only.

:::

\

Overview of data parallel training with DTensor

Before building an MLP that supports distribution, take a moment to explore the fundamentals of DTensor for data parallel training.

DTensor allows you to run distributed training across devices to improve efficiency, reliability and scalability. DTensor distributes the program and tensors according to the sharding directives through a procedure called Single program, multiple data (SPMD) expansion. A variable of a DTensor aware layer is created as dtensor.DVariable, and the constructors of DTensor aware layer objects take additional Layout inputs in addition to the usual layer parameters.

The main ideas for data parallel training are as follows:

  • Model variables are replicated on N devices each.
  • A global batch is split into N per-replica batches.
  • Each per-replica batch is trained on the replica device.
  • The gradient is reduced before weight up data is collectively performed on all replicas.
  • Data parallel training provides nearly linear speed with respect to the number of devices

Setup

DTensor is part of TensorFlow 2.9.0 release.

\

#!pip install --quiet --upgrade --pre tensorflow 

\

import matplotlib from matplotlib import pyplot as plt # Preset Matplotlib figure sizes. matplotlib.rcParams['figure.figsize'] = [9, 6] 

\

import tensorflow as tf import tensorflow_datasets as tfds from tensorflow.experimental import dtensor print(tf.__version__) # Set random seed for reproducible results  tf.random.set_seed(22) 

\

2024-08-15 02:49:40.914029: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-15 02:49:40.935518: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-15 02:49:40.941702: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2.17.0 

Configure 8 virtual CPUs for this experiment. DTensor can also be used with GPU or TPU devices. Given that this notebook uses virtual devices, the speedup gained from distributed training is not noticeable.

\

def configure_virtual_cpus(ncpu):   phy_devices = tf.config.list_physical_devices('CPU')   tf.config.set_logical_device_configuration(phy_devices[0], [         tf.config.LogicalDeviceConfiguration(),     ] * ncpu)  configure_virtual_cpus(8)  DEVICES = [f'CPU:{i}' for i in range(8)] devices = tf.config.list_logical_devices('CPU') device_names = [d.name for d in devices] device_names 

\

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1723690183.661893  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.665603  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.669301  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.672556  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.683679  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.687589  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.691101  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.694059  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.696961  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.700515  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.704018  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.706976  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.934382  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.936519  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.938569  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.940700  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.942765  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.944750  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.946705  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.948674  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.950629  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.952626  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.954710  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.956738  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.995780  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.997864  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.999851  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.001859  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See mo ['/device:CPU:0',  '/device:CPU:1',  '/device:CPU:2',  '/device:CPU:3',  '/device:CPU:4',  '/device:CPU:5',  '/device:CPU:6',  '/device:CPU:7'] re at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.003740  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.005715  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.007659  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.009659  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.011546  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.014055  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.016445  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.018866  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 

The MNIST Dataset

The dataset is available from TensorFlow Datasets. Split the data into training and testing sets. Only use 5000 examples for training and testing to save time.

\

train_data, test_data = tfds.load("mnist", split=['train[:5000]', 'test[:5000]'], batch_size=128, as_supervised=True) 

Preprocessing the data

Preprocess the data by reshaping it to be 2-dimensional and by rescaling it to fit into the unit interval, [0,1].

\

def preprocess(x, y):   # Reshaping the data   x = tf.reshape(x, shape=[-1, 784])   # Rescaling the data   x = x/255   return x, y  train_data, test_data = train_data.map(preprocess), test_data.map(preprocess) 

Build the MLP

Build an MLP model with DTensor aware layers.

The dense layer

Start by creating a dense layer module that supports DTensor. The dtensor.call_with_layout function can be used to call a function that takes in a DTensor input and produces a DTensor output. This is useful for initializing a DTensor variable, dtensor.DVariable, with a TensorFlow supported function.

\

class DenseLayer(tf.Module):    def __init__(self, in_dim, out_dim, weight_layout, activation=tf.identity):     super().__init__()     # Initialize dimensions and the activation function     self.in_dim, self.out_dim = in_dim, out_dim     self.activation = activation      # Initialize the DTensor weights using the Xavier scheme     uniform_initializer = tf.function(tf.random.stateless_uniform)     xavier_lim = tf.sqrt(6.)/tf.sqrt(tf.cast(self.in_dim + self.out_dim, tf.float32))     self.w = dtensor.DVariable(       dtensor.call_with_layout(           uniform_initializer, weight_layout,           shape=(self.in_dim, self.out_dim), seed=(22, 23),           minval=-xavier_lim, maxval=xavier_lim))      # Initialize the bias with the zeros     bias_layout = weight_layout.delete([0])     self.b = dtensor.DVariable(       dtensor.call_with_layout(tf.zeros, bias_layout, shape=[out_dim]))    def __call__(self, x):     # Compute the forward pass     z = tf.add(tf.matmul(x, self.w), self.b)     return self.activation(z) 

The MLP sequential model

Now create an MLP module that executes the dense layers sequentially.

\

class MLP(tf.Module):    def __init__(self, layers):     self.layers = layers    def __call__(self, x, preds=False):      # Execute the model's layers sequentially     for layer in self.layers:       x = layer(x)     return x 

Performing "data-parallel" training with DTensor is equivalent to tf.distribute.MirroredStrategy. To do this each device will run the same model on a shard of the data batch. So you'll need the following:

  • dtensor.Mesh with a single "batch" dimension
  • dtensor.Layout for all the weights that replicates them across the mesh (using dtensor.UNSHARDED for each axis)
  • dtensor.Layout for the data that splits the batch dimension across the mesh

Create a DTensor mesh that consists of a single batch dimension, where each device becomes a replica that receives a shard from the global batch. Use this mesh to instantiate an MLP mode with the following architecture:

Forward Pass: ReLU(784 x 700) x ReLU(700 x 500) x Softmax(500 x 10)

\

mesh = dtensor.create_mesh([("batch", 8)], devices=DEVICES) weight_layout = dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh)  input_size = 784 hidden_layer_1_size = 700 hidden_layer_2_size = 500 hidden_layer_2_size = 10  mlp_model = MLP([     DenseLayer(in_dim=input_size, out_dim=hidden_layer_1_size,                 weight_layout=weight_layout,                activation=tf.nn.relu),     DenseLayer(in_dim=hidden_layer_1_size , out_dim=hidden_layer_2_size,                weight_layout=weight_layout,                activation=tf.nn.relu),     DenseLayer(in_dim=hidden_layer_2_size, out_dim=hidden_layer_2_size,                 weight_layout=weight_layout)]) 

Training metrics

Use the cross-entropy loss function and accuracy metric for training.

\

def cross_entropy_loss(y_pred, y):   # Compute cross entropy loss with a sparse operation   sparse_ce = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_pred)   return tf.reduce_mean(sparse_ce)  def accuracy(y_pred, y):   # Compute accuracy after extracting class predictions   class_preds = tf.argmax(y_pred, axis=1)   is_equal = tf.equal(y, class_preds)   return tf.reduce_mean(tf.cast(is_equal, tf.float32)) 

Optimizer

Using an optimizer can result in significantly faster convergence compared to standard gradient descent. The Adam optimizer is implemented below and has been configured to be compatible with DTensor. In order to use Keras optimizers with DTensor, refer to the experimentaltf.keras.dtensor.experimental.optimizers module.

\

class Adam(tf.Module):      def __init__(self, model_vars, learning_rate=1e-3, beta_1=0.9, beta_2=0.999, ep=1e-7):       # Initialize optimizer parameters and variable slots       self.model_vars = model_vars       self.beta_1 = beta_1       self.beta_2 = beta_2       self.learning_rate = learning_rate       self.ep = ep       self.t = 1.       self.v_dvar, self.s_dvar = [], []       # Initialize optimizer variable slots       for var in model_vars:         v = dtensor.DVariable(dtensor.call_with_layout(tf.zeros, var.layout, shape=var.shape))         s = dtensor.DVariable(dtensor.call_with_layout(tf.zeros, var.layout, shape=var.shape))         self.v_dvar.append(v)         self.s_dvar.append(s)      def apply_gradients(self, grads):       # Update the model variables given their gradients       for i, (d_var, var) in enumerate(zip(grads, self.model_vars)):         self.v_dvar[i].assign(self.beta_1*self.v_dvar[i] + (1-self.beta_1)*d_var)         self.s_dvar[i].assign(self.beta_2*self.s_dvar[i] + (1-self.beta_2)*tf.square(d_var))         v_dvar_bc = self.v_dvar[i]/(1-(self.beta_1**self.t))         s_dvar_bc = self.s_dvar[i]/(1-(self.beta_2**self.t))         var.assign_sub(self.learning_rate*(v_dvar_bc/(tf.sqrt(s_dvar_bc) + self.ep)))       self.t += 1.       return 

Data packing

Start by writing a helper function for transferring data to the device. This function should use dtensor.pack to send (and only send) the shard of the global batch that is intended for a replica to the device backing the replica. For simplicity, assume a single-client application.

Next, write a function that uses this helper function to pack the training data batches into DTensors sharded along the batch (first) axis. This ensures that DTensor evenly distributes the training data to the 'batch' mesh dimension. Note that in DTensor, the batch size always refers to the global batch size; therefore, the batch size should be chosen such that it can be divided evenly by the size of the batch mesh dimension. Additional DTensor APIs to simplify tf.data integration are planned, so please stay tuned.

\

def repack_local_tensor(x, layout):   # Repacks a local Tensor-like to a DTensor with layout   # This function assumes a single-client application   x = tf.convert_to_tensor(x)   sharded_dims = []    # For every sharded dimension, use tf.split to split the along the dimension.   # The result is a nested list of split-tensors in queue[0].   queue = [x]   for axis, dim in enumerate(layout.sharding_specs):     if dim == dtensor.UNSHARDED:       continue     num_splits = layout.shape[axis]     queue = tf.nest.map_structure(lambda x: tf.split(x, num_splits, axis=axis), queue)     sharded_dims.append(dim)    # Now you can build the list of component tensors by looking up the location in   # the nested list of split-tensors created in queue[0].   components = []   for locations in layout.mesh.local_device_locations():     t = queue[0]     for dim in sharded_dims:       split_index = locations[dim]  # Only valid on single-client mesh.       t = t[split_index]     components.append(t)    return dtensor.pack(components, layout)  def repack_batch(x, y, mesh):   # Pack training data batches into DTensors along the batch axis   x = repack_local_tensor(x, layout=dtensor.Layout(['batch', dtensor.UNSHARDED], mesh))   y = repack_local_tensor(y, layout=dtensor.Layout(['batch'], mesh))   return x, y 

Training

Write a traceable function that executes a single training step given a batch of data. This function does not require any special DTensor annotations. Also write a function that executes a test step and returns the appropriate performance metrics.

\

@tf.function def train_step(model, x_batch, y_batch, loss, metric, optimizer):   # Execute a single training step   with tf.GradientTape() as tape:     y_pred = model(x_batch)     batch_loss = loss(y_pred, y_batch)   # Compute gradients and update the model's parameters   grads = tape.gradient(batch_loss, model.trainable_variables)   optimizer.apply_gradients(grads)   # Return batch loss and accuracy   batch_acc = metric(y_pred, y_batch)   return batch_loss, batch_acc  @tf.function def test_step(model, x_batch, y_batch, loss, metric):   # Execute a single testing step   y_pred = model(x_batch)   batch_loss = loss(y_pred, y_batch)   batch_acc = metric(y_pred, y_batch)   return batch_loss, batch_acc 

Now, train the MLP model for 3 epochs with a batch size of 128.

\

# Initialize the training loop parameters and structures epochs = 3 batch_size = 128 train_losses, test_losses = [], [] train_accs, test_accs = [], [] optimizer = Adam(mlp_model.trainable_variables)  # Format training loop for epoch in range(epochs):   batch_losses_train, batch_accs_train = [], []   batch_losses_test, batch_accs_test = [], []    # Iterate through training data   for x_batch, y_batch in train_data:     x_batch, y_batch = repack_batch(x_batch, y_batch, mesh)     batch_loss, batch_acc = train_step(mlp_model, x_batch, y_batch, cross_entropy_loss, accuracy, optimizer)    # Keep track of batch-level training performance     batch_losses_train.append(batch_loss)     batch_accs_train.append(batch_acc)    # Iterate through testing data   for x_batch, y_batch in test_data:     x_batch, y_batch = repack_batch(x_batch, y_batch, mesh)     batch_loss, batch_acc = test_step(mlp_model, x_batch, y_batch, cross_entropy_loss, accuracy)     # Keep track of batch-level testing     batch_losses_test.append(batch_loss)     batch_accs_test.append(batch_acc)  # Keep track of epoch-level model performance   train_loss, train_acc = tf.reduce_mean(batch_losses_train), tf.reduce_mean(batch_accs_train)   test_loss, test_acc = tf.reduce_mean(batch_losses_test), tf.reduce_mean(batch_accs_test)   train_losses.append(train_loss)   train_accs.append(train_acc)   test_losses.append(test_loss)   test_accs.append(test_acc)   print(f"Epoch: {epoch}")   print(f"Training loss: {train_loss.numpy():.3f}, Training accuracy: {train_acc.numpy():.3f}")   print(f"Testing loss: {test_loss.numpy():.3f}, Testing accuracy: {test_acc.numpy():.3f}") 

\

Epoch: 0 Training loss: 1.850, Training accuracy: 0.343 Testing loss: 1.375, Testing accuracy: 0.504 Epoch: 1 Training loss: 1.028, Training accuracy: 0.674 Testing loss: 0.744, Testing accuracy: 0.782 Epoch: 2 Training loss: 0.578, Training accuracy: 0.839 Testing loss: 0.486, Testing accuracy: 0.869 

Performance evaluation

Start by writing a plotting function to visualize the model's loss and accuracy during training.

\

def plot_metrics(train_metric, test_metric, metric_type):   # Visualize metrics vs training Epochs   plt.figure()   plt.plot(range(len(train_metric)), train_metric, label = f"Training {metric_type}")   plt.plot(range(len(test_metric)), test_metric, label = f"Testing {metric_type}")   plt.xlabel("Epochs")   plt.ylabel(metric_type)   plt.legend()   plt.title(f"{metric_type} vs Training Epochs"); 

\

plot_metrics(train_losses, test_losses, "Cross entropy loss") 

\

\

plot_metrics(train_accs, test_accs, "Accuracy") 

\

Saving your model

The integration of tf.saved_model and DTensor is still under development. As of TensorFlow 2.9.0, tf.saved_model only accepts DTensor models with fully replicated variables. As a workaround, you can convert a DTensor model to a fully replicated one by reloading a checkpoint. However, after a model is saved, all DTensor annotations are lost and the saved signatures can only be used with regular Tensors. This tutorial will be updated to showcase the integration once it is solidified.

Conclusion

This notebook provided an overview of distributed training with DTensor and the TensorFlow Core APIs. Here are a few more tips that may help:

  • The TensorFlow Core APIs can be used to build highly-configurable machine learning workflows with support for distributed training.
  • The DTensor concepts guide and Distributed training with DTensors tutorial contain the most up-to-date information about DTensor and its integrations.

For more examples of using the TensorFlow Core APIs, check out the guide. If you want to learn more about loading and preparing data, see the tutorials on image data loading or CSV data loading.

\n

\ \

:::info Originally published on the TensorFlow website, this article appears here under a new headline and is licensed under CC BY 4.0. Code samples shared under the Apache 2.0 License.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Ethereum unveils roadmap focusing on scaling, interoperability, and security at Japan Dev Conference

Ethereum unveils roadmap focusing on scaling, interoperability, and security at Japan Dev Conference

The post Ethereum unveils roadmap focusing on scaling, interoperability, and security at Japan Dev Conference appeared on BitcoinEthereumNews.com. Key Takeaways Ethereum’s new roadmap was presented by Vitalik Buterin at the Japan Dev Conference. Short-term priorities include Layer 1 scaling and raising gas limits to enhance transaction throughput. Vitalik Buterin presented Ethereum’s development roadmap at the Japan Dev Conference today, outlining the blockchain platform’s priorities across multiple timeframes. The short-term goals focus on scaling solutions and increasing Layer 1 gas limits to improve transaction capacity. Mid-term objectives target enhanced cross-Layer 2 interoperability and faster network responsiveness to create a more seamless user experience across different scaling solutions. The long-term vision emphasizes building a secure, simple, quantum-resistant, and formally verified minimalist Ethereum network. This approach aims to future-proof the platform against emerging technological threats while maintaining its core functionality. The roadmap presentation comes as Ethereum continues to compete with other blockchain platforms for market share in the smart contract and decentralized application space. Source: https://cryptobriefing.com/ethereum-roadmap-scaling-interoperability-security-japan/
Share
BitcoinEthereumNews2025/09/18 00:25
Understanding Bitcoin Mining Through the Lens of Dutch Disease

Understanding Bitcoin Mining Through the Lens of Dutch Disease

There’s a paradox at the heart of modern economics: sometimes, discovering a valuable resource can make a country poorer. It sounds impossible — how can sudden wealth lead to economic decline? Yet this pattern has repeated across decades and continents, from the Netherlands’ natural gas boom in the 1960s to oil discoveries in numerous developing countries. Economists have a name for this phenomenon: Dutch Disease. Today, as Bitcoin Mining operations establish themselves in regions around the world, attracted by cheap resources. With electricity and favorable regulations, economists are asking an intriguing question: Does cryptocurrency mining share enough characteristics with traditional resource booms to trigger similar economic distortions? Or is this digital industry different enough to avoid the pitfalls that have plagued oil-rich and gas-rich nations? The Kazakhstan Case Study In 2021, Kazakhstan became a global Bitcoin mining hub after China’s cryptocurrency ban. Within months, mining operations consumed nearly 8% of the nation’s electricity. The initial windfall — investment, jobs, tax revenue — quickly turned to crisis. By early 2022, the country faced rolling blackouts, surging energy costs for manufacturers, and public protests. The government imposed strict mining limits, but damage to traditional industries was already done. This pattern has a name: Dutch Disease. Understanding Dutch Disease Dutch Disease describes how sudden resource wealth can paradoxically weaken an economy. The term comes from the Netherlands’ experience after discovering North Sea gas in 1959. Despite the windfall, the Dutch economy suffered as the booming gas sector drove up wages and currency values, making traditional manufacturing uncompetitive. The mechanisms were interconnected: Foreign buyers needed Dutch guilders to purchase gas, strengthening the currency and making Dutch exports expensive. The gas sector bid up wages, forcing manufacturers to raise pay while competing in global markets where they couldn’t pass those costs along. The most talented workers and infrastructure investment flowed to gas extraction rather than diverse economic activities. When gas prices eventually fell in the 1980s, the Netherlands found itself with a hollowed-out industrial base — wealthier in raw terms but economically weaker. The textile factories had closed. Manufacturing expertise had evaporated. The younger generation possessed skills in gas extraction but limited training in other industries. This pattern has repeated globally. Nigeria’s oil discovery devastated its agricultural sector. Venezuela’s resource wealth correlates with chronic economic instability. The phenomenon is so familiar that economists call it the “resource curse” — the observation that countries with abundant natural resources often perform worse economically than countries without them. Bitcoin mining creates similar dynamics. Mining operations are essentially warehouses of specialized computers solving mathematical puzzles to earn bitcoin rewards (currently worth over $200,000 per block) — the catch: massive electricity consumption. A single facility can consume as much power as a small city, creating economic pressures comparable to those of traditional resource booms. How Mining Crowds Out Other Industries Dutch Disease operates through four interconnected channels: Resource Competition: Mining operations consume massive amounts of electricity at preferential rates, leaving less capacity for factories, data centers, and residential users. In constrained power grids, this creates a zero-sum competition in which mining’s profitability directly undermines other industries. Textile manufacturers in El Salvador reported a 40% increase in electricity costs within a year of nearby mining operations — costs that made global competitiveness untenable. Price Inflation: Mining operators bidding aggressively for electricity, real estate, technical labor, and infrastructure drive up input costs across regional economies. Small and medium enterprises operating on thin margins are particularly vulnerable to these shocks. Talent Reallocation: High mining wages draw skilled electricians, engineers, and technicians from traditional sectors. Universities report declining enrollment in manufacturing engineering as students pivot toward cryptocurrency specializations — skills that may prove narrow if mining operations relocate or profitability collapses. Infrastructure Lock-In: Grid capacity, cooling systems, and telecommunications networks optimized for mining rather than diversified development make regions increasingly dependent on a single volatile industry. This specialization makes economic diversification progressively more difficult and expensive. Where Vulnerability Is Highest The risk of mining-induced Dutch Disease depends on several structural factors: Small, undiversified economies face the most significant risk. When mining represents 5–10% of GDP or electricity consumption, it can dominate economic outcomes. El Salvador’s embrace of Bitcoin and Central Asian republics with significant mining operations exemplify this concentration risk. Subsidized energy creates perverse incentives. When governments provide electricity at a loss, mining operations enjoy artificial profitability that attracts excessive investment, intensifying Dutch Disease dynamics. The disconnect between private returns and social costs ensures mining expands beyond economically efficient levels. Weak governance limits effective responses. Without robust monitoring, transparent pricing, or enforceable frameworks, governments struggle to course-correct even when distortions become apparent. Rapid, unplanned growth creates an immediate crisis. When operations scale faster than infrastructure can accommodate, the result is blackouts, equipment damage, and cascading economic disruptions. Why Bitcoin Mining Differs from Traditional Resource Curses Several distinctions suggest mining-induced distortions may be more manageable than historical resource curses: Operational Mobility: Unlike oil fields, mining facilities can relocate relatively quickly. When China banned mining in 2021, operators moved to Kazakhstan, the U.S., and elsewhere within months. This mobility creates different dynamics — governments have leverage through regulation and pricing, but also face competition. The threat of exit disciplines both miners and regulators, potentially yielding more efficient outcomes than traditional resource sectors, where geographic necessity reduces flexibility. No Currency Appreciation: Classical Dutch Disease devastated manufacturing due to currency appreciation. Bitcoin mining doesn’t trigger this mechanism — mining revenues are traded globally and typically converted offshore, avoiding the local currency effects that made Dutch products uncompetitive in the 1960s. Export-oriented manufacturing can remain price-competitive if direct resource competition and input costs are managed. Profitability Volatility: Mining economics are extraordinarily sensitive to Bitcoin prices, network difficulty, and energy costs. When Bitcoin fell from $65,000 to under $20,000 in 2022, many operations became unprofitable and shut down rapidly. This boom-bust cycle, while disruptive, prevents the permanent structural transformation characterizing oil-dependent economies. Resources get released back to the broader economy during busts. Repurposable Infrastructure: Mining facilities can be repurposed as regular data centers. Electrical infrastructure serves other industrial uses. Telecommunications upgrades benefit diverse businesses. Unlike exhausted oil fields requiring environmental cleanup, mining infrastructure can support cloud computing, AI research, or other digital economy activities — creating potential for positive spillovers. Managing the Risk: Three Approaches Bitcoin stakeholders and host regions should consider three strategies to capture benefits while mitigating Dutch Disease risks: Dynamic Energy Pricing: Moving from fixed, subsidized rates toward pricing that reflects actual resource scarcity and opportunity costs. Iceland and Nordic countries have implemented time-of-use pricing and interruptible contracts that allow mining during off-peak periods while preserving capacity for critical uses during demand surges. Transparent, rule-based pricing formulas that adjust for baseline generation costs, grid congestion during peak periods, and environmental externalities let mining flourish when economically appropriate while automatically constraining it during resource competition. The challenge is political — subsidized electricity often exists for good reasons, including supporting industrial development and helping low-income residents. But allowing below-cost electricity to attract mining operations that may harm more than help represents a false economy. Different jurisdictions are finding different balances: some embrace market-based pricing, others maintain subsidies while restricting mining access, and some ban mining outright. Concentration Limits: Formal constraints on mining’s share of regional electricity and economic activity can prevent dominance. Norway has experimented with caps limiting mining to specific percentages of regional power capacity. The logic is straightforward: if mining represents 10–15% of electricity use, it’s significant but doesn’t dominate. If it reaches 40–50%, Dutch Disease risks become severe. These caps create certainty for all stakeholders. Miners understand expansion parameters. Other industries know they won’t be entirely squeezed out. Grid operators can plan with more explicit constraints. The challenge lies in determining appropriate thresholds — too low forgoes legitimate opportunity, too high fails to prevent problems. Smaller, less diversified economies warrant more conservative limits than larger, more robust ones. Multi-Purpose Infrastructure: Rather than specializing exclusively in mining, strategic planning should ensure investments serve broader purposes. Grid expansion benefiting diverse industrial users, telecommunications targeting rural connectivity alongside mining needs, and workforce programs emphasizing transferable skills (data center operations, electrical systems management, cybersecurity) can treat mining as a bridge industry, justifying infrastructure that enables broader digital economy development. Singapore’s evolution from an oil-refining hub to a diversified financial and technology center provides a valuable template: leverage the initial high-value industry to build capabilities that support economic complexity, rather than becoming path-dependent on a single volatile sector. Some regions are applying this thinking to Bitcoin mining — asking what infrastructure serves mining today but could enable cloud computing, AI research, or other digital activities tomorrow. Conclusion The parallels between Bitcoin mining and Dutch Disease are significant: sudden, high-value activity that crowds out traditional industries through resource competition, price inflation, talent reallocation, and infrastructure specialization. Kazakhstan’s 2021–2022 experience demonstrates this pattern can unfold rapidly. Yet essential differences exist. Mining’s mobility, currency neutrality, profitability volatility, and repurposable infrastructure create policy opportunities unavailable to governments confronting traditional resource curses. The question isn’t whether mining causes economic distortion — in some contexts it clearly has — but whether stakeholders will act to channel this activity toward sustainable development. For the Bitcoin community, this means recognizing that long-term industry viability depends on avoiding the resource curse pattern. Regions devastated by boom-bust cycles will ultimately restrict or ban mining regardless of short-term benefits. Sustainable growth requires accepting pricing that reflects actual costs, respecting concentration limits, and contributing to infrastructure that serves broader economic purposes. For host regions, the challenge is capturing mining’s benefits without sacrificing economic diversity. History shows resource booms that seem profitable in the moment often weaken economies in the long run. The key is recognizing risks during the boom — when everything seems positive and there’s pressure to embrace the opportunity uncritically — rather than waiting until damage becomes undeniable. The next decade will determine whether Bitcoin mining becomes a cautionary tale of resource misallocation or a case study in integrating volatile, technology-intensive industries into developing economies without triggering historical pathologies. The outcome depends not on the technology itself, but on whether humans shaping investment and policy decisions learn from history’s repeated lessons about how sudden wealth can become an economic curse. References Canadian economy suffers from ‘Dutch disease’ | Correspondent Frank Kuin. https://frankkuin.com/en/2005/11/03/dutch-disease-canada/ Sovereign Wealth Funds — Angadh Nanjangud. https://angadh.com/sovereignwealthfunds Understanding Bitcoin Mining Through the Lens of Dutch Disease was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story
Share
Medium2025/11/05 13:53