NeuroMesh: A Distributed Training Network for
Large AI Models

NeuroMesh Team
[email protected]


*Website: nmesh.io

World’s Largest AI Model, Trained by Everyone, for Everyone

Abstract

This paper introduces NeuroMesh, an innovative distributed training protocol designed to organize the underutilized computing power of devices worldwide for the development of large AI models. Faced with the monopolization of AI research and development by a few large corporations, NeuroMesh proposes a decentralized approach that democratizes access to large AI model training. By leveraging Predictive Coding Networks (PCN) over the conventional Back Propagation (BP) algorithm, NeuroMesh achieves efficient parallelization, asynchronous learning, and low network costs, facilitating scalable distributed training.

The protocol aims to accelerate AI advancements by aggregating global computational power and addresses the equitable distribution of AI’s benefits. By establishing a framework that encourages the contribution and collaboration of computational resources worldwide, NeuroMesh aspires to democratize large AI model research and mark a pivotal shift towards a more accessible and equitable AI research and development landscape. Moreover, our research team will also use the protocol to train the largest AI models seen to date.

1 Introduction

In the current landscape, the development and deployment of large AI models are controlled by a handful of large corporations. This monopolization risks stifling diversity in research and development, as smaller entities may find it difficult to access essential tools and data, potentially limiting the range of AI applications explored. Furthermore, it raises concerns about the equitable distribution of AI’s benefits, as the monopoly could hinder broader societal participation and contribution to AI advancements.

Additionally, there exists an extensive reservoir of idle computing power from personal computing devices across the globe. However, the current technological landscape lacks an effective mechanism to organize this idle power for training large AI models. The absence of a platform or framework that can efficiently organize and deploy these resources toward productive computational tasks represents a critical bottleneck. This limitation not only impedes the leveraging of distributed computing power but also restricts the scalability and development of AI models that require substantial computational resources.

In light of these challenges, our objective is to mobilize this spare computing power toward the development of the largest AI model the world has yet to witness. We will build a protocol that will allow anybody around the world to train their large AI models, democratizing access to high-level computational capabilities and advancing the field of artificial intelligence.

The current predominant training algorithm, Back Propagation (BP), incurs large network costs and cannot be distributed efficiently. We propose a novel distributed training network based on Predictive Coding Networks (PCN), a cutting-edge biologically based model that can be fully parallelized, have any structural topology, and allow asynchronous learning. These properties allow us to efficiently distribute and decentralize the training network, removing the need for GPU clusters. Based on previous team academic research and early results from the NeuroMesh research and development team, we have proved that PCNs can achieve similar results when compared with BP.

In addition to its technical advancements, NeuroMesh innovates through a training protocol that allows everyone everywhere to train AI models using NeuroMesh’s distributed training infrastructure without worrying about the underlying distribution. This will accelerate AI research and break the existing training monopoly imposed by large corporations.

In Section 2 we will dive into the technical details of PCN, analyzing its computational costs and network costs. Most of these research results are published by NeuroMesh team members or advisors. In Section 3, we will show how the properties of PCN will allow us to distribute training. Then, in Section 4 we propose the NeuroMesh protocol that is structured around integrating blockchain technology with distributed training. Finally, we conclude this paper in Section 5.

2 BP vs PCN

In this section, we will touch upon the technical details of both Back Propagation (BP) and Predictive Coding Networks (PCN). Kindly be advised that this section dives into specialized technical details. Readers may opt to proceed directly to Section 3 for further content.

2.1 Backpropagation

PIC

Figure 1: An illustration of a multi-layer feed-forward neural network in BP.

Backpropagation [13] is the backbone of nearly all modern neural networks. To illustrate how it works, let us consider a simple neural network that has 1 input layer, N-1 intermediate hidden layers, and 1 output layer. Let \(\overline {D}, \overline {L}\) be an input training data and label pair from the training dataset \(\mathcal {D}\). Let the input layer be \(\overline {X}_0\), the output layer be \(\overline {X}_N\), and the intermediate layers be \(\overline {X}_1, \overline {X}_2, ... \overline {X}_{N-1}\). The dimension of the layer \(\overline {X}_d\) is \(n_d\). Note that in fully connected networks a constant is usually added to each layer, for the sake of simplicity, let’s assume that this constant is not present. The connections between \(\overline {X}_{d-1}\) and \(\overline {X}_{d}\) can be represented by a matrix \(\boldsymbol {W}_d\). Finally, we use \(\sigma \) to denote the non-linear activation function and \(\ell \) to denote the loss function to minimize. The representation of such a network is given in Figure 1.

There are two steps during the learning phase [9, Chapter 2], forward pass and backward pass. In the forward pass, we compute the loss function \(\ell (\overline {L}, \overline {X}_N)\) between the output of the network \(\overline {X}_N\) and the training data label \(L\). To compute the output \(\overline {X}_N\) we need to compute the intermediate layers first, where \(\overline {X}_{d} = \sigma (\overline {Z}_d)\), where \(\overline {Z}_d=\boldsymbol {W_d}\overline {X}_{d-1}\). Therefore, \begin {equation} \label {eq:2:bq_loss} \ell (L, \overline {X}_N) = \ell (L, \sigma (\boldsymbol {W}_N\sigma (\boldsymbol {W}_{N-1}\sigma (...\boldsymbol {W}_1\overline {X}_0)))) \end {equation}

Note that \(\overline {X}_d\) are referred to as the network’s activation. After the loss function is computed, we derive the loss function for the weights, which is referred to as the gradient of \(\ell \) concerning \(\boldsymbol {W}_d\) = \(\nabla _{\boldsymbol {W}_d} \ell \). Since we want to minimize the loss function, we move the weights to the negative direction of the gradient. Let \(t\) be the current time, then \(\boldsymbol {W}^{t+1}_d=\boldsymbol {W}^{t}_d - \alpha \nabla _{\boldsymbol {W}_d}\ell \) where \(\alpha \) is the learning rate.

To compute \(\nabla _{\boldsymbol {W}_d} \ell \) for \(d\in \{1, ..., N\}\), we will apply the chain rule. We will first compute the auxiliary quantity \(\delta _d\) as described below. Note how it is defined recursively. \begin {equation} \begin {cases} \delta _N = \frac {\partial \ell }{\partial \overline {Z}_N} = \sigma ' \circ \frac {\partial \ell }{\partial \overline {X}_N}, & \text {if}\ d=N \\ \delta _d = \sigma '\circ (\boldsymbol {W}^T_{d+1} \delta _{d+1}), & \text {if}\ d=1,...,N-1 \end {cases} \label {eq:2:bp_gradient} \end {equation}

Then, \begin {equation} \nabla _{\boldsymbol {W}_d} \ell =\frac {\partial \ell }{\partial W_d} = \delta _d X_{d-1}^T \end {equation}

This backward notion is called the backward pass, which characterizes the name of this method, Backpropagation. A whole pass of this algorithm over the entire dataset \(\mathcal {D}\) is called an epoch. The model is trained for as many epochs \(E\) as necessary until the loss converges to a minimum.

After the network is trained, we want to use it to predict the output given a new input. To predict using the network we feed forward the new data \(\overline {D}^{new}\) to compute \(\overline {X}_N^{new}\), the prediction of the network. We often apply this process to existing labeled data to evaluate the model’s performance.

Because the weight updates depend on the loss function, the weight update between two neurons depends on information computed at the last layer, which is not local. BP is not considered biologically plausible, as the synaptic weight update in human brain neurons only depends on the adjacent neurons, known as Hebbian Plasticity.

2.2 Predictive Coding Networks

PIC

Figure 2: An illustration of a multi-layer feed-forward neural network in PCN.

Now, let’s explain PCNs using the notation established in BP. In PCNs, the layers are formed by value nodes, each containing a value \(z_{d, i}\). We use \(\overline {Z}_d\) to refer to all value nodes within the same layer. Note that in PCN these value nodes are also considered as parameters. Each value node has an associated error node \(\varepsilon _{d, i}\), error nodes within the same layer can be represented using \(\overline {\varepsilon }_d\) where the index \(i\) indicates the position in the layer. The difference with BP is that instead of propagating the neuron activity from the output layer to the input layer directly via a loss function gradient, PCNs propagate the neuron activity via the error nodes.

Let \(\overline {\mu }_d = \boldsymbol {W}_d \overline {X}_{d-1} = \boldsymbol {W}_d\sigma (\overline {Z}_{d-1})\), then the newly inserted error nodes can be computed \(\overline {\varepsilon }_d = \overline {Z}_d - \overline {\mu }_d\). These error nodes represent the difference between the actual and predicted value of a neuron. In PCN, we want to minimize the following energy function [2] \begin {equation} \label {eq:energy_function} \mathcal {F} = \frac {1}{2}\sum _{i,d}(\varepsilon _{d, i})^2, \end {equation} which is essentially the squared error sum of the error nodes. This also means that \(\overline {\mu }_d\) will move closer to \(\overline {Z}_d\) over time.

Following [20], the derivative used for optimizing the value nodes is: \begin {equation} \label {eq:optimization_x} \Delta z_{d, i} = \begin {cases} 0, & \text {if}\ d=0 \\ \gamma \cdot (-\varepsilon _{d, i} + \sigma '(z_{d,i})\sum ^{n^{d+1}}_{k=1}\varepsilon _{d+1, k} w_{d+1, k, i}) & \text {if}\ 0<d<N \\ 0, & \text {if}\ d=N \text { during training}, \\ \gamma \cdot (-\varepsilon _{d,i}), & \text {if}\ d=N \text { during prediction}, \end {cases} \end {equation} where \(\gamma \) is the value node learning rate. Then, the weights are updated to minimize the same objective function \(\mathcal {F}\) as follows: \begin {equation} \label {eq:2:pcn_weight_update} \Delta w_{d, i,j} = -\alpha \cdot \frac {\partial \mathcal {F}}{\partial w_{d,i,j}} = \alpha \cdot \varepsilon _{d, i}\sigma (z_{d-1, j}), \end {equation} where \(\alpha \) is the weight learning rate. In contrast with BP, it is important to remark that to update the weights and value nodes for the current layer, we only need the information from adjacent layers, namely, for the weight of layer \(d\), the errors \(\varepsilon _{d, i}\) and the value nodes \(\sigma (z_{d-1, j})\), \(\forall i,j\), instead of waiting for the loss to backpropagate from the output layer.

2.3 Inference Learning (IL) and its development

Inference Learning (IL) [8]

Multiple algorithms can be used to train PCNs. We start with the simplest one, called inference learning. Here, we explain how to train the network for one training data entry \((\overline {D}, \overline {L})\).

During the training phase, the input (layer \(0\)) and output (layer \(N\)) value nodes are fixed to \(\overline {D}\) and \( \overline {L}\) respectively. Then, the other values nodes \(z_{d, i}\) are modified to minimize the overall energy \(\mathcal {F}\). This is the inference step. When the inference converges (in practice the number of iterations is set to \(T\)), the weights are updated once to minimize the energy function \(\mathcal {F}\), following Eq. (6). Inference and weight update together constitute a single training epoch when applied to all the input data. Similar to BP, this is repeated for as many epochs as necessary.

During the prediction phase, the input is fixed to \(\overline {D}\), and the error nodes are optimized and decay to 0 as \(t\rightarrow \infty \). Note that the output of the network is not fixed, therefore, the output \(\overline {X}_N\) of the network is free to be updated. See Algorithm 1 for the IL training algorithm.

Incremental Predictive Coding (iPC)

Yuhang Song et al. [16] proposed Z-IL to optimize learning performance since IL is quite slow. This is a step in the right direction, however, the algorithm is not fully autonomous and needs external signals, just like IL.

In a recent paper [15], Tommaso et al. proposed a new training algorithm for PCNs that is fully autonomous, parallel, and much more efficient called incremental predictive coding (iPC). Here we present the algorithm:

Essentially, we run inference with weight updates simultaneously, achieving similar training performance (or even better) than BP.

Other local learning algorithms

Predictive coding network can be seen as an umbrella term encompassing a variety of models trained with algorithms that respect some key properties, namely locality and independence of the weight updates, and the existence of a separated set of parameters representing the current state of the network which is used to infer the final output of the model. Recent works explore these properties individually and show how they can be used to scale PCNs to large scale tasks. In [10], Ororbia et al. show how to train a ResNet-18 in a parallel and localized way by independently propagating the error to different chunks of the architecture by employing local target. These chunks act as separate nodes during the learning step. The authors even highlight the advantages of such an approach when applied to clustered training, as each node could be executed on a separate GPU. Instead, [4] shows how local targets can be used to train and improve the robustness of a neural network compared to BP on recent architectures such as VGG16 and EfficientNetB0.

2.4 PCNs and transformers

Predictive Coding Interpreted as Generative Model

Up until this point, we have discussed the usage of PCN with a strong underlying assumption: in the generative model, PC limits itself to the Gaussian form. This means that PC fail to keep up with the modern more complex model structures such as transformers [19]. This is because it’s difficult to approximate transformers with only Gaussian assumptions. Luca Pinchetti et al.[11] solved this problem by generalizing PCNs to arbitrary distributions. In this section, we dive into the mathematical details of how this was achieved.

First, let’s revisit Predictive Coding from a probabilistic view [2]. We can interpret PC as a variational inference problem [1]. Then, the neural activities \(\phi _i\) inside the value nodes \(\overline {X}_d\) represent probability distributions. Assuming we have a generative model \(\mathcal {D} = f(X)\) where \(\mathcal {D}\) is a data point and \(X\) is the set of latent variables. We can then describe its join probability \(p(\mathcal {D}, X) = p(\mathcal {D}| X)p(X)\). Training a network is trying to find values for the latent variables \(X\) given \(\mathcal {D}\). Applying the Bayes rule, we are interested in the quantity \(p(X|\mathcal {D}) = \frac {p(\mathcal {D}, X)}{p(\mathcal {D})}\). This is called the posterior, and it’s almost impossible to find the exact solution. Thus, it is often approximated in variational inference with a family of distributions \(q_\phi (X|\mathcal {D})\) where we need to learn the parameters \(\phi \). Usually, we use KL divergence[6] to measure the disparity between the approximated posterior and the exact posterior. We want to find the optimum \(\phi \)

\begin {equation} \label {qe:2:optimum_phi} q^*_\phi = \underset {\phi }{\operatorname {argmin}} D_{KL}[q_\phi (X|\mathcal {D})||p(X|\mathcal {D})] \end {equation} by minimizing an upper bound called the variational free energy \(\mathcal {F}\) \begin {equation} \label {eq:2:generative_model_energy} \mathcal {F} : D_{KL}[q_{\phi }(X|\mathcal {D})||p(\mathcal {D},X)] \geq D_{KL}[q_{\phi }(X|\mathcal {D})||p(\mathcal {D},X)] + \ln p(\mathcal {D}) = D_{KL}[q_\phi (X|\mathcal {D})||p(X|\mathcal {D})]. \end {equation}

The Gaussian assumption arises here: \begin {equation} p(\mathcal {D}, X) = p(\mathcal {D}| X)p(X) = \mathcal {N}(\mathcal {D}; g(X, \theta ), \Sigma _2) \mathcal {N}(X, \mu , \Sigma _1), \end {equation} where \(\Sigma _2\), \(\Sigma _1\), and \(\mu \) are priors that can also be learned. Using as variational posterior the Dirac-delta distribution \(q_\phi (X|\mathcal {D}) = \delta (X-\phi )\). Then, we get [11, 2.1] \begin {equation} \label {eq:rep:3} \mathcal {F} = \mathbb {E}_{q_{\phi }(X|\mathcal {D})}[\ln q_{\phi }(X|\mathcal {D})] - \mathbb {E}_{q_{\phi }(X|\mathcal {D})}[\ln p(\mathcal {D}, X)] = -\mathbb {E}_{q_{\phi }(X|\mathcal {D})}[\ln p(\mathcal {D}, X)] = -\ln p(\mathcal {D}, \phi ), \end {equation} where the entropy of \(q\) is 0. We can apply this result to deep neural networks, where \(X\) can represent multiple layers of a PCN \(X_0, X_1, X_2, ..., X_N\). The generative model then becomes

\begin {equation} \label {eq:rep:4} p(X_{0:N}) = p(X_0) \prod _{d=1}^{N} p(X_d | X_{d-1}) = \mathcal {N}(X_0; \mu _0, \Sigma _0) \prod _{d=1}^{N} \mathcal {N}(X_d; \mu _d, \Sigma _d), \end {equation}

where \(\mu _d = g_d(X_{d-1}, \theta _d)\) and \(X_N = \mathcal {D}\), \(\mu _0\) is some arbitrary prior set for some data \(\mathtt {d}\), and \(\Sigma _d\) are prior diagonal covariance matrices. The energy then becomes \begin {equation} \label {eq:rep:5} \widetilde {\mathcal {F}} = -\mathbb {E}_{q(X_{0:N} | \mathtt {d},\mathcal {D})}[\ln p(X_{0:N})] = \sum _{d=0}^{N} - \ln p(\phi _d | \mu _d) = \frac {1}{2} \left ( \sum _{d=0}^{N} \sum _{i=1}^{w_d} \Sigma ^{-1}_{d,i}\varepsilon ^2_{d,i} + \ln \Sigma _{d, i} \right ) + k \end {equation} where \(k\) is a constant, \(\varepsilon _d = \phi _d - \mu _d\), and The total energy, given by the sum of the energies \(\mathcal {E}_d\) over all the layers, when assuming identity covariance matrices, is [12]: \begin {equation} \mathcal {F} = \sum _{d=0}^{N} - \ln p(\phi _d | \mu _d) = \sum _{d=0}^{N} \mathcal {E}_d = \sum ^{N}_{d=0} \overline {\varepsilon }^2_{d}, \end {equation} which justifies the minimization of the sum of squared errors we have established in Section 2.2. Here we have established a mathematical proof of how PCNs are justified using a generative probability model.

Generalization to Arbitrary Distributions

However, the Gaussian assumption is limiting when we want to generalize to more complex networks. Luca Pinchetti et al. [11] have established the following upper bound for the energy function, and conducted several experiments that show PCN can be used to train transformer blocks. This also opens the gate for PCNs to be used in bigger and more complex networks.

\begin {equation} \label {eq:rep:9} \mathcal {F}_{KL} = \sum _{d=0}^N \mathcal {E}_d := \sum _{d=0}^N D_{KL}[\mathcal {X}_d(\phi _d^D) \| \widehat {\mathcal {X}}_d(\mu _d^D)] \leq \sum _{d=0}^N \mathcal {H}((\mathcal {X}_d(\phi _d^D), \widehat {\mathcal {X}}_d(\mu _d^D))). \end {equation}

2.5 Summary and Prospects

Here we summarize the most important properties of PCN:

  1. It only requires local error minimization instead of depending on a global loss function in the output layer [2]. As a result, PCN models can be trained in a localized way. If we partition the model into multiple blocks and assign each to its own GPU, only the external 2 layers in each block must communicate with other nodes.
  2. PCN can be fully parallelized. This means that we can train PCN models asynchronously.
  3. PCN supports arbitrary graph topology when defining models [7]. This means that models trained with PC can have any structure, not only the usually layered networks, a limitation imposed by BP. For example, it has already been shown how, for example, introducing cyclic connection in a PCN can improve its capabilities as associative memory and build models that are more similar to the hippocampus-neocortex structure in the human brain [18].
  4. Slightly modified PCN models have been proven to be able to memorize sequences of images [17]. Also, PC can be used to simulate how our brain associates memory (text, speech, etc) over multiple timescales [3].
  5. PCN can achieve similar performance as BP in traditional feed-forward networks and in memorizing information [14].
  6. PCN trained with Z-IL and iPC can achieve similar training time performance as BP [16].
  7. It can be generalized to more complex state-of-the-art network structures such as transformers and ResNets.

These properties of PCN make the model ideal for fully parallelized, distributed, and decentralized training. In the next section, we will dive into how we can use PCN’s properties to construct a distributed training network with unequal nodes that can join or leave at any moment.

We want to unify the spare computing resources all around the world to construct the largest AI model seen to date. We can then use this network to push advancement in all scientific fields such as Medicine, Mathematics, Physics, Aerospace engineering, Computer Science, etc.

PCNs aren’t mainstream right now because of several reasons. First BP has accumulated a large amount of research since 1986 and has become the industry standard. PCNs have only received some attention from researchers (mainly at the University of Oxford) in the recent decade. This has also contributed to little open-source support. Secondly, IL, the algorithm used traditionally to train PCNs, is much slower compared to BP (5-6 times slower). But this has been solved by Z-IL and other modern algorithms in the last couple of years. Lastly, large companies such as Google, Meta, and OpenAI do not have an incentive to reinvent the wheel as BP has been performing well and their GPUs are usually grouped in centralized clusters where the network bandwidth issue is not that prominent.

3 Distributed Training

3.1 PCN Properties and Distribution

PIC

Figure 3: An illustration of why in PCN the network cost of distributing training is \(O(n_{d+1})\).

Now that we have an initial understanding of PCNs, in this section, we propose the NeuroMesh protocol that will allow everyone to train models in a distributed way. To supplement the protocol, we will build a distributed training network to interface with the protocol and train the defined models. This way, we invite anyone, from researchers to small companies, to individual people who are interested in training large AI models to use the protocol to advance AI research and development.

Underlying the protocol and network, we will divide the training workload of the large neural network between Trainer nodes. The properties of PCN ensure that this can be done in a distributed environment.

PCN Localized Property Explained

First, let’s assume we have 2 nodes \(A\) and \(B\) that are connected in the following manner \(A \rightarrow B\). An illustration of the border between \(A\) and \(B\) in PCN has been given in Figure 3. Note that the red bar defines the boundary between node \(A\) and \(B\).

According to Eq. (5), in order to update \(\overline {Z}_d\) we need \(\overline {Z}_d\), \(\overline {\varepsilon }_d\), \(\boldsymbol {W}_{d+1}\) and \(\overline {\varepsilon }_{d+1}\). Both \(\overline {Z}_d\), \(\overline {\varepsilon }_d\) are available to \(A\), and we also design the connection border so that \(\boldsymbol {W}_{d+1}\) is kept within \(A\), therefore we only need to pass back \(\overline {\varepsilon }_{d+1}\) from node \(B\). This has a cost of \(O(n_{d+1})\), the dimension of layer \(d+1\). According to Eq. (6), in order to update \(\boldsymbol {W}_{d+1}\) we need \(\overline {Z}_d\) and \(\overline {\varepsilon }_{d+1}\). \(\overline {Z}_d\) is readily available. As discussed, \(\overline {\varepsilon }_{d+1}\) is already being passed so we could update the value nodes \(\overline {Z}_d\). Therefore, we do not introduce extra network costs in this step. Finally, \(B\) requires \(\overline {\mu }_{d+1}\) to keep updating the network. Since \(\boldsymbol {W}_{d+1}\) is stored in \(A\), we just need to compute \(\overline {\mu }_{d+1} = \boldsymbol {W}_{d+1}\overline {X}_d\) and communicate this result forward. This has a network cost of \(O(n_{d+1})\) as well. Combining these results, we see that in PCN the overall network cost is \(O(2n_{d+1}) = O(n_{d+1})\), and we have shown that each layer operates in a localized manner.

PCN and Distribution

Keeping in mind the properties explained in Section 2.5, we explain here how PCN can be used to train models in a distributed way.

  1. The network can have arbitrary topology, this allows nodes to be of any size and to be connected in any possible way. This means that we can accept a varying range of GPUs with almost any kind of computing power. Say power > Nvidia GTX 1060. Also, each node can have an arbitrary number of neighbors, making dependency on adjacent neighbors a lesser concern. One example is that PC allows us to build brain-like structures, with clusters of densely connected neurons, that are sparsely connected between themselves.
  2. Using PCN’s localized and fully parallel property, we can achieve much better performance when pipelining the model compared to BP. Namely, the training pipeline will not have any bubbles [5] after the model has been warmed up. Also, the localization property means that we can connect new nodes anywhere in the network.
  3. The training nodes can progress asynchronously. This ensures network flexibility, allowing nodes to join and leave at any moment. We can quickly refill the broken nodes or mark them as deprecated, stopping sync for their activations. Think of this as when our brain neurons die, the brain usually rewires around the dead neurons. E.g. when a nearby node malfunctions, the current node can train using other adjacent nodes data instead of stopping completely. This is not doable in BP since the whole model is strongly connected.
  4. The inter-layer communication cost can be further reduced by optimizing the transmission of error. We can just propagate the error terms that are large enough to have an actual influence on parameter updates instead of transmitting the entire error vector.

PIC

(a)

PIC

(b)
Figure 4: Difference in topology between a typical BP network (up) and a sketch of a PC network of structural connections that link distinct neuronal elements in a brain (down). Figure taken from [7].
.

Therefore, we can construct a training network where anyone can join to contribute, resilient to broken nodes. A possible PCN network structure is given in Figure 4, where we remark on the main network limitation of BP-based networks vs PC-based networks. Now, imagine the world’s computing power coming together, forming part of a huge training network and contributing towards training large AI models together.

Data Parallelization

Currently, when large AI models are trained over a vast amount of data, we need to implement data parallelization. Otherwise, the model would take a very long time to compute all provided data. This essentially means that you would need to sync model weights per batch. Using ChatGPT3.5, 175B as an example, you have to sync 700GB of data per batch. This kind of network bandwidth can only be achieved using professional GPUs such as A100 or H100 and is the main reason why BP large models can only be trained in centralized clusters.

This problem is also present in PCN if we employ data parallelism. However, we will use PCNs main innovative properties to combat this and avoid weight sync altogether. Instead of training the same model using data parallelization, we will use an innovative model design, where we increase the number of input nodes to achieve data parallelism instead of duplicating the model over different clusters as in BP. It is similar to forcing all human brains to be the same, learning about different data, and then forcing them to be the same again repeatedly. Data parallelism, in its essence, is not biologically feasible.

Let’s use a biological example for our innovative approach. We, as humans, have vision, hearing, touch, smell, etc. as sensory input to the brain. Similarly, we add more sensory inputs to the neural network and learn about more data at the same time by increasing the number of input nodes and output nodes, achieving pseudo-data parallelism by imitating our brain.

3.2 Experimental results

PIC

(a)

PIC

(b)
Figure 5: Training results of BP vs PCN using 3 different RNN models [11] (a) MNIST (b), down CIFAR10

Dataset Models PCN-IL BP
m1 62.81 248.17
MNIST m2 63.24 252.72
m3 62.75 255.21
m1 61.85 235.34
CIFAR m2 62.77 233.75
m3 69.14 238.42
Table 1: Comparing PCN and BP training performance based on the same hardware environment and dataset. The unit is iterations per second. [11]

Our Research and Development team has verified some of the sections 2 and 3 results. Our comparative experiments on small models show that the training performance of PCN is almost equivalent to that of BP in terms of accuracy. Figure 5 displays the training results of PCN and BP on the MNIST and CIFAR datasets under different neural network structures. We can see that the accuracy and loss curves when training PCN with IL are very close to those of BP, with no significant difference.

In terms of performance, BP’s training speed is about 3-4 times that of PCN with IL, as shown in Table 1. Although PCN does not have a performance advantage over BP in training, it requires far less GPU network bandwidth than BP, which also lowers its hardware requirements. GPUs designed for professional AI training (such as A100, H100) cost more than ten times as much as consumer GPUs (such as RTX 3090, RTX 4090), while the difference in computing power is not substantial, and in some cases, the latter may even be higher. The main difference in price stems from the difference in the GPUs’ data transfer capabilities. PCN can leverage a network composed of a large number of low-cost consumer graphics cards to train super AI large models. Therefore, we have reason to believe that the actual training cost of PCN can be lower than that of BP. This also contributes to NeuroMesh’s competitive edge.

Moreover, the performance of IL is not a bottleneck since newer algorithms such as Fa-ZIL proposed by Yuhang Song et al.[16] or iPC [15] incur no performance penalties over BP and can even be more performant under specific settings, such as full-batch training. Furthermore, it has been shown that backpropagation-free algorithms (such as in [10]) can learn as efficiently as BP on large and complex networks while employing local targets.

Now, it’s just a matter of combining all of these ideas to create something efficient and distributed, utilizing the spare computing resources from around the world to build the largest AI model seen to date.

4 The NeuroMesh Protocol

4.1 Roles Within the NeuroMesh Protocol

PIC

Figure 6: Illustration of possible NeuroMesh network composed of Trainers, Synchronizers, and Predictors.

NeuroMesh protocol, a groundbreaking framework for AI model training, introduces a semi-decentralized architecture distinguished by three essential roles, each pivotal to the protocol’s operation and success:

4.2 The Distributed Network of the NeuroMesh Protocol

The NeuroMesh protocol adopts a distributed but not fully decentralized network structure, aiming to surmount the challenges of efficiency and practicality that pure decentralization faces in the context of training large AI models. This approach enables a scalable and robust framework for AI development:

The protocol allows for dynamic participation, with Trainer nodes able to join or leave the network freely, promoting a resilient and adaptable training environment. This flexibility is balanced with the operational efficiency ensured by the Synchronizers, who, despite the decentralized ethos, maintain a level of oversight critical for the network’s functionality.

Synchronizers, vested with advanced permissions, verify the work of Trainers and ensure the network’s efficiency and the correctness of model training. This semi-decentralized structure leverages the benefits of distributed computing while ensuring the quality and efficiency of the training process.

Predictors, by hosting the trained models and offering prediction services, embody the protocol’s application layer. They ensure that the training process’s outcomes are accessible, transparent, and timely to the users, bridging the gap between the training and application phases of AI models. From a broader perspective, the PCN design inherent to the NeuroMesh protocol enhances the network’s fault tolerance and robustness. Even if some Trainer nodes become inactive or behave maliciously, the network’s overall operation remains unaffected, mirroring the resilience of biological neural networks.

4.3 Validating Contributions and Incentivizing Trainers in NeuroMesh

NeuroMesh introduces a comprehensive incentive and verification framework designed to ensure active participation from Trainer nodes and guarantee the integrity of their contributions. This section delves into the mechanisms behind motivating Trainer nodes to engage in the network’s activities and how Synchronizers validate that these nodes have honestly completed their assigned tasks.

Incentivizing Trainer Participation. This system utilizes a proof-of-contribution model, evaluating each Trainer’s input based on computational efforts and training outcome quality. Tokens are allocated at the conclusion of each training epoch, with distribution proportional to the individual contributions, fostering a competitive yet fair environment that encourages efficiency and optimal performance. The token economy within NeuroMesh is designed to create a self-sustaining ecosystem where tokens serve multiple purposes, from accessing AI services to participating in governance, thereby enhancing engagement and long-term commitment.

Adaptive Continuous Validation. To maintain the integrity and accuracy of contributions without relying on static benchmarks, NeuroMesh employs an adaptive continuous validation approach. Synchronizers play a pivotal role in this process, analyzing data patterns and comparing model parameter evolutions against anticipated trends. This dynamic validation mechanism helps identify inconsistencies or dishonest contributions effectively.

Shadow Computing Verification. Further solidifying the protocol’s trustworthiness, NeuroMesh leverages shadow computing, wherein Synchronizers replicate selected computations to serve as a dynamic benchmark. This process involves:

  1. Random Auditing via Shadow Computing: Random model segments or data processed by Trainers undergo shadow computation, creating an unpredictable verification system that deters dishonesty.
  2. Real-time Feedback and Correction: Discrepancies trigger immediate feedback from Synchronizers to Trainers, allowing for swift corrections and enhancing overall model accuracy.

Trainers consistently underperforming or acting dishonestly face consequences, including reduced token rewards or exclusion, ensuring a high-trust, efficient training environment.

4.4 Pioneering the Web3 AI Training Ecosystem

In the burgeoning landscape of Web3 and decentralized technologies, the advent of NeuroMesh signifies a pivotal evolution in the AI training domain. As an integral component of this ecosystem, NeuroMesh positions itself at a critical junction, bridging the gap between resource provision and the pressing need for comprehensive training platforms.

Bridging Gaps in the Web3 AI Training Landscape. The Web3 AI training ecosystem is marked by diversity, with projects focused on different AI development aspects. Some initiatives provide essential computational resources like GPUs and bandwidth, while others seek efficient AI model training mechanisms but face access challenges. NeuroMesh emerges as a vital component of the Web3 AI training ecosystem. By connecting computational resource providers with AI development projects, NeuroMesh fills a pivotal gap, enhancing the Web3 AI development cycle’s utility and efficiency. Its unique position facilitates seamless integration with the diverse landscape of Web3 projects, elevating the collective potential for innovation in AI training and development.

A Comprehensive Training Platform for AI Development. At its core, NeuroMesh offers a training platform that simplifies the complexities of distributed AI model training. It empowers developers and organizations to focus on the conceptual and business aspects of AI projects, providing a robust infrastructure that handles the technical demands of training large models. This democratizes AI development, making advanced training capabilities accessible to a broader range of innovators, regardless of their technical resource base.

Facilitating a Decentralized AI Training Service. NeuroMesh not only serves as a bridge within the Web3 ecosystem but also establishes itself as a provider of decentralized AI training services. By offering these services, NeuroMesh enables a wide range of applications, from startups experimenting with new AI-driven products to established entities seeking to enhance their offerings with advanced AI capabilities.

The platform’s service-oriented architecture ensures scalability, allowing it to accommodate the growing demands of the AI industry. Furthermore, by operating within the Web3 framework, NeuroMesh upholds the principles of decentralization, ensuring data privacy, security, and ownership remain in the hands of users.

Remarks. As the Web3 ecosystem continues to evolve, NeuroMesh stands out as a crucial enabler of AI model training, providing a necessary link between resource provisioning and AI development needs. Its development of a comprehensive training platform marks a significant stride towards democratizing AI, making it more accessible and manageable for users across the spectrum. In doing so, NeuroMesh is not just contributing to the Web3 AI training landscape but is actively shaping the future of AI development and deployment in a decentralized world.

5 Conclusion

As we stand on the brink of a new era in AI development, NeuroMesh emerges as a beacon of innovation within the Web3 AI training landscape. Through its pioneering architecture, NeuroMesh not only addresses the pressing need for efficient and accessible AI model training but also sets a new standard for the democratization of computational resources. By bridging the gap between resource provisioning and AI development, NeuroMesh paves the way for a more inclusive and collaborative future in AI research and application.

The introduction of a token-based incentive system further underscores NeuroMesh’s commitment to fostering a vibrant and self-sustaining ecosystem. This system not only rewards participants for their contributions but also encourages a culture of collaboration and open innovation. Furthermore, the implementation of shadow computing for verification purposes ensures the integrity and reliability of the training process, solidifying NeuroMesh’s position as a trustworthy platform for AI model development.

Looking ahead, the potential of NeuroMesh extends far beyond its current capabilities. As the platform evolves, it will continue to lower the barriers to AI model training, enabling a wider array of participants to contribute to and benefit from AI advancements. The vision for NeuroMesh is not just about creating a decentralized AI training platform but about redefining the possibilities of AI development in a Web3 world.

In conclusion, NeuroMesh represents a significant leap forward in the quest for a decentralized, efficient, and inclusive framework for AI model training. It embodies the spirit of innovation that drives the Web3 community, offering a glimpse into a future where AI development is more accessible, transparent, and equitable. As we move forward, the continued evolution of NeuroMesh will undoubtedly play a crucial role in shaping the landscape of AI training, bringing us closer to realizing the full potential of distributed computing in AI research and development.

References

[1]   David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, April 2017.

[2]   Rafal Bogacz. A tutorial on the free-energy framework for modeling perception and learning. Journal of Mathematical Psychology, 76:198–211, 2017. Model-based Cognitive Neuroscience.

[3]   Charlotte Caucheteux, Alexandre Gramfort, and Jean-Rémi King. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour, 7(3):430–441, March 2023.

[4]   Bhavin Choksi, Milad Mozafari, Callum Biggs O’May, Benjamin Ador, Andrea Alamia, and Rufin VanRullen. Predify: Augmenting deep neural networks with brain-inspired predictive coding dynamics. Advances in Neural Information Processing Systems, 34:14069–14083, 2021.

[5]   Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018.

[6]   James M. Joyce. Kullback-Leibler Divergence, pages 720–722. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.

[7]   Beren Millidge, Tommaso Salvatori, Yuhang Song, Rafal Bogacz, and Thomas Lukasiewicz. Predictive coding: Towards a future of deep learning beyond backpropagation?, 2022.

[8]   Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. A theoretical framework for inference and learning in predictive coding networks, 2022.

[9]   Michael A. Nielsen. Neural networks and deep learning, 2018.

[10]   Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Backpropagation-free deep learning with recursive local representation alignment. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press, 2023.

[11]   Luca Pinchetti, Tommaso Salvatori, Yordan Yordanov, Beren Millidge, Yuhang Song, and Thomas Lukasiewicz. Predictive coding beyond gaussian distributions, 2022.

[12]   R. P. Rao and D. H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 1999.

[13]   David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning Representations by Back-propagating Errors. Nature, 323(6088):533–536, 1986.

[14]   Tommaso Salvatori, Yuhang Song, Yujian Hong, Simon Frieder, Lei Sha, Zhenghua Xu, Rafal Bogacz, and Thomas Lukasiewicz. Associative memories via predictive coding, 2021.

[15]   Tommaso Salvatori, Yuhang Song, Yordan Yordanov, Beren Millidge, Zhenghua Xu, Lei Sha, Cornelius Emde, Rafal Bogacz, and Thomas Lukasiewicz. A stable, fast, and fully automatic learning algorithm for predictive coding networks, 2024.

[16]   Yuhang Song, Thomas Lukasiewicz, Zhenghua Xu, and Rafal Bogacz. Can the brain do backpropagation? — exact implementation of backpropagation in predictive coding networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 22566–22579. Curran Associates, Inc., 2020.

[17]   Mufeng Tang, Helen Barron, and Rafal Bogacz. Sequential memory with temporal predictive coding, 2023.

[18]   Mufeng Tang, Tommaso Salvatori, Beren Millidge, Yuhang Song, Thomas Lukasiewicz, and Rafal Bogacz. Recurrent predictive coding models for associative memory employing covariance learning. PLoS computational biology, 19(4):e1010719, 2023.

[19]   Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.

[20]   James C. R. Whittington and R. Bogacz. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation, 29:1229 – 1262, 2017.