A Model Decomposition-in-Time of Recurrent Neural Networks: A Feasibility Analysis

L D’Amore; L D’Amore

doi:DOI not found for this article.

ISSN: 2641-3086

Trends in Computer Science and Information Technology

Short Communication Open Access Peer-Reviewed

A Model Decomposition-in-Time of Recurrent Neural Networks: A Feasibility Analysis

L D’Amore*

University of Naples Federico II, Naples, Italy

Author and article information

*Corresponding author: L D’Amore, University of Naples Federico II, Naples, Italy, E-mail: [email protected]; [email protected]

doi : 10.17352/tcsit.000091

Received: 03 February, 2025 | Accepted: 18 February, 2025 | Published: 19 February, 2025

Keywords: Parallel-in-time approach; Domain decomposition; Additive objective function; Constrained least square problems; Recurrent neural networks; Parallel algorithm

Cite this as

D’Amore L. A Model Decomposition-in-Time of Recurrent Neural Networks: A Feasibility Analysis. Trends Comput Sci Inf Technol. 2025;10(2):007-010. Available from: 10.17352/tcsit.000091

Copyright License

© 2025 D’Amore L. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

In the context of Recurrent Neural Networks, minimization of the Loss Function (LF) causes the most training overhead. Following the Parallel In-Time approaches, we introduce an ab-initio decomposition across time direction. The key point of our approach lies in the innovative definition of local objective functions which allows us to overcome the sequential nature of the network and the management of dependencies between time steps. In particular, we define local RNNs by adding a suitable overlapping operator to the local objective functions which guarantees their matching between adjacent subsequences. In this way, we get to a fully parallelizable decomposition of the RNN whose implementation avoids global synchronizations or pipelining. Nearest neighbours communications guarantee the algorithm’s convergence. We hope that these findings encourage readers to further extend the framework according to their specific application requirements.

Main article text

1. Introduction

Recurrent Neural Networks (RNNs) are a class of Deep Learning models that are fundamentally designed to handle sequential data. Unlike feedforward neural networks, RNNs possess the unique feature of maintaining a memory of previous inputs by using their internal state to process sequences of inputs. This makes them ideally suited for applications where the order of data points is crucial [1]. With the increase in the size of the dataset and the improvement of the complexity of the model, the demand for computational strength and storage space in the training process also increases proportionally. The present work is placed in the context of the design of parallel algorithms for RNNs in large-scale applications where parallelizing RNNs can be challenging due to their sequential nature. We focus on RNNs with an additive loss function consisting of a large number of component functions, let us say f_i, such as

$L (θ) = m i n \sum_{i} f_{i} (θ) (1)$

where each term corresponds to the error between some data and the output of a parametric model, with being the vector of parameters. An example is linear least squares problems, where f_i has a quadratic structure, except for a regularization function. A more general class of additive cost problems is nonlinear least squares [2,3].

Minimization of the Loss Function (LF) causes the most training overhead. The structure of the additive cost function has facilitated the use of parallel computing approaches that are well-suited for the incremental approach [4]. Very often, besides the exploitation of Graphics Processing Units' acceleration [5], concurrency is introduced inside the operations of each minimization step, at the cost of an all-to-one data synchronization at the end of each step (this is the simplest way to achieve what is defined in the context of deep learning as data parallelism or the mini-batch gradient descent approach). In the alternative, the incremental computing of the values and subgradients of the components f_i could be performed in a distributed manner. This is defined as model parallelism. Naive model parallelism is relatively simple. It’s straightforward to split a large model into chunks of consecutive layers. However, there’s a sequential dependency between inputs and outputs of layers, so a naive implementation can lead to a large amount of idle time while a worker waits for outputs from the previous machine to be used as its inputs. We can reuse the ideas from data parallelism by having each worker only process a subset of data elements at one time, allowing us to cleverly overlap new computations with wait time. Pipeline parallelism allows the execution of a model to be partitioned such that multiple micro-batches can execute different parts of the model code concurrently [6-11].

In the presence of evolutionary problems, in the last decades, Parallel-In-Time methods have been investigated for reducing the temporal dimensionality of such problems. Since Nievergelt, in 1964, proposed for the first time the decomposition algorithm for finding the parallel solutions of evolutionary ordinary differential equations [12], the methods of time-parallel time integration have been extensively expanded and several relevant works can be found in the literature. An extensive and updated literature list can be found on the website [13] collecting information about people, methods, and software in the field of parallel-in-time integration methods. By relying on results we have obtained on the Kalman Filter and on variational methods on the uncertainty quantification models [14-16], in this work we present the main idea underlying an ab-initio model decomposition in time of RNNs. This approach implies both the time series and the objective function decomposition among computing resources. Local functionals are suitably modified, by imposing a regularization constraint in order to enforce the matching of their solutions between adjacent sub-sequences. Finally, according to the Additive Schwarz Method (ASM), synchronization of local solutions is imposed iteratively. Such synchronization guarantees the model convergence [16].

2. Model decomposition-in-time of RNNs

In the following, we summarize the key components of the hybrid decomposition of RNNs. In particular, we introduce the operators defining data reduction and localization then, finally, we give the mathematics underlying decomposition in time of RNNs. For simplicity, we consider the simplest RNN, with a self-connected hidden state, through which information cycles across time steps [17]. The decomposition approach could be extended to other types of RNN.

2.1 Recurrent neural networks

RNNs are a class of deep learning models that are fundamentally designed to handle sequential data. At each time step, t = 1,...,q, the RNN takes an input vector, $x_{t} \in ℜ^{m_{0}}$ and returns the output vector $y_{t} \in ℜ^{m_{1}}$ , s.t.

$y_{t} = σ_{y} (W_{h t} h_{t} + b_{t}) (2)$

where

$\begin{array}{l} h_{t} = σ_{h} (W_{x h} x_{t} + W_{h h} h_{t - 1} + b_{h}) t >0, \\ h_{0} =0 t =0, (3) \end{array}$

is the hidden state vector that captures information about previous inputs, W_xh is the weight matrix between the input and hidden layer, W_hh is the weight matrix for the recurrent connection (hidden-to-hidden), b_t and b_h are the bias vectors, and finally σy and σh are the activation functions. The weight matrices and the bias functions characterize the unknown model. These parameters are obtained by minimizing a predefined objective Function.

For convenience, at each time t = 1,..., q we concatenate vectors x_t and h_t−1 to form a m₀ + m₁ dimensional vector a_t. Then, if we pose

$W_{I} =(W_{x h}), W_{L} =(W_{h h})$

where $W_{I} \in ℝ^{m_{1} \times m_{0}}, W_{L} \in ℝ^{m_{1} \times m_{1}}, x_{t} \in ℝ^{m_{0}}, h_{t - 1} \in ℝ^{m_{1}}$ , we write all the weights together in the matrix

$W =(W_{I}, W_{L}) \in ℜ^{m_{1} \times (m_{0} + m_{1})}$

It results that equations (2) and (3) can be written as

$y_{t} = σ_{t} (W a_{t}) t =1, \dots, q . (4)$

Let $G_{t} \in ℜ^{m_{1} \times (m_{0} + m_{1})}$ be the matrix obtained by each r.h.s. of equation in (4); $g = v e c (G_{t}) \in ℝ^{(m_{0} + m_{1}) \times (m_{1} \times q)}$ be the vector obtained by stacking the t r.h.s of the equation in (4) on top of one another; let $z \in ℜ^{m_{1} \times q}$ be the column vector obtained collecting the vector y_t (output of the RNN) q times. Finally, let $A \in ℜ^{q \times (m_{0} + m 1)}$ be the matrix whose t-th row is obtained by the vectors a_t, t = 1...,q. The equation in (4) can be written as the overdetermined linear system [18,19].

Ag = z

(Least Square Minimization in RNNs) The least-square minimization problem refers to the computation of g such that [20]:

$\hat{g} = a r g m i n_{g} J (g) (5)$

with objective function

$J (g)=|| A g - z {||}_{2} (6)$

2.2 Model decomposition in time

The Model Decomposition in Time direction implies the reduction of both the time series and the objective function among computing resources. This is obtained by data reduction operation and functional localization, respectively. Local functions are obtained as a suitable modification of the original objective function. In particular, local functions are composed of two parts: the first part is the restriction of the original objective function to the sub-interval, then a regularization constraint is added in order to enforce the matching of their solutions between adjacent sub-sequences.

(Data Reduction) Let w = [w_t w_t+1 ... w_n]^T ∈ R^s be a vector with t ≥ 1 , n > 0, s = n − t and I_r = {1,...,r}, r > n and n > t. The extension of w to I_r is:

$E O_{I_{r}} : w \in ℝ^{s} \to E O_{I_{r}} (w)=[{\bar{w}}_{1} {\bar{w}}_{2} ... {\bar{w}}_{r}]^{T} \in ℝ^{r}, (7)$

where for i = 1,...,r

${\bar{w}}_{i} = {\begin{cases} w_{i} i f t \leq i \leq n \\ 0 i f i > n a n d i < t \end{cases} (8)$

Let B = [B¹ B² ... Bⁿ] ∈ R^m×n be such that B^j is the j −th column of B. For i < j (i = 1,...,n − 1 and j = 2,...,n) we define the set I_i,j = {i,...,j}. The restriction of B to the set I_j is:

$|_{I_{j}} : B \in ℝ^{m \times n} \to B |_{I_{j}} =[B^{1} B^{2} ... B^{j}] \in ℝ^{m \times j}, (9)$

and to I_i,j is:

$|_{I_{i, j}} : B \in ℝ^{m \times n} \to B |_{I_{i, j}} =[B^{i} B^{i + 1} ... B^{j}] \in ℝ^{m \times j - i} . (10)$

(Objective Function Reduction) Let n₁ ,n₂ > 0 two natural numbers less than q. Let

$J |_{(I_{n_{i}}, I_{n_{j}})} :(u |_{I_{n_{i}}}, u |_{I_{n_{j}}}) \mapsto J |_{(I_{n_{i}}, I_{n_{j}})} (u |_{I_{n_{i}}}, u |_{I_{n_{j}}}) \forall i, j =1,2 (11)$

denote the restriction of J defined in (6).

For simplicity of notations, we let J_i,jm ≡ J|_(Ii,Ij) with i, j = 1,2. We consider the decomposition of RNNs in two subsets only.

(In Time DD-RNN setup) Let I = {1,...,q} be the index set of rows of A. DD-RNN setup consists in decomposing I into 2 sets:

$I_{n_{1}} ={1,..., n_{1}}, I_{n_{2}} ={n_{1} - s + 1,..., q}, (12)$

where $| I_{1} |= n_{1} >0$ , $| I_{2} |= n_{2} >0$ , and

$I_{1,2} ={n_{1} - s + 1,..., n_{1}}, (13)$

is the overlap set. If s = 0, then I₁ ∩ I₂ = ∅ and I_1,2 ≠ ∅. Restrictions of A to I₁ and I₂ are:

$A_{1} = A |_{I_{1}} \in ℝ^{n_{1} \times (m_{0} + m_{1})}, A_{2} = A |_{I_{2}} \in ℝ^{n_{2} \times (m_{0} + m_{1})}, (14)$

(In Time DD-RNN Algorithm) According to the ASM in [21], we introduce the loop for k = 0,1,2,..., the model decomposition-in-time leads to two RNNs solving iteratively the following reduced least square minimization problems:

$\begin{array}{l} P_{1}^{k + 1} : {\hat{g}}_{1}^{k + 1} & = a r g m i n_{g_{1}^{k + 1} \in ℝ^{n_{1}}} J_{1} (g_{1}^{k + 1}, g_{2}^{k}) \\ = a r g m i n_{g_{1}^{k + 1} \in ℝ^{k_{1}}} [J |_{(I_{1}, I_{2})} (g_{1}^{k + 1}, g_{2}^{k}) + μ \cdot O_{1,2} (g_{1}^{k + 1}, g_{2}^{k})] \end{array} (15)$

$\begin{array}{l} P_{2}^{n + 1} : {\hat{g}}_{2}^{k + 1} & = a r g m i n_{g_{2}^{k + 1} \in ℝ^{n_{2}}} J_{2} (g_{2}^{k + 1}, g_{1}^{k + 1}) \\ = a r g m i n_{g_{2}^{k + 1} \in ℝ^{k_{2}}} [J |_{(I_{2}, I_{1})} (g_{2}^{k + 1}, g_{1}^{k + 1}) + μ \cdot O_{1,2} (g_{2}^{k + 1}, g_{1}^{k})] \end{array} (16)$

O_1,2 is the overlapping operator representing the data exchange on the overlap set I_1,2, and µ > 0 is the regularization parameter.

In particular, we pose

$O_{1,2} (g_{i}, g_{j})=|| E O_{I_{i}} (g_{i} |_{I_{1,2}}) - E O_{I_{i}} (g_{j} |_{I_{1,2}})||, i, j =1,2 (17)$

with $E O_{I_{i}} g_{1} |_{I_{1,2}}$ , $E O_{I_{i}} g_{2} |_{I_{1,2}}$ be the extension to I_i, of restriction to I_1,2 in (13) of $g_{1} \in ℝ^{n_{1}}$ and $g_{2} \in ℝ^{n_{2}}$ , respectively.

As a solution, we get the sequence ${g^{k + 1}}_{k \in ℕ_{0}}$ such that:

$g^{k + 1} = {\begin{array}{l} {\hat{g}}_{1}^{k + 1} |_{I_{1} \ I_{1,2}} & o n I_{1} \ I_{1,2} \\ \frac{μ}{2} ({\hat{g}}_{2}^{k + 1} |_{I_{1,2}} + {\hat{g}}_{1}^{k + 1} |_{I_{1,2}}) & o n I_{1,2} \\ {\hat{g}}_{2}^{k + 1} |_{I_{2} \ I_{1,2}} & o n I_{2} \ I_{1,2} \end{array}, (18)$

with the sets I₁, I₂ defined in (12) and I_1,2 in (13).

Algorithm implementing local RNN on I_n1 and I_n2,

1: repeat

2: k:= k + 1

3: Call Loc-RNN

4: Exchange x^k_i between adjacent subdomains and compute O_1,2

5: until $∥ g_{i}^{k} - g_{i}^{k - 1} ∥ < e p s$

3. Discussion

The aim of this work is only to illustrate the framework underlying such innovative ideas. Many aspects have been left intentionally unexplained. For example, regarding the size of the overlapping interval, we note that the majority of parallelization techniques first decompose the domain in nonoverlapping subdomains and then extend each subdomain with halo regions to enable stencil operations in a parallel context. Instead, the algorithm allows one to reduce communications among adjacent subdomains to only those nodes lying on the interfaces. In particular, if the widths of the halo regions are equal to one grid node wide, the halo nodes are actually the nodes of the overlapping region. Otherwise, halo regions contain more nodes than overlapping nodes and this means more communications among adjacent subdomains. Hence, the algorithm reduces communication times which depends on the number of inner nodes in overlap regions. In addition, the extra work performed on the overlapped region with an increasing size can be seen as the effect of a preconditioner on the overlapping region which can overestimate the solution [21].

Regarding the regularization parameter, we observe that the regularization parameter plays an important trade-off role. As is evident, when the regularization parameter overfitting occurs larger values are expected to produce reliable solutions even if the DD-step number increases. The choice of a good regularization parameter is one of the most important issues in solving unconstrained minimization problems and there exists a significant amount of research in the literature on the development of appropriate strategies for selecting regularization parameters. Parameter choice methods can be classified according to the input they require. There are two basic types: a-priori methods, requiring information about the noise level on data. The discrepancy principle, developed and analyzed by Morozov is the oldest one of them; data-driven methods, require no extra information. Generalized Cross-Validation (GCV), due to Wahba, is one of the most popular methods [22].

Another issue is the employment of a dynamic load balancing scheme based on adaptive and dynamic redefining of initial decomposition. Specifically, in order to optimally choose the domain decomposition configuration, the partitioning into subdomains must satisfy certain conditions. First, the computational load assigned to subdomains must be equally distributed. Good quality partitioning also requires the volume of communication during calculation to be kept at its minimum. In [15] the authors employed a dynamic load balancing scheme based on adaptive and dynamic redefining of initial decomposition, aimed to balance load between processors according to data location. In particular, the authors focused on the introduction of a dynamic redefining of initial DD in order to deal with problems where the observations are non uniformly distributed and generally sparse.

The most significant finding is that we can solve several smaller problems improving the accuracy-per-parameter metric. Most importantly, subproblems can be solved in parallel, leading to a scalable algorithm where the workers locally exchange parameter updates via a nearest-neighbourhood communication scheme, which does not require a fully connected network. In contrast to other decomposition-in-time approaches, in our approach local solvers run concurrently from the beginning. Overall, the employment of the framework on hybrid high-performance computing systems seems to be a fruitful research area [23].