22 April 2023


Theoretic foundation of transformer

View transformer as response of system

View transformer as nonlinear regression

Kernel transformer

Generalized Fourier Integral Theorems and their applications to transformer

Functional Model

Mixing MLP





The transformer is a neural network architecture that has become increasingly popular in natural language processing tasks such as machine translation, language modeling, and text classification. Its theoretical foundation can be viewed from several perspectives.

One way to understand the transformer is to view it as the response of a system to an input signal. In this view, the transformer can be seen as a dynamic system that transforms an input sequence of vectors into an output sequence of vectors. Each step of the transformer involves passing the input through a series of nonlinear transformations, which are applied in parallel across all elements of the sequence. This allows the transformer to capture complex dependencies between elements of the input sequence.

Another way to view the transformer is as a nonlinear regression model. In this view, the transformer can be seen as a function that maps an input sequence to an output sequence. The transformer learns this mapping by minimizing a loss function that measures the discrepancy between the predicted and actual output sequences. This approach allows the transformer to capture complex patterns in the input sequence that may be difficult to model with simpler linear models.

The transformer can also be viewed from the perspective of kernel methods. In this view, the transformer can be seen as a kernel function that maps pairs of input sequences to a high-dimensional feature space. This mapping allows the transformer to capture complex nonlinear relationships between elements of the input sequence.

Generalized Fourier Integral Theorems can also be applied to the transformer. These theorems allow us to express the transformer as a sum of basis functions, similar to the Fourier series expansion of periodic signals. This view of the transformer allows us to better understand its properties and behavior.

Finally, the transformer can be viewed as a functional model that operates on entire sequences rather than individual elements. This allows the transformer to capture higher-level properties of the input sequence, such as its overall structure and context.

In addition, the transformer architecture often incorporates a mixing MLP (multilayer perceptron) that combines information from different parts of the input sequence. This helps the transformer to capture long-range dependencies and enables it to model more complex relationships between elements of the input sequence.

blog comments powered by Disqus