LSSA: A Quick Overview

Beyond Legacy Vectors: Architecting a New Semantic Space for AI

LSSA: A Quick Overview

This document is part of the LSSA project

Follows this article.

Dense Vector Spaces

The heart of LSSA is certainly its vector space, which has completely different characteristics from what is commonly done. In the classic approach, the vector space — the representation of information — is dense, occupied by multidimensional vectors formed by the statistical co-occurrence of concepts, based on the principle that if two concepts often appear paired, they likely had proximate meanings. Now, aside from the fact that this is by no means certain, the method opens the way to a great many problems. To name a few:

  • There is no knowledge of the reasons for statistical proximity; we could define this method as a semantic ordering devoid of semantic knowledge.
  • Excessive interdependence between tokens, leading to the practical impossibility of making modifications to the structure.
  • Inability to practically resolve linguistic ambiguities.
  • Enormous computational burden for both the creation and use of the structure.

In essence, an information structure realized through a dense vector space has serious problems in tackling simple concepts like “trumpet and piano love each other in jazz but don’t know each other in the town band.”

The Cognitive Vector Space

Imagine a boundless library, with millions of books sorted by shape, color, size… but without a catalog. Every time you want to find a piece of information, you cannot search for it by subject. You literally have to go shelf by shelf, book by book, flipping through everything until — perhaps — you find what you need. This is, in essence, what happens in dense vector spaces: a gigantic, powerful representation, but devoid of a true semantic indexing system.

LSSA solves precisely this: it introduces a “conceptual index,” an organization by semantic affinity that allows for targeted cognitive paths. Instead of activating everything, it activates only what is needed. It doesn’t explore the entire library at every step but goes directly to the right shelf. And this is where the difference arises between a network that resembles a jumbled archive and one that can truly begin to think.

In LSSA, we have used a completely different approach from the dense one; its vector space is not semantically blind but, on the contrary, originates precisely from the reasons for the pairings. To achieve this, we first created a structure containing all possible basic information, and subsequently, we created links that tie individual concepts together. It may seem like an unachievable concept, but it is not at all, and probably the reason why no one had ever used this method lies precisely in the fact that, at first impression, it might seem unfeasible.

The first step, cataloging (primary training)

How many are all the possible representable base concepts? Well, we have counted them, perhaps for the first time: for a complex language like Italian, considering all possible variants, proper nouns of every kind, imported foreign expressions… in short, everything, they do not exceed 4.5 million. From a computational standpoint, four and a half million possible semantic tokens are not a large number at all; rather, they are a rather small quantity.

Unlike the classic representation, where each semantic token, paradoxically, contains no information about its own semantics, in LSSA, tokens are arranged on “layers of semantic proximity.” Imagine them as Cartesian planes containing tokens of similar meaning. For example, “trumpet” and “piano” are on the semantic affinity plane of musical instruments; “jazz” and “band music” (it sounds a bit awkward, but I checked, it’s a real term) are on that of musical genres, and “love,” along with “affinity,” on that of relationships. Well, unlike the classic representation, this representational structure allows us to know that “jazz,” “piano,” and “trumpet” find links in the relationships layer, while they have none for “band music.” This is obviously a simplification to facilitate understanding; what happens in reality is that “piano” and “trumpet” will often find pairings on multiple layers in the case of “jazz,” while they will find few or none in the case of “band music.”

To create this structure, which can be mentally visualized as a series of Cartesian planes stacked on the z-axis, we use a classifier, a large language model which we feed, one page at a time, content from the most varied branches of knowledge from Wikipedia, along with the list of already existing layers and operational instructions. The classifier observes each term and for each one, evaluates whether it can be placed in an existing layer or if it is necessary to create a new one. Once decided, it outputs a short sequence of instructions that are used by an algorithm that materially manages the operations. In addition to the structure of semantic proximity layers, LSSA has several other data structures; the main ones are an index-tree containing the coordinates (x, y, z) of each semantic token and a map of the free/occupied space in each layer. There are about 300 layers, again a small number.

NOTE: If a token can have more than one meaning, thus belonging to more than one affinity layer, the classifier creates a separate entry for it on each plane: in LSSA, semantic attributions are precise.

The creation of relationships (secondary training)

At this point, the basic structure is ready, all possible semantic tokens are in their place, ordered by proximity layer, but how are the relationships between them created? Secondary training creates an initial representation of the existing relationships between semantic tokens, and it does so through an entirely algorithmic solution, therefore very fast and with low computational cost. The algorithm is fed numerous texts of various kinds, which are explored sentence by sentence. The algorithm, based on the index-tree, creates vectors suitable for representing the sentence under examination, or if they already exist, it strengthens their weight. At the end of this operation, we will have a series of vectors of different weights that connect the tokens present in the structure, each with a different weight given by the pairing frequency. We will not examine here the consequences of having differentiated weights for each vector, for which we refer to the technical documentation, but we emphasize how evident the creation of a highly differentiated representative structure appears, not by simple statistical proximity but by cognitive paths: for every relationship created, LSSA knows the starting token, arriving token, proximity layers involved, and pairing frequency. A wealth of information whereas representation by statistical proximity knows… nothing! In addition, the real LSSA assigns to each vector a creation timestamp, a last-use timestamp, and a “lock” indicator, for which, again, we refer to the detailed documentation.

Inference

In LSSA, inference is guided by explicit relationships between tokens and semantic affinity planes. These relationships are based on local evaluations of the weights assigned to individual vectors, allowing the system to decide the most effective cognitive paths. In technical terms, this translates into a sequence of operations with O(1) computational cost: it is not necessary to evaluate all possible paths because the structure itself contains sufficient information to define them precisely and immediately. We can imagine the representation of information in LSSA as a large city equipped with a detailed map: every place is reachable by following well-traced paths, and for each path, we also know the quality and frequency with which it is used. One no longer proceeds by trial and error, but with awareness.

Formation of new tokens and affinity planes, conclusion

Unlike dense representation, in which — due to high computational cost — everything is static, LSSA allows for the dynamic creation of new tokens and even entire semantic affinity planes during inference. Inference itself, noticing the absence of elements useful for representing new information, can autonomously generate what is needed: a new semantic token is inserted into the relevant affinity plane and in the index-tree, thus marking the new position on the cognitive map and creating the relative vector from the origin node. This represents a huge leap in quality compared to dense representation models, where any modification requires new training, often expensive and rigidly centralized.

In conclusion, what LSSA proposes is not simply an evolution of the classic data structure used in artificial intelligence, but a true refoundation of it. A representation that does not merely amass concepts in a dense and immutable space, but organizes them into navigable, modifiable, and structurally readable semantic planes. In LSSA, every connection has an origin, a destination, and a reason to exist. Where dense space requires millions for each update, LSSA is capable of evolving with operations at minimal computational cost. And all this, without sacrificing precision, coherence, and scalability.

LSSA presents itself as a structure that is simple to implement and use, with a remarkably low computational cost. It thus solves the numerous problems of dense and static representations, which are poorly suited to describe thought: a phenomenon by its nature continuously evolving and intrinsically founded on semantic relationships.