Humanoids.fit

Some late-night thoughts while waiting for the alarm...

I recently posted a review of Google DeepMind's Genie paper, which described an autoregressive model that takes frames as input and outputs frames.

Genie: A generative interactive environment

Introducing a foundation world model that can be prompted to generate an endless variety of playable, interactive environments.

As we know, many researchers worldwide are trying to overcome the problem of discrete tokenizers. In LLMs, the language model receives discrete signals in infinite quantities (the 'internet') and trains on them as base training.

In Genie or other sophisticated methods, they use tokenizers that function as a kind of codebook, or even a general latent space of an autoencoder, treating it as a signal in infinite quantity.

This led me to wonder: Since current multimodal methods are essentially hybrid approaches between different tokenizers that concatenate their outputs into a single infinite signal (on which a model is trained autoregressively) - for example, taking a frame and converting it to a vector using a codebook, taking text to a vector, and concatenating them into one signal...

Could we take a sequence of -bits- representing any type of computational information - image, audio, sensor data, text - and treat it as the token to predict autoregressively?

This would mean having one single tokenizer without the need for various hybrid approaches, since we'd be working above the most basic computational unit of information.

What are your thoughts?

Thoughts on LGM (Large General Model)

Genie: A generative interactive environment

Discuss this post on LinkedIn