Improved Baselines with Visual Instruction Tuning
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
null
LIMA: Less Is More for Alignment
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Language models are increasingly being deployed for general problem solving
across a wide range of tasks, but are still confined to token-level,
left-to-right decision-making processes during inference. This means they can
fall short in tasks that require exploration, strategic lookahead, or where
initial decisions play a pivotal role. To surmount these challenges, we
introduce a new framework for language model inference, Tree of Thoughts (ToT),
which generalizes over the popular Chain of Thought approach to prompting
language models, and enables exploration over coherent units of text (thoughts)
that serve as intermediate steps toward problem solving. ToT allows LMs to
perform deliberate decision making by considering multiple different reasoning
paths and self-evaluating choices to decide the next course of action, as well
as looking ahead or backtracking when necessary to make global choices. Our
experiments show that ToT significantly enhances language models'
problem-solving abilities on three novel tasks requiring non-trivial planning
or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in
Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of
tasks, our method achieved a success rate of 74%. Code repo with all prompts:
https://github.com/ysymyth/tree-of-thought-llm.
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
TRAIN SHORT, TEST LONG: ATTENTION WITH LINEAR BIASES ENABLES INPUT LENGTH EXTRAPOLATION
IMAGEBIND: One Embedding Space To Bind Them All
Quantization - Qdrant
Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service with convenient API.
Quantization is an optional feature in Qdrant that enables efficient storage and search of high-dimensional vectors.
By transforming original vectors into a new representations, quantization compresses data while preserving close to original relative distances between vectors.
Different quantization methods have different mechanics and tradeoffs. We will cover them in this section.
Backpropagation through time - Wikipedia
Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers
Recurrent Memory Transformer
Transformer-based models show their effectiveness across multiple domains and
tasks. The self-attention allows to combine information from all sequence
elements into context-aware representations. However, global and local
information has to be stored mostly in the same element-wise representations.
Moreover, the length of an input sequence is limited by quadratic computational
complexity of self-attention.
In this work, we propose and study a memory-augmented segment-level recurrent
Transformer (RMT). Memory allows to store and process local and global
information as well as to pass information between segments of the long
sequence with the help of recurrence.
We implement a memory mechanism with no changes to Transformer model by
adding special memory tokens to the input or output sequence. Then the model is
trained to control both memory operations and sequence representations
processing.
Results of experiments show that RMT performs on par with the Transformer-XL
on language modeling for smaller memory sizes and outperforms it for tasks that
require longer sequence processing. We show that adding memory tokens to Tr-XL
is able to improve its performance. This makes Recurrent Memory Transformer a
promising architecture for applications that require learning of long-term
dependencies and general purpose in memory processing, such as algorithmic
tasks and reasoning.
The paper "Recurrent Memory Transformer" proposes a memory-augmented segment-level recurrent Transformer (RMT) model that stores and processes global and local information by adding memory tokens to the input or output sequence, and shows that RMT performs on par with Transformer-XL on language modeling for smaller memory sizes and outperforms it for longer sequence processing tasks.
Key insights and lessons learned:
The self-attention mechanism in Transformer-based models has quadratic computational complexity for long sequences and limits the amount of global and local information that can be stored and processed.
Adding memory tokens to the input or output sequence of a Transformer-based model allows for memory-augmentation and the storage and processing of global and local information, as well as the passing of information between segments of long sequences with the help of recurrence.
The proposed RMT model performs on par with Transformer-XL on language modeling for smaller memory sizes and outperforms it for longer sequence processing tasks.
The RMT model can be applied to a wide range of tasks and domains, including natural language processing and image recognition.
Emergent and Predictable Memorization in Large Language Models
Memorization, or the tendency of large language models (LLMs) to output
entire sequences from their training data verbatim, is a key concern for safely
deploying language models. In particular, it is vital to minimize a model's
memorization of sensitive datapoints such as those containing personal
identifiable information (PII). The prevalence of such undesirable memorization
can pose issues for model trainers, and may even require discarding an
otherwise functional model. We therefore seek to predict which sequences will
be memorized before a large model's full train-time by extrapolating the
memorization behavior of lower-compute trial runs. We measure memorization of
the Pythia model suite, and find that intermediate checkpoints are better
predictors of a model's memorization behavior than smaller fully-trained
models. We additionally provide further novel discoveries on the distribution
of memorization scores across models and data.
The paper "Emergent and Predictable Memorization in Large Language Models" by Stella Biderman et al. studies the problem of memorization in large language models and proposes a method to predict which sequences will be memorized before full training of the model, based on extrapolation of memorization behavior from lower-compute trial runs, and provides novel insights on the distribution of memorization scores across models and data.
Key insights and lessons learned from the paper:
Memorization is a key concern for deploying large language models safely, particularly for sensitive datapoints such as PII.
Intermediate checkpoints are better predictors of memorization behavior than smaller fully-trained models.
Memorization scores follow a power-law distribution across models and data, with some datapoints being more prone to memorization than others.
Fine-tuning can mitigate memorization to some extent, but not completely.
The Forward-Forward Algorithm: Some Preliminary Investigations
Evidence of a predictive coding hierarchy in the human brain listening to speech - Nature Human Behaviour
Current machine learning language algorithms make adjacent word-level predictions. In this work, Caucheteux et al. show that the human brain probably uses long-range and hierarchical predictions, taking into account up to eight possible words into the future.
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
Moravec's paradox - Wikipedia
Moravec's paradox is the observation by artificial intelligence and robotics researchers that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky and others in the 1980s. Moravec wrote in 1988, "it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".[1]
Moravec's paradox is the observation by artificial intelligence and robotics researchers that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky and others in the 1980s. Moravec wrote in 1988, "it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility"
PANGU-Σ: TOWARDS TRILLION PARAMETER LANGUAGE MODEL WITH SPARSE HETEROGENEOUS COMPUTING
REAC T: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions
Joint Embedding Methods - Contrastive · Deep Learning
Joint Embedding methods try to make their backbone network robust to certain distortions and are invariant to data augmentation.
The Model That Changes Everything: Alpaca Breakthrough (ft. Apple's LLM, BritGPT, Ernie and AlexaTM)
8 years of cost reduction in 5 weeks: how Stanford's Alpaca model changes everything, including the economics of OpenAI and GPT 4. The breakthrough, using self-instruct, has big implications for Apple's secret large language model, Baidu's ErnieBot, Amazon's attempts and even governmental efforts, like the newly announced BritGPT.
I will go through how Stanford put the model together, why it costs so little, and demonstrate in action versus Chatgpt and GPT 4. And what are the implications of short-circuiting human annotation like this? With analysis of a tweet by Eliezer Yudkowsky, I delve into the workings of the model and the questions it rises.
Web Demo: https://alpaca-ai0.ngrok.io/
Alpaca: https://crfm.stanford.edu/2023/03/13/alpaca.html
Ark Forecast: https://research.ark-invest.com/hubfs/1_Download_Files_ARK-Invest/Big_Ideas/ARK%20Invest_013123_Presentation_Big%20Ideas%202023_Final.pdf
Eliezer Tweet: https://twitter.com/ESYudkowsky/status/1635577836525469697
https://twitter.com/ESYudkowsky/status/1635667349792780288
Self-Instruct: https://arxiv.org/pdf/2212.10560.pdf
InstructGPT: https://openai.com/research/instruction-following
OpenAI Terms: https://openai.com/policies/terms-of-use
MMLU Test: https://arxiv.org/pdf/2009.03300.pdf
Apple LLM: https://www.nytimes.com/2023/03/15/technology/siri-alexa-google-assistant-artificial-intelligence.html
GPT 4 API: https://openai.com/pricing
Llama Models: https://arxiv.org/pdf/2302.13971.pdf
BritGPT: https://www.theguardian.com/technology/2023/mar/15/uk-to-invest-900m-in-supercomputer-in-bid-to-build-own-britgpt
Amazon: https://www.businessinsider.com/amazons-ceo-andy-jassy-on-chat-cpt-ai-2023-2?r=US&IR=T
AlexaTM: https://arxiv.org/pdf/2208.01448.pdf
Baidu Ernie: https://www.nytimes.com/2023/03/16/world/asia/china-baidu-chatgpt-ernie.html
PaLM API: https://developers.googleblog.com/2023/03/announcing-palm-api-and-makersuite.html
https://www.patreon.com/AIExplained
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
Couairon embedding arithmetic of multimodal queries for image retrieval cvprw 2022 paper
GPT-4 Technical Report
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Training Compute-Optimal Large Language Models
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
Product quantization for nearest neighbor search
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics
LLaMA: Open and Efficient Foundation Language Models