What's in the RedPajama-Data-1T LLM training set

Description

RedPajama is “a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens”. It’s a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, …

GitHub - Zjh-819/LLMDataHub: A quick guide (especially) for

Red Pajama: An Open-Source Llama Model

RedPajama: New Open-Source LLM Reproducing LLaMA Training Dataset

Dolma, OLMo, and the Future of Open-Source LLMs

LLMs의 기이한 세계에 대해 알아보기 – Jini AI

Data collection for LLMs - Argilla 1.14 documentation

What is RedPajama? - by Michael Spencer

Supervised Fine-tuning: customizing LLMs

Web LLM runs the vicuna-7b Large Language Model entirely in your

From ChatGPT to LLaMA to RedPajama: I'm Switching My Interest to