Great to see that @Apple has unveiled its own language model, DCLM-7B. In light of this, @polyverse_ai has commenced the integration of @Apple's DCLM datasets and tools, setting the stage for future advancements in optimizing AI training datasets to enhance language model performance. The DCLM-Baseline was established by meticulously applying a series of cleaning, filtering, and deduplication procedures to the raw Common Crawl data (DCLM-Pool).
🌐 A foundational 7-billion parameter model, meticulously trained on 2.5 trillion tokens derived from open-access datasets.
📊The training predominantly involved English language data, with a context window extending up to 2048 tokens.
📈 The model integrates data from DCLM-BASELINE, StarCoder, and ProofPile2.
🧠 Demonstrates performance on par with models trained on proprietary datasets, such as Mistral.
🔬 Training was conducted using PyTorch within the OpenLM framework.