Parakeet: A Tiny LLM
17 by razodactyl | 4 comments on Hacker News.
Hi all, https://ift.tt/avSCxde This post is more for gauging interest, I plan to release the entire end to end code including: - Dataset curation (including citations). - Model checkpoints. - Inference code. - Synthetic data generation. - etc. Parakeet is the name of a small language model I've designed from the ground up for the purpose of research. The challenge was to see how far I could push the limits of LLM tech. given a massively constrained environment. It was trained on a 3080 Ti and has a considerable amount more training to do but here are the results so far. Specs: - 18 layers / 18 heads - 8K context. - 1152 embedding dimension. - cl100k tokenizer (TikToken) - ALiBi (max I can train is 1200 tokens so this was crucial) - KV caching for improved inference. - Grouped Query Attention (2 layers per group / speeds up inference) - `min_p`: Cut-off low quality tokens. - Softmax1: https://ift.tt/DOdwVR1 - Not sure if this really made much of a difference / it's hard to train comparable models as compute resources are limited. - Sub 400M parameters (378M from memory) Edit: - Things I forgot to mention: NO RLHF / DPO, it's entirely dataset driven. - The model seems mostly harmless due to being trained only with synthetic data. - A side-effect of only being trained on synthetic data is that the model learns quite fast. - There's probably less than 2 weeks of actual training time in the model so far. - You don't need to start from scratch when altering model parameters. Weights can be copied/merged in and out of smaller/larger models. Why? - Curiosity got to me - I wanted to know what would happen if a model with a considerably small amount of parameters was bombarded with data. - There were many results showing these language models with room for more training but instead many are scaled up. - I wanted to see what happens if you just keep training them. References: - "./datasets/wikipedia-20220301.en.jsonl"), - "./datasets/euclaise_littletown.jsonl"), # https://ift.tt/A47OvVf - "./datasets/squad-v2.0-processed.jsonl"), - "./datasets/huggingface_ultrachat200k.jsonl"), # https://ift.tt/jzwOqTW - "./datasets/wizardlm_evol_instruct_v2_196k.jsonl"), # https://ift.tt/n96beRI - "./datasets/crumb_clean_instruct_440k.jsonl"), # https://ift.tt/6Y1gklw - Generate a story starting with the sentence "It was already late when they stepped out of the house". - "./datasets/openorca_4m.jsonl"), # https://ift.tt/9TJnbXS - "./datasets/databricks_dolly15k.jsonl"), # https://ift.tt/IeYZzjd - Common-sense reasoning. - "./datasets/teven_code_contests4m.jsonl"), # https://ift.tt/DCQVSZL - ['PYTHON', 'PYTHON3', 'JAVA', 'CPP'] - "./datasets/squad-v2.0-summaries.jsonl"), - "./datasets/google-boolq.jsonl"), - "./datasets/stingning_ultrachat.jsonl"), # https://ift.tt/YQpjC53 - "./datasets/wikimovies-train.jsonl"), - "./datasets/kunishou-databricks-dolly-15k-ja.jsonl"), - "./datasets/wizardlm_evol_instruct_70k.jsonl"), # https://ift.tt/9bVudJz - "./datasets/map_codefeedback.jsonl"), Sorry for bad formatting! ...continues in reply due to 4000 char. limit.
Tidak ada komentar:
Posting Komentar