Using Axolotl; Brute Forcing It.
Archived from occybyte.com/resources · 2023-11-29
https://github.com/OpenAccess-AI-Collective/axolotl
Okay, to set up; just create a directory wherever I want it to be, so, I have it in:
.LLM/axolotl
The easiest way to set up it truly is by using docker; I decided to use anaconda prompt to actually go through all that. So, I didn’t create an anaconda prompt w/ admin privilege shortcut because navigating to the directory is good practice for myself. Instead I just:
Then that’s when I did this whole part:
git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl
pip3 install packaging pip3 install -e ’.[flash-attn,deepspeed]’
Unless there is a new update, I just use docker compose up -d to run it. Interestingly enough I come into some issues where I need to reset docker back to factory defaults whenever I launch and exit it too many times in one session.
If first time, copy-paste this:
docker run —gpus ‘“all”’ —rm -it winglian/axolotl:main-py3.10-cu118-2.0.1
Else, if not the first time instead use this:
docker compose up -d
Datasets & YML
So, the biggest issue that I ran into is that it doesn’t exactly explain what the different datasets are. A lot of this is contextual lingo, you know what a dataset is but type being alpaca explains a lot if you do this for a while. It doesn’t explain anything as a newbie.
huggingface repo datasets: - path: vicgalle/alpaca-gpt4 type: alpaca # format from earlier
huggingface repo with specific configuration/subset datasets: - path: EleutherAI/pile name: enron_emails type: completion # format from earlier field: text # Optional[str] default: text, field to use for completion data
The type underneath the colon (:) with datasets is what type of dataset it is.
-
alpaca is “instruction” / “output”
-
most of these will be sharegpt, context & alpaca types.
-
sharegpt is conversational
-
context is like text generation, summaration and such.
This means that if axolotl is giving an error asking for an instruction. You likely need an instruction key inside of your json, jsonl, csv, parquet or etc., data type.
- if your dataset type doesn’t match any of the predefined ones you’ll have to make a custom section for that.
One of mines is something like the below.
datasets: - path: json ds_type: jsonl data_files: /localdataset/occybyte_existencetypes.jsonl type: system_prompt: "" field_system: system format: “[ID] {ID} [TEXT] {Text} [TYPE] {ExistenceType} [INFO] {InformationType} [TRUTH] {Truth} [NOTE] {Note}” no_input_format: “[ID] {ID} [TEXT] {Text} [TYPE] {ExistenceType} [INFO] {InformationType} [TRUTH] {Truth} [NOTE] {Note}”
This is saying that the dataset is a jsonl, json file that can be found in that specific directory. Everything under type: denotes the custom file that that I created.
Other resources to use: YAML Validator
JSONL Validator