
Full Fine-Tuning the traditional method for adapting large language models (LLMs), involving updates to all of the model’s parameters. While more resource-intensive than parameter-efficient fine-tuning (PEFT) and other methods, it allows for deeper and more comprehensive customization, especially when adapting to significantly different tasks or domains.
This is Part 5 and the last of my series in LLM Customization. In this post, we’ll explore full fine-tuning—why it’s still relevant, how it works, and when it’s worth the extra compute. By fine-tuning the entire model, we give it the flexibility to learn complex patterns and fully integrate new knowledge. Though it demands more memory and processing power, full fine-tuning is ideal when maximum performance and deep task adaptation are required, especially in production scenarios or domain-specific applications.
Despite the popularity and advantages of PEFT, full fine-tuning remains the most powerful and flexible method for adapting large language models. It allows every layer of the model to be updated, enabling the model to learn deep, task-specific patterns and adjust its entire internal reasoning process. This is especially important when the task or domain is significantly different from what the model was originally trained on, such as in specialized domains or when dealing with private data.
Process
- Prepare fine-tuning dataset. This is labeled, domain-, task-specific, private data. It is normally smaller, but not always, than the original training dataset.
- Prepare base model.
- Train the model. Like other deep-learning models this involves cost functions, back propagation, and evaluation.
- Test the performance of the LLM with the test data.
Compared to PEFT, full fine-tuning is more compute-intensive, takes longer to train, and carries a higher risk of overfitting—but it allows for maximum model flexibility and task adaptation.
Components
GPU
In the case of full fine-tuning, you really have to have a strong GPU with sufficient memory to make training reasonably efficient. In my case, I will be using an NVIDIA RTX 4090 GPU with 24 GB VRAM rented from Vasti.ai.
CUDA
CUDA or Compute Unified Device Architecture is a parallel computing platform and programming model developed by NVIDIA. It allows software developers to use NVIDIA GPUs (Graphics Processing Units) for general-purpose computing tasks — not just graphics.
Torch
Underneath all these is Torch. When using torch in Python, it usually refers to PyTorch. The originalTorch was a deep learning framework written in Lua, and PyTorch is considered its spiritual successor. Built in Python—hence the name Py-Torch—it serves as a powerful framework for building, training, and deploying neural networks, powering the entire machine learning and model training pipeline.
Hugging Face
Hugging Face is a company that’s become the go-to hub for machine learning models and tools, especially in natural language processing (NLP). They develop and maintain open-source libraries that we will be using:
- datasets – The datasets library gives you access to thousands of ready-to-use datasets — NLP, audio, vision, and more. This includes the Alpaca Dataset.
- transformers – The transformers library provides access to thousands of pretrained models for NLP, computer vision, and more. This includes the TinyLlama Model.
- peft – The peft library provides a framework and tools for applying PEFT techniques to LLMs including prefix tuning, adapter, and of course LoRa.
Alpaca Dataset
The Alpaca Dataset is a high-quality dataset specifically designed for fine-tuning language models on instruction-following tasks. It was created by researchers at Stanford to provide a good resource for training models to better understand and respond to natural language instructions, mimicking the style of instruction-based tasks that are prevalent in models like OpenAI’s GPT series. It was used to fine-tune Meta’s Llama 7B Model to create the Alpaca Model. We will be using the same approach but via PEFT and, due to resource constraints, with a smaller subset. We will not get significant changes but the approach is what we’re after.
TinyLlama
TinyLlama is a small and efficient open-source large language model based on Meta’s original LLaMA (Large Language Model Meta AI) architecture. It was developed to provide high-quality performance while being lightweight enough to run on consumer-grade hardware. Despite its name, it is actually a capable 1.1 billion parameter model. We will be using TinyLlama/TinyLlama-1.1B-Chat-v1.0 a chat-optimized version, tuned for conversational tasks. But ChatGPT it is not. Which is why we will fine-tune it as an exercise.
Quantization
In PEFT, we used quantization during training due to resource constraints. However, this is not recommended with full fine-tuning. Quantization is the process of converting a model’s weights from high-precision floating point numbers (usually 32-bit floats) to lower-precision formats like 8-bit integers (int8) or 16-bit floats (float16 or bfloat16). Unfortunately, this lack of precision (8-bit has 256 discrete values, 16-bit has 65,535) can lead to poor convergence of the cost function during training. We will have to use high-precision and use more memory.
Install Packages
First, we make sure to install torch with CUDA support.
pip uninstall torch
pip install torch --index-url https://download.pytorch.org/whl/cu118
Then we install the rest of the packages.
pip install transformers datasets
Inference With Base Model
To set the baseline, we first try inferencing with the original/base model.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Load tokenizer and model without quantization
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # required for training
# Choose device manually
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load base model on the selected device
model = AutoModelForCausalLM.from_pretrained(model_name)
model.to(device)
model.eval()
# Generate function
def generate(model, prompt):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
return tokenizer.decode(output[0], skip_special_tokens=True)
# Alpaca-style instruction prompt
prompt = """### Instruction:
Explain the difference between renewable and non-renewable energy sources.
### Response:
"""
print("=== BEFORE Fine-Tuning ===")
print(generate(model, prompt))
Training
Next we run, the full file-tuning training process. As mentioned earlier, you will need a strong GPU with sufficient memory for this. If you don’t have one, you can rent one from Vasti.ai.
Here’s an overview of the training code:
- Load the model – Load the base TinyLlama 1.1B Chat model from Hugging Face.
- Load the tokenizer – Load the tokenizer that matches the model and set the padding token to the end-of-sequence token.
- Load and format the dataset – Load the Alpaca-cleaned dataset from Hugging Face and use a sample subset (first 1,000 entries) for purposes of illustration. Format each training example in an instruction-following style with clear sections for instruction, input, and response.
- Tokenize the data – Convert the formatted text into token IDs that the model understands, and create matching labels for training.
- Set up trainer – Define hyperparameters such as batch size, learning rate, number of epochs, logging intervals, checkpoint saving, and use of half-precision (fp16). Set up the Hugging Face Trainer with the model, training data, tokenizer, and a data collator that handles padding and batching.
- Train the model – Run the training loop using the prepared dataset and model.
- Save the fine-tuned model – Save the final fine-tuned model and tokenizer locally for future use or inference.
And here’s the code:
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
# === Load base model and tokenizer ===
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Required for training
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto"
)
# === Load and format Alpaca dataset ===
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")
def format_alpaca(example):
prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
return {"text": prompt}
dataset = dataset.map(format_alpaca)
# === Tokenize dataset ===
def tokenize(example):
result = tokenizer(
example["text"],
truncation=True,
padding="max_length",
max_length=512
)
result["labels"] = result["input_ids"].copy()
return result
tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)
# === Set up Trainer ===
training_args = TrainingArguments(
output_dir="./tinyllama-full",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-5,
num_train_epochs=2,
logging_steps=10,
save_steps=100,
save_total_limit=1,
fp16=True, # Keep True for GPU
report_to="none"
)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
data_collator=data_collator
)
# === Train the model ===
trainer.train()
# === Save fine-tuned model and tokenizer ===
model.save_pretrained("tinyllama-alpaca-full")
tokenizer.save_pretrained("tinyllama-alpaca-full")
Inference With Full Fine-Tuned Model
After training is done. We try inferencing with the full fine-tuned model.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Model name — make sure this matches what you used earlier
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model_path = "tinyllama-alpaca-full" # Path to your full fine-tuned model
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Choose device manually
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load fully fine-tuned model
model = AutoModelForCausalLM.from_pretrained(model_path)
model.to(device)
model.eval()
# Generate function
def generate(model, prompt):
inputs = tokenizer(prompt, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
return tokenizer.decode(output[0], skip_special_tokens=True)
# Alpaca-style instruction prompt
prompt = """### Instruction:
Explain the difference between renewable and non-renewable energy sources.
### Response:
"""
print("=== AFTER Full Fine-Tuning ===")
print(generate(model, prompt))
And that’s it. Thanks to HuggingFace, everything’s relatively straightforward.
The code is available in the GitHub repo as well as in the Google Colab Notebook.