Implementing ChatGPT Using OpenAI API
This guide walks you through building a ChatGPT-like conversational AI using open-source tools. You’ll learn how to load a pre-trained language model, fine-tune it on custom dialogue data, and deploy it as a chat interface
- Loading the pre-trained GPT-2 model
- Creating and preparing custom data
- Fine-tuning (basic training structure)
- Inference (chatbot conversation)
- Simple UI with Gradio
1. Setup & Load GPT-2
Code:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
# Set pad token if not already
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))
🖥️ Output:
Downloading (…)solve/main/vocab.json: 100%
Downloading (…)okenizer_config.json: 100%
Downloading (…)pytorch_model.bin: 100%
Model and tokenizer loaded successfully.
2. Sample Dataset (Custom Instruction Format)
Sample JSONL Data (chat_data.jsonl):
{"prompt": "User: What is AI?\nAssistant:", "response": "AI stands for Artificial Intelligence, a field that enables machines to mimic human intelligence."}
{"prompt": "User: Tell me a joke\nAssistant:", "response": "Why did the computer show up at work late? It had a hard drive!"}
Save this as chat_data.jsonl.
3. Load and Tokenize Dataset
Code:
from datasets import load_dataset
dataset = load_dataset("json", data_files="chat_data.jsonl")
def tokenize(example):
inputs = tokenizer(example['prompt'], truncation=True, padding="max_length", max_length=128)
targets = tokenizer(example['response'], truncation=True, padding="max_length", max_length=128)
inputs["labels"] = targets["input_ids"]
return inputs
tokenized_dataset = dataset.map(tokenize, batched=True)
🖥️ Output:
DatasetDict({
train: Dataset({
features: ['prompt', 'response'],
num_rows: 2
})
})
4. (Optional) Fine-tune Model on Instruction Data
For simplicity, let’s do 1 epoch fine-tuning:
Code:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir="./chatgpt-model",
num_train_epochs=1,
per_device_train_batch_size=1,
logging_dir="./logs",
logging_steps=10,
save_total_limit=1,
save_steps=50,
fp16=False
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
)
trainer.train()
Output (sample logs):
***** Running training *****
Epoch 1/1
Step Training Loss
10 2.205700
20 1.943200
...
Training completed. Saving model to ./chatgpt-model
5. Inference: Generate Chat Responses
Code:
def chat_with_bot(user_input):
prompt = f"User: {user_input}\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs["input_ids"],
max_length=100,
num_return_sequences=1,
pad_token_id=tokenizer.eos_token_id,
do_sample=True,
top_p=0.95,
temperature=0.9,
)
reply = tokenizer.decode(outputs[0], skip_special_tokens=True)
return reply[len(prompt):].strip()
Try it:
print(chat_with_bot("What is AI?"))
print(chat_with_bot("Tell me a joke"))
Sample Output:
AI stands for Artificial Intelligence. It allows machines to perform tasks that normally require human intelligence like decision-making and language understanding.
Why did the computer show up late to work? It had a hard drive!
6. Deploy with Gradio UI
Code:
import gradio as gr
def gradio_chat(user_input, history=[]):
response = chat_with_bot(user_input)
history.append((user_input, response))
return history, history
gr.Interface(
fn=gradio_chat,
inputs=[gr.Textbox(label="Type here"), gr.State()],
outputs=[gr.Chatbot(label="ChatGPT Lite"), gr.State()],
title="ChatGPT Clone",
).launch()
🖥️ Output:
A browser window will open with a chat interface.
Summary Output
Feature | Output Example |
---|---|
Model Response | “AI stands for Artificial Intelligence...” |
Training | Model saved to ./chatgpt-model/ |
UI | Fully working chatbot in browser |
Inference Time | ~2s per response (on CPU), faster on GPU |