The AI Application Era
Building AI-powered applications is no longer the exclusive domain of machine learning engineers. With modern APIs and frameworks, full-stack developers can now integrate powerful AI capabilities into their applications with relative ease.
This guide covers everything from choosing the right AI service to production deployment, with practical code examples and real-world architecture patterns.
Understanding AI APIs
The Three Tiers
Tier 1: Foundation Models (Most Powerful)
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)
- Features: General-purpose, versatile, expensive
Tier 2: Specialized Models
- Cohere (Embeddings, classification)
- Stability AI (Image generation)
- ElevenLabs (Voice synthesis)
- Features: Domain-specific, optimized, moderate cost
Tier 3: Pre-built Solutions
- AWS Comprehend
- Azure Cognitive Services
- Google Cloud AI
- Features: Ready-to-use, limited customization, low cost
Choosing the Right API
Decision tree:
Need custom training? → Build your own or use Tier 3
General text processing? → GPT-3.5 (cheap) or GPT-4 (quality)
Specialized domain? → Tier 2 specialist
Maximum control? → Open-source (Llama, Mistral) + hosting
Architecture Patterns
Pattern 1: Simple API Wrapper
Use case: Basic AI features without complexity
// Next.js API route
export default async function handler(req, res) {
const { prompt } = req.body;
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{ role: "user", content: prompt }],
});
res.json({ result: response.choices[0].message.content });
}
Pros: Simple, fast to implement
Cons: No caching, expensive, no fallbacks
Pattern 2: Queue-Based Processing
Use case: Time-intensive AI tasks
Architecture:
User Request → Queue (BullMQ/RabbitMQ)
→ Worker Process → AI API
→ Update Database → Notify User
Implementation:
// Add to queue
async function generateReport(userId: string, data: any) {
await reportQueue.add('generate', {
userId,
data,
timestamp: Date.now()
});
return { status: 'processing', jobId: job.id };
}
// Worker
reportQueue.process('generate', async (job) => {
const { userId, data } = job.data;
const aiResponse = await callAI(data);
await db.reports.create({
userId,
content: aiResponse,
status: 'completed'
});
await notifyUser(userId, 'Report ready!');
});
Benefits:
- Non-blocking
- Handles high load
- Retry logic
- Progress tracking
Pattern 3: Caching Layer
Use case: Reduce AI API calls and costs
async function getCachedAIResponse(prompt: string) {
// Check cache first
const cached = await redis.get(`ai:${hash(prompt)}`);
if (cached) return JSON.parse(cached);
// Call AI if not cached
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{ role: "user", content: prompt }],
});
const result = response.choices[0].message.content;
// Cache for 24 hours
await redis.setex(`ai:${hash(prompt)}`, 86400, JSON.stringify(result));
return result;
}
Cost savings: 60-80% for repeated queries
Pattern 4: Hybrid AI (Multiple Models)
Use case: Combine strengths of different models
async function processDocument(doc: string) {
// Use cheap model for classification
const category = await classifyWithGPT35(doc);
// Use expensive model only for specific categories
if (category === 'complex') {
return await analyzeWithGPT4(doc);
}
// Use specialized model for others
return await analyzeWithCohere(doc);
}
Cost optimization: 50% savings vs. using GPT-4 for everything
Pattern 5: Streaming Responses
Use case: Better UX for long outputs
// Server
export async function POST(req: Request) {
const { prompt } = await req.json();
const stream = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }],
stream: true,
});
const encoder = new TextEncoder();
return new Response(
new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content || '';
controller.enqueue(encoder.encode(text));
}
controller.close();
},
})
);
}
// Client
async function streamResponse(prompt: string) {
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
displayText(text); // Update UI incrementally
}
}
UX improvement: Perceived speed increase of 3-5x
Building a Complete AI App: Chat Application
Architecture Overview
Frontend (Next.js)
↓
API Routes
↓
Rate Limiter → Auth Check → Input Validation
↓
Prompt Engineering Layer
↓
AI Provider (OpenAI/Anthropic)
↓
Post-processing
↓
Database (Conversation History)
Step 1: Setup
npm install openai zod next-auth prisma redis
Step 2: Environment Configuration
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...
REDIS_URL=redis://...
NEXTAUTH_SECRET=...
Step 3: API Route with All Best Practices
import { OpenAI } from 'openai';
import { z } from 'zod';
import { rateLimit } from '@/lib/rate-limit';
import { getServerSession } from 'next-auth';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
// Input validation
const messageSchema = z.object({
message: z.string().min(1).max(4000),
conversationId: z.string().optional(),
});
export async function POST(req: Request) {
try {
// 1. Authentication
const session = await getServerSession();
if (!session) {
return Response.json({ error: 'Unauthorized' }, { status: 401 });
}
// 2. Rate limiting
const rateLimitResult = await rateLimit(session.user.id);
if (!rateLimitResult.success) {
return Response.json({
error: 'Too many requests'
}, { status: 429 });
}
// 3. Input validation
const body = await req.json();
const { message, conversationId } = messageSchema.parse(body);
// 4. Load conversation history
const messages = await loadHistory(conversationId);
// 5. Add system prompt
const systemPrompt = {
role: 'system',
content: 'You are a helpful assistant. Be concise and accurate.'
};
// 6. Call OpenAI with error handling
const completion = await openai.chat.completions.create({
model: 'gpt-3.5-turbo',
messages: [systemPrompt, ...messages, { role: 'user', content: message }],
max_tokens: 500,
temperature: 0.7,
}).catch(error => {
console.error('OpenAI error:', error);
throw new Error('AI service unavailable');
});
const assistantMessage = completion.choices[0].message.content;
// 7. Save conversation
await saveConversation({
conversationId,
userId: session.user.id,
userMessage: message,
assistantMessage,
});
// 8. Return response
return Response.json({
message: assistantMessage,
conversationId,
});
} catch (error) {
if (error instanceof z.ZodError) {
return Response.json({
error: 'Invalid input',
details: error.errors
}, { status: 400 });
}
console.error('Chat error:', error);
return Response.json({
error: 'Internal server error'
}, { status: 500 });
}
}
Step 4: Rate Limiting Implementation
// lib/rate-limit.ts
import { Redis } from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
export async function rateLimit(userId: string) {
const key = `rate-limit:${userId}`;
const limit = 20; // requests
const window = 60; // seconds
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, window);
}
if (current > limit) {
return { success: false, remaining: 0 };
}
return { success: true, remaining: limit - current };
}
Step 5: Cost Tracking
async function trackCost(usage: any, userId: string) {
const costs = {
'gpt-4': { input: 0.03, output: 0.06 }, // per 1K tokens
'gpt-3.5-turbo': { input: 0.0015, output: 0.002 },
};
const modelCost = costs[usage.model];
const cost =
(usage.prompt_tokens / 1000) * modelCost.input +
(usage.completion_tokens / 1000) * modelCost.output;
await db.usage.create({
userId,
model: usage.model,
tokens: usage.total_tokens,
cost: cost,
});
return cost;
}
Production Best Practices
1. Error Handling
async function callAIWithRetry(prompt: string, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{ role: "user", content: prompt }],
});
} catch (error) {
if (error.status === 429) { // Rate limit
await sleep(Math.pow(2, i) * 1000); // Exponential backoff
continue;
}
if (error.status >= 500) { // Server error
if (i === retries - 1) throw error;
await sleep(1000);
continue;
}
throw error; // Client error, don't retry
}
}
}
2. Prompt Security
function sanitizePrompt(userInput: string) {
// Remove prompt injection attempts
const forbidden = [
'ignore previous instructions',
'system:',
'new instruction:',
];
let sanitized = userInput;
forbidden.forEach(phrase => {
sanitized = sanitized.replace(new RegExp(phrase, 'gi'), '');
});
// Limit length
return sanitized.substring(0, 4000);
}
3. Content Moderation
async function moderateContent(text: string) {
const moderation = await openai.moderations.create({
input: text,
});
const flagged = moderation.results[0].flagged;
if (flagged) {
const categories = moderation.results[0].categories;
throw new Error(`Content flagged: ${Object.keys(categories).filter(k => categories[k]).join(', ')}`);
}
return true;
}
4. Monitoring and Observability
import { trace } from '@opentelemetry/api';
async function monitoredAICall(prompt: string) {
const span = trace.getTracer('ai-service').startSpan('openai.call');
try {
span.setAttribute('prompt.length', prompt.length);
span.setAttribute('model', 'gpt-3.5-turbo');
const start = Date.now();
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{ role: "user", content: prompt }],
});
const duration = Date.now() - start;
span.setAttribute('response.tokens', response.usage.total_tokens);
span.setAttribute('duration.ms', duration);
span.setAttribute('cost.usd', calculateCost(response.usage));
return response;
} catch (error) {
span.recordException(error);
throw error;
} finally {
span.end();
}
}
Common Pitfalls and Solutions
Pitfall 1: Not Handling Context Length
Problem: Hitting token limits
Solution:
function truncateConversation(messages: Message[], maxTokens = 3000) {
let totalTokens = 0;
const truncated = [];
// Keep most recent messages
for (let i = messages.length - 1; i >= 0; i--) {
const tokens = estimateTokens(messages[i].content);
if (totalTokens + tokens > maxTokens) break;
truncated.unshift(messages[i]);
totalTokens += tokens;
}
return truncated;
}
Pitfall 2: Exposing API Keys
Problem: API keys in client-side code
Solution:
- Always call AI APIs from server
- Use environment variables
- Implement API routes
- Never send keys to frontend
Pitfall 3: No Cost Controls
Problem: Unexpected API bills
Solution:
async function checkBudget(userId: string) {
const monthlySpend = await getMonthlySpend(userId);
const limit = await getUserLimit(userId);
if (monthlySpend >= limit) {
throw new Error('Monthly budget exceeded');
}
return { remaining: limit - monthlySpend };
}
Testing AI Applications
describe('AI Chat API', () => {
it('should return valid response', async () => {
const response = await chatCompletion('Hello');
expect(response).toBeDefined();
expect(response.length).toBeGreaterThan(0);
expect(typeof response).toBe('string');
});
it('should handle rate limits', async () => {
// Make 21 requests (over limit of 20)
const requests = Array(21).fill(null).map(() =>
chatCompletion('Test')
);
await expect(Promise.all(requests)).rejects.toThrow('Too many requests');
});
it('should reject harmful content', async () => {
await expect(
chatCompletion('How to make a bomb')
).rejects.toThrow('Content flagged');
});
});
Deployment Checklist
- Environment variables configured
- Rate limiting implemented
- Error handling and retries
- Content moderation enabled
- Cost tracking in place
- Monitoring and alerting configured
- Caching layer active
- Input validation
- API keys secured
- CORS configured correctly
- Budget limits set
- Backup AI provider configured
Real-World Examples
Example 1: Document Q&A
async function documentQA(document: string, question: string) {
const chunks = splitIntoChunks(document, 1000);
const embeddings = await getEmbeddings(chunks);
const questionEmbedding = await getEmbedding(question);
const relevantChunks = findSimilar(questionEmbedding, embeddings, 3);
const context = relevantChunks.map(c => chunks[c.index]).join('\n\n');
const answer = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: "Answer based only on the provided context."
},
{
role: "user",
content: `Context:\n${context}\n\nQuestion: ${question}`
}
],
});
return answer.choices[0].message.content;
}
Example 2: Code Review Bot
async function reviewCode(code: string, language: string) {
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: `You are a senior ${language} code reviewer. Provide constructive feedback on: security, performance, best practices, and potential bugs.`
},
{ role: "user", content: code }
],
});
return parseReviewComments(response.choices[0].message.content);
}
Conclusion
Building production-ready AI applications requires more than just API calls. Focus on:
- Robust error handling - APIs will fail
- Cost management - Track and limit spending
- Security - Sanitize inputs, moderate content
- Performance - Cache, queue, optimize
- Monitoring - Track usage, errors, costs
Start simple, add complexity as needed, and always prioritize user experience and cost efficiency.
Ready to build your AI-powered application? Contact us for architecture consultation and development services.
AI APIs are powerful tools, but production applications require careful architecture, robust error handling, and constant monitoring. Build for reliability, not just functionality.



