Building Custom Claude Code Skills for Infrastructure Automation

2026-01-24 - 11 min read

Daniel Young

Founder, DRYCodeWorks

Your team runs the same Terraform plans, database migrations, and deployment scripts daily. Instead of documenting procedures that nobody reads, encode them as Claude Code skills that guide the AI through your exact workflows. Here's how to build reusable skills that turn tribal knowledge into executable automation.

You've probably experienced this: a new team member asks how to run database migrations for the staging environment. You point them to a Confluence page that was last updated eighteen months ago. Half the commands are wrong. The environment variables have different names now. They spend two hours debugging something that should take five minutes.

Documentation rots. Tribal knowledge lives in Slack threads and people's heads. And every time someone leaves the team, you lose a little more institutional memory.

What if instead of writing documentation that goes stale, you could encode your team's workflows as executable instructions that an AI assistant follows step-by-step? That's the promise of Claude Code skills—reusable markdown files that turn your procedures into guided automation.

What Are Claude Code Skills?

Skill - A markdown file that provides Claude Code with domain-specific instructions for completing a task. Skills live in .claude/skills/ and are invoked by name (e.g., /deployment:deploy-staging).

Agentic Workflow - A multi-step process where Claude Code executes commands, reads output, makes decisions, and continues autonomously until the task is complete or requires human input.

CLAUDE.md - A project-level configuration file that provides persistent context to Claude Code about your codebase, conventions, and workflows.

Skills differ from simple prompts in a crucial way: they're procedural. A skill doesn't just describe what you want—it specifies the exact steps, decision points, error handling, and verification criteria for a workflow. Claude Code follows the skill like a runbook, but with the intelligence to adapt when things go wrong.

Invoking Skills

Skills are invoked by name using a slash prefix. If you have a skill at .claude/skills/deployment/deploy-staging.md with name: deployment:deploy-staging, you invoke it by typing:

text

/deployment:deploy-staging

/deployment:deploy-staging

Claude Code loads the skill file and follows its instructions. You can also invoke skills programmatically using the Skill tool, which is useful when one skill needs to call another.

Skills use colon notation for namespacing:

text

/database:clickhouse-migrate
/infrastructure:terraform-plan

/database:clickhouse-migrate
/infrastructure:terraform-plan

You can pass arguments after the skill name:

text

/deployment:deploy staging
/database:backup production --skip-verification

/deployment:deploy staging
/database:backup production --skip-verification

The skill receives these arguments and can reference them in its workflow. We'll cover argument handling later in this article.

Anatomy of a Skill File

Skills use YAML frontmatter for metadata and markdown for instructions:

name - The identifier used to invoke the skill. Use colon notation for namespacing (e.g., deployment:deploy-staging).

description - A one-line summary shown when Claude Code lists available skills. This helps users (and Claude) understand when to use the skill.

The frontmatter is minimal by design. Unlike configuration files that need extensive options, skills are primarily defined by their markdown content. The intelligence comes from how you write the instructions, not from metadata flags.

Here's a complete example:

.claude/skills/deployment/deploy-staging.mdmarkdown

---
name: deployment:deploy-staging
description: Deploy the current branch to staging environment with verification
---

# Deploy to Staging

**Announce at start:** "I'm using deploy-staging to deploy your changes."

## Prerequisites

Before deploying, verify:
- Current branch has no uncommitted changes
- All tests pass locally
- You're authenticated to AWS

## Workflow

### Step 1: Run Tests

```bash
npm test
```

If tests fail, stop and report the failures.

### Step 2: Build for Production

```bash
npm run build
```

### Step 3: Deploy

```bash
serverless deploy --stage staging
```

### Step 4: Verify

After deployment, run smoke tests:

```bash
curl https://staging.example.com/health
```

Report the deployment status and health check results.

---
name: deployment:deploy-staging
description: Deploy the current branch to staging environment with verification
---

# Deploy to Staging

**Announce at start:** "I'm using deploy-staging to deploy your changes."

## Prerequisites

Before deploying, verify:
- Current branch has no uncommitted changes
- All tests pass locally
- You're authenticated to AWS

## Workflow

### Step 1: Run Tests

```bash
npm test
```

If tests fail, stop and report the failures.

### Step 2: Build for Production

```bash
npm run build
```

### Step 3: Deploy

```bash
serverless deploy --stage staging
```

### Step 4: Verify

After deployment, run smoke tests:

```bash
curl https://staging.example.com/health
```

Report the deployment status and health check results.

The structure matters. Claude Code reads the skill and follows it sequentially, executing commands and checking results at each step. The more explicit you are about decision points and success criteria, the better the automation works.

What Execution Looks Like

When you invoke /deployment:deploy-staging, Claude Code shows its progress:

text

> /deployment:deploy-staging

I'm using deployment:deploy-staging to deploy your changes.

Checking prerequisites...

$ git status
On branch feature/user-auth
nothing to commit, working tree clean

✓ No uncommitted changes

$ npm test
PASS  src/auth.test.ts
PASS  src/api.test.ts
Test Suites: 2 passed, 2 total

✓ All tests pass

$ npm run build
Build completed successfully.

✓ Production build ready

$ serverless deploy --stage staging
Deploying to staging...
Service deployed to: https://staging.example.com

$ curl https://staging.example.com/health
{"status": "healthy", "version": "1.2.3"}

✓ Health check passed

**Deployment complete.** Your changes are live at https://staging.example.com

> /deployment:deploy-staging

I'm using deployment:deploy-staging to deploy your changes.

Checking prerequisites...

$ git status
On branch feature/user-auth
nothing to commit, working tree clean

✓ No uncommitted changes

$ npm test
PASS  src/auth.test.ts
PASS  src/api.test.ts
Test Suites: 2 passed, 2 total

✓ All tests pass

$ npm run build
Build completed successfully.

✓ Production build ready

$ serverless deploy --stage staging
Deploying to staging...
Service deployed to: https://staging.example.com

$ curl https://staging.example.com/health
{"status": "healthy", "version": "1.2.3"}

✓ Health check passed

**Deployment complete.** Your changes are live at https://staging.example.com

Each step shows the command, its output, and a status indicator. When something fails, Claude Code stops at that step and reports the issue before asking how to proceed.

Building Your First Infrastructure Skill

Let's build a real skill: running ClickHouse database migrations safely. This workflow has several steps that are easy to get wrong:

Verify you're targeting the correct environment
Run migrations in the proper order
Validate the schema after migration
Handle rollback if something fails

Here's how to encode this as a skill:

.claude/skills/database/clickhouse-migrate.mdmarkdown

---
name: database:clickhouse-migrate
description: Run ClickHouse migrations safely with validation and rollback support
---

# ClickHouse Migration

**Announce at start:** "I'm using database:clickhouse-migrate to run database migrations."

## Prerequisites

Verify environment before proceeding:

```bash
echo "Target: $CLICKHOUSE_HOST"
```

Ask user to confirm this is the correct environment.

## Workflow

### Step 1: Check Current State

```bash
alembic -c alembic-clickhouse.ini current
```

Report the current migration version.

### Step 2: Show Pending Migrations

```bash
alembic -c alembic-clickhouse.ini history --indicate-current
```

List all pending migrations and ask user to confirm.

### Step 3: Run Migration

```bash
alembic -c alembic-clickhouse.ini upgrade head
```

If migration fails:
1. Report the error
2. Ask user: "Should I attempt rollback?"
3. If yes, run: `alembic -c alembic-clickhouse.ini downgrade -1`

### Step 4: Validate Schema

```bash
clickhouse-client --query "DESCRIBE TABLE logs"
```

Compare output to expected schema. Report any discrepancies.

### Step 5: Verify Data Integrity

```bash
clickhouse-client --query "SELECT count() FROM logs"
```

Report row count to confirm data is accessible.

---
name: database:clickhouse-migrate
description: Run ClickHouse migrations safely with validation and rollback support
---

# ClickHouse Migration

**Announce at start:** "I'm using database:clickhouse-migrate to run database migrations."

## Prerequisites

Verify environment before proceeding:

```bash
echo "Target: $CLICKHOUSE_HOST"
```

Ask user to confirm this is the correct environment.

## Workflow

### Step 1: Check Current State

```bash
alembic -c alembic-clickhouse.ini current
```

Report the current migration version.

### Step 2: Show Pending Migrations

```bash
alembic -c alembic-clickhouse.ini history --indicate-current
```

List all pending migrations and ask user to confirm.

### Step 3: Run Migration

```bash
alembic -c alembic-clickhouse.ini upgrade head
```

If migration fails:
1. Report the error
2. Ask user: "Should I attempt rollback?"
3. If yes, run: `alembic -c alembic-clickhouse.ini downgrade -1`

### Step 4: Validate Schema

```bash
clickhouse-client --query "DESCRIBE TABLE logs"
```

Compare output to expected schema. Report any discrepancies.

### Step 5: Verify Data Integrity

```bash
clickhouse-client --query "SELECT count() FROM logs"
```

Report row count to confirm data is accessible.

Notice the decision points: "Ask user to confirm," "If migration fails." These create natural breakpoints where Claude Code pauses for human input rather than charging ahead blindly.

Skill Design Patterns

After building dozens of skills across infrastructure, deployment, and data pipelines, patterns emerge for what works well.

Pattern 1: Announce-Verify-Execute-Confirm

Every skill should follow this rhythm:

Announce - Tell the user what's about to happen
Verify - Check prerequisites and current state
Execute - Run the actual commands
Confirm - Validate the outcome and report results

This pattern prevents surprises and creates audit trails. When something goes wrong three weeks from now, you'll have Claude Code's output showing exactly what commands ran and what the state was at each step.

Pattern 2: Explicit Decision Trees

Don't leave decisions implicit. If a command might fail in different ways, enumerate them:

markdown

### Handle Terraform Plan Output

After running `terraform plan`:

**If no changes:** Report "No infrastructure changes needed" and exit.

**If only additions:** Show the additions and ask "Proceed with apply?"

**If modifications or deletions:**
1. Highlight the destructive changes in a warning
2. Ask user to type "confirm" to proceed
3. Do NOT proceed without explicit confirmation

### Handle Terraform Plan Output

After running `terraform plan`:

**If no changes:** Report "No infrastructure changes needed" and exit.

**If only additions:** Show the additions and ask "Proceed with apply?"

**If modifications or deletions:**
1. Highlight the destructive changes in a warning
2. Ask user to type "confirm" to proceed
3. Do NOT proceed without explicit confirmation

Pattern 3: Checkpoints for Long Workflows

For workflows that take more than a few minutes, add explicit checkpoints:

markdown

### Checkpoint: Database Backup Complete

✅ Backup created: `backup-2026-01-24-staging.sql`
✅ Backup verified: 2.3GB, checksum matches
✅ Uploaded to S3: s3://backups/staging/

**Continue with migration?** (The backup can be restored if needed)

### Checkpoint: Database Backup Complete

✅ Backup created: `backup-2026-01-24-staging.sql`
✅ Backup verified: 2.3GB, checksum matches
✅ Uploaded to S3: s3://backups/staging/

**Continue with migration?** (The backup can be restored if needed)

This prevents the frustration of a workflow failing 80% through with no way to resume.

Pattern 4: Tool Integration

Skills can invoke other skills for modularity:

markdown

### Step 3: Create Backup

Invoke **backup-database** skill before proceeding with destructive operations.

### Step 3: Create Backup

Invoke **backup-database** skill before proceeding with destructive operations.

This lets you compose complex workflows from simpler, tested components.

Pattern 5: Handling Arguments

Skills can accept arguments passed after the skill name. Reference them in your skill by describing what Claude Code should extract:

.claude/skills/deployment/deploy.mdmarkdown

---
name: deployment:deploy
description: Deploy to a specified environment (staging, production)
---

# Deploy

**Arguments:** This skill expects an environment name (staging or production).

If no environment is provided, ask the user which environment to deploy to.

## Validation

Valid environments: staging, production

If the provided environment is not in this list, report the error and exit.

## Workflow

### For Staging

```bash
serverless deploy --stage staging
```

### For Production

Production deployments require additional verification:

1. Confirm all staging tests have passed
2. Ask user to type "deploy production" to confirm
3. Run: `serverless deploy --stage production`

---
name: deployment:deploy
description: Deploy to a specified environment (staging, production)
---

# Deploy

**Arguments:** This skill expects an environment name (staging or production).

If no environment is provided, ask the user which environment to deploy to.

## Validation

Valid environments: staging, production

If the provided environment is not in this list, report the error and exit.

## Workflow

### For Staging

```bash
serverless deploy --stage staging
```

### For Production

Production deployments require additional verification:

1. Confirm all staging tests have passed
2. Ask user to type "deploy production" to confirm
3. Run: `serverless deploy --stage production`

The skill doesn't parse arguments programmatically—Claude Code understands natural language and extracts the environment from whatever the user provides: /deployment:deploy staging, /deployment:deploy to production, or /deployment:deploy --env=staging all work.

Organizing Skills for Teams

As your skill library grows, organization becomes important. Skills use colon notation for namespacing:

text

.claude/
├── skills/
│   ├── infrastructure/
│   │   ├── terraform-plan.md      # name: infrastructure:terraform-plan
│   │   ├── terraform-apply.md     # name: infrastructure:terraform-apply
│   │   └── rotate-secrets.md      # name: infrastructure:rotate-secrets
│   ├── database/
│   │   ├── clickhouse-migrate.md  # name: database:clickhouse-migrate
│   │   ├── postgres-backup.md     # name: database:postgres-backup
│   │   └── restore-from-backup.md # name: database:restore-from-backup
│   └── deployment/
│       ├── deploy-staging.md      # name: deployment:deploy-staging
│       └── deploy-production.md   # name: deployment:deploy-production
└── CLAUDE.md

.claude/
├── skills/
│   ├── infrastructure/
│   │   ├── terraform-plan.md      # name: infrastructure:terraform-plan
│   │   ├── terraform-apply.md     # name: infrastructure:terraform-apply
│   │   └── rotate-secrets.md      # name: infrastructure:rotate-secrets
│   ├── database/
│   │   ├── clickhouse-migrate.md  # name: database:clickhouse-migrate
│   │   ├── postgres-backup.md     # name: database:postgres-backup
│   │   └── restore-from-backup.md # name: database:restore-from-backup
│   └── deployment/
│       ├── deploy-staging.md      # name: deployment:deploy-staging
│       └── deploy-production.md   # name: deployment:deploy-production
└── CLAUDE.md

The filesystem structure is for your organization—Claude Code uses the name field in frontmatter for invocation. To run a database migration:

text

/database:clickhouse-migrate

/database:clickhouse-migrate

The colon notation groups related skills while keeping invocation names flat. You don't need deep directory nesting—the namespace prefix provides the organization. Keep your CLAUDE.md file focused on project-wide context: coding standards, architectural decisions, and common patterns. Skills handle specific workflows.

Testing and Iteration

Skills benefit from the same iterative refinement as code. Here's how to debug when things go wrong.

Watch the Execution

When Claude Code runs a skill, it shows each step in real-time. Watch for:

Skipped steps - Claude Code interpreted your instructions differently than intended
Wrong commands - Ambiguous phrasing led to incorrect command construction
Missing confirmations - Decision points that should pause but didn't

Common Failure Modes

"Claude Code ran the wrong command"

Your instruction was ambiguous. Instead of:

markdown

Run the tests.

Run the tests.

Be explicit:

markdown

```bash
npm test
```

Run this exact command. Do not substitute with yarn, pnpm, or other alternatives.

```bash
npm test
```

Run this exact command. Do not substitute with yarn, pnpm, or other alternatives.

"Claude Code didn't stop when it should have"

Decision points need explicit stop instructions:

markdown

If the health check fails:
1. Report the failure with the HTTP status code
2. **STOP and ask the user** whether to proceed or rollback
3. Do NOT continue to the next step without explicit confirmation

If the health check fails:
1. Report the failure with the HTTP status code
2. **STOP and ask the user** whether to proceed or rollback
3. Do NOT continue to the next step without explicit confirmation

"Claude Code asked too many questions"

You added too many confirmation points. Reserve confirmations for destructive operations or genuine decision points—not every step.

Test Systematically

Run your skill in these scenarios:

Happy path - Everything works as expected
Missing prerequisites - What happens when AWS credentials are missing?
Command failures - Does it handle non-zero exit codes correctly?
Ambiguous input - What if the user provides unexpected arguments?

A well-tuned skill feels like pair programming with someone who knows your infrastructure. A poorly-written skill feels like giving directions to a city you've never visited—technically correct words that lead somewhere unexpected. The difference is specificity.

When Skills Beat Traditional Automation

Skills aren't a replacement for proper CI/CD pipelines or infrastructure-as-code. They excel in specific scenarios:

Infrequent operations - Tasks you run monthly don't justify a full automation pipeline
Judgment-required workflows - Operations where human review is needed mid-process
Debugging and investigation - Following a systematic troubleshooting process
Onboarding acceleration - New team members can execute complex workflows immediately

The meta-benefit is that writing a skill forces you to articulate your workflow precisely. That precision is valuable whether Claude Code runs it or a human follows it manually.

Quick Reference

Skill File Location

text

.claude/skills/<namespace>/<skill-name>.md

.claude/skills/<namespace>/<skill-name>.md

Frontmatter

yaml

---
name: namespace:skill-name
description: One-line description for skill discovery
---

---
name: namespace:skill-name
description: One-line description for skill discovery
---

Invocation

text

/namespace:skill-name
/namespace:skill-name argument1 argument2

/namespace:skill-name
/namespace:skill-name argument1 argument2

Essential Patterns

Announce - Start with "I'm using [skill] to [action]"
Explicit commands - Use fenced code blocks for exact commands
Decision points - Say "STOP and ask" for destructive operations
Validation - End with verification commands and success criteria

Common Phrases

Intent	Phrase
Require confirmation	"Ask user to confirm before proceeding"
Hard stop	"STOP and ask the user"
Conditional	"If [condition], then [action]. Otherwise, [alternative]."
Report	"Report [metric] to the user"
Invoke another skill	"Invoke namespace:other-skill before proceeding"

Getting Started

If you're new to Claude Code skills, start small:

Pick one workflow your team runs weekly
Write out the steps as you'd explain them to a new team member
Convert those steps into a skill file with explicit commands
Run it with Claude Code and note where it deviates from your intent
Refine until it works reliably

The first skill takes an hour to get right. The tenth skill takes ten minutes. The investment compounds as your library grows and your team's tribal knowledge becomes executable infrastructure.

Your documentation doesn't have to rot. Your procedures don't have to live in people's heads. With Claude Code skills, you can encode your team's expertise in a format that stays current and executes reliably—because Claude Code follows the instructions every time, exactly as written.