Table of Contents
- Overall Architecture of Codex Code Review Automation
- GitHub Actions Workflow YAML Configuration in Detail
- Implementing the Codex API Call Script
- Posting Codex Code Review Automation Results as PR Comments
- Prompt Design and Review Quality Improvement
- Token Cost Management and Rate Limit Handling
- Production Considerations and Future Directions
on:
pull_request:
types: [opened, synchronize]
jobs:
codex-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get PR diff
run: git diff origin/${{ github.base_ref }}...HEAD > diff.txt
- name: Run Codex review
run: python review.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
This single workflow forms the backbone of Codex code review automation. Every time a PR is opened or a commit is pushed, it extracts the diff, calls the OpenAI API to generate review comments, and posts them directly to the PR via the GitHub API. Basic code quality checks are already done before a human reviewer even looks at the code.
Manual code reviews often can’t keep up with the pace of incoming PRs. In microservice architectures where dozens of PRs land per day, reviewer bottlenecks slow down the entire deployment cycle. Integrating the Codex API with GitHub Actions can significantly reduce this bottleneck.
Overall Architecture of Codex Code Review Automation
The entire pipeline breaks down into three stages: PR event detection, diff analysis and API call, and result posting.
PR opened/synchronize
│
▼
┌─────────────────┐
│ GitHub Actions │
│ Workflow Trigger │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ git diff extract │────▶│ OpenAI API │
│ (file filtering) │ │ (Codex model)│
└─────────────────┘ └──────┬───────┘
│
▼
┌──────────────────┐
│ GitHub REST API │
│ PR Review Comment │
└──────────────────┘
GitHub Actions’ pull_request event supports the opened, synchronize, and reopened types. Only opened and synchronize need to be triggered. Including reopened leads to duplicate comments on PRs that have already been reviewed.
synchronize fires when new commits are pushed to an existing PR. This includes force-pushes, so the diff range may change — something to keep in mind.
The key is controlling the size of the diff. Sending the entire diff directly to the API risks exceeding the token limit or causing costs to spike. Preprocessing steps like filtering by file extension, capping the number of changed lines, and excluding binary files are essential.
GitHub Actions Workflow YAML Configuration in Detail
The workflow file lives in the .github/workflows/ directory. The filename is flexible, but an explicit name like codex-review.yml makes management easier.
fetch-depth Setting in the Checkout Step
Setting fetch-depth: 0 in actions/checkout fetches the full git history. If this value is left at the default (1), git diff can’t compute the difference against the base branch. In a shallow clone, the comparison target commit simply doesn’t exist.
- uses: actions/checkout@v4
with:
fetch-depth: 0
However, in monorepos with tens of thousands of commits, fetch-depth: 0 significantly increases checkout time. In such cases, limiting it to something like fetch-depth: 50 and explicitly fetching the base branch is a better approach.
Diff Extraction and File Filtering
Using the raw diff as-is includes unnecessary changes like package-lock.json, .min.js, and image binaries. Extension-based filtering is needed.
git diff origin/$BASE_BRANCH...HEAD \
-- '*.py' '*.js' '*.ts' '*.go' '*.java' \
':!*lock*' ':!*.min.*' \
> diff.txt
This command includes only Python, JavaScript, TypeScript, Go, and Java files while excluding lock files and minified files. Adjust the extension list based on the project.
Adding a guard that skips the API call when the diff file exceeds a certain size (e.g., 100KB) is a safe practice. This prevents unnecessary costs on large refactoring PRs.
Implementing the Codex API Call Script
The core piece is a Python script that reads the diff file, sends it to the OpenAI API, and parses the response. It uses OpenAI’s Chat Completions API.
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def read_diff(path="diff.txt"):
with open(path, "r") as f:
return f.read()
def request_review(diff_content):
response = client.chat.completions.create(
model="o3-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": diff_content}
],
temperature=0.2
)
return response.choices[0].message.content
Setting temperature low ensures consistency in code reviews. A high temperature generates different feedback for the same code each time, reducing review reliability. A range of 0.1–0.3 works well for code analysis.
The system prompt determines review quality. Without specific instructions, only generic comments are produced.
SYSTEM_PROMPT = """You are a senior code reviewer.
Analyze the git diff and provide:
1. Bug risks (null checks, race conditions, resource leaks)
2. Security issues (SQL injection, XSS, hardcoded secrets)
3. Performance concerns (N+1 queries, unnecessary allocations)
Output format: JSON array of objects with keys:
- file: filename
- line: line number in diff
- severity: critical|warning|info
- comment: explanation in Korean
"""
Specifying JSON output makes downstream parsing stable. Accepting free-form text requires regex parsing and becomes fragile against model response variations.
Posting Codex Code Review Automation Results as PR Comments
After parsing the API response, review comments are posted to the PR via the GitHub REST API. There are two approaches.
| Approach | Endpoint | Characteristics | Best For |
|---|---|---|---|
| PR Comment | issues/{number}/comments |
Single comment on the entire PR | Summary reviews |
| Review Comment | pulls/{number}/reviews |
Inline comments per file and line | Detailed reviews |
Inline comments feel much closer to a real reviewer experience. However, implementation complexity is higher — the diff’s line numbers must be converted into the position values that the GitHub API requires.
PR-Level Comment Approach
This is the simpler approach. Review results are formatted as a markdown table and posted as a single comment.
import requests
def post_comment(review_json, pr_number):
token = os.environ["GITHUB_TOKEN"]
repo = os.environ["GITHUB_REPOSITORY"]
url = f"https://api.github.com/repos/{repo}/issues/{pr_number}/comments"
body = format_review_as_markdown(review_json)
headers = {
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json"
}
requests.post(url, json={"body": body}, headers=headers)
Inline Review Comment Approach
To post per-file, per-line comments, use the pulls/{number}/reviews endpoint. Each comment in the comments array requires path, position, and body. Here, position refers to the relative line position within the diff, so separate diff parsing logic is needed.
position is not the absolute line number of the file — it’s the relative position within the diff hunk. Incorrect calculation results in comments appearing on the wrong line or the API returning a 422 error. Refer to the GitHub REST API Pull Request Review documentation for the exact calculation method.
Prompt Design and Review Quality Improvement
The structure of the system prompt determines review quality. A generic "review this code" instruction won’t produce meaningful results.
Three principles in prompt design determine review accuracy. First, specify the review scope — whether to prioritize security, performance, or bug risks. Second, fix the output format as JSON to ensure parsing stability. Third, provide project conventions as context. Including lint rules from .eslintrc or pyproject.toml in the system prompt produces feedback aligned with the project’s style.
def build_system_prompt(lint_config=None):
base = "You are a code reviewer. Focus on bugs and security."
if lint_config:
base += f"\nProject conventions:\n{lint_config}"
base += "\nOutput: JSON array [{file, line, severity, comment}]"
return base
If false positives pile up in review results, developers start ignoring the automated reviews. Fine-grained severity classification helps, and wrapping info-level comments in collapsible <details> tags is a practical approach.
System prompts deserve the same version control treatment as code. Separating the prompt into a dedicated file (e.g.,
prompts/review-v1.txt) and tracking it with git makes it possible to trace review quality changes at the commit level.
Token Cost Management and Rate Limit Handling
Cost is the first barrier in Codex code review automation. Calling the API on every PR accumulates token consumption.
Reducing Token Consumption
Instead of sending the entire diff, splitting it by changed functions saves tokens. Using a parser like tree-sitter to extract function boundaries and sending only the functions containing changes is an effective approach.
def chunk_diff_by_function(diff_text, max_tokens=3000):
chunks = split_by_file(diff_text)
result = []
for chunk in chunks:
if estimate_tokens(chunk) > max_tokens:
sub_chunks = split_by_hunk(chunk)
result.extend(sub_chunks)
else:
result.append(chunk)
return result
Token estimation doesn’t need to be precise. Approximating at roughly 1 token ≈ 4 characters for English and 1 token ≈ 1–2 characters for Korean is sufficient. For exact counts, use the tiktoken library.
Rate Limit Handling
The OpenAI API has per-minute request (RPM) and per-minute token (TPM) limits. Large PRs or simultaneous PRs can hit rate limits.
import time
def call_with_retry(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if "rate_limit" in str(e).lower():
wait = 2 ** attempt * 10
time.sleep(wait)
else:
raise
raise RuntimeError("Max retries exceeded")
Retry with exponential backoff, but cap at 3 attempts. The execution time limit of GitHub Actions workflows is also a factor. The default timeout is 6 hours, but if review automation takes tens of minutes, it delays the entire CI pipeline.
| Strategy | Token Savings | Implementation Complexity | Review Quality Impact |
|---|---|---|---|
| Send entire diff | None | Low | Maximum context |
| Split by file | Medium | Medium | Misses cross-file relationships |
| Split by function | High | High | Analyzes only within functions |
| Changed lines ±10 only | Very high | Low | Insufficient context |
Choose the appropriate strategy based on project size and PR frequency. For small teams, sending the entire diff is sufficient. For teams with 50+ PRs per day, function-level splitting is cost-effective.
Production Considerations and Future Directions
Three things to consider when putting Codex code review automation into a production pipeline.
First, API key management. Storing OPENAI_API_KEY in GitHub Secrets is the baseline, but a key rotation schedule is also needed. Using organization-level secrets allows sharing a single key across multiple repos, but makes cost tracking difficult. Issuing per-repo keys is better for cost visibility.
Second, indicating review result trustworthiness. Without clearly marking comments as AI-generated, junior developers may mistake AI comments for a senior reviewer’s feedback. Prefixing comments with something like 🤖 [Codex Auto-Review] is standard practice.
Third, selective execution. Running automated reviews on every PR creates not just cost issues but noise as well. Code review is unnecessary for README edits or dependency update PRs. Adding a label-based trigger (only run on PRs with the needs-ai-review label) or a changed-file-count filter addresses this.
The workflow’s
paths filter can restrict execution to changes in specific directories. Setting paths: ['src/**', 'lib/**'] to trigger only on source code changes reduces unnecessary runs. See the GitHub Actions workflow trigger documentation for detailed syntax.
One extension direction is automated suggested changes. Using GitHub’s suggestion syntax, reviewers can apply fixes with a single "Apply suggestion" button click. Design the prompt so the Codex API response includes corrected code, then format it as a suggestion markdown block.
```suggestion
def calculate_total(items):
if not items:
return 0
return sum(item.price for item in items)
Inserting this format into PR comments lets reviewers fix code with a single click.
The next step is building a review results dashboard. Aggregating which categories (security, performance, bugs) produce the most findings and which files repeatedly surface the same issues reveals team-level code quality trends. Monitoring token consumption patterns on the OpenAI API usage management page is also essential for cost optimization. The ultimate value of Codex code review automation isn't review speed — it's serving as a safety net that catches repetitive mistakes early.
## Related Posts
- Detecting Security Vulnerabilities with OpenAI Codex in 5 Steps: SQL Injection & XSS Auto-Scanning Practical Guide - A hands-on account of attaching OpenAI Codex to legacy PHP and Node.js codebases to automatically detect SQL injection and XSS vulnerabilities. Prompt design...