Automate Code Reviews with GitHub Actions and Codex in 5 Steps

Table of Contents

on:
  pull_request:
    types: [opened, synchronize]
jobs:
  codex-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Get PR diff
        run: git diff origin/${{ github.base_ref }}...HEAD > diff.txt
      - name: Run Codex review
        run: python review.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

This single workflow forms the backbone of Codex code review automation. Every time a PR is opened or a commit is pushed, it extracts the diff, calls the OpenAI API to generate review comments, and posts them directly to the PR via the GitHub API. Basic code quality checks are already done before a human reviewer even looks at the code.

Manual code reviews often can’t keep up with the pace of incoming PRs. In microservice architectures where dozens of PRs land per day, reviewer bottlenecks slow down the entire deployment cycle. Integrating the Codex API with GitHub Actions can significantly reduce this bottleneck.

Overall Architecture of Codex Code Review Automation

The entire pipeline breaks down into three stages: PR event detection, diff analysis and API call, and result posting.

PR opened/synchronize
       │
       ▼
┌─────────────────┐
│ GitHub Actions   │
│ Workflow Trigger │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌──────────────┐
│ git diff extract │────▶│ OpenAI API   │
│ (file filtering) │     │ (Codex model)│
└─────────────────┘     └──────┬───────┘
                               │
                               ▼
                    ┌──────────────────┐
                    │ GitHub REST API   │
                    │ PR Review Comment │
                    └──────────────────┘

GitHub Actions’ pull_request event supports the opened, synchronize, and reopened types. Only opened and synchronize need to be triggered. Including reopened leads to duplicate comments on PRs that have already been reviewed.

Choosing PR Event Types
synchronize fires when new commits are pushed to an existing PR. This includes force-pushes, so the diff range may change — something to keep in mind.

The key is controlling the size of the diff. Sending the entire diff directly to the API risks exceeding the token limit or causing costs to spike. Preprocessing steps like filtering by file extension, capping the number of changed lines, and excluding binary files are essential.

GitHub Actions Workflow YAML Configuration in Detail

The workflow file lives in the .github/workflows/ directory. The filename is flexible, but an explicit name like codex-review.yml makes management easier.

fetch-depth Setting in the Checkout Step

Setting fetch-depth: 0 in actions/checkout fetches the full git history. If this value is left at the default (1), git diff can’t compute the difference against the base branch. In a shallow clone, the comparison target commit simply doesn’t exist.

- uses: actions/checkout@v4
  with:
    fetch-depth: 0

However, in monorepos with tens of thousands of commits, fetch-depth: 0 significantly increases checkout time. In such cases, limiting it to something like fetch-depth: 50 and explicitly fetching the base branch is a better approach.

Diff Extraction and File Filtering

Using the raw diff as-is includes unnecessary changes like package-lock.json, .min.js, and image binaries. Extension-based filtering is needed.

git diff origin/$BASE_BRANCH...HEAD \
  -- '*.py' '*.js' '*.ts' '*.go' '*.java' \
  ':!*lock*' ':!*.min.*' \
  > diff.txt

This command includes only Python, JavaScript, TypeScript, Go, and Java files while excluding lock files and minified files. Adjust the extension list based on the project.

Adding a Diff Size Limit
Adding a guard that skips the API call when the diff file exceeds a certain size (e.g., 100KB) is a safe practice. This prevents unnecessary costs on large refactoring PRs.

Implementing the Codex API Call Script

The core piece is a Python script that reads the diff file, sends it to the OpenAI API, and parses the response. It uses OpenAI’s Chat Completions API.

import os
import json
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def read_diff(path="diff.txt"):
    with open(path, "r") as f:
        return f.read()

def request_review(diff_content):
    response = client.chat.completions.create(
        model="o3-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": diff_content}
        ],
        temperature=0.2
    )
    return response.choices[0].message.content

Setting temperature low ensures consistency in code reviews. A high temperature generates different feedback for the same code each time, reducing review reliability. A range of 0.1–0.3 works well for code analysis.

The system prompt determines review quality. Without specific instructions, only generic comments are produced.

SYSTEM_PROMPT = """You are a senior code reviewer.
Analyze the git diff and provide:
1. Bug risks (null checks, race conditions, resource leaks)
2. Security issues (SQL injection, XSS, hardcoded secrets)
3. Performance concerns (N+1 queries, unnecessary allocations)

Output format: JSON array of objects with keys:
- file: filename
- line: line number in diff
- severity: critical|warning|info
- comment: explanation in Korean
"""

Specifying JSON output makes downstream parsing stable. Accepting free-form text requires regex parsing and becomes fragile against model response variations.

Posting Codex Code Review Automation Results as PR Comments

After parsing the API response, review comments are posted to the PR via the GitHub REST API. There are two approaches.

Approach Endpoint Characteristics Best For
PR Comment issues/{number}/comments Single comment on the entire PR Summary reviews
Review Comment pulls/{number}/reviews Inline comments per file and line Detailed reviews

Inline comments feel much closer to a real reviewer experience. However, implementation complexity is higher — the diff’s line numbers must be converted into the position values that the GitHub API requires.

PR-Level Comment Approach

This is the simpler approach. Review results are formatted as a markdown table and posted as a single comment.

import requests

def post_comment(review_json, pr_number):
    token = os.environ["GITHUB_TOKEN"]
    repo = os.environ["GITHUB_REPOSITORY"]
    url = f"https://api.github.com/repos/{repo}/issues/{pr_number}/comments"

    body = format_review_as_markdown(review_json)
    headers = {
        "Authorization": f"Bearer {token}",
        "Accept": "application/vnd.github+json"
    }
    requests.post(url, json={"body": body}, headers=headers)

Inline Review Comment Approach

To post per-file, per-line comments, use the pulls/{number}/reviews endpoint. Each comment in the comments array requires path, position, and body. Here, position refers to the relative line position within the diff, so separate diff parsing logic is needed.

Watch Out When Calculating position
position is not the absolute line number of the file — it’s the relative position within the diff hunk. Incorrect calculation results in comments appearing on the wrong line or the API returning a 422 error. Refer to the GitHub REST API Pull Request Review documentation for the exact calculation method.

Prompt Design and Review Quality Improvement

The structure of the system prompt determines review quality. A generic "review this code" instruction won’t produce meaningful results.

Three principles in prompt design determine review accuracy. First, specify the review scope — whether to prioritize security, performance, or bug risks. Second, fix the output format as JSON to ensure parsing stability. Third, provide project conventions as context. Including lint rules from .eslintrc or pyproject.toml in the system prompt produces feedback aligned with the project’s style.

def build_system_prompt(lint_config=None):
    base = "You are a code reviewer. Focus on bugs and security."
    if lint_config:
        base += f"\nProject conventions:\n{lint_config}"
    base += "\nOutput: JSON array [{file, line, severity, comment}]"
    return base

If false positives pile up in review results, developers start ignoring the automated reviews. Fine-grained severity classification helps, and wrapping info-level comments in collapsible <details> tags is a practical approach.

Version-Controlling Prompts
System prompts deserve the same version control treatment as code. Separating the prompt into a dedicated file (e.g., prompts/review-v1.txt) and tracking it with git makes it possible to trace review quality changes at the commit level.

Token Cost Management and Rate Limit Handling

Cost is the first barrier in Codex code review automation. Calling the API on every PR accumulates token consumption.

Reducing Token Consumption

Instead of sending the entire diff, splitting it by changed functions saves tokens. Using a parser like tree-sitter to extract function boundaries and sending only the functions containing changes is an effective approach.

def chunk_diff_by_function(diff_text, max_tokens=3000):
    chunks = split_by_file(diff_text)
    result = []
    for chunk in chunks:
        if estimate_tokens(chunk) > max_tokens:
            sub_chunks = split_by_hunk(chunk)
            result.extend(sub_chunks)
        else:
            result.append(chunk)
    return result

Token estimation doesn’t need to be precise. Approximating at roughly 1 token ≈ 4 characters for English and 1 token ≈ 1–2 characters for Korean is sufficient. For exact counts, use the tiktoken library.

Rate Limit Handling

The OpenAI API has per-minute request (RPM) and per-minute token (TPM) limits. Large PRs or simultaneous PRs can hit rate limits.

import time

def call_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "rate_limit" in str(e).lower():
                wait = 2 ** attempt * 10
                time.sleep(wait)
            else:
                raise
    raise RuntimeError("Max retries exceeded")

Retry with exponential backoff, but cap at 3 attempts. The execution time limit of GitHub Actions workflows is also a factor. The default timeout is 6 hours, but if review automation takes tens of minutes, it delays the entire CI pipeline.

Strategy Token Savings Implementation Complexity Review Quality Impact
Send entire diff None Low Maximum context
Split by file Medium Medium Misses cross-file relationships
Split by function High High Analyzes only within functions
Changed lines ±10 only Very high Low Insufficient context

Choose the appropriate strategy based on project size and PR frequency. For small teams, sending the entire diff is sufficient. For teams with 50+ PRs per day, function-level splitting is cost-effective.

Production Considerations and Future Directions

Three things to consider when putting Codex code review automation into a production pipeline.

First, API key management. Storing OPENAI_API_KEY in GitHub Secrets is the baseline, but a key rotation schedule is also needed. Using organization-level secrets allows sharing a single key across multiple repos, but makes cost tracking difficult. Issuing per-repo keys is better for cost visibility.

Second, indicating review result trustworthiness. Without clearly marking comments as AI-generated, junior developers may mistake AI comments for a senior reviewer’s feedback. Prefixing comments with something like 🤖 [Codex Auto-Review] is standard practice.

Third, selective execution. Running automated reviews on every PR creates not just cost issues but noise as well. Code review is unnecessary for README edits or dependency update PRs. Adding a label-based trigger (only run on PRs with the needs-ai-review label) or a changed-file-count filter addresses this.

Using GitHub Actions Path Filters
The workflow’s paths filter can restrict execution to changes in specific directories. Setting paths: ['src/**', 'lib/**'] to trigger only on source code changes reduces unnecessary runs. See the GitHub Actions workflow trigger documentation for detailed syntax.

One extension direction is automated suggested changes. Using GitHub’s suggestion syntax, reviewers can apply fixes with a single "Apply suggestion" button click. Design the prompt so the Codex API response includes corrected code, then format it as a suggestion markdown block.

```suggestion
def calculate_total(items):
    if not items:
        return 0
    return sum(item.price for item in items)

Inserting this format into PR comments lets reviewers fix code with a single click.

The next step is building a review results dashboard. Aggregating which categories (security, performance, bugs) produce the most findings and which files repeatedly surface the same issues reveals team-level code quality trends. Monitoring token consumption patterns on the OpenAI API usage management page is also essential for cost optimization. The ultimate value of Codex code review automation isn't review speed — it's serving as a safety net that catches repetitive mistakes early.

## Related Posts

- Detecting Security Vulnerabilities with OpenAI Codex in 5 Steps: SQL Injection & XSS Auto-Scanning Practical Guide - A hands-on account of attaching OpenAI Codex to legacy PHP and Node.js codebases to automatically detect SQL injection and XSS vulnerabilities. Prompt design...
Scroll to Top