GitHub CodeQL’s code scanning is more than just a fancy linter; it’s a static analysis engine that treats your code as a database, allowing for sophisticated queries to find subtle bugs and security vulnerabilities.

Here’s CodeQL in action, finding a classic SQL injection vulnerability. Imagine this Python code:

import sqlite3
from flask import Flask, request

app = Flask(__name__)

@app.route('/user')
def get_user():
    user_id = request.args.get('id')
    conn = sqlite3.connect('users.db')
    cursor = conn.cursor()
    # Vulnerable query!
    cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")
    user_data = cursor.fetchone()
    conn.close()
    return str(user_data)

if __name__ == '__main__':
    app.run(debug=True)

When CodeQL scans this, it doesn’t just look for SELECT statements. It builds a relational database of your code’s structure and data flow. It knows that user_id comes directly from an untrusted web request (request.args.get('id')) and that this untrusted input is then directly embedded into a SQL query executed by cursor.execute(). This direct flow from "tainted" input to a "sensitive" sink (the database query) is what CodeQL flags.

The problem CodeQL solves is the inherent difficulty in manually tracking data flow and identifying security weaknesses across large, complex codebases. Traditional security tools often rely on pattern matching or heuristics, which can lead to high false positive rates or miss novel vulnerabilities. CodeQL, by modeling code as data, allows for precise, logic-based vulnerability detection.

Internally, CodeQL works in three stages: extraction, analysis, and presentation.

  1. Extraction: CodeQL’s extractor analyzes source code (or compiled code) and generates a database of extracted data. This database represents the code’s structure, including abstract syntax trees (ASTs), control flow graphs (CFGs), and data flow graphs (DFGs).
  2. Analysis: This is where the magic happens. You run CodeQL queries written in the QL language against the generated database. These queries are essentially logical statements that define patterns or properties of code. For example, a query might look for data that flows from a web request parameter into a database query without sanitization.
  3. Presentation: The results of the QL queries are presented as alerts in your GitHub repository, indicating the location and nature of the vulnerability.

The levers you control are primarily the CodeQL queries you choose to run. GitHub provides a rich set of built-in queries covering common vulnerability types (like SQL injection, cross-site scripting, path traversal, etc.) for various languages. You can also write your own custom QL queries to detect specific patterns or vulnerabilities relevant to your project. Configuring the GitHub Actions workflow is another key lever, determining which languages are scanned, which query suites are used, and how frequently scans are performed.

The surprising thing about CodeQL is how it abstracts away the complexities of specific programming languages. The QL language is declarative and logic-based, designed to express properties of code rather than imperative steps. This means a single QL query can often be applied across different languages, provided the CodeQL extractor has been written for them. You’re not writing Python code to find Python vulnerabilities; you’re writing QL, a language for describing code properties, which then operates on a CodeQL database extracted from your Python code.

One of the most powerful, yet often overlooked, aspects of CodeQL is its ability to model taint flow. This isn’t just about tracking a variable; it’s about tracking the origin of data and where it reaches sensitive operations. A "taint source" is any input that could be controlled by an attacker (like HTTP request parameters, file uploads, or environment variables). A "taint sink" is a function or operation that could be exploited if it receives tainted data (like executing a shell command, performing a database query, or rendering HTML). CodeQL’s taint-tracking queries explicitly model this flow, identifying when tainted data reaches a sink without proper sanitization. This is far more precise than simple static analysis that might just look for certain function calls.

Once you’ve mastered detecting common vulnerabilities with CodeQL, the next step is to start writing custom queries to find logic errors or project-specific security flaws.

Want structured learning?

Take the full Github course →