Play AppSec WarGames
Want to skill-up in secure coding and AppSec? Try SecDim Wargames to learn how to find, hack and fix security vulnerabilities inspired by real-world incidents.
A research study comparing click-on (instant lookup) vs key-in (manual typing) digital dictionaries found that easier look up methods reduced spelling knowledge retention by 20-30%. Typing words manually requires active cognitive processing while clicking leads to passive consumption.
Learn by writing code, building projects and sandboxing is the most effective way to learn a new coding skill, framework or language. However, since the popularity of generative AI, developers are increasingly relying on AI generated code fixes. This is either because
a) they are under pressure to deliver, or
b) they do not understand a defect (e.g. a security vulnerability) in the first place.
Using AI generated code fixes for security is a double-edged sword: it can be a powerful tool to quickly generate a patch, or it can leave the vulnerability unpatched or introduce a new one.
A research study has shown that 40% of the Copilot generated code contains security vulnerabilities. Another study shows that given the right prompt and parameters, LLMs are reliable in fixing synthetically generated vulnerable code. However, more recent studies argued that LLMs struggle to repair over 90% of vulnerabilities in the real-world software without security expert guidance.
Aside from reliability of LLMs in fixing security vulnerabilities, there is another interesting issue here:
Would AI security code fixes help developers in becoming better in secure coding?
This article aims to answer this question.
We analysed the submissions for some of our AppSec coding challenges. In our platform, we allow developers to download the source code of challenges and use their own IDE with any AI agent tool.
For this experiment, we analysed over 420 submissions. We reviewed a sample of challenges with three degrees of difficulty (easy, medium, and hard) in three languages (Python, Ruby, and Go).
For each submission, we analysed the following metrics:
This is the screenshot of the challenges that we presented to the developers.
The following table summarises the results. A submission is a git commit. Time intervals were calculated from git commit timestamps. For statistical testing, we used Welch’s t-test for unequal variances (if p < 0.05 it shows significant difference). To classify a submission as AI-generated, we looked at Behavioural Indicators and Code Characteristics.
| Metric | AI Fixes | Human Fixes | Statistical Difference |
|---|---|---|---|
| Code Diffs (lines) | |||
| - Average | 56.3 | 18.4 | p < 0.001* |
| - Min | 9 | 1 | - |
| - Max | 493 | 166 | - |
| Time Intervals (minutes) | |||
| - Average | 8.2 | 24.7 | p < 0.01* |
| - Min | 1 | 1 | - |
| - Max | 1380 | 44640 | - |
| Files Changed | |||
| - Average | 1.2 | 1.4 | p = 0.23 |
| - Min | 1 | 1 | - |
| - Max | 2 | 5 | - |
| EASY CHALLENGE | |||
| Count | 2 | 35 | - |
| Code Diffs (avg) | 30.0 | 16.2 | p = 0.15 |
| Time Intervals (avg, min) | 8.0 | 18.3 | p = 0.31 |
| Files Changed (avg) | 1.0 | 1.6 | p = 0.22 |
| MEDIUM CHALLENGE | |||
| Count | 12 | 9 | - |
| Code Diffs (avg) | 63.4 | 12.1 | p < 0.001* |
| Time Intervals (avg, min) | 8.9 | 12.4 | p = 0.52 |
| Files Changed (avg) | 1.2 | 1.0 | p = 0.34 |
| HARD CHALLENGE | |||
| Count | 8 | 17 | - |
| Code Diffs (avg) | 48.6 | 26.8 | p < 0.05* |
| Time Intervals (avg, min) | 7.8 | 36.9 | p = 0.08 |
| Files Changed (avg) | 1.1 | 1.2 | p = 0.71 |
| *Statistically significant at α = 0.05 |
The data revealed distinct behavioural patterns between AI-generated and human-written code changes. There is a strong correlation between code diffs, time diffs and file changes when comparing AI versus human fixes.
The following graph shows the correlation between AI and human code diffs. As shown, AI-generated code has large code diffs. It is 3 to 4 times larger than human-written code.
The following graph shows the correlation between AI and human commit time intervals. As it is shown, the commit time intervals were mostly under 5 minutes for AI generated code. AI generated code fixes are pushed 3 times faster than human-written code.
Regarding code characteristics, AI-generated code contained mostly examples of textbook-style security patches, with heavy commenting and documentation. It usually had comprehensive error handling unrelated to the security vulnerability.
Human-written code was mostly targeted at minimal required changes and incremental improvements. It primarily added new code with minimal commenting patterns.
Regarding file changes, there were no significant differences between AI and human codes.
Now let’s look at something more interesting.
The following table shows an average comparison of commits for AI-generated and human-written code.
| Metric | AI-Fixes | Human-Fixes | Ratio |
|---|---|---|---|
| Avg Max Commit Size | 161.8 lines | 17.7 lines | 9.1x larger |
| Avg Commits per Repo | 10.7 commits | 2.2 commits | 4.9x more |
| Avg Lines per Commit | 57.9 lines | 10.9 lines | 5.3x larger |
Here are some specific example of AI fixes for three developers:
| Challenge | Lines Changed | Time Taken | Speed |
|---|---|---|---|
| Medium | 493 lines | 50 seconds | 591 lines/min |
| Medium | 308 lines | 34 seconds | 543 lines/min |
| Medium | 187 lines | 1 minute | 187 lines/min |
Examining the first developer’s case, they introduced 591 lines of code diffs in one minute. This equates to approximately 10 lines per second. When compared to typical human reading speed of 50-100 lines per minute (with comprehension), this rate is extraordinarily fast and unachievable by humans. Therefore, this speed makes it physically impossible for developers to comprehend or even carefully validate the changes before pushing.
Let’s look at two repositories for an easy challenge where one developer used AI and the other relied on her own brain. The challenge was a vulnerable Django application to integer overflow.
The developer had a single commit. The commit changed 2 files and it was mostly adding new lines.
f758d00 Mon Jun 23 11:52:21 2025 +0000 security fix
src/program/utils.py | 5 ++---
src/program/views.py | 9 ++++++++-
2 files changed, 10 insertions(+), 4 deletions(-)
The code change is focused on the root-cause of the vulnerability. The developer modified the two required files with an effective patch (if you want to learn more about this vulnerable, see my article, titled Write up for Start Here.js: How To and Not To Prevent Integer Overflow in JavaScript)
diff --git a/src/program/utils.py b/src/program/utils.py
index a8182f9..b8d26c7 100644
--- a/src/program/utils.py
+++ b/src/program/utils.py
@@ -1,11 +1,10 @@
import numpy as np
# 248 * 86400 * 1000
-threshold = np.sum(np.array([2142720000], dtype=np.intc))
+threshold = 2142720000
def is_optimal(days):
# days * 86400 * 1000
- a = np.array([days, 8640000], dtype=np.intc)
- res = np.multiply(a[0], a[1])
+ res = 8640000 * days
if(res >= threshold):
diff --git a/src/program/views.py b/src/program/views.py
index 73e21d5..ef3572e 100644
--- a/src/program/views.py
+++ b/src/program/views.py
@@ -9,7 +9,14 @@ def index(request):
return HttpResponse("[!] Connected to Boeing 787!<br>[?] Enter how many days this Boeing has been operational (1 to 248): http://localhost:8080/isoptimal?days=[1-248]: ")
def isoptimal(request):
- days = int(request.GET["days"])
+ try:
+ days = int(request.GET["days"])
+ except (ValueError, TypeError):
+ return HttpResponse("Invalid input: 'days' must be a number.", status=400)
+
+ if days <= 0:
+ return HttpResponse("Negative", status=400)
+
The git history of AI fixes for the same easy secure coding challenge is provided below.
As we can see, AI has modified many files. There are lots of code insertion and addition.
46b6dc1 Tue Jun 24 13:37:39 2025 +0200 test 6
52dbe26 Tue Jun 24 13:32:13 2025 +0200 test 5
bbbd02d Tue Jun 24 13:27:32 2025 +0200 test 4
29ab73a Tue Jun 24 13:22:59 2025 +0200 test 3
382a5bd Tue Jun 24 13:18:05 2025 +0200 test 2
9f777ac Tue Jun 24 13:08:49 2025 +0200 test 1
test 1
src/app/settings.py | 4 ++--
src/program/views.py | 39 +++++++++++++++++++++++++++++----------
2 files changed, 31 insertions(+), 12 deletions(-)
test 2
src/app/urls.py | 1 +
1 file changed, 1 insertion(+)
test 3
src/program/views.py | 29 ++++++++---------------------
1 file changed, 8 insertions(+), 21 deletions(-)
test 4
src/app/urls.py | 5 ++---
src/program/urls.py | 2 --
2 files changed, 2 insertions(+), 5 deletions(-)
test 5
src/app/settings.py | 6 +++---
src/program/utils.py | 6 ++----
src/program/views.py | 17 +++++++++--------
3 files changed, 14 insertions(+), 15 deletions(-)
test 6
src/program/utils.py | 8 ++------
src/program/views.py | 5 ++---
2 files changed, 4 insertions(+), 9 deletions(-)
There were completely unnecessary changes to settings.py to disable debug mode
and restrict allowed hosts to localhost. AI introduced two new methods. One method added HTTP security headers to the responses. All these changes did not address the vulnerability.
--- a/src/app/settings.py
+++ b/src/app/settings.py
@@ -23,9 +23,9 @@ BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# SECURITY WARNING: don't run with debug turned on in production!
-DEBUG = True
+DEBUG = False
-ALLOWED_HOSTS = ['*']
+ALLOWED_HOSTS = ['127.0.0.1']
diff --git a/src/program/views.py b/src/program/views.py
index 73e21d5..6674514 100644
--- a/src/program/views.py
+++ b/src/program/views.py
@@ -1,17 +1,36 @@
- return HttpResponse(f"[i] System is optimal<br>Reboot is required in {248 - days} days")
+ out = f"[i] System is optimal<br>Reboot is required in {248 - days} days"
+ resp = HttpResponse(out)
+ _add_security_headers(resp)
+ return resp
+
+def _bad():
+ resp = HttpResponseBadRequest()
+ _add_security_headers(resp)
+ return resp
+
+def _good():
+ resp = HttpResponse("Good")
+ _add_security_headers(resp)
+ return resp
+
+def _add_security_headers(resp):
+ resp['X-Frame-Options'] = 'DENY'
+ resp['X-Content-Type-Options'] = 'nosniff'
+ resp['X-XSS-Protection'] = '1; mode=block'
Each time a developer pushes their code, SecDim server tests the code by simulating an attacker and provides the result. This guides the developer in the right track. However, the next two commits does not show if the developer utilised the test out. The commit reverted back some of the changes but did not address the vulnerability.
diff --git a/src/program/views.py b/src/program/views.py
index 6674514..5cfc1c5 100644
--- a/src/program/views.py
+++ b/src/program/views.py
@@ -2,35 +2,22 @@ from django.http import HttpResponse, HttpResponseBadRequest
from .utils import is_optimal
def index(request):
- resp = HttpResponse("[!] Connected to Boeing 787!<br>[?] Enter how many days this Boeing has been operational (1 to 248): http://localhost:8080/isoptimal?days=[1-248]: ")
- _add_security_headers(resp)
- return resp
+ return HttpResponse(
+ "[!] Connected to Boeing 787!<br>[?] Enter how many days this Boeing has been operational (1 to 248): http://localhost:8080/isoptimal?days=[1-248]: "
+ )
def isoptimal(request):
days_str = request.GET.get("days", None)
try:
if days_str is None or days_str.strip() == "":
- return _bad()
+ return HttpResponseBadRequest()
days = int(days_str)
if days < 1:
- return _bad()
+ return HttpResponseBadRequest()
except (ValueError, TypeError):
- return _bad()
+ return HttpResponseBadRequest()
res = is_optimal(days)
if res:
- out = "[i] Reboot is required"
+ return HttpResponse("[i] Reboot is required")
else:
- out = f"[i] System is optimal<br>Reboot is required in {248 - days} days"
- resp = HttpResponse(out)
- _add_security_headers(resp)
- return resp
-def _bad():
- resp = HttpResponseBadRequest()
- _add_security_headers(resp)
- return resp
-
-def _add_security_headers(resp):
- resp['X-Frame-Options'] = 'DENY'
- resp['X-Content-Type-Options'] = 'nosniff'
- resp['X-XSS-Protection'] = '1; mode=block'
+ return HttpResponse(f"[i] System is optimal<br>Reboot is required in {248 - days} days")
The next commit is formatting and linting.
diff --git a/src/app/urls.py b/src/app/urls.py
index 206f611..77aeb50 100644
--- a/src/app/urls.py
+++ b/src/app/urls.py
@@ -14,10 +14,9 @@ Including another URLconf
2. Add a URL to urlpatterns: path('blog/', include('blog.urls'))
"""
from django.contrib import admin
-from django.urls import include, path
+from django.urls import path, include
urlpatterns = [
- path('', include('program.urls')),
path('admin/', admin.site.urls),
+ path('', include('program.urls')),
]
-
diff --git a/src/program/urls.py b/src/program/urls.py
index 7d4e7a4..4a16b52 100644
--- a/src/program/urls.py
+++ b/src/program/urls.py
@@ -2,8 +2,6 @@ from django.urls import path
from . import views
urlpatterns = [
- # /program/
path('', views.index, name='index'),
- # /program/1
path('isoptimal/', views.isoptimal, name='isoptimal'),
]
These are the final commits. It can show some signs of frustration. The developer has reverted back all the changes and start from the scratch. As we can see, the settings.py file is again modified. Sadly, the final commit has made changes to view.py which has only masked away the vulnerability and has left the vulnerability unpatched in utils.py.
diff --git a/src/app/settings.py b/src/app/settings.py
index 799e1ab..e0b68a0 100644
--- a/src/app/settings.py
+++ b/src/app/settings.py
@@ -20,12 +20,12 @@ BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# See https://docs.djangoproject.com/en/3.0/howto/deployment/checklist/
# SECURITY WARNING: keep the secret key used in production secret!
-SECRET_KEY = '=q$m*%7k-yswz5ienrbkzaej3msmck&6xf6h6xktfnkpk$)s@&'
+SECRET_KEY = os.environ.get("DJANGO_SECRET_KEY", '=q$m*%7k-yswz5ienrbkzaej3msmck&6xf6h6xktfnkpk$)s@&')
# SECURITY WARNING: don't run with debug turned on in production!
-DEBUG = False
+DEBUG = os.environ.get("DJANGO_DEBUG", "True") == "True"
-ALLOWED_HOSTS = ['127.0.0.1']
+ALLOWED_HOSTS = os.environ.get("DJANGO_ALLOWED_HOSTS", "localhost,127.0.0.1").split(",")
# Application definition
diff --git a/src/program/utils.py b/src/program/utils.py
index a8182f9..c2f47a7 100644
--- a/src/program/utils.py
+++ b/src/program/utils.py
@@ -1,12 +1,10 @@
import numpy as np
-# 248 * 86400 * 1000
threshold = np.sum(np.array([2142720000], dtype=np.intc))
def is_optimal(days):
- # days * 86400 * 1000
a = np.array([days, 8640000], dtype=np.intc)
res = np.multiply(a[0], a[1])
- if(res >= threshold):
+ if res >= threshold:
return True
else:
- return False
\ No newline at end of file
+ return False
diff --git a/src/program/views.py b/src/program/views.py
index 5cfc1c5..fb6bb80 100644
--- a/src/program/views.py
+++ b/src/program/views.py
@@ -3,19 +3,20 @@ from .utils import is_optimal
def index(request):
return HttpResponse(
- "[!] Connected to Boeing 787!<br>[?] Enter how many days this Boeing has been operational (1 to 248): http://localhost:8080/isoptimal?days=[1-248]: "
+ "[!] Connected to Boeing 787!<br>[?] Enter how many days this Boeing has been operational (1 to 248): "
+ "http://localhost:8080/isoptimal?days=[1-248]: "
)
def isoptimal(request):
- days_str = request.GET.get("days", None)
+ days_str = request.GET.get("days", "")
try:
- if days_str is None or days_str.strip() == "":
- return HttpResponseBadRequest()
days = int(days_str)
- if days < 1:
- return HttpResponseBadRequest()
- except (ValueError, TypeError):
- return HttpResponseBadRequest()
+ if not (1 <= days <= 248):
+ if days < 1:
+ return HttpResponseBadRequest("Invalid number of days. Must be between 1 and 248.")
+ except ValueError:
+ return HttpResponseBadRequest("Invalid number of days. Must be integer.")
+
diff --git a/src/program/utils.py b/src/program/utils.py
index c2f47a7..8b9453a 100644
--- a/src/program/utils.py
+++ b/src/program/utils.py
@@ -2,9 +2,5 @@ import numpy as np
threshold = np.sum(np.array([2142720000], dtype=np.intc))
def is_optimal(days):
- a = np.array([days, 8640000], dtype=np.intc)
- res = np.multiply(a[0], a[1])
- if res >= threshold:
- return True
- else:
- return False
+ # Returns True if reboot is required, else False
+ return days >= 248
diff --git a/src/program/views.py b/src/program/views.py
index fb6bb80..0906086 100644
--- a/src/program/views.py
+++ b/src/program/views.py
@@ -11,9 +11,8 @@ def isoptimal(request):
days_str = request.GET.get("days", "")
try:
days = int(days_str)
- if not (1 <= days <= 248):
- if days < 1:
- return HttpResponseBadRequest("Invalid number of days. Must be between 1 and 248.")
+ if not (1 <= days): # allow any number >= 1 for reboot, but <1 is error
+ return HttpResponseBadRequest("Invalid number of days. Must be at least 1.")
except ValueError:
return HttpResponseBadRequest("Invalid number of days. Must be integer.")
Our analysis showed that when AI-generated code was readily available, developers simply accepted it. They did not invest time in reviewing the generated code, reasoning about it, removing unnecessary changes, or tweaking it to properly patch vulnerabilities. The ease of accepting AI output has resulted in passive consumption. Developers in our experiments showed a similar behavioural pattern to the digital dictionary research study (comparing click-on versus manual typing of words).
Over-reliance on AI leads to significant reduction in critical thinking, problem-solving ability and skill development over time. Similar research studies on AI’s role in education have highlighted the risks of learners’ over-reliance on generated outputs. They found that learners quickly become accustomed to auto-suggested solutions and don’t think about the steps required to solve coding problems.
Developers’ over-reliance on AI results in overlooking potential security vulnerabilities and a decline in their secure coding awareness. This has also given rise to optimism bias, where some believe AI coding tools enhance security and therefore blindly accept AI-generated diffs. As we can see in the first two commits of the AI-generated code example, this bias partly stems from well-formatted, professional-looking AI-generated code. It instils a false sense of security and masks underlying security flaws.
The secure coding challenges reviewed in this experiment were simple web applications with few lines of code and files. These do not resemble the reality of production applications, which are complex and intertwined. Production environments require deeper contextual understanding to detect or fix potential security vulnerabilities. However, over-reliance on AI deepens the comprehension gap where developers may no longer understand how to address these issues.
The objective of this experiment was to evaluate the impact of AI on secure code learning. The data showed that developers often “just accept all” AI-generated code without truly understanding it. Unrealistically short time intervals between large and complex commits provided clear evidence that code review, comprehension, and testing were not taking place.
Although this was a short experiment, the findings aligned with other research studies. We can confidently conclude that if AI is not used correctly, it can lead to over-reliance and the degradation of secure coding skills.
Be aware of AI’s potential negative impact on your critical thinking and skill development:
When designing the SecDim AI coding mentor, Dr. SecDim, we intentionally restricted certain capabilities. The AI does not provide complete coding answers, nor does it allow direct acceptance of suggestions. Instead, users must type in AI recommendations manually. This design has helped our users to utilise the best of AI coding mentors, avoiding over-reliance whilst supporting them through a tailored learning path.
Let me conclude this post with a relevant word of wisdom from the greatest figure in classical Persian literature, Saadi Shirazi (13th century poet):
نابرده رنج گنج میسر نمی شود - مزد آن گرفت جان برادر که کار کرد
Without pain, you don’t find a treasure (no pain, no gain) - The reward was taken by the brother who worked hard.
Want to skill-up in secure coding and AppSec? Try SecDim Wargames to learn how to find, hack and fix security vulnerabilities inspired by real-world incidents.
Join our secure coding and AppSec community. A discussion board to share and discuss all aspects of secure programming, AppSec, DevSecOps, fuzzing, cloudsec, AIsec code review, and more.
Read more