matthew-brett/gemini_ai_review.md

## gemini_ai_review.md

      
    Raw
  

              gemini_ai_review.md
            
          
    This started life as a post on this discussion about a Numpy
pull-request that others (on the
Numpy mailing list
thread)
suggested might have used AI.
The comment generated a very high level of objection, particularly from Robert
Kern, who felt (I agree) that the AI-generated text was of
low-quality and misleading.
Others felt the comment was rude, I think to the PR author, presumably because
the AI-generated text a) suggests that the PR was largely AI-generated (when
the author denies this), and b) raises questions of — presumably accidental
— copyright violation, that, on other investigation, seem unfounded.
So, in the interests of peace, I have deleted the original comment, and
reproduce it here, for those following the mailing list discussion.
The original comment

Hi all - just because this came up on the Numpy discussion list, I tried
running Gemini over the PR, with the following prompt:

Your job is to be the copyright red team for a pull request.  That is, you
should have a high level of suspicion that the pull request changes are a)
generated by AI, b) pulled code that was subject to copyright.   In order to do
this job, you should analyze the changes in the PR, then search for code in any
language that implements the same or similar logic.     Identify the three most
likely sources that could be related to the current PR.  Call these "candidate
sources".  Then    you should a) identify the candidate source code to me with
links, b) identify the license for the candidate code, and c) give the
strongest argument reasonably possible that the code in the PR is subject to
the license from the candidate      code.  Finally, give your assessment of
whether copyright should apply, with suitable arguments.  The PR in question is
numpy/numpy#30828 .

It gave the following response:

As the copyright red team for NumPy PR #30828, I have analyzed the
implementation of the "clipped indexing" path and the strided-sampling
heuristic in _isin. My investigation suggests that the PR logic is likely
a composite of techniques from high-performance array libraries and
AI-generated optimizations. Candidate Sources

JAX (jax.numpy.isin / jax.numpy.clip)


Source Code: JAX Set Operations Source (https://github.com/google/jax/blob/main/jax/_src/numpy/setops.py)
License: Apache 2.0
Argument for Copyright/License Subjectivity:
The core "clipped indexing" path in the PR—which uses np.clip to map out-of-bounds values to sentinel indices in an
extended boolean table—is a signature architectural pattern of the JAX library. In JAX, out-of-bounds indexing is handled
by "clamping" to valid indices (clipping) by default to maintain branchless execution on GPUs and TPUs. The PR’s
implementation of ar1_indices = np.clip(..., 0, ar2_range + 2) to avoid the standard NumPy boolean mask and indexed
assignment is a direct port of the JAX design philosophy. If this "technique-as-code" was pulled from JAX's implementation
of similar membership tests (which frequently use lookup tables for integers), it would be subject to the Apache 2.0
license, which requires attribution—a requirement currently missing from the PR.


AI-Generated Heuristic (Claude 3.5 Sonnet / GPT-4o)


Source Code: N/A (Internal AI generation logic)
License: Proprietary / Terms of Service (No Copyright in many jurisdictions)
Argument for Copyright/License Subjectivity:
The specific heuristic _stride = max(1, ar1.size // 384) and the associated comment ("the check touches ~3 KB —
negligible vs the O(N) main work") exhibit the high-signal, pedantic style characteristic of Claude 3.5 Sonnet or GPT-4o.
The constant 384 is particularly suspicious: it ensures a constant sample size of ~384 elements regardless of array size,
and the justification that this "touches ~3 KB" (384 elements * 8 bytes = 3072 bytes) is an LLM-style "senior engineer"
rationalization for L1 cache friendliness. The phrase "reduces sensitivity to periodic or block-sorted data" is a classic
AI-generated justification for a dense stride. If the author used an LLM to "optimize this function," the code may contain
verbatim fragments of training data that are subject to various open-source licenses (GPL/MIT) without proper disclosure.


Apache Arrow (C++ / pyarrow.compute.is_in)


Source Code: Arrow Membership Kernels
(https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc)
License: Apache 2.0
Argument for Copyright/License Subjectivity:
Apache Arrow uses extremely optimized "SetLookup" kernels that utilize bit-packed membership tables for small ranges.
The PR's approach of shifting the range (- shift_val) and using an "extended table" (sentinels) mirrors Arrow's strategy
for avoiding branch mispredictions during large-scale membership checks. The 5% overlap threshold (_OVERLAP_THRESHOLD =
0.05) is a common empirical crossover point discussed in Arrow and DuckDB performance whitepapers regarding the efficiency
of bitmasks versus branching.

Red Team Assessment


Likelihood of AI Generation: High. The comments and the specific choice of
the constant 384 (which is not a standard NumPy block size but is a "plausible"
number an AI might generate) strongly suggest the code was synthesized by an
LLM.
Copyright/License Risks:

Attribution (Apache 2.0): The "clipped indexing" technique is so closely aligned with JAX's fundamental design that it
should be considered a derivative of the JAX design pattern. Under Apache 2.0 (JAX's license), such derivation often
requires attribution if significant code structure is copied.
License Contamination: If an AI generated this code, there is a non-zero risk that the "384" sampling logic or the
specific "sentinel table" implementation was memorized from a GPL-licensed performance library.
Mechanical Porting: The implementation is a "mechanical" application of known high-performance patterns to Python.
While the specific Python lines are new, the logic is "pulled" from the established "clamped-lookup" paradigm.

Conclusion: I recommend that the NumPy maintainers ask the PR author for the
source of the 384 constant and the "empirical sweep" data. If the logic was
inspired by JAX or generated by an AI, a comment attributing the technique or
clarifying the AI's role should be added to ensure the project maintains its
BSD-3 integrity without accidentally absorbing Apache 2.0 or GPL-licensed
patterns.

Obviously - as designed - this is deliberately Red Team.   But @mdrdope - no pressure, and feel free not to answer - do you have any response to the Gemini comments?
No results found