This started life as a post on this discussion about a Numpy pull-request that others (on the Numpy mailing list thread) suggested might have used AI.
The comment generated a very high level of objection, particularly from Robert Kern, who felt (I agree) that the AI-generated text was of low-quality and misleading. Others felt the comment was rude, I think to the PR author, presumably because the AI-generated text a) suggests that the PR was largely AI-generated (when the author denies this), and b) raises questions of — presumably accidental — copyright violation, that, on other investigation, seem unfounded.
So, in the interests of peace, I have deleted the original comment, and reproduce it here, for those following the mailing list discussion.
Hi all - just because this came up on the Numpy discussion list, I tried running Gemini over the PR, with the following prompt:
Your job is to be the copyright red team for a pull request. That is, you should have a high level of suspicion that the pull request changes are a) generated by AI, b) pulled code that was subject to copyright. In order to do this job, you should analyze the changes in the PR, then search for code in any language that implements the same or similar logic. Identify the three most likely sources that could be related to the current PR. Call these "candidate sources". Then you should a) identify the candidate source code to me with links, b) identify the license for the candidate code, and c) give the strongest argument reasonably possible that the code in the PR is subject to the license from the candidate code. Finally, give your assessment of whether copyright should apply, with suitable arguments. The PR in question is numpy/numpy#30828 .
It gave the following response:
As the copyright red team for NumPy PR #30828, I have analyzed the implementation of the "clipped indexing" path and the strided-sampling heuristic in _isin. My investigation suggests that the PR logic is likely a composite of techniques from high-performance array libraries and AI-generated optimizations. Candidate Sources
- JAX (jax.numpy.isin / jax.numpy.clip)
- Source Code: JAX Set Operations Source (https://github.com/google/jax/blob/main/jax/_src/numpy/setops.py)
- License: Apache 2.0
- Argument for Copyright/License Subjectivity: The core "clipped indexing" path in the PR—which uses np.clip to map out-of-bounds values to sentinel indices in an extended boolean table—is a signature architectural pattern of the JAX library. In JAX, out-of-bounds indexing is handled by "clamping" to valid indices (clipping) by default to maintain branchless execution on GPUs and TPUs. The PR’s implementation of ar1_indices = np.clip(..., 0, ar2_range + 2) to avoid the standard NumPy boolean mask and indexed assignment is a direct port of the JAX design philosophy. If this "technique-as-code" was pulled from JAX's implementation of similar membership tests (which frequently use lookup tables for integers), it would be subject to the Apache 2.0 license, which requires attribution—a requirement currently missing from the PR.
- AI-Generated Heuristic (Claude 3.5 Sonnet / GPT-4o)
- Source Code: N/A (Internal AI generation logic)
- License: Proprietary / Terms of Service (No Copyright in many jurisdictions)
- Argument for Copyright/License Subjectivity: The specific heuristic _stride = max(1, ar1.size // 384) and the associated comment ("the check touches ~3 KB — negligible vs the O(N) main work") exhibit the high-signal, pedantic style characteristic of Claude 3.5 Sonnet or GPT-4o. The constant 384 is particularly suspicious: it ensures a constant sample size of ~384 elements regardless of array size, and the justification that this "touches ~3 KB" (384 elements * 8 bytes = 3072 bytes) is an LLM-style "senior engineer" rationalization for L1 cache friendliness. The phrase "reduces sensitivity to periodic or block-sorted data" is a classic AI-generated justification for a dense stride. If the author used an LLM to "optimize this function," the code may contain verbatim fragments of training data that are subject to various open-source licenses (GPL/MIT) without proper disclosure.
- Apache Arrow (C++ / pyarrow.compute.is_in)
- Source Code: Arrow Membership Kernels (https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc)
- License: Apache 2.0
- Argument for Copyright/License Subjectivity: Apache Arrow uses extremely optimized "SetLookup" kernels that utilize bit-packed membership tables for small ranges. The PR's approach of shifting the range (- shift_val) and using an "extended table" (sentinels) mirrors Arrow's strategy for avoiding branch mispredictions during large-scale membership checks. The 5% overlap threshold (_OVERLAP_THRESHOLD = 0.05) is a common empirical crossover point discussed in Arrow and DuckDB performance whitepapers regarding the efficiency of bitmasks versus branching.
Red Team Assessment
Likelihood of AI Generation: High. The comments and the specific choice of the constant 384 (which is not a standard NumPy block size but is a "plausible" number an AI might generate) strongly suggest the code was synthesized by an LLM.
Copyright/License Risks:
- Attribution (Apache 2.0): The "clipped indexing" technique is so closely aligned with JAX's fundamental design that it should be considered a derivative of the JAX design pattern. Under Apache 2.0 (JAX's license), such derivation often requires attribution if significant code structure is copied.
- License Contamination: If an AI generated this code, there is a non-zero risk that the "384" sampling logic or the specific "sentinel table" implementation was memorized from a GPL-licensed performance library.
- Mechanical Porting: The implementation is a "mechanical" application of known high-performance patterns to Python. While the specific Python lines are new, the logic is "pulled" from the established "clamped-lookup" paradigm.
Conclusion: I recommend that the NumPy maintainers ask the PR author for the source of the 384 constant and the "empirical sweep" data. If the logic was inspired by JAX or generated by an AI, a comment attributing the technique or clarifying the AI's role should be added to ensure the project maintains its BSD-3 integrity without accidentally absorbing Apache 2.0 or GPL-licensed patterns.
Obviously - as designed - this is deliberately Red Team. But @mdrdope - no pressure, and feel free not to answer - do you have any response to the Gemini comments?