Skip to content

Instantly share code, notes, and snippets.

@Hermann-SW
Last active January 7, 2026 00:22
Show Gist options
  • Select an option

  • Save Hermann-SW/94241e719dd3518a3276d804b627319d to your computer and use it in GitHub Desktop.

Select an option

Save Hermann-SW/94241e719dd3518a3276d804b627319d to your computer and use it in GitHub Desktop.
Determine pi_{k^2+(k+1)^2}(10^n) (for all k<=10^n)
// NOLINT(legal/copyright)
// g++ par.cc -O3 -Wall -pedantic -Wextra -o par -fopenmp
// (cpplinted and cppchecked)
//
#include <string.h>
#include <iostream>
#include <cassert>
#include <tuple>
#include <queue>
typedef __int128_t num;
const int lev = 16; // results in ceil((fib(19)-1)/2) = 2090 work items
num c;
char *V;
std::queue<std::tuple<num, num, bool, bool>> Q[2090];
int nxt = 0;
const bool lft = false;
const bool rgt = true;
num other(num v, num p) {
num x = 2*v*(v+1)+1;
assert(x%p == 0);
return x/p;
}
num next(num v, num p) {
num r = v%p;
return v + (2*r < p ? p-2*r-1 : 2*(p-r)-1);
}
void right(num v, num p, int l, bool b);
void left(num v, num p, int l, bool b) {
if (l == 0) { Q[nxt++].push({v, p, lft, b}); return; }
V[v] = 1;
num q = other(v, p);
if (q != p) left(next(v, q), q, l-1, false);
right(next(v, p), p, l-1, b);
}
void right(num v, num p, int l, bool b) {
if (l == 0) { Q[nxt++].push({v, p, rgt, b}); return; }
V[v] = 1;
num q = other(v, p);
left(next(v, p), p, l-1, false);
if (b) right(next(v, q), q, l-1, true);
}
int main(int argc, char *argv[]) {
assert(argc == 2);
assert(atoi(argv[1]) >= 4);
num mx = 1;
for (int e = 1; e <= atoi(argv[1]); ++e) {
mx *= 10;
}
V = new char[1 + mx];
assert(V);
explicit_bzero(V, 1+mx);
left(3, 5, lev, true);
_Pragma("omp parallel for")
for (int i = 0; i < 2090; ++i) {
while (!Q[i].empty()) {
num v = std::get<0>(Q[i].front());
num p = std::get<1>(Q[i].front());
bool l = std::get<2>(Q[i].front());
bool b = std::get<3>(Q[i].front());
Q[i].pop();
if (v > mx) continue;
V[v] = 1;
num q = other(v, p);
if (l == lft) {
if (q != p) Q[i].push({next(v, q), q, lft, false});
Q[i].push({next(v, p), p, rgt, b});
} else {
Q[i].push({next(v, p), p, lft, false});
if (b) Q[i].push({next(v, q), q, rgt, true});
}
}
} // _Pragma("omp parallel for")
num c = mx, i;
for (i = 1; i <= mx; ++i) c-=V[i];
std::cout << static_cast<__int64_t>(c) << "\n";
return 0;
}
@Hermann-SW
Copy link
Author

Hermann-SW commented Jan 3, 2026

Takes 25GB resident memory for 10^10.
Started with nohup time ./par 10 took 1:17h on 32GB RAM 16C/32T AMD 7950X CPU.

$2090 = \lceil\frac{fib(19)-1}{2}\rceil$ work items in Q after limited height initial DFS.
For later OpenMP parallelization and run on 1TB RAM 8-socket 192C/384T server with Intel Xeon 8890v4 CPUs.
Plan is to determine exact pi values for 10^11 and 10^12 as well.

Maximal queue size 340152917 during the sequential BFS computation.

Result is $\ pi_{k^2+(k+1)^2}(10^{10}) = 614,983,774$, so there are $614,983,774$ primes of form $k^2+(k+1)^2$ for $k\leq 10^{10}$.
There are quite more primes in that sequence than primes in natural numbers: $\ pi(10^{10}) = pi_{k}(10^{10}) = 455,052,511$.

The algorithm does equivalent of sieve of Eratosthenes for primes of that sum of successive squares form.
Composites are marked with 1 in 10GB array V, primes are marked with 0.
Algorithm traverses (left half of) great-grandparent DAG (as defined in A000071) of composite sum of successive squares.

$ cat nohup.out
2090
614983774
340152917 Q max size
4611.24user 8.06system 1:16:59elapsed 99%CPU (0avgtext+0avgdata 26704340maxresident)k
0inputs+8outputs (0major+6795216minor)pagefaults 0swaps
$ 

P.S:
great-grandparnt DAG is described in followup gist here.

@Hermann-SW
Copy link
Author

Hermann-SW commented Jan 4, 2026

Especially the tree corresponding to prime 5 takes much longer than the other trees.
With revision 2 now 3.4× times faster with 16 OpenMP threads:

$ time numactl -C 0-15 ./par 10
614983774

real	22m40.991s
user	88m26.700s
sys	0m5.347s
$

@Hermann-SW
Copy link
Author

Hermann-SW commented Jan 4, 2026

Ongoing run on 8-socket 192C/384T Lenovo x3950-X6 server with Intel Xeon 8890v4 CPUs.

Started with nohup ./doit:

hermann@x3950-X6:~$ cat doit 
#!/bin/bash
time numactl -C 0-191 ./par 11
hermann@x3950-X6:~$ 

par process uses 141.8g resident and 151.4g virtual memory.

@Hermann-SW
Copy link
Author

Hermann-SW commented Jan 5, 2026

Started with 19200 %CPU, then quickly dropped.
For many hours did run with 300 %CPU, then more hours with 200 %CPU.
Finally completed after many more hours with 100 %CPU.
14:28h wall-clock time in total (71:40h user time):

hermann@x3950-X6:~$ cat nohup.out 
5573589175

real	867m26.910s
user	4299m20.034s
sys	172m37.703s
hermann@x3950-X6:~$ 
hermann@x3950-X6:~$ stat nohup.out 
  File: nohup.out
  Size: 63        	Blocks: 8          IO Block: 4096   regular file
Device: 8,2	Inode: 144770086   Links: 1
Access: (0600/-rw-------)  Uid: ( 1000/ hermann)   Gid: ( 1000/ hermann)
Access: 2026-01-05 01:00:21.628167853 +0100
Modify: 2026-01-04 23:54:27.976743735 +0100
Change: 2026-01-04 23:54:27.976743735 +0100
 Birth: 2026-01-04 09:27:01.065733044 +0100
hermann@x3950-X6:~$ 

@Hermann-SW
Copy link
Author

Hermann-SW commented Jan 6, 2026

From previous comment, the first parallel version did take only 22min to determine a(10):

$ time numactl -C 0-15 ./par 10
614983774

real	22m40.991s
user	88m26.700s
sys	0m5.347s
$

Previous (sequential) computation of a(10) with PARI/GP took 23h on AMD 7950X CPU:

hermann@7950x:~$ gp -q
? a(n)=sum(k=1, 10^n, isprime(k^2+(k+1)^2));
? #
   timer = 1 (on)
? a(10)

cpu time = 33h, 53min, 56,970 ms, real time = 23h, 3min, 55,987 ms.
614983774
? 

Today I used parsum() instead of sum() on 192C/384T 8-socket server with slower Intel Xeon 8890v4 CPUS.
Reduction of runtime from 23h to 19 minutes ... wow:

hermann@x3950-X6:~$ numactl -C 0-191 gp -q
? a(n)=parsum(k=1, 10^n, isprime(k^2+(k+1)^2));
? #
   timer = 1 (on)
? a(10)
cpu time = 59h, 57min, 42,370 ms, real time = 18min, 46,398 ms.
614983774
? 

For numbers represented as 64bit long isprime() calls ispseudoprime() which is much faster than isprime() and is correct up to 2^64. Computation for 10^10 ks without ispseudoprime speedup takes 1h:

? parsum(k=10^10+1,2*10^10,isprime(k^2+(k+1)^2));
cpu time = 191h, 45min, 10,976 ms, real time = 1h, 3,472 ms.
?

@Hermann-SW
Copy link
Author

Followup work is happening in gist parallel_dfs_bfs_evenly.cc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment