Aidan Do aidando73

## swizzle_mma_repro.cu
// Minimal repro: fp8 SS MMA with NoSwizzle vs SW32 vs SW64 vs SW128.
// All four produce identical MMA throughput (129 cycles/iter).
// Only depends on CUTLASS (no FlashMLA, no mla_decode_v21).
//
// Build & run (requires sm_100a GPU + CUTLASS source):
//   nvcc -std=c++20 -O2 --generate-code=arch=compute_100a,code=[sm_100a] \
//     -I third-party/cutlass/include \
//     -o scripts/swizzle_mma_repro scripts/swizzle_mma_repro.cu \
//     && ./scripts/swizzle_mma_repro

## gist:dd0684b470dffe1aeb6f398cd3deac25
Metric                                          v6                  v4                  v6/v4 change
--------------------------------------------------------------------------------------------------------------
Duration                                        8,550 us            26,510 us           3.10x faster
Grid Size                                       144                 48                  3x
Block Size                                      512                 512                 same
Registers Per Thread                            32                  32                  same
Local Memory Spilling                           0                   0                   same
Theoretical Occupancy                           100%                100%                same
Achieved Occupancy                              25.01%              25.00%              same
Executed IPC Active                             1.30                1.25                +4%

## README.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aidando73
                / README.md
            
            
              Last active
              December 26, 2024 02:17
            
              
                SWE-Bench-Lite CodeAct 2.1 - Llama 405B/3.3 70B
              
          
Commands run were: https://github.com/aidando73/OpenHands/pull/1/files
Ran on OpenHands + CodeAct 2.1 - ecff5c67fb7f1995556f0f36f5050f33dc0953d2
Ran on SWE bench lite (300 instances)
Provider: OpenRouter

Llama 3.3 70B Instruct


Score: 0.047
model: "openrouter/meta-llama/llama-3.3-70b-instruct"


## gist:f32788f42e12bad180ae91890b05e495

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aidando73
                / gist:f32788f42e12bad180ae91890b05e495
            
            
              Last active
              November 21, 2024 10:18
            
              
                How do LLM evals work?
              
          
https://www.reddit.com/r/LocalLLaMA/comments/16whnun/how_do_you_account_for_varying_llm_output_with/
https://huggingface.co/blog/open-llm-leaderboard-mmlu
Inspect a few eval datasets, mmlu: https://huggingface.co/datasets/lukaemon/mmlu, hella swag
Look into: https://crfm.stanford.edu/helm/lite/latest/
Look into how https://github.com/EleutherAI/lm-evaluation-harness/tree/main compute which answer the LLM chose. Aparently they gather a set of tokens and then compute probabilities - https://huggingface.co/blog/open-llm-leaderboard-mmlu


## !lambdalabs-setup.md

      
              4 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aidando73
                / !lambdalabs-setup.md
            
            
              Last active
              February 22, 2025 01:03
            
              
                Setup for lambdalabs.com boxes
              
          
    After booting a new instance:
curl -H 'Cache-Control: no-cache' https://gist.githubusercontent.com/aidando73/2876aabaae7ded0dec68d04d34b70086/raw/new_instance.bash | bash
If you want to use conda:
source ~/miniconda3/bin/activate

  
## README.md

      
              3 files
            
          
              0 forks
            
          
                1 comment
              
            
              0 stars
            
          
                aidando73
                / README.md
            
            
              Last active
              November 21, 2024 01:07
            
              
                Updating train_gpt2.cu for inference (llm.c)
              
          
Run python train_gpt2.py
Copy paste the .cu snippet below right above this line here:

https://github.com/karpathy/llm.c/blob/7ecd8906afe6ed7a2b2cdb731c042f26d525b820/train_gpt2.cu#L1761

Use https://tiktokenizer.vercel.app/?model=gpt2 to generate tokens
Copy and paste the values into tokens array. (Keep the -1 at the end - marks the end of the array)
Run the .sh snippet listed below, linking your model binary


## README.md

      
              2 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aidando73
                / README.md
            
            
              Last active
              November 20, 2024 02:41
            
              
                llm.c vislog.py
              
          
    vislog.py

Python script version of
https://github.com/karpathy/llm.c/blob/7ecd8906afe6ed7a2b2cdb731c042f26d525b820/dev/vislog.ipynb
Avoids needing a jupyter server
Run from root of llm.c repo

  
## gist:fc7a6069fef9b718576ff9adbc8fe6b7

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aidando73
                / gist:fc7a6069fef9b718576ff9adbc8fe6b7
            
            
              Last active
              November 21, 2024 22:05
            
              
                Training GPT-2 124M
              
          
    Training GPT-2 124M

I tried the GPT-2 124M parameter model on 3 different GPU setups - following karpathy/llm.c#481. Here's how long each run it took and how much it cost:


1x A10G
1x A100 40GB
8x A100 40GB


training time
48h
15h
2h


cost
$163
$27
$45


Most of the completions are fairly nonsensical, but here are some interesting ones:

  
## strings.txt
   Trailing
~!@#$%^&*()_+{}|:"<>?[]\',./`
👾 🙇 💁 🙅 🙆 🙋 🙎 🙍
Ṱ̺̺̕o͞ ̷i̲̬͇̪͙n̝̗͕v̟̜̘̦͟o̶̙
<sc<script>ript>alert(13)</sc</script>ript>
	// Minimal repro: fp8 SS MMA with NoSwizzle vs SW32 vs SW64 vs SW128.
	// All four produce identical MMA throughput (129 cycles/iter).
	// Only depends on CUTLASS (no FlashMLA, no mla_decode_v21).
	//
	// Build & run (requires sm_100a GPU + CUTLASS source):
	// nvcc -std=c++20 -O2 --generate-code=arch=compute_100a,code=[sm_100a] \
	// -I third-party/cutlass/include \
	// -o scripts/swizzle_mma_repro scripts/swizzle_mma_repro.cu \
	// && ./scripts/swizzle_mma_repro
	Metric v6 v4 v6/v4 change
	--------------------------------------------------------------------------------------------------------------
	Duration 8,550 us 26,510 us 3.10x faster
	Grid Size 144 48 3x
	Block Size 512 512 same
	Registers Per Thread 32 32 same
	Local Memory Spilling 0 0 same
	Theoretical Occupancy 100% 100% same
	Achieved Occupancy 25.01% 25.00% same
	Executed IPC Active 1.30 1.25 +4%
	Trailing
	~!@#$%^&*()_+{}\|:"<>?[]\',./`
	👾 🙇 💁 🙅 🙆 🙋 🙎 🙍
	Ṱ̺̺̕o͞ ̷i̲̬͇̪͙n̝̗͕v̟̜̘̦͟o̶̙
	<sc<script>ript>alert(13)</sc</script>ript>