Created
January 14, 2026 08:28
-
-
Save jphme/668c2ce4445ef7a34b16e1b40b19ada4 to your computer and use it in GitHub Desktop.
Log KL Debug Output (Step1)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 2026-01-14 09:22:23,310 - xyz_agent.training.train - INFO - Reward distribution: retrieval(+24/-40), correctness(+18/-46), citations(+8/-56), format_penalty(1/64), no_final_answer(0/64) | |
| 2026-01-14 09:22:23,311 - xyz_agent.training.train - INFO - Group advantages: min=-0.475, max=1.100, mean=-0.000 | |
| 2026-01-14 09:22:23,311 - xyz_agent.training.train - INFO - Preparing training data... | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - Prepared 64 training examples | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - First datum model_input length: 7599 | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - target_tokens: dtype=int64, shape=[7599] | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - logprobs: dtype=float32, shape=[7599] | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - advantages: dtype=float32, shape=[7599] | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - mask: dtype=float32, shape=[7599] | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - Updating policy (num_substeps=1)... | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - Substep 1/1: 64 examples | |
| 2026-01-14 09:22:23,463 - xyz_agent.training.train - INFO - Pass 1: Computing training logprobs... | |
| 2026-01-14 09:22:40,772 - xyz_agent.training.train - INFO - KL penalty applied: avg_magnitude=0.0043 | |
| 2026-01-14 09:22:40,772 - xyz_agent.training.train - INFO - Pass 2: Forward-backward... | |
| 2026-01-14 09:23:25,742 - xyz_agent.training.train - INFO - Forward-backward completed - metrics: {'clock_cycle:unique': 8790639.0, 'clock_cycle:unique_1': 8790640.0, 'clock_cycle:unique_2': 8790641.0, 'clock_cycle:unique_3': 8790642.0, 'loss:sum': -456.5468291938305} | |
| 2026-01-14 09:23:26,731 - xyz_agent.training.train - INFO - Optimizer step 1/1 completed | |
| 2026-01-14 09:23:26,743 - xyz_agent.training.train - INFO - [KL DEBUG] Sample 0: | |
| 2026-01-14 09:23:26,743 - xyz_agent.training.train - INFO - sampling_logprobs shape: torch.Size([7599]) | |
| 2026-01-14 09:23:26,743 - xyz_agent.training.train - INFO - train_lp shape: torch.Size([7599]) | |
| 2026-01-14 09:23:26,743 - xyz_agent.training.train - INFO - action_mask sum (total action tokens): 352 | |
| 2026-01-14 09:23:26,743 - xyz_agent.training.train - INFO - non_forced_mask sum (trainable tokens): 168 | |
| 2026-01-14 09:23:26,743 - xyz_agent.training.train - INFO - forced tokens in actions: 184 | |
| 2026-01-14 09:23:26,744 - xyz_agent.training.train - INFO - sampling_lp_actions[:5]: [-5.627412796020508, -0.7875568866729736, -1.0823413133621216, -0.25068995356559753, -1.209656834602356] | |
| 2026-01-14 09:23:26,744 - xyz_agent.training.train - INFO - training_lp_actions[:5]: [-6.494776725769043, -0.7751193046569824, -2.4692530632019043, -0.1334303617477417, -0.7405458688735962] | |
| 2026-01-14 09:23:26,744 - xyz_agent.training.train - INFO - diff[:5]: [0.8673639297485352, -0.012437582015991211, 1.3869117498397827, -0.11725959181785583, -0.46911096572875977] | |
| 2026-01-14 09:23:26,746 - xyz_agent.training.train - INFO - diff mean: 0.2327, std: 0.7734 | |
| 2026-01-14 09:23:26,747 - xyz_agent.training.train - INFO - [OUTLIERS] 23 tokens with |diff| > 1.0 | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - Outlier 1: pos=1693, tok_id=8711 | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - s_lp=-1.082, t_lp=-2.469, diff=1.387 | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - Context IDs: [200008, 15143, 1078, 8711, 355, 32011, 23292] | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - Outlier 2: pos=1719, tok_id=8711 | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - s_lp=-3.512, t_lp=-4.601, diff=1.088 | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - Context IDs: [53, 3201, 30, 8711, 355, 382, 5985] | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - Outlier 3: pos=1725, tok_id=14012 | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - s_lp=-3.369, t_lp=-5.928, diff=2.559 | |
| 2026-01-14 09:23:26,749 - xyz_agent.training.train - INFO - Context IDs: [5985, 32011, 30, 14012, 29235, 6716, 316] | |
| 2026-01-14 09:23:26,750 - xyz_agent.training.train - INFO - Outlier 4: pos=1727, tok_id=6716 | |
| 2026-01-14 09:23:26,750 - xyz_agent.training.train - INFO - s_lp=-6.445, t_lp=-5.433, diff=-1.012 | |
| 2026-01-14 09:23:26,750 - xyz_agent.training.train - INFO - Context IDs: [30, 14012, 29235, 6716, 316, 15861, 53] | |
| 2026-01-14 09:23:26,750 - xyz_agent.training.train - INFO - Outlier 5: pos=1729, tok_id=15861 | |
| 2026-01-14 09:23:26,750 - xyz_agent.training.train - INFO - s_lp=-2.718, t_lp=-4.188, diff=1.470 | |
| 2026-01-14 09:23:26,750 - xyz_agent.training.train - INFO - Context IDs: [29235, 6716, 316, 15861, 53, 3201, 30] | |
| 2026-01-14 09:23:26,751 - xyz_agent.training.train - INFO - [KL DEBUG] Sample 1: | |
| 2026-01-14 09:23:26,751 - xyz_agent.training.train - INFO - sampling_logprobs shape: torch.Size([4947]) | |
| 2026-01-14 09:23:26,751 - xyz_agent.training.train - INFO - train_lp shape: torch.Size([4947]) | |
| 2026-01-14 09:23:26,751 - xyz_agent.training.train - INFO - action_mask sum (total action tokens): 272 | |
| 2026-01-14 09:23:26,751 - xyz_agent.training.train - INFO - non_forced_mask sum (trainable tokens): 105 | |
| 2026-01-14 09:23:26,751 - xyz_agent.training.train - INFO - forced tokens in actions: 167 | |
| 2026-01-14 09:23:26,752 - xyz_agent.training.train - INFO - sampling_lp_actions[:5]: [-0.14155146479606628, -2.1010820865631104, -1.0547370910644531, -0.023898281157016754, -0.019786830991506577] | |
| 2026-01-14 09:23:26,752 - xyz_agent.training.train - INFO - training_lp_actions[:5]: [-0.11977655440568924, -2.3091163635253906, -0.7210659980773926, -0.657882571220398, -0.0002791491860989481] | |
| 2026-01-14 09:23:26,752 - xyz_agent.training.train - INFO - diff[:5]: [-0.021774910390377045, 0.20803427696228027, -0.33367109298706055, 0.6339842677116394, -0.01950768195092678] | |
| 2026-01-14 09:23:26,752 - xyz_agent.training.train - INFO - diff mean: 0.2391, std: 0.7435 | |
| 2026-01-14 09:23:26,752 - xyz_agent.training.train - INFO - [OUTLIERS] 10 tokens with |diff| > 1.0 | |
| 2026-01-14 09:23:26,752 - xyz_agent.training.train - INFO - Outlier 1: pos=1714, tok_id=193114 | |
| 2026-01-14 09:23:26,770 - xyz_agent.training.train - INFO - s_lp=-0.641, t_lp=-3.223, diff=2.583 | |
| 2026-01-14 09:23:26,770 - xyz_agent.training.train - INFO - Context IDs: [2624, 13012, 7534, 193114, 355, 21417, 5591] | |
| 2026-01-14 09:23:26,770 - xyz_agent.training.train - INFO - Outlier 2: pos=1722, tok_id=4294 | |
| 2026-01-14 09:23:26,770 - xyz_agent.training.train - INFO - s_lp=-1.717, t_lp=-4.026, diff=2.309 | |
| 2026-01-14 09:23:26,770 - xyz_agent.training.train - INFO - Context IDs: [262, 78542, 81208, 4294, 11557, 7534, 6359] | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - Outlier 3: pos=4779, tok_id=50005 | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - s_lp=-0.836, t_lp=-1.842, diff=1.006 | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - Context IDs: [81208, 26300, 90470, 50005, 5126, 2866, 51514] | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - Outlier 4: pos=4781, tok_id=2866 | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - s_lp=-2.069, t_lp=-5.490, diff=3.420 | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - Context IDs: [90470, 50005, 5126, 2866, 51514, 86788, 13] | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - Outlier 5: pos=4783, tok_id=86788 | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - s_lp=-1.711, t_lp=-0.108, diff=-1.603 | |
| 2026-01-14 09:23:26,771 - xyz_agent.training.train - INFO - Context IDs: [5126, 2866, 51514, 86788, 13, 730, 23604] | |
| 2026-01-14 09:23:26,777 - xyz_agent.training.train - INFO - [KL DEBUG] Sample 2: | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - sampling_logprobs shape: torch.Size([8027]) | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - train_lp shape: torch.Size([8027]) | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - action_mask sum (total action tokens): 506 | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - non_forced_mask sum (trainable tokens): 185 | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - forced tokens in actions: 321 | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - sampling_lp_actions[:5]: [-0.14155146479606628, -0.7260820865631104, -6.013871192932129, -0.5163002610206604, -0.6825945973396301] | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - training_lp_actions[:5]: [-0.11977655440568924, -1.0591163635253906, -5.98875617980957, -0.26570484042167664, -0.1287568211555481] | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - diff[:5]: [-0.021774910390377045, 0.3330342769622803, -0.025115013122558594, -0.25059542059898376, -0.553837776184082] | |
| 2026-01-14 09:23:26,778 - xyz_agent.training.train - INFO - diff mean: 0.1508, std: 0.4796 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - [OUTLIERS] 13 tokens with |diff| > 1.0 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Outlier 1: pos=1725, tok_id=193114 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - s_lp=-0.564, t_lp=-2.285, diff=1.721 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Context IDs: [2624, 13012, 7534, 193114, 355, 32080, 98828] | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Outlier 2: pos=4708, tok_id=3100 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - s_lp=-2.011, t_lp=-3.539, diff=1.528 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Context IDs: [23958, 220, 18, 3100, 8698, 13, 19792] | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Outlier 3: pos=4710, tok_id=13 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - s_lp=-1.130, t_lp=-2.272, diff=1.143 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Context IDs: [18, 3100, 8698, 13, 19792, 16723, 13] | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Outlier 4: pos=5358, tok_id=35644 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - s_lp=-4.018, t_lp=-6.876, diff=2.858 | |
| 2026-01-14 09:23:26,779 - xyz_agent.training.train - INFO - Context IDs: [200006, 173781, 200005, 35644, 316, 28, 44580] | |
| 2026-01-14 09:23:26,780 - xyz_agent.training.train - INFO - Outlier 5: pos=7723, tok_id=1921 | |
| 2026-01-14 09:23:26,780 - xyz_agent.training.train - INFO - s_lp=-5.386, t_lp=-6.574, diff=1.187 | |
| 2026-01-14 09:23:26,780 - xyz_agent.training.train - INFO - Context IDs: [3684, 11, 581, 1921, 23958, 220, 18] | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - WARNING - [KL WARNING] High KL detected: kl_v1=0.1204 (unclipped=0.1937) | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - WARNING - Expected: < 0.01-0.1 for on-policy training | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - WARNING - Non-forced action tokens: 16356 (forced excluded: 18090) | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - WARNING - Clipped tokens: 1739 (10.63%) | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - INFO - KL alignment metrics: kl_v1=0.1204 (unclipped=0.1937), entropy=1.1224, forced_ratio=52.52%, clipped=10.63% | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - INFO - Policy updated successfully | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - INFO - Step 1 metrics: avg_reward=0.497 (p10=0.000, p90=1.500), retrieval=0.375, correctness=0.281, citations=0.125, format_penalty=0.016, soft_errors=0.02, no_final_answer=0.000, avg_turns=3.2, avg_tokens=7222, valid=64, excluded=0 | |
| 2026-01-14 09:23:26,823 - xyz_agent.training.train - INFO - Updating sampling client with new weights... | |
| 2026-01-14 09:23:49,552 - xyz_agent.training.train - INFO - [KL POST DEBUG] Sample 0: | |
| 2026-01-14 09:23:49,552 - xyz_agent.training.train - INFO - sampling_logprobs shape: torch.Size([7599]) | |
| 2026-01-14 09:23:49,552 - xyz_agent.training.train - INFO - eval_lp shape: torch.Size([7599]) | |
| 2026-01-14 09:23:49,552 - xyz_agent.training.train - INFO - action_mask sum (total): 352 | |
| 2026-01-14 09:23:49,552 - xyz_agent.training.train - INFO - non_forced_mask sum (trainable): 168 | |
| 2026-01-14 09:23:49,552 - xyz_agent.training.train - INFO - forced tokens in actions: 184 | |
| 2026-01-14 09:23:49,553 - xyz_agent.training.train - INFO - diff[:5]: [0.8543539047241211, -0.24739938974380493, 1.4684909582138062, -0.1501888930797577, -0.4545387029647827] | |
| 2026-01-14 09:23:49,553 - xyz_agent.training.train - INFO - diff mean: 0.2391, std: 0.7745 | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - [KL POST DEBUG] Sample 1: | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - sampling_logprobs shape: torch.Size([4947]) | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - eval_lp shape: torch.Size([4947]) | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - action_mask sum (total): 272 | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - non_forced_mask sum (trainable): 105 | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - forced tokens in actions: 167 | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - diff[:5]: [-0.03425846993923187, 0.2451791763305664, -0.4994897246360779, 0.5439993739128113, -0.019507920369505882] | |
| 2026-01-14 09:23:51,919 - xyz_agent.training.train - INFO - diff mean: 0.2621, std: 0.7558 | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - [KL POST DEBUG] Sample 2: | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - sampling_logprobs shape: torch.Size([8027]) | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - eval_lp shape: torch.Size([8027]) | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - action_mask sum (total): 506 | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - non_forced_mask sum (trainable): 185 | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - forced tokens in actions: 321 | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - diff[:5]: [-0.02168755978345871, 0.2484782338142395, 0.29035139083862305, -0.2962920665740967, -0.3598055839538574] | |
| 2026-01-14 09:23:54,295 - xyz_agent.training.train - INFO - diff mean: 0.1476, std: 0.4855 | |
| 2026-01-14 09:23:56,950 - xyz_agent.training.train - INFO - KL post-update (n=4): kl_v1=0.1268 (unclipped=0.1949), entropy=0.9563, forced=59.09%, clipped=10.42% |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment