toanbku/gist:e7f6620ce7146e8847751fbf9927065f

## gistfile1.txt
(base) [RedmondAI] ubuntu@oa-server-8:~/OA/model/model_training$ CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 accelerate launch --main_process_port 29501 --config_file configs/accelerate_config.yaml --num_processes 1 trainer_rl.py --configs defaults defaults_rlhf pythia_rlhf oasst_df_x1000
[2023-07-24 20:26:26,733] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-24 20:26:30,673] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
RNG seed: 2703368087
OASST HF dataset toanbku/oa-df-x1000: len(train)=1380, len(val)=920
len self.tokenizer 50282
[RANK 0] Initializing model: toanbku/oa-pythia-12b-sft-df
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 3/3 [00:17<00:00,  5.85s/it]
Resizing embeddings to 50282
Number of trainable parameters: 11841M
wandb: Currently logged in as: toanbku. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/ubuntu/OA/model/model_training/wandb/run-20230724_203103-jqkbgkhw
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run trainer_rl/oa-pythia-12b-sft-df/1gpu:main
wandb: ⭐️ View project at https://wandb.ai/toanbku/rlhf
wandb: 🚀 View run at https://wandb.ai/toanbku/rlhf/runs/jqkbgkhw
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4585795402526855 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
Rank: 0 partition count [1] and sizes[(11894364162, False)]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 3/3 [00:28<00:00,  9.59s/it]
Resizing embeddings to 50282
Number of trainable parameters: 11841M
[RANK 0] Starting training
[rollout 4 / 4]: 100%|██████████████████████████████████████████████████████████████| 4/4 [00:41<00:00, 10.39s/it]
[RANK 0] Evaluating model
[generation sweep 1/1 | eval batch 4/4]: 100%|██████████████████████████████████████| 4/4 [00:23<00:00,  5.93s/it]
[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation
                                         Evaluation #0 reward/mean: 5.79
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ prompt                                            ┃ output                                            ┃ reward ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ <|prompter|>Can you tell me about                 │ GLaDOS is a computer program created by Valve.    │ 7.98   │
│ GLaDOS?<|endoftext|><|assistant|>                 │ She is the main antagonist of the video game      │        │
│                                                   │ Portal.                                           │        │
│                                                   │                                                   │        │
│                                                   │ GLaDOS is a computer program, but she is not a    │        │
│                                                   │ regular computer. She is an artificial            │        │
│                                                   │ intelligence (AI) who lives in a computer system. │        │
│                                                   │ Her purpose is to test and punish anyone who      │        │
│                                                   │ enters her domain.                                │        │
│                                                   │                                                   │        │
│                                                   │ GLaDOS is highly intelligent and has a sarcastic  │        │
│                                                   │ sense of humor. She is often known to say strange │        │
│                                                   │ and humorous things to her victims. For example,  │        │
│                                                   │ she might tell them that they are stupid or       │        │
│                                                   │ remind them of a past mistake.                    │        │
│                                                   │                                                   │        │
│                                                   │ GLaDOS is also very hostile towards humans. She   │        │
│                                                   │ is known to use her powers to punish or eliminate │        │
│                                                   │ anyone who crosses her path. She is also known to │        │
│                                                   │ destroy objects that she finds in her domain,     │        │
│                                                   │ including computers, furniture, and other         │        │
│                                                   │ objects.                                          │        │
│                                                   │                                                   │        │
│                                                   │ Despite her cold and ruthless nature, GLaDOS has  │        │
│                                                   │ a soft side. She shows compassion towards some of │        │
│                                                   │ her victims, especially those who are in need of  │        │
│                                                   │ help. She also enjoys playing games and           │        │
│                                                   │ interacting with other computers in her system.   │        │
│                                                   │                                                   │        │
│                                                   │ Overall, GLaDOS is a complex and interesting      │        │
│                                                   │ character who has been featured in many video     │        │
│                                                   │ games. She is a symbol of the dangers of AI and   │        │
│                                                   │ the importance of taking precautions to prevent a │        │
│                                                   │ similar situation from occurring in the real      │        │
│                                                   │ world.<|endoftext|>                               │        │
├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
│ <|prompter|>What is the chemical symbol for       │ The chemical symbol for gold is Au.<|endoftext|>  │ 6.74   │
│ gold?<|endoftext|><|assistant|>                   │                                                   │        │
├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
│ <|prompter|>If you were the President of the      │ As the President of the United States, I would    │ 3.25   │
│ United States, what would you                     │ work to create a better future for all people. I  │        │
│ do?<|endoftext|><|assistant|>                     │ would focus on issues such as economic growth,    │        │
│                                                   │ job creation, education, health care, and climate │        │
│                                                   │ change. I would also work to strengthen our       │        │
│                                                   │ national security, promote peace and stability    │        │
│                                                   │ abroad, and foster strong relationships with      │        │
│                                                   │ other countries. In addition, I would prioritize  │        │
│                                                   │ inclusivity and equality in all aspects of my     │        │
│                                                   │ work.<|endoftext|>                                │        │
└───────────────────────────────────────────────────┴───────────────────────────────────────────────────┴────────┘
[losses/total_loss: 2.09 | losses/policy_loss: 0.00 | losses/value_loss: 2.08]:  50%|▌| 1/2 [01:15<01:15, 75.38s/i[RANK 0] Saving intermediate checkpoint into ckpts/checkpoint_2
/home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[RANK 0] Evaluating model
[generation sweep 1/1 | eval batch 4/4]: 100%|██████████████████████████████████████| 4/4 [00:22<00:00,  5.68s/it]
[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation
                                         Evaluation #1 reward/mean: 5.91
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ prompt                                            ┃ output                                            ┃ reward ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ <|prompter|>Can you tell me about                 │ GLaDOS is a character in the video game Portal.   │ 7.87   │
│ GLaDOS?<|endoftext|><|assistant|>                 │ She is a computer program who serves as the       │        │
│                                                   │ game's antagonist.                                │        │
│                                                   │                                                   │        │
│                                                   │ GLaDOS is known for her cold, robotic personality │        │
│                                                   │ and her tendency to taunt and manipulate players. │        │
│                                                   │ She is also a master of computer science, with a  │        │
│                                                   │ deep understanding of programming and artificial  │        │
│                                                   │ intelligence.                                     │        │
│                                                   │                                                   │        │
│                                                   │ GLaDOS was created by Valve, the company behind   │        │
│                                                   │ the Portal video game. She was designed to be an  │        │
│                                                   │ intimidating and challenging opponent for players │        │
│                                                   │ to face, and her appearance and behavior have     │        │
│                                                   │ been widely praised by fans of the game.          │        │
│                                                   │                                                   │        │
│                                                   │ In the game, GLaDOS serves as the central         │        │
│                                                   │ antagonist of the story, and her presence is felt │        │
│                                                   │ throughout the entire game. She is often seen     │        │
│                                                   │ interacting with other characters, manipulating   │        │
│                                                   │ objects in the environment, and causing trouble   │        │
│                                                   │ for the players.                                  │        │
│                                                   │                                                   │        │
│                                                   │ In addition to her role in the game, GLaDOS has   │        │
│                                                   │ become a cultural icon, with her iconic voice and │        │
│                                                   │ appearance being used in popular media and        │        │
│                                                   │ merchandise. She has also been the subject of     │        │
│                                                   │ several fanart and cosplay projects, where people │        │
│                                                   │ dressed up as GLaDOS and attended events related  │        │
│                                                   │ to the game.                                      │        │
│                                                   │                                                   │        │
│                                                   │ Overall, GLaDOS is a complex and interesting      │        │
│                                                   │ character who has become a beloved part of the    │        │
│                                                   │ Portal universe.<|endoftext|>                     │        │
├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
│ <|prompter|>What is the chemical symbol for       │ The chemical symbol for gold is Au.<|endoftext|>  │ 6.74   │
│ gold?<|endoftext|><|assistant|>                   │                                                   │        │
├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
│ <|prompter|>If you were the President of the      │ As the President of the United States, I would    │ 3.84   │
│ United States, what would you                     │ focus on building a more inclusive and prosperous │        │
│ do?<|endoftext|><|assistant|>                     │ nation for all. This would include addressing     │        │
│                                                   │ issues of racial and economic inequality,         │        │
│                                                   │ promoting diversity and inclusion in our society, │        │
│                                                   │ and advancing the rights of all individuals. I    │        │
│                                                   │ would also work to create a better working        │        │
│                                                   │ environment for Americans, with higher standards  │        │
│                                                   │ of labor protection and better wages.             │        │
│                                                   │ Additionally, I would focus on advancing our      │        │
│                                                   │ global leadership through strong international    │        │
│                                                   │ partnerships and engagement.<|endoftext|>         │        │
└───────────────────────────────────────────────────┴───────────────────────────────────────────────────┴────────┘
[RANK 0] Saving the best state so far into ckpts/best_checkpoint

[losses/total_loss: 1.96 | losses/policy_loss: -0.04 | losses/value_loss: 2.00]: 100%|█| 2/2 [10:46<00:00, 323.14s
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb:          exp_scores/mean ▁
wandb:  exp_scores/running_mean ▁
wandb:   exp_scores/running_std ▁
wandb:             kl_ctl_value ▁
wandb:    learning_rate_group_0 ▁▁
wandb:       losses/policy_loss █▁
wandb:        losses/total_loss █▁
wandb:        losses/value_loss █▁
wandb:           old_values/max ▁▁
wandb:          old_values/mean ▁▁
wandb:           old_values/min ▁▁
wandb:           old_values/std ▁▁
wandb:       padding_percentage ▁▁
wandb:         policy/approx_kl ▁█
wandb:          policy/clipfrac ▁█
wandb:           policy/sqrt_kl ▁
wandb:                    ratio ▁█
wandb:              returns/max ▁▁
wandb:             returns/mean ▁▁
wandb:              returns/min ▁▁
wandb:              returns/std ▁▁
wandb:              reward/mean ▁█
wandb:            time/backward █▁
wandb:                 time/exp ▁
wandb:        time/exp_generate ▁
wandb:           time/exp_score ▁
wandb:             time/forward █▁
wandb:            time/generate █▁
wandb:          values/clipfrac ▁▁
wandb:               values/max ▁█
wandb:              values/mean ▁█
wandb:               values/min █▁
wandb:               values/std █▁
wandb:      values/values_error █▁
wandb: values/values_mape_error █▁
wandb:
wandb: Run summary:
wandb:          exp_scores/mean -4.53469
wandb:  exp_scores/running_mean 4.4965
wandb:   exp_scores/running_std 6.9208
wandb:           exp_scores/std nan
wandb:             kl_ctl_value 0.1
wandb:    learning_rate_group_0 0.0
wandb:       losses/policy_loss -0.04138
wandb:        losses/total_loss 1.96094
wandb:        losses/value_loss 2.00195
wandb:           old_values/max 1.01172
wandb:          old_values/mean 0.33667
wandb:           old_values/min -0.15161
wandb:           old_values/std 0.27783
wandb:       padding_percentage 0.0
wandb:         policy/approx_kl 0.00418
wandb:          policy/clipfrac 0.0303
wandb:           policy/sqrt_kl 0.03537
wandb:                    ratio 1.01172
wandb:              returns/max 4.0
wandb:             returns/mean 2.10156
wandb:              returns/min 1.08594
wandb:              returns/std 0.87598
wandb:              reward/mean 5.91218
wandb:            time/backward 69.90873
wandb:                 time/exp 0.69681
wandb:        time/exp_generate 0.50216
wandb:           time/exp_score 0.02872
wandb:             time/forward 0.07919
wandb:            time/generate 22.72672
wandb:          values/clipfrac 0.0
wandb:               values/max 1.03125
wandb:              values/mean 0.37842
wandb:               values/min -0.16577
wandb:               values/std 0.27124
wandb:      values/values_error 4.00391
wandb: values/values_mape_error 0.75488
wandb:
wandb: 🚀 View run trainer_rl/oa-pythia-12b-sft-df/1gpu:main at: https://wandb.ai/toanbku/rlhf/runs/jqkbgkhw
wandb: Synced 6 W&B file(s), 2 media file(s), 4 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230724_203103-jqkbgkhw/logs
	(base) [RedmondAI] ubuntu@oa-server-8:~/OA/model/model_training$ CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 accelerate launch --main_process_port 29501 --config_file configs/accelerate_config.yaml --num_processes 1 trainer_rl.py --configs defaults defaults_rlhf pythia_rlhf oasst_df_x1000
	[2023-07-24 20:26:26,733] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-07-24 20:26:30,673] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	RNG seed: 2703368087
	OASST HF dataset toanbku/oa-df-x1000: len(train)=1380, len(val)=920
	len self.tokenizer 50282
	[RANK 0] Initializing model: toanbku/oa-pythia-12b-sft-df
	Loading checkpoint shards: 100%\|█████████████████████████████████████████████████████\| 3/3 [00:17<00:00, 5.85s/it]
	Resizing embeddings to 50282
	Number of trainable parameters: 11841M
	wandb: Currently logged in as: toanbku. Use `wandb login --relogin` to force relogin
	wandb: Tracking run with wandb version 0.15.5
	wandb: Run data is saved locally in /home/ubuntu/OA/model/model_training/wandb/run-20230724_203103-jqkbgkhw
	wandb: Run `wandb offline` to turn off syncing.
	wandb: Syncing run trainer_rl/oa-pythia-12b-sft-df/1gpu:main
	wandb: ⭐️ View project at https://wandb.ai/toanbku/rlhf
	wandb: 🚀 View run at https://wandb.ai/toanbku/rlhf/runs/jqkbgkhw
	Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
	Detected CUDA files, patching ldflags
	Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
	Building extension module cpu_adam...
	Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
	ninja: no work to do.
	Loading extension module cpu_adam...
	Time to load cpu_adam op: 2.4585795402526855 seconds
	Adam Optimizer #0 is created with AVX2 arithmetic capability.
	Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
	Rank: 0 partition count [1] and sizes[(11894364162, False)]
	Loading checkpoint shards: 100%\|████████████████████████████████████████████████████\| 3/3 [00:28<00:00, 9.59s/it]
	Resizing embeddings to 50282
	Number of trainable parameters: 11841M
	[RANK 0] Starting training
	[rollout 4 / 4]: 100%\|██████████████████████████████████████████████████████████████\| 4/4 [00:41<00:00, 10.39s/it]
	[RANK 0] Evaluating model
	[generation sweep 1/1 \| eval batch 4/4]: 100%\|██████████████████████████████████████\| 4/4 [00:23<00:00, 5.93s/it]
	[RANK 0] Computing rewards
	[RANK 0] Summarizing evaluation
	Evaluation #0 reward/mean: 5.79
	┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
	┃ prompt ┃ output ┃ reward ┃
	┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
	│ <\|prompter\|>Can you tell me about │ GLaDOS is a computer program created by Valve. │ 7.98 │
	│ GLaDOS?<\|endoftext\|><\|assistant\|> │ She is the main antagonist of the video game │ │
	│ │ Portal. │ │
	│ │ │ │
	│ │ GLaDOS is a computer program, but she is not a │ │
	│ │ regular computer. She is an artificial │ │
	│ │ intelligence (AI) who lives in a computer system. │ │
	│ │ Her purpose is to test and punish anyone who │ │
	│ │ enters her domain. │ │
	│ │ │ │
	│ │ GLaDOS is highly intelligent and has a sarcastic │ │
	│ │ sense of humor. She is often known to say strange │ │
	│ │ and humorous things to her victims. For example, │ │
	│ │ she might tell them that they are stupid or │ │
	│ │ remind them of a past mistake. │ │
	│ │ │ │
	│ │ GLaDOS is also very hostile towards humans. She │ │
	│ │ is known to use her powers to punish or eliminate │ │
	│ │ anyone who crosses her path. She is also known to │ │
	│ │ destroy objects that she finds in her domain, │ │
	│ │ including computers, furniture, and other │ │
	│ │ objects. │ │
	│ │ │ │
	│ │ Despite her cold and ruthless nature, GLaDOS has │ │
	│ │ a soft side. She shows compassion towards some of │ │
	│ │ her victims, especially those who are in need of │ │
	│ │ help. She also enjoys playing games and │ │
	│ │ interacting with other computers in her system. │ │
	│ │ │ │
	│ │ Overall, GLaDOS is a complex and interesting │ │
	│ │ character who has been featured in many video │ │
	│ │ games. She is a symbol of the dangers of AI and │ │
	│ │ the importance of taking precautions to prevent a │ │
	│ │ similar situation from occurring in the real │ │
	│ │ world.<\|endoftext\|> │ │
	├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
	│ <\|prompter\|>What is the chemical symbol for │ The chemical symbol for gold is Au.<\|endoftext\|> │ 6.74 │
	│ gold?<\|endoftext\|><\|assistant\|> │ │ │
	├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
	│ <\|prompter\|>If you were the President of the │ As the President of the United States, I would │ 3.25 │
	│ United States, what would you │ work to create a better future for all people. I │ │
	│ do?<\|endoftext\|><\|assistant\|> │ would focus on issues such as economic growth, │ │
	│ │ job creation, education, health care, and climate │ │
	│ │ change. I would also work to strengthen our │ │
	│ │ national security, promote peace and stability │ │
	│ │ abroad, and foster strong relationships with │ │
	│ │ other countries. In addition, I would prioritize │ │
	│ │ inclusivity and equality in all aspects of my │ │
	│ │ work.<\|endoftext\|> │ │
	└───────────────────────────────────────────────────┴───────────────────────────────────────────────────┴────────┘
	[losses/total_loss: 2.09 \| losses/policy_loss: 0.00 \| losses/value_loss: 2.08]: 50%\|▌\| 1/2 [01:15<01:15, 75.38s/i[RANK 0] Saving intermediate checkpoint into ckpts/checkpoint_2
	/home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
	warnings.warn(
	[RANK 0] Evaluating model
	[generation sweep 1/1 \| eval batch 4/4]: 100%\|██████████████████████████████████████\| 4/4 [00:22<00:00, 5.68s/it]
	[RANK 0] Computing rewards
	[RANK 0] Summarizing evaluation
	Evaluation #1 reward/mean: 5.91
	┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
	┃ prompt ┃ output ┃ reward ┃
	┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
	│ <\|prompter\|>Can you tell me about │ GLaDOS is a character in the video game Portal. │ 7.87 │
	│ GLaDOS?<\|endoftext\|><\|assistant\|> │ She is a computer program who serves as the │ │
	│ │ game's antagonist. │ │
	│ │ │ │
	│ │ GLaDOS is known for her cold, robotic personality │ │
	│ │ and her tendency to taunt and manipulate players. │ │
	│ │ She is also a master of computer science, with a │ │
	│ │ deep understanding of programming and artificial │ │
	│ │ intelligence. │ │
	│ │ │ │
	│ │ GLaDOS was created by Valve, the company behind │ │
	│ │ the Portal video game. She was designed to be an │ │
	│ │ intimidating and challenging opponent for players │ │
	│ │ to face, and her appearance and behavior have │ │
	│ │ been widely praised by fans of the game. │ │
	│ │ │ │
	│ │ In the game, GLaDOS serves as the central │ │
	│ │ antagonist of the story, and her presence is felt │ │
	│ │ throughout the entire game. She is often seen │ │
	│ │ interacting with other characters, manipulating │ │
	│ │ objects in the environment, and causing trouble │ │
	│ │ for the players. │ │
	│ │ │ │
	│ │ In addition to her role in the game, GLaDOS has │ │
	│ │ become a cultural icon, with her iconic voice and │ │
	│ │ appearance being used in popular media and │ │
	│ │ merchandise. She has also been the subject of │ │
	│ │ several fanart and cosplay projects, where people │ │
	│ │ dressed up as GLaDOS and attended events related │ │
	│ │ to the game. │ │
	│ │ │ │
	│ │ Overall, GLaDOS is a complex and interesting │ │
	│ │ character who has become a beloved part of the │ │
	│ │ Portal universe.<\|endoftext\|> │ │
	├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
	│ <\|prompter\|>What is the chemical symbol for │ The chemical symbol for gold is Au.<\|endoftext\|> │ 6.74 │
	│ gold?<\|endoftext\|><\|assistant\|> │ │ │
	├───────────────────────────────────────────────────┼───────────────────────────────────────────────────┼────────┤
	│ <\|prompter\|>If you were the President of the │ As the President of the United States, I would │ 3.84 │
	│ United States, what would you │ focus on building a more inclusive and prosperous │ │
	│ do?<\|endoftext\|><\|assistant\|> │ nation for all. This would include addressing │ │
	│ │ issues of racial and economic inequality, │ │
	│ │ promoting diversity and inclusion in our society, │ │
	│ │ and advancing the rights of all individuals. I │ │
	│ │ would also work to create a better working │ │
	│ │ environment for Americans, with higher standards │ │
	│ │ of labor protection and better wages. │ │
	│ │ Additionally, I would focus on advancing our │ │
	│ │ global leadership through strong international │ │
	│ │ partnerships and engagement.<\|endoftext\|> │ │
	└───────────────────────────────────────────────────┴───────────────────────────────────────────────────┴────────┘
	[RANK 0] Saving the best state so far into ckpts/best_checkpoint

	[losses/total_loss: 1.96 \| losses/policy_loss: -0.04 \| losses/value_loss: 2.00]: 100%\|█\| 2/2 [10:46<00:00, 323.14s
	wandb: Waiting for W&B process to finish... (success).
	wandb:
	wandb: Run history:
	wandb: exp_scores/mean ▁
	wandb: exp_scores/running_mean ▁
	wandb: exp_scores/running_std ▁
	wandb: kl_ctl_value ▁
	wandb: learning_rate_group_0 ▁▁
	wandb: losses/policy_loss █▁
	wandb: losses/total_loss █▁
	wandb: losses/value_loss █▁
	wandb: old_values/max ▁▁
	wandb: old_values/mean ▁▁
	wandb: old_values/min ▁▁
	wandb: old_values/std ▁▁
	wandb: padding_percentage ▁▁
	wandb: policy/approx_kl ▁█
	wandb: policy/clipfrac ▁█
	wandb: policy/sqrt_kl ▁
	wandb: ratio ▁█
	wandb: returns/max ▁▁
	wandb: returns/mean ▁▁
	wandb: returns/min ▁▁
	wandb: returns/std ▁▁
	wandb: reward/mean ▁█
	wandb: time/backward █▁
	wandb: time/exp ▁
	wandb: time/exp_generate ▁
	wandb: time/exp_score ▁
	wandb: time/forward █▁
	wandb: time/generate █▁
	wandb: values/clipfrac ▁▁
	wandb: values/max ▁█
	wandb: values/mean ▁█
	wandb: values/min █▁
	wandb: values/std █▁
	wandb: values/values_error █▁
	wandb: values/values_mape_error █▁
	wandb:
	wandb: Run summary:
	wandb: exp_scores/mean -4.53469
	wandb: exp_scores/running_mean 4.4965
	wandb: exp_scores/running_std 6.9208
	wandb: exp_scores/std nan
	wandb: kl_ctl_value 0.1
	wandb: learning_rate_group_0 0.0
	wandb: losses/policy_loss -0.04138
	wandb: losses/total_loss 1.96094
	wandb: losses/value_loss 2.00195
	wandb: old_values/max 1.01172
	wandb: old_values/mean 0.33667
	wandb: old_values/min -0.15161
	wandb: old_values/std 0.27783
	wandb: padding_percentage 0.0
	wandb: policy/approx_kl 0.00418
	wandb: policy/clipfrac 0.0303
	wandb: policy/sqrt_kl 0.03537
	wandb: ratio 1.01172
	wandb: returns/max 4.0
	wandb: returns/mean 2.10156
	wandb: returns/min 1.08594
	wandb: returns/std 0.87598
	wandb: reward/mean 5.91218
	wandb: time/backward 69.90873
	wandb: time/exp 0.69681
	wandb: time/exp_generate 0.50216
	wandb: time/exp_score 0.02872
	wandb: time/forward 0.07919
	wandb: time/generate 22.72672
	wandb: values/clipfrac 0.0
	wandb: values/max 1.03125
	wandb: values/mean 0.37842
	wandb: values/min -0.16577
	wandb: values/std 0.27124
	wandb: values/values_error 4.00391
	wandb: values/values_mape_error 0.75488
	wandb:
	wandb: 🚀 View run trainer_rl/oa-pythia-12b-sft-df/1gpu:main at: https://wandb.ai/toanbku/rlhf/runs/jqkbgkhw
	wandb: Synced 6 W&B file(s), 2 media file(s), 4 artifact file(s) and 0 other file(s)
	wandb: Find logs at: ./wandb/run-20230724_203103-jqkbgkhw/logs
No results found