Skip to content

Instantly share code, notes, and snippets.

@TC-Alex
Created October 12, 2020 19:32
Show Gist options
  • Select an option

  • Save TC-Alex/3bf351c445594589ad9647c7ec91ba67 to your computer and use it in GitHub Desktop.

Select an option

Save TC-Alex/3bf351c445594589ad9647c7ec91ba67 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"cells":[{"metadata":{"trusted":true},"cell_type":"code","source":"import pandas as pd\nimport numpy as np\nimport csv\nimport os\nimport mysql.connector\nimport datetime as dt","execution_count":1,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## DA-868 - YTSR Full Catalog Audit\n\nThe purpose of this notebook is to pull in all YTSR monetizations that are currently in YouTube's ecosystem and all YTSR monetizations that are marked active and delivered in tc-www and find the delta.\n\nThere is reason to believe that many YTSR assets are missing from YouTube due to issues with the delivery system.\n\nThe laziest way to check if an asset is missing is to find all song_ids that SHOULD be in YouTube's ecosystem that do NOT have a corresponding custom_id record in asset file.\n\nEssentially, custom_id for song_id 123456 will show up as SID_123456 in the current system. This is good. Easy 1:1 way to see if it is there. It's possible that a lot of old references do not have this as I suspect\nthis is something that was added in YouTube's internal system after we started making this service available to clients.\n\nNaturally, this notebook will do the following:\n\n1. Load in necessary columns in YT asset file (some for debugging only)\n2. Pull in all song ids from tc-www that should be there (with some extra details so our export is more meaningful to stakeholders)\n3. Find all song_ids that do not have a corresponding custom_id in asset file\n4. Export clean .csv with song data for all of those that are missing"},{"metadata":{"trusted":true},"cell_type":"code","source":"# STEP 1\n\n# This is ~ 1.2GB of data, so it will take some time. Hopefully the usecols will cut down on a lot of wasted memory overhead.\nyt_assets = pd.read_csv('asset_full_report_tunecore_L_v1-2.csv',usecols=['asset_id','asset_type','status','custom_id','isrc','upc','asset_title','album'])\nyt_assets = yt_assets[(yt_assets['asset_type']=='SOUND_RECORDING')&(yt_assets['status']=='ACTIVE')].reset_index(drop=True) # only pull in active YTSR entries. excludes stale and/or Art Track references\n\n# summary stats\nprint(\"There are currently {n} active YTSR assets in YouTube's system (as of when the asset file was exported).\".format(n=len(yt_assets)))\nprint(yt_assets.head())\n\n\n# STEP 2\n\ncreds = os.environ['MYSQL_REPORTING_CREDS'].split('@#$')\ncnx = mysql.connector.connect(host=creds[0],user=creds[1],password=creds[2],database=creds[3])\ncursor = cnx.cursor()\n\nquery = \"\"\"\nselect \n tm.song_id,\n s.tunecore_isrc,\n s.optional_isrc,\n coalesce(if(length(optional_isrc)<12,NULL,optional_isrc),tunecore_isrc) deliverable_isrc,\n a.id album_id,\n a.takedown_at,\n group_concat(distinct u.number) all_track_upcs\nfrom track_monetizations tm\njoin songs s on s.id = tm.song_id\njoin albums a on a.id = s.album_id\njoin upcs u on u.upcable_id = s.album_id and u.upcable_type = 'Album' and u.inactive=0\nwhere tm.state = 'delivered' and tm.takedown_at is null and tm.store_id = 48\ngroup by 1\n\"\"\"\n\nprint('Pulling track metadata from TC database...')\n\ncursor.execute(query)\nsongs = pd.DataFrame(cursor.fetchall(),columns=cursor.column_names)\n\ncursor.close()\ncnx.close()\n\nprint('There are currently {n} song records that should be in YT according to tc-www.'.format(n=len(songs)))","execution_count":9,"outputs":[{"output_type":"stream","text":"There are currently 1830430 active YTSR assets in YouTube's system (as of when the asset file was exported).\n asset_id asset_type status custom_id isrc \\\n0 A977865443302302 SOUND_RECORDING ACTIVE 117182 TCABK1205936 \n1 A261467628552459 SOUND_RECORDING ACTIVE SID_8580725 TCADK1780725 \n2 A850199892412334 SOUND_RECORDING ACTIVE SID_9659176 TCADW1859176 \n3 A444079630320172 SOUND_RECORDING ACTIVE SID_1575364 USTCF1075364 \n4 A148576287358588 SOUND_RECORDING ACTIVE SID_7553321 TCADA1753321 \n\n upc asset_title album \n0 8.597091e+11 Crooked Officer (feat. Mr Klean) We Made the Bricks \n1 NaN Coffee Pot NaN \n2 NaN 99 Rico NaN \n3 NaN Sabre Dance NaN \n4 NaN Hercules NaN \nPulling track metadata from TC database...\nThere are currently 2015311 song records that should be in YT according to tc-www.\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"# STEP 3\ndef scrub(custom_id):\n # Takes custom_id and returns an integer song_id if 'SID_' is present. Otherwise, returns NULL.\n custom_id = str(custom_id)\n if 'SID_' in custom_id:\n try:\n result = int(custom_id.replace('SID_',''))\n return result\n except:\n return None\n else:\n return None\n\nyt_assets['song_id'] = yt_assets['custom_id'].map(lambda x: scrub(x))\nmissing_song_ids = list(set(songs['song_id']) - set(songs['song_id']).intersection(set(yt_assets['song_id'])))\nyt_assets['song_id']","execution_count":10,"outputs":[{"output_type":"execute_result","execution_count":10,"data":{"text/plain":"0 NaN\n1 8580725.0\n2 9659176.0\n3 1575364.0\n4 7553321.0\n ... \n1830425 10054261.0\n1830426 8849837.0\n1830427 9276054.0\n1830428 3576249.0\n1830429 7402619.0\nName: song_id, Length: 1830430, dtype: float64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#len(missing_song_ids) -> 299761 that's a lot\nsongs[songs['song_id'].isin(missing_song_ids)].to_csv('Potentially Missing Songs - DA868 v2.csv',index=False)","execution_count":11,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Important Finding!\n\nBased on the huge number of song ids missing from YouTube's system, it is clear that something is wrong with this approach. After talking to Will Ezor and Wes Sprague, I learned that YouTube has an idiosyncratic way of handling duplicate asset content. For the sake of YTSR, YouTube creates \"asset shares\" for every asset reference we send them. If we send them a duplicate audio file (i.e. one they already have in their system, and they deem it to be identical to another one of our assets already in their system), they will make the new asset reference \"active\" and the existing one will become an inactive share. It appears that inactive shares are not able to monetize at all - relinquishing all monetization privileges to the new asset reference. \n\n#### What does this mean in terms of data?\n\nYouTube's CMS has an asset share report available for export underneath the main asset report that we have been using as reference for a long time. Every asset share refers to its parent asset (the \"active\" asset that currently invalidates it) in the **parent_asset_id** column. We can leverage this to find out if the assets missing from the above exercise are present in the asset share report. Their existence here would prove that they ARE in YouTube's system as inactive assets. **Redelivering this content would not increase the total number of live assets our YTSR catalog.**"},{"metadata":{"trusted":true},"cell_type":"code","source":"asset_shares = pd.read_csv('asset_share_report_tunecore_D_v1-0.csv',usecols=['parent_asset_id','custom_id'])\nasset_shares.head()","execution_count":12,"outputs":[{"output_type":"execute_result","execution_count":12,"data":{"text/plain":" parent_asset_id custom_id\n0 A866250641231873 NaN\n1 A316671909428712 NaN\n2 A651737084177139 SID_7432763\n3 A526927396241505 NaN\n4 A501329789060693 NaN","text/html":"<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>parent_asset_id</th>\n <th>custom_id</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>A866250641231873</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>1</th>\n <td>A316671909428712</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>2</th>\n <td>A651737084177139</td>\n <td>SID_7432763</td>\n </tr>\n <tr>\n <th>3</th>\n <td>A526927396241505</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>4</th>\n <td>A501329789060693</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#len(asset_shares[asset_shares['custom_id'].isin(['SID_{0}'.format(str(x)) for x in missing_song_ids])]) -> 123040 (some duplicates)\nknown_shares = asset_shares[asset_shares['custom_id'].isin(['SID_{0}'.format(str(x)) for x in missing_song_ids])].reset_index(drop=True)\n\nnot_actually_missing_song_ids = list(set(known_shares['custom_id'].map(lambda x: int(x.replace('SID_','')))))\npotential_tracks = songs[(songs['song_id'].isin(missing_song_ids))&(~songs['song_id'].isin(not_actually_missing_song_ids))]#.to_csv('Still Potentially Missing Songs - DA868 v1.csv',index=False)","execution_count":30,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#asset_shares[asset_shares['parent_asset_id']=='AAVZGJYJMAJYPR2']\n#yt_assets[yt_assets['asset_id']=='AAVZGJYJMAJYPR2'] \n# ok.... some of these valid asset_ids just do NOT show up in their exports! wth!\npotential_tracks[~potential_tracks['deliverable_isrc'].isin(yt_assets['isrc'])].to_csv('Really Just Grasping at Straws Now - DA868 v1.csv',index=False)","execution_count":33,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"potential_tracks[(~potential_tracks['deliverable_isrc'].isin(yt_assets['isrc']))&(np.isnan(potential_tracks['takedown_at']))]","execution_count":34,"outputs":[{"output_type":"execute_result","execution_count":34,"data":{"text/plain":" song_id tunecore_isrc optional_isrc deliverable_isrc album_id \\\n96 2646 USTC10602646 None USTC10602646 472 \n98 2648 USTC10602648 None USTC10602648 472 \n99 2650 USTC10602650 None USTC10602650 472 \n113 2891 USTC10602891 None USTC10602891 510 \n118 2898 USTC10602898 None USTC10602898 510 \n... ... ... ... ... ... \n2015306 12890720 TCAFD2090720 TCAFD2090720 TCAFD2090720 3930669 \n2015307 12891133 TCAFD2091133 QMZEF2001010 QMZEF2001010 3930860 \n2015308 12891359 TCAFD2091359 TCAFD2091359 TCAFD2091359 3930924 \n2015309 12891887 TCAFD2091887 TCAFD2091887 TCAFD2091887 3931155 \n2015310 12892851 TCAFD2092851 TCAFD2092851 TCAFD2092851 3931554 \n\n takedown_at all_track_upcs \n96 NaT 859700004722 \n98 NaT 859700004722 \n99 NaT 859700004722 \n113 NaT 859700005101 \n118 NaT 859700005101 \n... ... ... \n2015306 NaT 859742717901 \n2015307 NaT 859742720017 \n2015308 NaT 859742720727 \n2015309 NaT 859742723216 \n2015310 NaT 859742727511 \n\n[24466 rows x 7 columns]","text/html":"<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>song_id</th>\n <th>tunecore_isrc</th>\n <th>optional_isrc</th>\n <th>deliverable_isrc</th>\n <th>album_id</th>\n <th>takedown_at</th>\n <th>all_track_upcs</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>96</th>\n <td>2646</td>\n <td>USTC10602646</td>\n <td>None</td>\n <td>USTC10602646</td>\n <td>472</td>\n <td>NaT</td>\n <td>859700004722</td>\n </tr>\n <tr>\n <th>98</th>\n <td>2648</td>\n <td>USTC10602648</td>\n <td>None</td>\n <td>USTC10602648</td>\n <td>472</td>\n <td>NaT</td>\n <td>859700004722</td>\n </tr>\n <tr>\n <th>99</th>\n <td>2650</td>\n <td>USTC10602650</td>\n <td>None</td>\n <td>USTC10602650</td>\n <td>472</td>\n <td>NaT</td>\n <td>859700004722</td>\n </tr>\n <tr>\n <th>113</th>\n <td>2891</td>\n <td>USTC10602891</td>\n <td>None</td>\n <td>USTC10602891</td>\n <td>510</td>\n <td>NaT</td>\n <td>859700005101</td>\n </tr>\n <tr>\n <th>118</th>\n <td>2898</td>\n <td>USTC10602898</td>\n <td>None</td>\n <td>USTC10602898</td>\n <td>510</td>\n <td>NaT</td>\n <td>859700005101</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>2015306</th>\n <td>12890720</td>\n <td>TCAFD2090720</td>\n <td>TCAFD2090720</td>\n <td>TCAFD2090720</td>\n <td>3930669</td>\n <td>NaT</td>\n <td>859742717901</td>\n </tr>\n <tr>\n <th>2015307</th>\n <td>12891133</td>\n <td>TCAFD2091133</td>\n <td>QMZEF2001010</td>\n <td>QMZEF2001010</td>\n <td>3930860</td>\n <td>NaT</td>\n <td>859742720017</td>\n </tr>\n <tr>\n <th>2015308</th>\n <td>12891359</td>\n <td>TCAFD2091359</td>\n <td>TCAFD2091359</td>\n <td>TCAFD2091359</td>\n <td>3930924</td>\n <td>NaT</td>\n <td>859742720727</td>\n </tr>\n <tr>\n <th>2015309</th>\n <td>12891887</td>\n <td>TCAFD2091887</td>\n <td>TCAFD2091887</td>\n <td>TCAFD2091887</td>\n <td>3931155</td>\n <td>NaT</td>\n <td>859742723216</td>\n </tr>\n <tr>\n <th>2015310</th>\n <td>12892851</td>\n <td>TCAFD2092851</td>\n <td>TCAFD2092851</td>\n <td>TCAFD2092851</td>\n <td>3931554</td>\n <td>NaT</td>\n <td>859742727511</td>\n </tr>\n </tbody>\n</table>\n<p>24466 rows × 7 columns</p>\n</div>"},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"Does YouTube remove assets of ALL types if we issue a takedown from one service? Mainly, if an art track asset is revoked... does it disappear from YT entirely?"}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":false,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{},"toc_section_display":false,"toc_window_display":false},"language_info":{"name":"python","version":"3.6.5","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":4}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment