TC-Alex/DA-868.ipynb

## DA-868.ipynb
{"cells":[{"metadata":{"trusted":true},"cell_type":"code","source":"import pandas as pd\nimport numpy as np\nimport csv\nimport os\nimport mysql.connector\nimport datetime as dt","execution_count":1,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## DA-868 - YTSR Full Catalog Audit\n\nThe purpose of this notebook is to pull in all YTSR monetizations that are currently in YouTube's ecosystem and all YTSR monetizations that are marked active and delivered in tc-www and find the delta.\n\nThere is reason to believe that many YTSR assets are missing from YouTube due to issues with the delivery system.\n\nThe laziest way to check if an asset is missing is to find all song_ids that SHOULD be in YouTube's ecosystem that do NOT have a corresponding custom_id record in asset file.\n\nEssentially, custom_id for song_id 123456 will show up as SID_123456 in the current system.  This is good.  Easy 1:1 way to see if it is there.  It's possible that a lot of old references do not have this as I suspect\nthis is something that was added in YouTube's internal system after we started making this service available to clients.\n\nNaturally, this notebook will do the following:\n\n1. Load in necessary columns in YT asset file (some for debugging only)\n2. Pull in all song ids from tc-www that should be there (with some extra details so our export is more meaningful to stakeholders)\n3. Find all song_ids that do not have a corresponding custom_id in asset file\n4. Export clean .csv with song data for all of those that are missing"},{"metadata":{"trusted":true},"cell_type":"code","source":"# STEP 1\n\n# This is ~ 1.2GB of data, so it will take some time.  Hopefully the usecols will cut down on a lot of wasted memory overhead.\nyt_assets = pd.read_csv('asset_full_report_tunecore_L_v1-2.csv',usecols=['asset_id','asset_type','status','custom_id','isrc','upc','asset_title','album'])\nyt_assets = yt_assets[(yt_assets['asset_type']=='SOUND_RECORDING')&(yt_assets['status']=='ACTIVE')].reset_index(drop=True) # only pull in active YTSR entries.  excludes stale and/or Art Track references\n\n# summary stats\nprint(\"There are currently {n} active YTSR assets in YouTube's system (as of when the asset file was exported).\".format(n=len(yt_assets)))\nprint(yt_assets.head())\n\n\n# STEP 2\n\ncreds = os.environ['MYSQL_REPORTING_CREDS'].split('@#$')\ncnx = mysql.connector.connect(host=creds[0],user=creds[1],password=creds[2],database=creds[3])\ncursor = cnx.cursor()\n\nquery = \"\"\"\nselect \n  tm.song_id,\n  s.tunecore_isrc,\n  s.optional_isrc,\n  coalesce(if(length(optional_isrc)<12,NULL,optional_isrc),tunecore_isrc) deliverable_isrc,\n  a.id album_id,\n  a.takedown_at,\n  group_concat(distinct u.number) all_track_upcs\nfrom track_monetizations tm\njoin songs s on s.id = tm.song_id\njoin albums a on a.id = s.album_id\njoin upcs u on u.upcable_id = s.album_id and u.upcable_type = 'Album' and u.inactive=0\nwhere tm.state = 'delivered' and tm.takedown_at is null and tm.store_id = 48\ngroup by 1\n\"\"\"\n\nprint('Pulling track metadata from TC database...')\n\ncursor.execute(query)\nsongs = pd.DataFrame(cursor.fetchall(),columns=cursor.column_names)\n\ncursor.close()\ncnx.close()\n\nprint('There are currently {n} song records that should be in YT according to tc-www.'.format(n=len(songs)))","execution_count":9,"outputs":[{"output_type":"stream","text":"There are currently 1830430 active YTSR assets in YouTube's system (as of when the asset file was exported).\n           asset_id       asset_type  status    custom_id          isrc  \\\n0  A977865443302302  SOUND_RECORDING  ACTIVE       117182  TCABK1205936   \n1  A261467628552459  SOUND_RECORDING  ACTIVE  SID_8580725  TCADK1780725   \n2  A850199892412334  SOUND_RECORDING  ACTIVE  SID_9659176  TCADW1859176   \n3  A444079630320172  SOUND_RECORDING  ACTIVE  SID_1575364  USTCF1075364   \n4  A148576287358588  SOUND_RECORDING  ACTIVE  SID_7553321  TCADA1753321   \n\n            upc                       asset_title               album  \n0  8.597091e+11  Crooked Officer (feat. Mr Klean)  We Made the Bricks  \n1           NaN                        Coffee Pot                 NaN  \n2           NaN                           99 Rico                 NaN  \n3           NaN                       Sabre Dance                 NaN  \n4           NaN                          Hercules                 NaN  \nPulling track metadata from TC database...\nThere are currently 2015311 song records that should be in YT according to tc-www.\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"# STEP 3\ndef scrub(custom_id):\n    # Takes custom_id and returns an integer song_id if 'SID_' is present.  Otherwise, returns NULL.\n    custom_id = str(custom_id)\n    if 'SID_' in custom_id:\n        try:\n            result = int(custom_id.replace('SID_',''))\n            return result\n        except:\n            return None\n    else:\n        return None\n\nyt_assets['song_id'] = yt_assets['custom_id'].map(lambda x: scrub(x))\nmissing_song_ids = list(set(songs['song_id']) - set(songs['song_id']).intersection(set(yt_assets['song_id'])))\nyt_assets['song_id']","execution_count":10,"outputs":[{"output_type":"execute_result","execution_count":10,"data":{"text/plain":"0                 NaN\n1           8580725.0\n2           9659176.0\n3           1575364.0\n4           7553321.0\n              ...    \n1830425    10054261.0\n1830426     8849837.0\n1830427     9276054.0\n1830428     3576249.0\n1830429     7402619.0\nName: song_id, Length: 1830430, dtype: float64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#len(missing_song_ids) -> 299761 that's a lot\nsongs[songs['song_id'].isin(missing_song_ids)].to_csv('Potentially Missing Songs - DA868 v2.csv',index=False)","execution_count":11,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Important Finding!\n\nBased on the huge number of song ids missing from YouTube's system, it is clear that something is wrong with this approach.  After talking to Will Ezor and Wes Sprague, I learned that YouTube has an idiosyncratic way of handling duplicate asset content.  For the sake of YTSR, YouTube creates \"asset shares\" for every asset reference we send them.  If we send them a duplicate audio file (i.e. one they already have in their system, and they deem it to be identical to another one of our assets already in their system), they will make the new asset reference \"active\" and the existing one will become an inactive share.  It appears that inactive shares are not able to monetize at all - relinquishing all monetization privileges to the new asset reference. \n\n#### What does this mean in terms of data?\n\nYouTube's CMS has an asset share report available for export underneath the main asset report that we have been using as reference for a long time.  Every asset share refers to its parent asset (the \"active\" asset that currently invalidates it) in the **parent_asset_id** column.  We can leverage this to find out if the assets missing from the above exercise are present in the asset share report.  Their existence here would prove that they ARE in YouTube's system as inactive assets.  **Redelivering this content would not increase the total number of live assets our YTSR catalog.**"},{"metadata":{"trusted":true},"cell_type":"code","source":"asset_shares = pd.read_csv('asset_share_report_tunecore_D_v1-0.csv',usecols=['parent_asset_id','custom_id'])\nasset_shares.head()","execution_count":12,"outputs":[{"output_type":"execute_result","execution_count":12,"data":{"text/plain":"    parent_asset_id    custom_id\n0  A866250641231873          NaN\n1  A316671909428712          NaN\n2  A651737084177139  SID_7432763\n3  A526927396241505          NaN\n4  A501329789060693          NaN","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>parent_asset_id</th>\n      <th>custom_id</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>A866250641231873</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>A316671909428712</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>A651737084177139</td>\n      <td>SID_7432763</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>A526927396241505</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>A501329789060693</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n</div>"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#len(asset_shares[asset_shares['custom_id'].isin(['SID_{0}'.format(str(x)) for x in missing_song_ids])]) -> 123040 (some duplicates)\nknown_shares = asset_shares[asset_shares['custom_id'].isin(['SID_{0}'.format(str(x)) for x in missing_song_ids])].reset_index(drop=True)\n\nnot_actually_missing_song_ids = list(set(known_shares['custom_id'].map(lambda x: int(x.replace('SID_','')))))\npotential_tracks = songs[(songs['song_id'].isin(missing_song_ids))&(~songs['song_id'].isin(not_actually_missing_song_ids))]#.to_csv('Still Potentially Missing Songs - DA868 v1.csv',index=False)","execution_count":30,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#asset_shares[asset_shares['parent_asset_id']=='AAVZGJYJMAJYPR2']\n#yt_assets[yt_assets['asset_id']=='AAVZGJYJMAJYPR2'] \n# ok.... some of these valid asset_ids just do NOT show up in their exports! wth!\npotential_tracks[~potential_tracks['deliverable_isrc'].isin(yt_assets['isrc'])].to_csv('Really Just Grasping at Straws Now - DA868 v1.csv',index=False)","execution_count":33,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"potential_tracks[(~potential_tracks['deliverable_isrc'].isin(yt_assets['isrc']))&(np.isnan(potential_tracks['takedown_at']))]","execution_count":34,"outputs":[{"output_type":"execute_result","execution_count":34,"data":{"text/plain":"          song_id tunecore_isrc optional_isrc deliverable_isrc  album_id  \\\n96           2646  USTC10602646          None     USTC10602646       472   \n98           2648  USTC10602648          None     USTC10602648       472   \n99           2650  USTC10602650          None     USTC10602650       472   \n113          2891  USTC10602891          None     USTC10602891       510   \n118          2898  USTC10602898          None     USTC10602898       510   \n...           ...           ...           ...              ...       ...   \n2015306  12890720  TCAFD2090720  TCAFD2090720     TCAFD2090720   3930669   \n2015307  12891133  TCAFD2091133  QMZEF2001010     QMZEF2001010   3930860   \n2015308  12891359  TCAFD2091359  TCAFD2091359     TCAFD2091359   3930924   \n2015309  12891887  TCAFD2091887  TCAFD2091887     TCAFD2091887   3931155   \n2015310  12892851  TCAFD2092851  TCAFD2092851     TCAFD2092851   3931554   \n\n        takedown_at all_track_upcs  \n96              NaT   859700004722  \n98              NaT   859700004722  \n99              NaT   859700004722  \n113             NaT   859700005101  \n118             NaT   859700005101  \n...             ...            ...  \n2015306         NaT   859742717901  \n2015307         NaT   859742720017  \n2015308         NaT   859742720727  \n2015309         NaT   859742723216  \n2015310         NaT   859742727511  \n\n[24466 rows x 7 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>song_id</th>\n      <th>tunecore_isrc</th>\n      <th>optional_isrc</th>\n      <th>deliverable_isrc</th>\n      <th>album_id</th>\n      <th>takedown_at</th>\n      <th>all_track_upcs</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>96</th>\n      <td>2646</td>\n      <td>USTC10602646</td>\n      <td>None</td>\n      <td>USTC10602646</td>\n      <td>472</td>\n      <td>NaT</td>\n      <td>859700004722</td>\n    </tr>\n    <tr>\n      <th>98</th>\n      <td>2648</td>\n      <td>USTC10602648</td>\n      <td>None</td>\n      <td>USTC10602648</td>\n      <td>472</td>\n      <td>NaT</td>\n      <td>859700004722</td>\n    </tr>\n    <tr>\n      <th>99</th>\n      <td>2650</td>\n      <td>USTC10602650</td>\n      <td>None</td>\n      <td>USTC10602650</td>\n      <td>472</td>\n      <td>NaT</td>\n      <td>859700004722</td>\n    </tr>\n    <tr>\n      <th>113</th>\n      <td>2891</td>\n      <td>USTC10602891</td>\n      <td>None</td>\n      <td>USTC10602891</td>\n      <td>510</td>\n      <td>NaT</td>\n      <td>859700005101</td>\n    </tr>\n    <tr>\n      <th>118</th>\n      <td>2898</td>\n      <td>USTC10602898</td>\n      <td>None</td>\n      <td>USTC10602898</td>\n      <td>510</td>\n      <td>NaT</td>\n      <td>859700005101</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2015306</th>\n      <td>12890720</td>\n      <td>TCAFD2090720</td>\n      <td>TCAFD2090720</td>\n      <td>TCAFD2090720</td>\n      <td>3930669</td>\n      <td>NaT</td>\n      <td>859742717901</td>\n    </tr>\n    <tr>\n      <th>2015307</th>\n      <td>12891133</td>\n      <td>TCAFD2091133</td>\n      <td>QMZEF2001010</td>\n      <td>QMZEF2001010</td>\n      <td>3930860</td>\n      <td>NaT</td>\n      <td>859742720017</td>\n    </tr>\n    <tr>\n      <th>2015308</th>\n      <td>12891359</td>\n      <td>TCAFD2091359</td>\n      <td>TCAFD2091359</td>\n      <td>TCAFD2091359</td>\n      <td>3930924</td>\n      <td>NaT</td>\n      <td>859742720727</td>\n    </tr>\n    <tr>\n      <th>2015309</th>\n      <td>12891887</td>\n      <td>TCAFD2091887</td>\n      <td>TCAFD2091887</td>\n      <td>TCAFD2091887</td>\n      <td>3931155</td>\n      <td>NaT</td>\n      <td>859742723216</td>\n    </tr>\n    <tr>\n      <th>2015310</th>\n      <td>12892851</td>\n      <td>TCAFD2092851</td>\n      <td>TCAFD2092851</td>\n      <td>TCAFD2092851</td>\n      <td>3931554</td>\n      <td>NaT</td>\n      <td>859742727511</td>\n    </tr>\n  </tbody>\n</table>\n<p>24466 rows × 7 columns</p>\n</div>"},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"Does YouTube remove assets of ALL types if we issue a takedown from one service?  Mainly, if an art track asset is revoked... does it disappear from YT entirely?"}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":false,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{},"toc_section_display":false,"toc_window_display":false},"language_info":{"name":"python","version":"3.6.5","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":4}
No results found