{"componentChunkName":"component---src-templates-blog-post-js","path":"/get-youtube-data-aws","result":{"data":{"markdownRemark":{"html":"<p><img style=\"max-width: 50%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"1\" src=\"https://github.com/user-attachments/assets/05e115e6-d836-4d8e-b8d6-887f5735bb6b\"/></p>\n<p>In this post, let's build a project to use AWS Lambda functions to load data from YouTube, and store it in an S3 bucket.</p>\n<p>This project can be found on <a href=\"https://github.com/MalshaL/youtube-data-analysis/tree/master\">GitHub</a> as well.</p>\n<p>To make the project more interesting, let's define the problem we're trying to solve using YouTube data.</p>\n<p>What are the key factors that influence video engagement (likes, comments, and shares) and audience retention (watch time and drop-off rates) on YouTube, and how can creators optimize their content to maximize these metrics?</p>\n<p>The overall architecture of the solution we'll implement is outlined below.\nIn this post, we'll work on the first part of extracting raw data in to the S3 bucket.</p>\n<p><img src=\"https://github.com/user-attachments/assets/62e136c7-0849-4efc-a885-8e37b365fa84\" alt=\"1-1\"></p>\n<p>It's worthwhile to have some idea about why we've chosen AWS Lambda and S3 among the many services available on the AWS platform.\nBeing serverless, Lambda functions are easy to build and deploy. Therefore, it's more suitable for a relatively shorter script to run as the one for getting and storing data. It also falls within the <a href=\"https://docs.aws.amazon.com/lambda/latest/dg/configuration-timeout.html#:~:text=maximum%20value%20of%20900%20seconds%20(15%20minutes).\">15 minute maximum timeout value</a> for Lambda functions.</p>\n<p>We're using an S3 bucket to store the raw data as it supports open data formats, and is scalable, durable, and cost-effective.</p>\n<h3>1. Getting started with YouTube API</h3>\n<p>The first step is to start using the YouTube API to get the data we are looking for using Python.</p>\n<p>YouTube has a helpful <a href=\"https://developers.google.com/youtube/v3/getting-started\">guide</a> on getting started with the API. </p>\n<p>Create a new project in <a href=\"https://console.developers.google.com/\">Google Developers Console</a>. Name the project, I've called it as <code class=\"language-text\">youtube-etl</code>.</p>\n<p><img src=\"https://github.com/user-attachments/assets/66cae20e-af56-469a-918e-ca404e204468\" alt=\"aws-yt-2\"></p>\n<p>Create credentials for the project. Because we're not using private data, we're choosing to create an API key.</p>\n<p><img src=\"https://github.com/user-attachments/assets/d1189edb-0d2b-46be-9f40-279ad1901d4e\" alt=\"aws-yt-3\"></p>\n<p>Head to API Console to enable the YouTube API.</p>\n<p><img src=\"https://github.com/user-attachments/assets/5a3c0695-5776-484a-b7eb-d51a8737ee2f\" alt=\"aws-yt-4\"></p>\n<p>Enable the <code class=\"language-text\">YouTube Data API v3</code> for the project.</p>\n<p><img src=\"https://github.com/user-attachments/assets/93ed2887-6bd1-4016-9fc6-8b7d3ca61b80\" alt=\"aws-yt-5\"></p>\n<p>It should now appear in the list of enabled APIs.</p>\n<p><img style=\"max-width: 50%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"6\" src=\"https://github.com/user-attachments/assets/5959ffe2-e421-4ef5-b7aa-e9c009eda45b\"/></p>\n<p>Back in the Credentials page, restrict the API key we created to only use the YouTube API.</p>\n<p><img style=\"max-width: 40%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"7\" src=\"https://github.com/user-attachments/assets/a4c6e413-bd42-41a1-94be-634f094a7666\"/></p>\n<p>YouTube API has a daily limit of 10000 quota units. You can see this limit on Enabled APIs and Services > Youtube Data API -> Quotas and system limits.</p>\n<p><img src=\"https://github.com/user-attachments/assets/22dcae03-803b-454e-a2d4-7cf96d37941c\" alt=\"aws-yt-8\"></p>\n<h6><em>Data Model for the dashboard</em></h6>\n<p>The <a href=\"https://developers.google.com/youtube/v3/docs/videos/list\">API Reference</a> has the quota used for each request.</p>\n<p><img src=\"https://github.com/user-attachments/assets/263491e9-5b87-40e9-8de6-19fe38f556e8\" alt=\"aws-yt-9\"></p>\n<h3>2. Using the YouTube API</h3>\n<p>Before using the API, there're a few terms to get to know in order to use the API effectively.</p>\n<p>When an API request is sent, a list of items are returned. These items are referred to as <code class=\"language-text\">resources</code>.\nEach resource has groups of properties which are known as <code class=\"language-text\">parts</code>. See the list of parts available for <code class=\"language-text\">video</code> resource <a href=\"https://developers.google.com/youtube/v3/getting-started#partial\">here</a>.\nEach part contains a set of properties known as <code class=\"language-text\">fields</code>.</p>\n<p>When building your API request, you'll need to specify the <code class=\"language-text\">part</code> and <code class=\"language-text\">field</code> parameters, so that the response can be filtered. This helps to reduce latency with the request, and saves you from processing many properties which may not be used in your project.</p>\n<p>Let's try using <code class=\"language-text\">curl</code> to send a simple request to get the video categories.\nThe YouTube API has a <a href=\"https://developers.google.com/youtube/v3/docs/videoCategories/list\"><code class=\"language-text\">videoCategories:List</code></a> endpoint. </p>\n<p>You can use the API Explorer in the same page to help build your <code class=\"language-text\">curl</code> request. Click to view it in full screen mode.</p>\n<p><img src=\"https://github.com/user-attachments/assets/0d21a30a-af5f-4d6e-b86b-1d4509d3bae9\" alt=\"aws-yt-10\"></p>\n<p>Use <code class=\"language-text\">snippet</code> for the <code class=\"language-text\">part</code> parameter and <code class=\"language-text\">US</code> as the region code. The API has different region codes for different countries, and we'll have a look based on <code class=\"language-text\">US</code> in this example.</p>\n<p><img src=\"https://github.com/user-attachments/assets/3a7972cc-dcf5-4dc9-9652-977741ac85c0\" alt=\"aws-yt-11\"></p>\n<p>Run the curl command with your API Key in a terminal. You can remove the line for the access token as we aren't using any.</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">curl -X GET &#39;https://youtube.googleapis.com/youtube/v3/videoCategories?part=snippet&amp;regionCode=US&amp;key=YOUR_API_KEY&#39; --header &#39;Accept: application/json&#39; </code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span></span></pre></div>\n<p>The response will contain a list of video categories in list format.</p>\n<p><img style=\"max-width: 30%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"12\" src=\"https://github.com/user-attachments/assets/2e9386e1-311d-4343-a9d0-087beefaa2f6\"/></p>\n<h3>3. Using the YouTube API to get data for the project</h3>\n<p>As we now have some idea about using the API, let's try to build the request for the data we want for this project. </p>\n<p>The project aims to track the daily growth rate and audience engagement of popular travel videos, while identifying trending content. To achieve this, we'll focus on gathering popular travel videos published within the last 24 hours. To limit the scope, we'll focus on videos based on traveling in Japan.</p>\n<p>The YouTube API has a <a href=\"https://developers.google.com/youtube/v3/docs/search/list\">search endpoint</a> which allows us to get a list of videos. It's worth noting that one search request will cost 100 quota units.</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">curl -X GET \\\n&#39;https://www.googleapis.com/youtube/v3/search?part=snippet&amp;q=japan+travel&amp;type=video&amp;maxResults=50&amp;publishedAfter=2024-12-11T00:00:00Z&amp;publishedBefore=2024-12-12T00:00:00Z&amp;order=viewCount&amp;topic_id=/m/07bxq&amp;videoDuration=medium&amp;key=YOUR_API_KEY&#39;</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span></span></pre></div>\n<p>Let's have a look at the parameters in the request.\nThe <code class=\"language-text\">part</code> parameter is set to return the <code class=\"language-text\">snippet</code> which will contain the video id.\nThe <code class=\"language-text\">q</code> parameter allows querying for travel videos.\nAs we want the videos published in the last 24 hours, the <code class=\"language-text\">publishedAfter</code> is set to the day before, and the <code class=\"language-text\">publishedBefore</code> to the next day, to limit the response to 24 hours.\nAs we're looking for the most popular videos, we're ordering the result by <code class=\"language-text\">viewCount</code>.\nThe topic Id helps to filter results to travel topic. Topic Ids are defined <a href=\"https://developers.google.com/youtube/v3/docs/search/list#:~:text=See%20topic%20IDs%20supported%20as%20of%20February%2015%2C%202017\">here</a>.\nThe <code class=\"language-text\">videoDuration</code> parameter has 4 possible values - short (&#x3C;4 minutes), medium (between 4 and 20 minutes), long (>20 minutes) and any (default value). We'll execute two requests with <code class=\"language-text\">medium</code> and <code class=\"language-text\">long</code> values to exclude shorts content on YouTube. </p>\n<p>The response contains a list of videos with their ids and other metadata.</p>\n<p><img style=\"max-width: 30%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"13\" src=\"https://github.com/user-attachments/assets/4b64682a-ed4d-405c-8f42-d13b88ca9e7c\"/></p>\n<h3>4. Use Python to build the API request</h3>\n<p>Now, let's use Python to build the above request. </p>\n<p>I've created a simple Python project in PyCharm, but you can use Jupyter Notebook as well.</p>\n<p><img style=\"max-width: 30%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"14\" src=\"https://github.com/user-attachments/assets/010e8521-0041-4391-8f08-50426dc3745f\"/></p>\n<p>Log in / Sign up to AWS Cloud Console. If you're creating a new account, you'll be able to use the free tier services.</p>\n<p><img style=\"max-width: 25%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"15\" src=\"https://github.com/user-attachments/assets/b6bf975c-00e4-4eb0-af56-0cd24cede5a1\"/></p>\n<p>In the AWS Console, search of Lambda service and click on <code class=\"language-text\">Create a Function</code>.\nAdd a function name and set the Runtime option to use Python. In addition to the function, an IAM role will be created with permission to write logs to the Amazon CloudWatch.</p>\n<p><img style=\"max-width: 50%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"16\" src=\"https://github.com/user-attachments/assets/f8ff2fb7-1c7b-4741-ac63-ef17ce387514\"/></p>\n<p>When you create the function, you'll see the code editor window where you can put in your code for the Lambda function. In the Runtime Settings section has the handler function defined as <code class=\"language-text\">lambda_function.lambda_handler</code>, which will be executed when the function is run.</p>\n<p><img src=\"https://github.com/user-attachments/assets/0a5802ab-5601-411a-a949-56a041a94c9b\" alt=\"aws-yt-17\"></p>\n<p>Let's try running the sample function defined. Head to the <code class=\"language-text\">Test</code> tab and click on Test button to execute the function.\nThe function will execute and create CloudWatch logs. Expand to view the output and logs.</p>\n<p><img src=\"https://github.com/user-attachments/assets/caea34a7-e9ae-4189-9994-2e1648eae0fc\" alt=\"aws-yt-18\"></p>\n<p>Let's head back to the IDE/Jupyter notebook and write the Python code to call the API.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token keyword\">import</span> os\n<span class=\"token keyword\">import</span> json\n<span class=\"token keyword\">from</span> googleapiclient<span class=\"token punctuation\">.</span>discovery <span class=\"token keyword\">import</span> build\n\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">lambda_handler</span><span class=\"token punctuation\">(</span>event<span class=\"token punctuation\">,</span> context<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token comment\"># get the API key from Lambda environment variables</span>\n    youtube_api_key <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>environ<span class=\"token punctuation\">[</span><span class=\"token string\">'YOUTUBE_API_KEY'</span><span class=\"token punctuation\">]</span>\n\n    <span class=\"token comment\"># define api variables</span>\n    api_name <span class=\"token operator\">=</span> <span class=\"token string\">'youtube'</span>\n    api_version <span class=\"token operator\">=</span> <span class=\"token string\">'v3'</span>\n\n    <span class=\"token comment\"># initialise the API client</span>\n    youtube <span class=\"token operator\">=</span> build<span class=\"token punctuation\">(</span>api_name<span class=\"token punctuation\">,</span> api_version<span class=\"token punctuation\">,</span> developerKey<span class=\"token operator\">=</span>youtube_api_key<span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># execute search request for 50 videos with the highest view count</span>\n    search_response <span class=\"token operator\">=</span> youtube<span class=\"token punctuation\">.</span>search<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span><span class=\"token builtin\">list</span><span class=\"token punctuation\">(</span>\n        part<span class=\"token operator\">=</span><span class=\"token string\">'id'</span><span class=\"token punctuation\">,</span>\n        q<span class=\"token operator\">=</span><span class=\"token string\">'japan + travel'</span><span class=\"token punctuation\">,</span>\n        <span class=\"token builtin\">type</span><span class=\"token operator\">=</span><span class=\"token string\">'video'</span><span class=\"token punctuation\">,</span>\n        maxResults<span class=\"token operator\">=</span><span class=\"token number\">50</span><span class=\"token punctuation\">,</span>\n        publishedAfter<span class=\"token operator\">=</span><span class=\"token string\">'2024-12-11T00:00:00Z'</span><span class=\"token punctuation\">,</span>\n        publishedBefore<span class=\"token operator\">=</span><span class=\"token string\">'2024-12-12T00:00:00Z'</span><span class=\"token punctuation\">,</span>\n        order<span class=\"token operator\">=</span><span class=\"token string\">'viewCount'</span><span class=\"token punctuation\">,</span>\n        topicId<span class=\"token operator\">=</span><span class=\"token string\">'/m/07bxq'</span><span class=\"token punctuation\">,</span>\n        videoDuration<span class=\"token operator\">=</span><span class=\"token string\">'medium'</span>\n    <span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>execute<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># get video ids</span>\n    video_ids <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span>item<span class=\"token punctuation\">[</span><span class=\"token string\">'id'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">[</span><span class=\"token string\">'videoId'</span><span class=\"token punctuation\">]</span> <span class=\"token keyword\">for</span> item <span class=\"token keyword\">in</span> search_response<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'items'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span>\n\n    <span class=\"token comment\"># get data for each video</span>\n    video_data_response <span class=\"token operator\">=</span> youtube<span class=\"token punctuation\">.</span>videos<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span><span class=\"token builtin\">list</span><span class=\"token punctuation\">(</span>\n        part<span class=\"token operator\">=</span><span class=\"token string\">'snippet,content_details,statistics'</span><span class=\"token punctuation\">,</span>\n        <span class=\"token builtin\">id</span><span class=\"token operator\">=</span><span class=\"token string\">','</span><span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>video_ids<span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>execute<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># extract video data</span>\n    videos <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n    <span class=\"token keyword\">for</span> item <span class=\"token keyword\">in</span> video_data_response<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'items'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token comment\"># get() function allows a fallback value in case the element is not found in the video_data_response</span>\n        snippet <span class=\"token operator\">=</span> item<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'snippet'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span>\n        content_details <span class=\"token operator\">=</span> item<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'contentDetails'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span>\n        statistics <span class=\"token operator\">=</span> item<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'statistics'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span>\n\n        video <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"id\"</span><span class=\"token punctuation\">:</span> item<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'id'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"title\"</span><span class=\"token punctuation\">:</span> snippet<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'title'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"description\"</span><span class=\"token punctuation\">:</span> snippet<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'description'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"publishedAt\"</span><span class=\"token punctuation\">:</span> snippet<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'publishedAt'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"channelId\"</span><span class=\"token punctuation\">:</span> snippet<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'channelId'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"channelTitle\"</span><span class=\"token punctuation\">:</span> snippet<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'channelTitle'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"videoCategoryId\"</span><span class=\"token punctuation\">:</span> snippet<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'categoryId'</span><span class=\"token punctuation\">,</span> <span class=\"token number\">0</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"tags\"</span><span class=\"token punctuation\">:</span> snippet<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'tags'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"videoDuration\"</span><span class=\"token punctuation\">:</span> content_details<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'duration'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"videoDefinition\"</span><span class=\"token punctuation\">:</span> content_details<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'definition'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"initialViewCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'viewCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"initialLikeCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'likeCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"initialFavoriteCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'favoriteCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"initialCommentCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'commentCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"collectionDate\"</span><span class=\"token punctuation\">:</span> search_date\n        <span class=\"token punctuation\">}</span>\n        videos<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>video<span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># return list</span>\n    <span class=\"token keyword\">return</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token string\">'statusCode'</span><span class=\"token punctuation\">:</span> <span class=\"token number\">200</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">'body'</span><span class=\"token punctuation\">:</span> json<span class=\"token punctuation\">.</span>dumps<span class=\"token punctuation\">(</span>videos<span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Replace the code in Lambda function with the above.</p>\n<p>Head to 'Environment Variables' in the 'Configuration' section and add a new variable to store the API key.</p>\n<p><img src=\"https://github.com/user-attachments/assets/b7c1e7e0-dbed-43c3-9730-8c2fdc9d6d7a\" alt=\"aws-yt-19\"></p>\n<p>Since AWS Lambda doesn't include <code class=\"language-text\">googleapicleint</code> by default, we need to run the below command in local python environment to package it into a folder, zip and upload it in the Lambda environment. </p>\n<p>We'll use Lambda Layers to package this library separately from our code. Using layers allow the package to be used across multiple Lambda functions as well.\nNote that all packages in a Lambda layer need to reside inside a <code class=\"language-text\">python</code> folder.</p>\n<div class=\"gatsby-highlight\" data-language=\"bash\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-bash line-numbers\"><code class=\"language-bash\"><span class=\"token function\">mkdir</span> -p layer/python\npip <span class=\"token function\">install</span> google-api-python-client -t layer/python\n<span class=\"token builtin class-name\">cd</span> layer\n<span class=\"token function\">zip</span> -r googleapiclient_layer.zip python</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Go towards the bottom of the page and add a new Layer. Select 'Create a new layer'.</p>\n<p><img src=\"https://github.com/user-attachments/assets/ee23f2cd-72ed-4a0e-abde-9e5699b0d934\" alt=\"aws-yt-20\"></p>\n<p>Add a layer name such as <code class=\"language-text\">googleapiclient</code>, choose the Python runtime, and upload the zip file created in the above steps.</p>\n<p>Back in the Lambda function page, go towards the end and select 'Add a layer' again. Select the Layer we created to attach it to the function.</p>\n<p><img src=\"https://github.com/user-attachments/assets/09a7d5ad-97d2-4340-bd2d-1a179f651813\" alt=\"aws-yt-21\"></p>\n<p>Now let's create a test event so that the function can be executed for testing.\nHead to 'Test' tab and create a test event with an empty json input.</p>\n<p><img style=\"max-width: 30%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"22\" src=\"https://github.com/user-attachments/assets/0fe0fcf8-5da7-4c5a-9670-53e61f10a442\"/></p>\n<p>Back in the 'Code' tab, deploy the code, and run the test.</p>\n<p><img style=\"max-width: 50%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"23\" src=\"https://github.com/user-attachments/assets/7a7cdaae-a103-4223-aa5c-9ae0f8e381e0\"/></p>\n<p>The output will be displayed in the window.</p>\n<p><img src=\"https://github.com/user-attachments/assets/cf1ec7da-3e96-434a-bdf4-7794fa3a0fbd\" alt=\"aws-yt-24\"></p>\n<h3>5. Store the raw data in S3</h3>\n<p>Head to S3 in the AWS Console and create a new S3 bucket to store data.</p>\n<p><img style=\"max-width: 30%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"25\" src=\"https://github.com/user-attachments/assets/6e974aad-060e-41ec-9401-be08d5e52f40\"/></p>\n<p>In the Python code in the Lambda function, replace the json output with the below.\nWe're using partitions to store the data in S3 to make data reads more efficient. In this instance, we're using the <code class=\"language-text\">collection_date</code> to partition the\ndata into separate folders. S3 will create folders as <code class=\"language-text\">collection_date=date</code> so that each file can be easily queried.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\">    <span class=\"token comment\"># add new imports</span>\n    <span class=\"token keyword\">import</span> os\n    <span class=\"token keyword\">from</span> io <span class=\"token keyword\">import</span> BytesIO\n    <span class=\"token keyword\">import</span> pyarrow <span class=\"token keyword\">as</span> pa\n    <span class=\"token keyword\">import</span> pyarrow<span class=\"token punctuation\">.</span>parquet <span class=\"token keyword\">as</span> pq\n    <span class=\"token keyword\">import</span> boto3\n    <span class=\"token keyword\">from</span> datetime <span class=\"token keyword\">import</span> datetime\n    <span class=\"token keyword\">from</span> zoneinfo <span class=\"token keyword\">import</span> ZoneInfo\n    <span class=\"token keyword\">from</span> googleapiclient<span class=\"token punctuation\">.</span>discovery <span class=\"token keyword\">import</span> build\n\n    <span class=\"token comment\"># ------------------------------------------</span>\n\n    <span class=\"token comment\"># get column names </span>\n    columns <span class=\"token operator\">=</span> videos<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>keys<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># convert list of dicts to list of lists</span>\n    videos_list <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>key<span class=\"token punctuation\">:</span><span class=\"token punctuation\">[</span>item<span class=\"token punctuation\">[</span>key<span class=\"token punctuation\">]</span> <span class=\"token keyword\">for</span> item <span class=\"token keyword\">in</span> videos<span class=\"token punctuation\">]</span> <span class=\"token keyword\">for</span> key <span class=\"token keyword\">in</span> columns<span class=\"token punctuation\">}</span>\n\n    <span class=\"token comment\"># convert list to pyarrow table</span>\n    videos_tb <span class=\"token operator\">=</span> pa<span class=\"token punctuation\">.</span>table<span class=\"token punctuation\">(</span>videos_list<span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># init s3 client</span>\n    s3 <span class=\"token operator\">=</span> boto3<span class=\"token punctuation\">.</span>client<span class=\"token punctuation\">(</span><span class=\"token string\">'s3'</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># use collection_date to partition</span>\n    bucket_name <span class=\"token operator\">=</span> <span class=\"token string\">'&lt;S3 Bucket name>'</span>\n    parquet_file_key <span class=\"token operator\">=</span> <span class=\"token string-interpolation\"><span class=\"token string\">f'raw/videos/collection_date=</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>search_date<span class=\"token punctuation\">}</span></span><span class=\"token string\">/</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>search_date<span class=\"token punctuation\">}</span></span><span class=\"token string\">.parquet'</span></span>\n\n    <span class=\"token comment\"># write to parquet file in memory</span>\n    parquet_buffer <span class=\"token operator\">=</span> BytesIO<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n    pq<span class=\"token punctuation\">.</span>write_table<span class=\"token punctuation\">(</span>videos_tb<span class=\"token punctuation\">,</span> parquet_buffer<span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># upload to s3</span>\n    s3<span class=\"token punctuation\">.</span>put_object<span class=\"token punctuation\">(</span>Bucket<span class=\"token operator\">=</span>bucket_name<span class=\"token punctuation\">,</span> Key<span class=\"token operator\">=</span>parquet_file_key<span class=\"token punctuation\">,</span> Body<span class=\"token operator\">=</span>parquet_buffer<span class=\"token punctuation\">.</span>getvalue<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string-interpolation\"><span class=\"token string\">f'Parquet file uploaded to s3://</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>bucket_name<span class=\"token punctuation\">}</span></span><span class=\"token string\">/</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>parquet_file_key<span class=\"token punctuation\">}</span></span><span class=\"token string\">'</span></span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Before running the function, we need to add the pyarrow package into Lambda as well.\nWe could try the previous method we used for <code class=\"language-text\">googleapiclient</code>, but for <code class=\"language-text\">pyarrow</code> this will result in errors\nbecause some files are dependent on the OS (Mac, Windows or Linux) it is built on. For the pyarrow package to be usable on Lambda,\nit should be built on a Linux machine. Alternatively, you could make use of a public pre-built layer resource available <a href=\"https://github.com/keithrozario/Klayers/tree/master/deployments\">here</a>.\nMake sure to pick the correct Python version and AWS region and obtain the arn value for the resource.</p>\n<p>In the Lambda function, select 'Add a Layer' and use the arn to attach the layer.\nI've also updated the Python version of the Lambda function to match the Python version I picked for pyarrow.</p>\n<p><img src=\"https://github.com/user-attachments/assets/87821e01-5a28-4cad-85d7-4bbf71583c7f\" alt=\"26\"></p>\n<p>Now, if you deploy changes and test the function, it will throw an error saying that the function has timed out after 3 seconds.\nSince writing parquet files need more time, let's update the timeout value to 1 minute.</p>\n<p><img src=\"https://github.com/user-attachments/assets/dc7df723-a24a-44a8-9c54-d50dde25feb1\" alt=\"26-2\"></p>\n<p>When you execute the function now, it'll throw an error mentioning that permissions are missing.</p>\n<div class=\"gatsby-highlight\" data-language=\"bash\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-bash line-numbers\"><code class=\"language-bash\">An error occurred <span class=\"token punctuation\">(</span>AccessDenied<span class=\"token punctuation\">)</span> when calling the PutObject operation</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span></span></pre></div>\n<p>To resolve this error, we need to enable the IAM role associated with Lambda function to access the S3 bucket.</p>\n<p>Go to the <code class=\"language-text\">Configuration</code> tab and select <code class=\"language-text\">Permissions</code>. Click on the role name to open it in IAM console.</p>\n<p><img src=\"https://github.com/user-attachments/assets/6021c396-3fbc-42d8-afda-922f7e58f52b\" alt=\"27\"></p>\n<p>Select to edit the existing policy.</p>\n<p><img src=\"https://github.com/user-attachments/assets/3931fce3-fa04-4bad-bf83-979de1ab7c79\" alt=\"28\"></p>\n<p>Select outside of statement to get the option <code class=\"language-text\">Add new statement</code> on the left side. </p>\n<p><img src=\"https://github.com/user-attachments/assets/5f601958-c759-47be-8aa8-87e56ab9e94c\" alt=\"29\"></p>\n<p>Then select <code class=\"language-text\">s3</code> as the service to add, and <code class=\"language-text\">s3:GetObject</code> and <code class=\"language-text\">s3:PutObject</code> as the allowed actions.\nIn the resource selection, select the S3 bucket created earlier. This should add a snippet similar to below.</p>\n<div class=\"gatsby-highlight\" data-language=\"json\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-json line-numbers\"><code class=\"language-json\">        <span class=\"token punctuation\">{</span>\n\t\t\t<span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"s3:PutObject\"</span><span class=\"token punctuation\">,</span>\n\t\t\t\t<span class=\"token string\">\"s3:GetObject\"</span>\n\t\t\t<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"arn:aws:s3:::&lt;bucket_name>/*\"</span>\n\t\t\t<span class=\"token punctuation\">]</span>\n\t\t<span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Rerun the Lambda function. If you followed all the steps, the parquet file should be created in the S3 bucket.</p>\n<p><img style=\"max-width: 30%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"30\" src=\"https://github.com/user-attachments/assets/00a1a873-06c9-483b-8f71-72d6d34d7621\"/></p>\n<p>Congratulations on reaching this point! You've done a great job!!</p>\n<h3>6. Enable the job to run daily using Amazon EventBridge</h3>\n<p>Before setting up the daily schedule for the Lambda function, we need to update the <code class=\"language-text\">publishedBefore</code> and <code class=\"language-text\">publishedAfter</code> dates to change dynamically in the first API call.</p>\n<p>Assume we want to run the daily job at 12.30am AEDT (UTC +11), which is 1.30pm UTC.\nWe can get the videos published from 1.30pm on previous day to 1.30pm the next day.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\">    <span class=\"token keyword\">from</span> datetime <span class=\"token keyword\">import</span> datetime<span class=\"token punctuation\">,</span> timedelta\n\n    <span class=\"token comment\"># set datetime for search</span>\n    search_date <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span>datetime<span class=\"token punctuation\">.</span>now<span class=\"token punctuation\">(</span>ZoneInfo<span class=\"token punctuation\">(</span><span class=\"token string\">'Australia/Melbourne'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token operator\">-</span>timedelta<span class=\"token punctuation\">(</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strftime<span class=\"token punctuation\">(</span><span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span>\n    previous_date <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span>datetime<span class=\"token punctuation\">.</span>now<span class=\"token punctuation\">(</span>ZoneInfo<span class=\"token punctuation\">(</span><span class=\"token string\">'Australia/Melbourne'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token operator\">-</span>timedelta<span class=\"token punctuation\">(</span><span class=\"token number\">2</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strftime<span class=\"token punctuation\">(</span><span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span>\n    utc_time <span class=\"token operator\">=</span> <span class=\"token string\">'T13:30:00Z'</span>\n    search_date_utc <span class=\"token operator\">=</span> search_date <span class=\"token operator\">+</span> utc_time\n    previous_date_utc <span class=\"token operator\">=</span> previous_date <span class=\"token operator\">+</span> utc_time</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Once this is updated in the API request, we can set up the daily trigger to execute the function using Amazon EventBridge.</p>\n<p>Head to EventBridge console and create a rule.\nClick on 'Continue to create rule'.</p>\n<p><img src=\"https://github.com/user-attachments/assets/290b6cd4-84c9-4362-95b7-a6a445020507\" alt=\"31\"></p>\n<p>Set the cron expression to trigger the Lambda function. I've set it to 1.30pm UTC.</p>\n<p><img src=\"https://github.com/user-attachments/assets/4bf34627-015f-4f45-9b22-aeb8cb6e75af\" alt=\"32\"></p>\n<p>Set the Lambda function as the target.</p>\n<p><img src=\"https://github.com/user-attachments/assets/e833cf03-d1eb-4513-92f8-6e5ad498e4dc\" alt=\"33\"></p>\n<p>Head back to the Lambda function to verify that the trigger has been added.</p>\n<p><img src=\"https://github.com/user-attachments/assets/722e4ba7-3a57-4785-b24d-9f92655f7d7d\" alt=\"34\"></p>\n<ol start=\"7\">\n<li>Read data from S3 to get data from API</li>\n</ol>\n<p>Let's create a second Lambda function as <code class=\"language-text\">getYoutubeStats</code>.\nIn the Lambda function, let's get all the data files stored in S3:</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token keyword\">import</span> boto3\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">lambda_handler</span><span class=\"token punctuation\">(</span>event<span class=\"token punctuation\">,</span> context<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n\n    <span class=\"token comment\"># init s3 client</span>\n    s3 <span class=\"token operator\">=</span> boto3<span class=\"token punctuation\">.</span>client<span class=\"token punctuation\">(</span><span class=\"token string\">'s3'</span><span class=\"token punctuation\">)</span>\n\n    bucket_name <span class=\"token operator\">=</span> <span class=\"token string\">'&lt;bucket_name>'</span> \n    directory_prefix <span class=\"token operator\">=</span> <span class=\"token string\">'raw/videos/'</span>\n\n    <span class=\"token comment\"># get all files in raw/videos/</span>\n    all_files_response <span class=\"token operator\">=</span> s3<span class=\"token punctuation\">.</span>list_objects_v2<span class=\"token punctuation\">(</span>Bucket<span class=\"token operator\">=</span>bucket_name<span class=\"token punctuation\">,</span> Prefix<span class=\"token operator\">=</span>directory_prefix<span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>all_files_response<span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>To run the process, update the IAM role related with this new Lambda function with the permissions to access S3.</p>\n<div class=\"gatsby-highlight\" data-language=\"json\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-json line-numbers\"><code class=\"language-json\">        <span class=\"token punctuation\">{</span>\n\t\t\t<span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"s3:GetObject\"</span><span class=\"token punctuation\">,</span>\n\t\t\t\t<span class=\"token string\">\"s3:PutObject\"</span>\n\t\t\t<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"arn:aws:s3:::&lt;bucket_name>/*\"</span>\n\t\t\t<span class=\"token punctuation\">]</span>\n\t\t<span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n\t\t<span class=\"token punctuation\">{</span>\n\t\t\t<span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"s3:ListBucket\"</span>\n\t\t\t<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"arn:aws:s3:::&lt;bucket_name>\"</span>\n\t\t\t<span class=\"token punctuation\">]</span>\n\t\t<span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>When the function is run, the response will have a <code class=\"language-text\">Contents</code> field, which contains the file names stored in S3. We can use this to get the data in the file.</p>\n<p>The completed function is as below:</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token keyword\">import</span> os\n<span class=\"token keyword\">from</span> io <span class=\"token keyword\">import</span> BytesIO\n<span class=\"token keyword\">import</span> pyarrow <span class=\"token keyword\">as</span> pa\n<span class=\"token keyword\">import</span> pyarrow<span class=\"token punctuation\">.</span>parquet <span class=\"token keyword\">as</span> pq\n<span class=\"token keyword\">import</span> boto3\n<span class=\"token keyword\">from</span> datetime <span class=\"token keyword\">import</span> datetime<span class=\"token punctuation\">,</span> timedelta\n<span class=\"token keyword\">from</span> zoneinfo <span class=\"token keyword\">import</span> ZoneInfo\n<span class=\"token keyword\">from</span> googleapiclient<span class=\"token punctuation\">.</span>discovery <span class=\"token keyword\">import</span> build\n\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">lambda_handler</span><span class=\"token punctuation\">(</span>event<span class=\"token punctuation\">,</span> context<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token comment\"># get the API key from Lambda environment variables</span>\n    youtube_api_key <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>environ<span class=\"token punctuation\">[</span><span class=\"token string\">'YOUTUBE_API_KEY'</span><span class=\"token punctuation\">]</span>\n\n    <span class=\"token comment\"># define api variables</span>\n    api_name <span class=\"token operator\">=</span> <span class=\"token string\">'youtube'</span>\n    api_version <span class=\"token operator\">=</span> <span class=\"token string\">'v3'</span>\n\n    <span class=\"token comment\"># initialise the API client</span>\n    youtube <span class=\"token operator\">=</span> build<span class=\"token punctuation\">(</span>api_name<span class=\"token punctuation\">,</span> api_version<span class=\"token punctuation\">,</span> developerKey<span class=\"token operator\">=</span>youtube_api_key<span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># init s3 client</span>\n    s3 <span class=\"token operator\">=</span> boto3<span class=\"token punctuation\">.</span>client<span class=\"token punctuation\">(</span><span class=\"token string\">'s3'</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># set today's datetime for search</span>\n    search_date <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span>datetime<span class=\"token punctuation\">.</span>now<span class=\"token punctuation\">(</span>ZoneInfo<span class=\"token punctuation\">(</span><span class=\"token string\">'Australia/Melbourne'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> timedelta<span class=\"token punctuation\">(</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strftime<span class=\"token punctuation\">(</span><span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># get last 6 days for which data should be retrieved</span>\n    today <span class=\"token operator\">=</span> datetime<span class=\"token punctuation\">.</span>now<span class=\"token punctuation\">(</span>ZoneInfo<span class=\"token punctuation\">(</span><span class=\"token string\">'Australia/Melbourne'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n    search_date_list <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">(</span>today<span class=\"token operator\">-</span>timedelta<span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strftime<span class=\"token punctuation\">(</span><span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">2</span><span class=\"token punctuation\">,</span><span class=\"token number\">8</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span>\n\n    <span class=\"token comment\"># get the oldest date for which video data should be retrieved</span>\n    <span class=\"token comment\"># convert string to datetime</span>\n    oldest_collection_date <span class=\"token operator\">=</span> datetime<span class=\"token punctuation\">.</span>strptime<span class=\"token punctuation\">(</span><span class=\"token punctuation\">(</span>today <span class=\"token operator\">-</span> timedelta<span class=\"token punctuation\">(</span><span class=\"token number\">7</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strftime<span class=\"token punctuation\">(</span><span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># get data for last 7 days</span>\n    bucket_name <span class=\"token operator\">=</span> <span class=\"token string\">'&lt;bucket_name>'</span>\n    videos <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n\n    <span class=\"token keyword\">for</span> day <span class=\"token keyword\">in</span> search_date_list<span class=\"token punctuation\">:</span>\n        directory_prefix <span class=\"token operator\">=</span> <span class=\"token string\">'raw/videos'</span>\n        partition_prefix <span class=\"token operator\">=</span> <span class=\"token string-interpolation\"><span class=\"token string\">f'</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>directory_prefix<span class=\"token punctuation\">}</span></span><span class=\"token string\">/collection_date=</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>day<span class=\"token punctuation\">}</span></span><span class=\"token string\">'</span></span>\n\n        <span class=\"token comment\"># check if partition exists in S3</span>\n        file_response <span class=\"token operator\">=</span> s3<span class=\"token punctuation\">.</span>list_objects_v2<span class=\"token punctuation\">(</span>Bucket<span class=\"token operator\">=</span>bucket_name<span class=\"token punctuation\">,</span> Prefix<span class=\"token operator\">=</span>partition_prefix<span class=\"token punctuation\">)</span>\n\n        <span class=\"token comment\"># break loop if latest partition is not available - means previous partitions aren't available as well</span>\n        <span class=\"token keyword\">if</span> <span class=\"token string\">'Contents'</span> <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> file_response<span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">break</span>\n        <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n            <span class=\"token comment\"># file path</span>\n            file_key <span class=\"token operator\">=</span> file_response<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'Contents'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'Key'</span><span class=\"token punctuation\">)</span>\n            <span class=\"token comment\"># file date in datetime format</span>\n            file_date <span class=\"token operator\">=</span> datetime<span class=\"token punctuation\">.</span>strptime<span class=\"token punctuation\">(</span>day<span class=\"token punctuation\">,</span> <span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span> \n            search_date_dt <span class=\"token operator\">=</span> datetime<span class=\"token punctuation\">.</span>strptime<span class=\"token punctuation\">(</span>search_date<span class=\"token punctuation\">,</span> <span class=\"token string\">'%Y-%m-%d'</span><span class=\"token punctuation\">)</span>\n            <span class=\"token comment\"># get object</span>\n            file_obj <span class=\"token operator\">=</span> s3<span class=\"token punctuation\">.</span>get_object<span class=\"token punctuation\">(</span>Bucket<span class=\"token operator\">=</span>bucket_name<span class=\"token punctuation\">,</span> Key<span class=\"token operator\">=</span>file_key<span class=\"token punctuation\">)</span>\n            file_table <span class=\"token operator\">=</span> pq<span class=\"token punctuation\">.</span>read_table<span class=\"token punctuation\">(</span>BytesIO<span class=\"token punctuation\">(</span>file_obj<span class=\"token punctuation\">[</span><span class=\"token string\">'Body'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>read<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n            video_ids <span class=\"token operator\">=</span> file_table<span class=\"token punctuation\">.</span>column<span class=\"token punctuation\">(</span><span class=\"token string\">'id'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>to_pylist<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n            <span class=\"token comment\"># get data for each video</span>\n            video_data_response <span class=\"token operator\">=</span> youtube<span class=\"token punctuation\">.</span>videos<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span><span class=\"token builtin\">list</span><span class=\"token punctuation\">(</span>\n                part<span class=\"token operator\">=</span><span class=\"token string\">'statistics'</span><span class=\"token punctuation\">,</span>\n                <span class=\"token builtin\">id</span><span class=\"token operator\">=</span><span class=\"token string\">','</span><span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>video_ids<span class=\"token punctuation\">)</span>\n            <span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>execute<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n            <span class=\"token keyword\">for</span> item <span class=\"token keyword\">in</span> video_data_response<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'items'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n                statistics <span class=\"token operator\">=</span> item<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'statistics'</span><span class=\"token punctuation\">,</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span>\n\n                video <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n                    <span class=\"token string\">\"id\"</span><span class=\"token punctuation\">:</span> item<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'id'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">''</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n                    <span class=\"token string\">\"initialCollectionDate\"</span><span class=\"token punctuation\">:</span> file_date<span class=\"token punctuation\">,</span>\n                    <span class=\"token string\">\"collectionDate\"</span><span class=\"token punctuation\">:</span> search_date_dt<span class=\"token punctuation\">,</span>\n                    <span class=\"token string\">\"collectionCount\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">(</span>file_date <span class=\"token operator\">-</span> oldest_collection_date<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>days <span class=\"token operator\">+</span> <span class=\"token number\">2</span><span class=\"token punctuation\">,</span>\n                    <span class=\"token string\">\"viewCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'viewCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n                    <span class=\"token string\">\"likeCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'likeCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n                    <span class=\"token string\">\"favoriteCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'favoriteCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n                    <span class=\"token string\">\"commentCount\"</span><span class=\"token punctuation\">:</span> statistics<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span><span class=\"token string\">'commentCount'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'0'</span><span class=\"token punctuation\">)</span>\n                <span class=\"token punctuation\">}</span>\n                videos<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>video<span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># save data in s3 if data is available</span>\n    <span class=\"token keyword\">if</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>videos<span class=\"token punctuation\">)</span> <span class=\"token operator\">></span> <span class=\"token number\">0</span><span class=\"token punctuation\">:</span>\n        <span class=\"token comment\"># get column names</span>\n        columns <span class=\"token operator\">=</span> videos<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>keys<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n        <span class=\"token comment\"># convert list of dicts to list of lists</span>\n        videos_list <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>key<span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span>item<span class=\"token punctuation\">[</span>key<span class=\"token punctuation\">]</span> <span class=\"token keyword\">for</span> item <span class=\"token keyword\">in</span> videos<span class=\"token punctuation\">]</span> <span class=\"token keyword\">for</span> key <span class=\"token keyword\">in</span> columns<span class=\"token punctuation\">}</span>\n\n        <span class=\"token comment\"># convert list to pyarrow table</span>\n        videos_tb <span class=\"token operator\">=</span> pa<span class=\"token punctuation\">.</span>table<span class=\"token punctuation\">(</span>videos_list<span class=\"token punctuation\">)</span>\n\n        <span class=\"token comment\"># write to parquet file in memory</span>\n        parquet_buffer <span class=\"token operator\">=</span> BytesIO<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n        pq<span class=\"token punctuation\">.</span>write_table<span class=\"token punctuation\">(</span>videos_tb<span class=\"token punctuation\">,</span> parquet_buffer<span class=\"token punctuation\">)</span>\n\n        <span class=\"token comment\"># upload to s3</span>\n        parquet_file_key <span class=\"token operator\">=</span> <span class=\"token string-interpolation\"><span class=\"token string\">f'raw/video_stats/collection_date=</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>search_date<span class=\"token punctuation\">}</span></span><span class=\"token string\">/</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>search_date<span class=\"token punctuation\">}</span></span><span class=\"token string\">.parquet'</span></span>\n        s3<span class=\"token punctuation\">.</span>put_object<span class=\"token punctuation\">(</span>Bucket<span class=\"token operator\">=</span>bucket_name<span class=\"token punctuation\">,</span> Key<span class=\"token operator\">=</span>parquet_file_key<span class=\"token punctuation\">,</span> Body<span class=\"token operator\">=</span>parquet_buffer<span class=\"token punctuation\">.</span>getvalue<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n        <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string-interpolation\"><span class=\"token string\">f'Parquet file uploaded to s3://</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>bucket_name<span class=\"token punctuation\">}</span></span><span class=\"token string\">/</span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>parquet_file_key<span class=\"token punctuation\">}</span></span><span class=\"token string\">'</span></span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n        <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string-interpolation\"><span class=\"token string\">f'No data available before </span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>search_date_list<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">}</span></span><span class=\"token string\">'</span></span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Before executing the function, make sure you have added environment variables, and attached the required layers similar to what we did for 'getYoutubeData' function.</p>\n<p>Finally, add the new function to be triggered with the same EventBridge rule.</p>\n<p><img src=\"https://github.com/user-attachments/assets/a5d6ccfc-2cd7-4a5d-ae94-b57bc7bb5c2e\" alt=\"35\"></p>\n<p>Now we have setup the data extract Lambda functions to get data using the YouTube API!</p>\n<p>Congratulations on reaching this far!</p>\n<p>Read the next steps of the project - using AWS Glue to transform data and store in Redshift - here in <a href=\"ttps://malshal.github.io/bitsploit/youtube-data-transform-aws\">Part 2</a>.</p>","frontmatter":{"date":"December 11, 2024","path":"/get-youtube-data-aws","title":"Extract and load YouTube data in S3 using AWS Lambda","tags":["Project","AWS","ETL","AWS Lambda"]}}},"pageContext":{}},"staticQueryHashes":["3649515864"]}