{"componentChunkName":"component---src-templates-blog-post-js","path":"/youtube-data-transform-aws","result":{"data":{"markdownRemark":{"html":"<p><img src=\"https://github.com/user-attachments/assets/62e136c7-0849-4efc-a885-8e37b365fa84\" alt=\"1\"></p>\n<p>In the previous post <a href=\"https://malshal.github.io/bitsploit/get-youtube-data-aws\">here</a>, we built AWS Lambda functions to get data from the YouTube API and store in an S3 bucket.</p>\n<p>In this post, let's continue the project by transforming the raw data in S3 using AWS Glue, and store the data in RedShift, so that it can be used for anlysis. </p>\n<h3>1. Use AWS Glue Crawler to build the Glue database in Glue Data Catalog</h3>\n<p>AWS Glue Crawler is used to build metadata tables for data tables in a data store. These metadata tables are then stored in a Glue database.\nIt's worthwhile noting that the crawler doesn't move the data itself from it's original location, but only creates a pointer to the data so that they can be referenced\nin AWS Glue scripts.</p>\n<p>To get started, let's login to the AWS Console and head to AWS Glue Console.\nExpand the left side menu to find 'Crawlers' and create a new crawler.</p>\n<p><img src=\"https://github.com/user-attachments/assets/d98f1974-e7a8-4eb6-8e7a-ee2e40ff0bfa\" alt=\"2\"></p>\n<p>Add the S3 bucket with raw data as a new data source.</p>\n<p><img src=\"https://github.com/user-attachments/assets/0729c87d-46d2-4d79-8152-606fc4acc1e0\" alt=\"3\"></p>\n<p>In the next step, create a new IAM Role to be associated with Glue services.</p>\n<p><img src=\"https://github.com/user-attachments/assets/71735eb6-ae64-43ae-aa2a-a299401c783f\" alt=\"4\"></p>\n<p>In the next step, create a Glue database to store the metadata tables created by the Crawler.\nKeep the crawler schedule <code class=\"language-text\">On demand</code> for now.</p>\n<p><img src=\"https://github.com/user-attachments/assets/4b1b544c-1701-4522-a3fb-5e1ae8e54f18\" alt=\"5\"></p>\n<p>If you come across an error stating <code class=\"language-text\">One crawler failed to create. Account was denied access</code>, this is because your account\nmay not have passed the minimum amount of usage required to be enabled to create a Gleu Crawler. This can be resolved by creating an EC2 instance\n(use a free-tier eligible option if you are on free tier) and run it for a few minutes. You'll get an email from AWS indicating that your account\nwas validated for additional use. Make sure to terminate the EC2 instance and delete if not required.</p>\n<p>Once the Crawler is created, click the crawler name and start a crawler run. Once completed, it'll display the status at the end of the page.</p>\n<p><img src=\"https://github.com/user-attachments/assets/5aae5b65-85bc-4eb3-948d-4eb76410f90a\" alt=\"6\"></p>\n<p>Head to Databases and select the database created earlier to view the tables added by the crawler.</p>\n<p><img src=\"https://github.com/user-attachments/assets/c59328a9-ba7a-43ee-b295-2a03051ba85e\" alt=\"7\"></p>\n<p>Each table will have the schema automatically inferred by the crawler.</p>\n<p><img src=\"https://github.com/user-attachments/assets/7b090971-4de5-4805-b92c-e7ccd2508b71\" alt=\"8\"></p>\n<h3>2. Build the ETL job using AWS Glue</h3>\n<p>Now that we have tables setup in the Glue Data Catalog, let's build the ETL job to transform the raw data into the format required for analysis.</p>\n<p>On the left side pane, select 'Data Integration and ETL' and head to 'ETL Jobs'.\nYou'll notice that there are 3 ways provided to create a job - using the Visual tool, using notebook or using a script.</p>\n<p><img src=\"https://github.com/user-attachments/assets/64fd768e-2177-4922-8556-0e24867291ac\" alt=\"9\"></p>\n<p>In this poject, we'll use the script option to build the job.</p>\n<p><img src=\"https://github.com/user-attachments/assets/a68e7187-ee0d-47e1-8054-73cd6df38a67\" alt=\"10\"></p>\n<p>Because AWS Glue incurrs charges when using interactive sessions, I've written the code locally in VSCode and will only be testing on Glue itself.</p>\n<p>As we execute the Lambda function daily, the Glue job will run after the Lambda functions daily to store the new data in the data store.</p>\n<p>Initially we'll be reading data from newest partition in the S3 bucket, do some transformations to the data, and finally store back in S3 as csv files.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"></span></pre></div>\n<p>Head to AWS Glue, create a script and add the code. </p>\n<p><img src=\"https://github.com/user-attachments/assets/e116e2eb-eab5-4507-b57c-e061052343d3\" alt=\"11\"></p>\n<p>Click on the <code class=\"language-text\">Job Details</code> tab to create the IAM role for Glue.</p>\n<p><img src=\"https://github.com/user-attachments/assets/08a4d824-2946-4edf-9fc1-e5c31f02c4b6\" alt=\"12\"></p>\n<p>Set the Glue settings as below. </p>\n<p><img src=\"https://github.com/user-attachments/assets/e852332f-9f1c-4aa8-b95d-3aa949eadc0f\" alt=\"13\"></p>\n<p>Next, head to IAM to edit the role we just created for Glue.\nIn the role, make sure the <code class=\"language-text\">AWSGlueServiceRole</code> policy is attached to the role.</p>\n<p><img src=\"https://github.com/user-attachments/assets/a53ada57-60e0-456c-a827-372f845d13f4\" alt=\"14\"></p>\n<p>Create a new policy with below to give permission to Glue to access the S3 bucket.</p>\n<div class=\"gatsby-highlight\" data-language=\"json\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-json line-numbers\"><code class=\"language-json\"><span class=\"token punctuation\">{</span>\n    <span class=\"token property\">\"Version\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"2012-10-17\"</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"Statement\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n        <span class=\"token punctuation\">{</span>\n            <span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"s3:GetObject\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"s3:PutObject\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"s3:ListBucket\"</span>\n            <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"arn:aws:s3:::&lt;bucket_name>/*\"</span>\n            <span class=\"token punctuation\">]</span>\n        <span class=\"token punctuation\">}</span>\n    <span class=\"token punctuation\">]</span>\n<span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Head back to AWS Glue and run the job we created above. </p>\n<p><img src=\"https://github.com/user-attachments/assets/ceb65a85-f728-4310-80f2-c0e3a03cb010\" alt=\"15\"></p>\n<p>If the job run succeeds, the <code class=\"language-text\">output</code> folder in S3 will contain csv files with the result data.</p>\n<p><img src=\"https://github.com/user-attachments/assets/3b23bf5a-df84-4aa6-ad75-8aa34e9a0025\" alt=\"16\"></p>\n<h3>3. Set up Redshift Serverless to store data</h3>\n<p>FIrst, create a <code class=\"language-text\">temp</code> folder in S3 to be used as a temporary storage location in S3 when Glue jobs interact with Amazon Redshift. Glue job will use this folder for data staging and intermediate processing during data transfers between Glue and Redshift.</p>\n<p><img src=\"https://github.com/user-attachments/assets/30f40c0b-0a26-45bc-b194-6fc9ae6eb837\" alt=\"17\"></p>\n<p>We'll be using Redshift Serverless free trial instead of Redshift primarily to limit the costs for this project.</p>\n<p>Let's set up a Redshift Serverless <code class=\"language-text\">namespace</code> and <code class=\"language-text\">workgroup</code>.</p>\n<p>Add a <code class=\"language-text\">namespace</code> name.</p>\n<p><img src=\"https://github.com/user-attachments/assets/c561597f-ea3e-4929-8da1-4c6e68e60a38\" alt=\"18\"></p>\n<p>The database name will be <code class=\"language-text\">dev</code> by default. I'll be manually setting up the username and password to access the database.</p>\n<p><img src=\"https://github.com/user-attachments/assets/4b4fae85-2633-4738-bbf4-66e7acd82b8e\" alt=\"19\"></p>\n<p>In the next step, specify the S3 bucket for the IAM role to access.</p>\n<p><img src=\"https://github.com/user-attachments/assets/88cc6cd4-ff94-48a1-9da3-27ee1f3589b5\" alt=\"20\"></p>\n<p>Add the workgroup name, and base capacity in RPUs (Redshift Processing Unit). I've kept the base capacity at the lowest value as the project will only be processing small datasets.</p>\n<p><img src=\"https://github.com/user-attachments/assets/0ea96774-3de9-46cb-b13f-290f4d2b589c\" alt=\"21\"></p>\n<p>Once creation is finished, you can use the Query Editor link on the left sidebar to access the SQL querying platform.</p>\n<p><img src=\"https://github.com/user-attachments/assets/a00fbd6d-2450-4900-b66b-a970642abc20\" alt=\"22\"></p>\n<p>Connect to the database using the credentials provided earlier.</p>\n<p><img src=\"https://github.com/user-attachments/assets/296c9df5-9ef4-4610-a60c-16f48e5cc988\" alt=\"23\"></p>\n<p>Use the below SQL query to create the result tables.</p>\n<div class=\"gatsby-highlight\" data-language=\"sql\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-sql line-numbers\"><code class=\"language-sql\"><span class=\"token keyword\">CREATE</span> <span class=\"token keyword\">TABLE</span> videos <span class=\"token punctuation\">(</span>\n    video_id                        <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">255</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>video_title                    <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">500</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>video_description_truncated    <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">500</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>video_description_length       <span class=\"token keyword\">INT</span>\n    <span class=\"token punctuation\">,</span>video_published_datetime       <span class=\"token keyword\">TIMESTAMP</span>\n    <span class=\"token punctuation\">,</span>channel_id                     <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">255</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>channel_title                  <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">500</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>video_category_id              <span class=\"token keyword\">INT</span>\n    <span class=\"token punctuation\">,</span>video_tags_truncated           <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">500</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>video_duration                 <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">20</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>video_definition               <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">20</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>collection_date                <span class=\"token keyword\">DATE</span>\n    <span class=\"token punctuation\">,</span>ingested_datetime              <span class=\"token keyword\">TIMESTAMP</span>\n<span class=\"token punctuation\">)</span><span class=\"token punctuation\">;</span>\n\n<span class=\"token keyword\">CREATE</span> <span class=\"token keyword\">TABLE</span> video_stats <span class=\"token punctuation\">(</span>\n    video_id                    <span class=\"token keyword\">VARCHAR</span><span class=\"token punctuation\">(</span><span class=\"token number\">255</span><span class=\"token punctuation\">)</span>\n    <span class=\"token punctuation\">,</span>initial_collection_date    <span class=\"token keyword\">DATE</span>\n    <span class=\"token punctuation\">,</span>collection_date            <span class=\"token keyword\">DATE</span>\n    <span class=\"token punctuation\">,</span>collection_count           <span class=\"token keyword\">INT</span>\n    <span class=\"token punctuation\">,</span>view_count                 <span class=\"token keyword\">INT</span>\n    <span class=\"token punctuation\">,</span>like_count                 <span class=\"token keyword\">INT</span>\n    <span class=\"token punctuation\">,</span>favorite_count             <span class=\"token keyword\">INT</span>\n    <span class=\"token punctuation\">,</span>comment_count              <span class=\"token keyword\">INT</span>\n    <span class=\"token punctuation\">,</span>ingested_datetime          <span class=\"token keyword\">TIMESTAMP</span>\n<span class=\"token punctuation\">)</span><span class=\"token punctuation\">;</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Right-click to Refresh and view the tables created in the database.</p>\n<p><img src=\"https://github.com/user-attachments/assets/f7608435-a4e7-403e-b1be-4737123c0fc1\" alt=\"24\"></p>\n<h3>4. Create connection from AWS Glue to Redshift to write data</h3>\n<p>To connect Glue with Redshift, we need to setup a Glue connection that will connect using JDBC.</p>\n<p>Head to AWS GLue and create a new connection. Use the menu on left side to find connections.</p>\n<p>Search for <code class=\"language-text\">JDBC</code> in the data sources section.</p>\n<p><img src=\"https://github.com/user-attachments/assets/6a3893e1-aff4-4226-8d04-43fffaecb29e\" alt=\"25\"></p>\n<p>Add a name for the connection, and provide the username and password for the connection to access the database.</p>\n<p><img src=\"https://github.com/user-attachments/assets/bbf98550-f4b5-46b5-bef8-1db7c1e594bf\" alt=\"26\"></p>\n<p>Next, expand the <code class=\"language-text\">Network options</code> to add in the VPC name, subnet and security group of Redshift. Make sure these values match with corresponding values from the Redshift cluster you created above.</p>\n<p><img src=\"https://github.com/user-attachments/assets/bbe8c43d-0daf-4655-89e8-3820ef95ec51\" alt=\"27\"></p>\n<p>Save changes and make sure the connection status updates to <code class=\"language-text\">Ready</code>.</p>\n<p><img src=\"https://github.com/user-attachments/assets/a3c33b3a-05e6-46f1-990c-f970bd20eda6\" alt=\"28\"></p>\n<p>To use this JDBC connection in our Glue job, head back to the Glue job and add the connection in <code class=\"language-text\">Job Details</code> section.</p>\n<p><img src=\"https://github.com/user-attachments/assets/9ffe471a-a2ba-4d7e-a335-542136e741cd\" alt=\"29\"></p>\n<p>In order for the Glue job to connect with Redshift using the JDBC connection, the IAM role associated with Glue need to have necessary policies attached.\nHead to IAM and select the role that is associated with Glue and attach the following poliices to the role.</p>\n<p><img src=\"https://github.com/user-attachments/assets/75823630-5665-4cd5-8c7f-16a63e31997a\" alt=\"30\"></p>\n<p>In addition to the above, the role should already have the policy we added to provide access to S3 in step 2.</p>\n<p>Next, create a new policy with below to give permission to Glue to access AWS Secret Manager and to be able to create and pass the role.</p>\n<div class=\"gatsby-highlight\" data-language=\"json\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-json line-numbers\"><code class=\"language-json\"><span class=\"token punctuation\">{</span>\n    <span class=\"token property\">\"Version\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"2012-10-17\"</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"Statement\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n        <span class=\"token punctuation\">{</span>\n            <span class=\"token property\">\"Sid\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"VisualEditor0\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"secretsmanager:GetSecretValue\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"secretsmanager:PutResourcePolicy\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"secretsmanager:DescribeSecret\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"secretsmanager:PutSecretValue\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"secretsmanager:CreateSecret\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"redshift-serverless:GetCredentials\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"secretsmanager:TagResource\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"secretsmanager:UpdateSecret\"</span>\n            <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"&lt;Redshift workgroup ARN>\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"arn:aws:secretsmanager:*:&lt;account_id>:secret:*\"</span>\n            <span class=\"token punctuation\">]</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token punctuation\">{</span>\n            <span class=\"token property\">\"Sid\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Statement1\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"iam:CreateServiceLinkedRole\"</span>\n            <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"arn:aws:iam::&lt;account_id>:role/service-role/&lt;Glue_role_name>\"</span>\n            <span class=\"token punctuation\">]</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token punctuation\">{</span>\n            <span class=\"token property\">\"Sid\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Statement2\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"iam:PassRole\"</span>\n            <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n            <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n                <span class=\"token string\">\"arn:aws:iam::&lt;account_id>:role/service-role/&lt;Glue_role_name>\"</span><span class=\"token punctuation\">,</span>\n                <span class=\"token string\">\"arn:aws:iam::&lt;account_id>:role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift\"</span>\n            <span class=\"token punctuation\">]</span>\n        <span class=\"token punctuation\">}</span>\n    <span class=\"token punctuation\">]</span>\n<span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Once the role is configured, head back to Glue > Connections and Test the connection.</p>\n<p><img src=\"https://github.com/user-attachments/assets/e3084661-2824-45c1-8a00-c57b9f320039\" alt=\"31\"></p>\n<p>If the test is successfule, head back to your Glue job and run the job. If every step was followed, the job run will succeed!</p>\n<p>Head to the Redshift Query Editor to view the data in your tables.</p>\n<p><img src=\"https://github.com/user-attachments/assets/0c9e0f71-809c-43f6-a54a-191704872303\" alt=\"32\"></p>\n<p><img src=\"https://github.com/user-attachments/assets/663e0c24-7486-4247-ab54-9fd06d5ccf57\" alt=\"33\"></p>\n<h3>5. Automate ETL pipeline using AWS Step Functions</h3>\n<p>In this section, let's use AWS Step Functions to build and do a daily run of the complete ETL pipeline, from data ingestion using Lambda and storing in S3, to transforming in Glue and storing in Redshift. We'll use the existing AWS Eventbridge rule to trigger the Step function, so that it runs daily at 12.30 am AEST.</p>\n<p>Head to AWS Step Functions and start creating a state machine. </p>\n<p>In the <code class=\"language-text\">Config</code> tab, add a name to the state machine and select the type as standard.</p>\n<p><img src=\"https://github.com/user-attachments/assets/4a292d81-d23c-42b6-aa6d-86bb3aa2f574\" alt=\"34\"></p>\n<p>Go to the <code class=\"language-text\">Design</code> tab and add Lambda as the firt node. Select the Lambda function to get video data as the API argument.</p>\n<p><img src=\"https://github.com/user-attachments/assets/84427787-5815-4982-a85c-af3d21c06463\" alt=\"35\"></p>\n<p>Similarly, create another Lambda node to run the function to get video statistics.</p>\n<p>Next, add a Glue <code class=\"language-text\">StartJobRun</code> node and add the Glue job name in API parameters.</p>\n<p><img src=\"https://github.com/user-attachments/assets/ad8fdf0a-c84c-4d09-a8a4-0a655e200210\" alt=\"36\"></p>\n<p>Click to create the function. You'll be informed about the IAM role that will be created to access the necessary services from the step function.</p>\n<p><img src=\"https://github.com/user-attachments/assets/edde453d-77c7-4517-b58e-7e8dd3cd2295\" alt=\"37\"></p>\n<p>Next, head to Eventbridge and select the rule we creaed previously to run Lambda functions. Edit the rule to add the Step function as the new target, and remove the Lambda functions from the rule targets.</p>\n<p><img src=\"https://github.com/user-attachments/assets/87360b9d-caf5-434d-81c1-dc88d6066598\" alt=\"38\"></p>\n<p>Now, the ETL process should run evryday and store the new data in the Redshift tables.</p>\n<h3>6. Setup email updates for the ETL process</h3>\n<p>Head to SNS service and create a Standard topic.</p>\n<p><img src=\"https://github.com/user-attachments/assets/9e380270-cbf5-4c3b-97a8-42b7d853020c\" alt=\"39\"></p>\n<p>In the created topic, create a subscription using your preferred email. Make sure to check your email inbox and confirm the subscription by clicking the confirmation link.</p>\n<p><img src=\"https://github.com/user-attachments/assets/2c309c04-f62d-4ad2-ac7e-25272ef879ef\" alt=\"40\"></p>\n<p>Head back to the Step Function created previously, and add in steps to check the Glue job status and publish status to the SNS topic.</p>\n<p><img src=\"https://github.com/user-attachments/assets/43335570-ba54-4868-aece-e413c0a62df4\" alt=\"41\"></p>\n<p>When the steps are added, the state machine definition will be as below.</p>\n<div class=\"gatsby-highlight\" data-language=\"json\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-json line-numbers\"><code class=\"language-json\"><span class=\"token punctuation\">{</span>\n  <span class=\"token property\">\"QueryLanguage\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"JSONata\"</span><span class=\"token punctuation\">,</span>\n  <span class=\"token property\">\"Comment\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"A description of my state machine\"</span><span class=\"token punctuation\">,</span>\n  <span class=\"token property\">\"StartAt\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"getData\"</span><span class=\"token punctuation\">,</span>\n  <span class=\"token property\">\"States\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n    <span class=\"token property\">\"getData\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n      <span class=\"token property\">\"Type\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Task\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"arn:aws:states:::lambda:invoke\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Output\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"{% $states.result.Payload %}\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Arguments\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token property\">\"FunctionName\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"&lt;Lambda function ARN>\"</span>\n      <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Retry\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n        <span class=\"token punctuation\">{</span>\n          <span class=\"token property\">\"ErrorEquals\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n            <span class=\"token string\">\"Lambda.ServiceException\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Lambda.AWSLambdaException\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Lambda.SdkClientException\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Lambda.TooManyRequestsException\"</span>\n          <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"IntervalSeconds\"</span><span class=\"token operator\">:</span> <span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"MaxAttempts\"</span><span class=\"token operator\">:</span> <span class=\"token number\">3</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"BackoffRate\"</span><span class=\"token operator\">:</span> <span class=\"token number\">2</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"JitterStrategy\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"FULL\"</span>\n        <span class=\"token punctuation\">}</span>\n      <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Next\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"getStats\"</span>\n    <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"getStats\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n      <span class=\"token property\">\"Type\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Task\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"arn:aws:states:::lambda:invoke\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Output\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"{% $states.result.Payload %}\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Arguments\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token property\">\"FunctionName\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"&lt;Lambda function ARN>\"</span>\n      <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Retry\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n        <span class=\"token punctuation\">{</span>\n          <span class=\"token property\">\"ErrorEquals\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n            <span class=\"token string\">\"Lambda.ServiceException\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Lambda.AWSLambdaException\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Lambda.SdkClientException\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Lambda.TooManyRequestsException\"</span>\n          <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"IntervalSeconds\"</span><span class=\"token operator\">:</span> <span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"MaxAttempts\"</span><span class=\"token operator\">:</span> <span class=\"token number\">3</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"BackoffRate\"</span><span class=\"token operator\">:</span> <span class=\"token number\">2</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"JitterStrategy\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"FULL\"</span>\n        <span class=\"token punctuation\">}</span>\n      <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Next\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Glue StartJobRun\"</span>\n    <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"Glue StartJobRun\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n      <span class=\"token property\">\"Type\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Task\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"arn:aws:states:::glue:startJobRun.sync\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Arguments\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token property\">\"JobName\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"&lt;Glue job name>\"</span>\n      <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Assign\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token property\">\"JobRunState\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"{% $states.result.JobRunState %}\"</span>\n      <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Next\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Check Job Status\"</span>\n    <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"Check Job Status\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n      <span class=\"token property\">\"Type\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Choice\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Choices\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n        <span class=\"token punctuation\">{</span>\n          <span class=\"token property\">\"Condition\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"{% $JobRunState = 'SUCCEEDED' %}\"</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"Next\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Send Success Notification\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token punctuation\">{</span>\n          <span class=\"token property\">\"Condition\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"{% $JobRunState = 'FAILED' %}\"</span><span class=\"token punctuation\">,</span>\n          <span class=\"token property\">\"Next\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Send Failure Notification\"</span>\n        <span class=\"token punctuation\">}</span>\n      <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Default\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Handle Unknown State\"</span>\n    <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"Send Success Notification\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n      <span class=\"token property\">\"Type\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Task\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"arn:aws:states:::sns:publish\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Arguments\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token property\">\"TopicArn\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"&lt;SNS topic ARN>\"</span><span class=\"token punctuation\">,</span>\n        <span class=\"token property\">\"Message\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Glue job completed successfully!\"</span><span class=\"token punctuation\">,</span>\n        <span class=\"token property\">\"Subject\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Glue Job Success\"</span>\n      <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"End\"</span><span class=\"token operator\">:</span> <span class=\"token boolean\">true</span>\n    <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"Send Failure Notification\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n      <span class=\"token property\">\"Type\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Task\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"arn:aws:states:::sns:publish\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Arguments\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token property\">\"TopicArn\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"&lt;SNS topic ARN>\"</span><span class=\"token punctuation\">,</span>\n        <span class=\"token property\">\"Message\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Glue job failed. Please check logs for details.\"</span><span class=\"token punctuation\">,</span>\n        <span class=\"token property\">\"Subject\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Glue Job Failure\"</span>\n      <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"End\"</span><span class=\"token operator\">:</span> <span class=\"token boolean\">true</span>\n    <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n    <span class=\"token property\">\"Handle Unknown State\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">{</span>\n      <span class=\"token property\">\"Type\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Fail\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Error\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"UnknownJobState\"</span><span class=\"token punctuation\">,</span>\n      <span class=\"token property\">\"Cause\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"The Glue job state was neither SUCCEEDED nor FAILED.\"</span>\n    <span class=\"token punctuation\">}</span>\n  <span class=\"token punctuation\">}</span>\n<span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Update the related IAM role with below to enable publishing messages to the SNS topic.</p>\n<div class=\"gatsby-highlight\" data-language=\"json\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-json line-numbers\"><code class=\"language-json\">        <span class=\"token punctuation\">{</span>\n\t\t\t<span class=\"token property\">\"Sid\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Statement1\"</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Effect\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Allow\"</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Action\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"sns:Publish\"</span>\n\t\t\t<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n\t\t\t<span class=\"token property\">\"Resource\"</span><span class=\"token operator\">:</span> <span class=\"token punctuation\">[</span>\n\t\t\t\t<span class=\"token string\">\"&lt;SNS topic ARN>\"</span>\n\t\t\t<span class=\"token punctuation\">]</span>\n\t\t<span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Execute the state machine to test the ETL process. If all the steps have been completed, you'll receive an email indicating the job has completed successfully!</p>","frontmatter":{"date":"January 12, 2025","path":"/youtube-data-transform-aws","title":"Transform and Load YouTube data in S3 using AWS Glue","tags":["Project","AWS","ETL","AWS Glue"]}}},"pageContext":{}},"staticQueryHashes":["3649515864"]}