{"componentChunkName":"component---src-templates-blog-post-js","path":"/generate-your-own-dataset","result":{"data":{"markdownRemark":{"html":"<p><img src=\"https://user-images.githubusercontent.com/10103699/202613539-3e9bcbd9-644e-4dd9-9722-05213ead5eea.jpg\" alt=\"data-gen1\"></p>\n<h6><em>Photo by <a href=\"https://unsplash.com/@techdailyca?utm_source=unsplash&#x26;utm_medium=referral&#x26;utm_content=creditCopyText\">Tech Daily</a> on <a href=\"https://unsplash.com\">Unsplash</a></em></h6>\n<p>Hey there!</p>\n<p>Have you ever had the idea for a personal project but couldn't find the right dataset for you?</p>\n<p>I was in your shoes when I wanted to create a Power BI report for HR (Human Resource) data with a monthly breakdown.\n<br>\nAnd update it every month with a new data file, while keeping the old data files in place. </p>\n<p>To get started, I needed a dataset with employee data for several months, separately for each month. </p>\n<p>Fortunately, I came across the HR dataset created by <a href=\"http://www.obvience.com/\">obviEnce</a>, which is also the\ndataset used in the <a href=\"https://learn.microsoft.com/en-us/power-bi/create-reports/sample-human-resources\">Microsoft Power BI HR sample</a>.\nIt has employee data for each month in a separate csv file, which I could use to build a number of HR\ndata metrics like headcount and number of new hires. </p>\n<p>However, to make the project more interesting, I thought of creating another set of data files to replicate a\nsecond data source. This dataset would contain data from the hiring process and help create other metrics such as the number of\nvacancies and filled roles in the company over time. </p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/204405214-838c965b-49fa-41a2-8950-1cbd51092d3d.png\" alt=\"data-gen2\"></p>\n<h6><em>The Project Idea</em></h6>\n<p>In this post, I'm going to go through the process of how I used Python to generate the vacancy dataset. This may be useful if you\never wanted to build something similar and did not have the access to an open dataset with multiple files.</p>\n<h3>Before we get started,</h3>\n<p>It may be useful to look at pros and cons in this approach. If you're able to find one, it's always better to use a dataset\nfrom real world over a generated dataset. It brings in the 'real-world' aspect, because it'll have actual patterns\nthat you can use to build a story around your data. </p>\n<p>However, because my main goal for this project is simply visualising\ndata using Power BI, it's sufficient to use a generated dataset. I tried to bring in the real-world aspect by using\nrandom sampling, a hiring timeline and predefined proportions to add variety to the data. I'll explain how to\nincorporate these ideas in the next sections of this post.</p>\n<h3>A brief look at the first data source - Employee data</h3>\n<p>Before we dive into vacancy data, let's first look at the data from obviEnce. Because both data sources should be\ncoming from the same company, the vacancy dataset will need to share some information from this data. </p>\n<p>These are the columns from our first data source:</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/205765871-e0ddcf10-ff3c-4f7b-a312-6740aa5605e6.png\" alt=\"data-gen3\"></p>\n<h6><em>Data Description for the first dataset</em></h6>\n<p>From these variables, I've chosen to use the 'FP' and 'BU Region' columns to indicate vacancies for\nfull/part-time positions in different BU regions.</p>\n<p>Let's also have a look at the count of new hires (those who started working, not just hired) from the Employee dataset.</p>\n<p>In my Python notebook, I first import the packages.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token keyword\">import</span> pandas <span class=\"token keyword\">as</span> pd\n<span class=\"token keyword\">import</span> numpy <span class=\"token keyword\">as</span> np\n<span class=\"token keyword\">import</span> random\n<span class=\"token keyword\">import</span> calendar\n<span class=\"token keyword\">import</span> glob\n<span class=\"token keyword\">import</span> os\n\n<span class=\"token keyword\">import</span> matplotlib<span class=\"token punctuation\">.</span>pyplot <span class=\"token keyword\">as</span> plt\n<span class=\"token operator\">%</span>matplotlib inline</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>And plot the number of new hires:</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token comment\"># plot new hire count in employee data</span>\n\n<span class=\"token comment\"># get current working directory</span>\ncwd <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>abspath<span class=\"token punctuation\">(</span>os<span class=\"token punctuation\">.</span>getcwd<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n<span class=\"token comment\"># get all files in directory</span>\nfiles <span class=\"token operator\">=</span> glob<span class=\"token punctuation\">.</span>glob<span class=\"token punctuation\">(</span>cwd <span class=\"token operator\">+</span> <span class=\"token string\">\"/Employee data/*\"</span><span class=\"token punctuation\">)</span>\n\nmonths <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\nnewHires <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n\n<span class=\"token keyword\">for</span> <span class=\"token builtin\">file</span> <span class=\"token keyword\">in</span> files<span class=\"token punctuation\">:</span>\n    df <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>read_csv<span class=\"token punctuation\">(</span><span class=\"token builtin\">file</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># get the count of new hires</span>\n    newHires<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>df<span class=\"token punctuation\">[</span><span class=\"token string\">'isNewHire'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span><span class=\"token builtin\">sum</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># get the month for the file</span>\n    months<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>pd<span class=\"token punctuation\">.</span>to_datetime<span class=\"token punctuation\">(</span>df<span class=\"token punctuation\">[</span><span class=\"token string\">'date'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> dayfirst<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>month<span class=\"token punctuation\">)</span>\n                 \n<span class=\"token comment\"># sort new hires based on month</span>\nnewHires <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span>x <span class=\"token keyword\">for</span> _<span class=\"token punctuation\">,</span>x <span class=\"token keyword\">in</span> <span class=\"token builtin\">sorted</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">zip</span><span class=\"token punctuation\">(</span>months<span class=\"token punctuation\">,</span> newHires<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span>\nmonths <span class=\"token operator\">=</span> <span class=\"token builtin\">sorted</span><span class=\"token punctuation\">(</span>months<span class=\"token punctuation\">)</span>\n<span class=\"token comment\"># replace month numbers with names</span>\nmonths <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span>calendar<span class=\"token punctuation\">.</span>month_name<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> months<span class=\"token punctuation\">]</span>\n\n<span class=\"token comment\"># plot new hires</span>\nfig<span class=\"token punctuation\">,</span> ax <span class=\"token operator\">=</span> plt<span class=\"token punctuation\">.</span>subplots<span class=\"token punctuation\">(</span>figsize<span class=\"token operator\">=</span><span class=\"token punctuation\">(</span><span class=\"token number\">15</span><span class=\"token punctuation\">,</span><span class=\"token number\">8</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\nbars <span class=\"token operator\">=</span> ax<span class=\"token punctuation\">.</span>bar<span class=\"token punctuation\">(</span>months<span class=\"token punctuation\">,</span> newHires<span class=\"token punctuation\">)</span>\n<span class=\"token comment\"># add data labels</span>\n<span class=\"token keyword\">for</span> i<span class=\"token punctuation\">,</span>v <span class=\"token keyword\">in</span> <span class=\"token builtin\">enumerate</span><span class=\"token punctuation\">(</span>newHires<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    ax<span class=\"token punctuation\">.</span>text<span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">,</span>v<span class=\"token operator\">+</span><span class=\"token number\">20</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">(</span>v<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> ha<span class=\"token operator\">=</span><span class=\"token string\">'center'</span><span class=\"token punctuation\">)</span>\nplt<span class=\"token punctuation\">.</span>title<span class=\"token punctuation\">(</span><span class=\"token string\">'New Hires'</span><span class=\"token punctuation\">)</span>\nplt<span class=\"token punctuation\">.</span>show<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string\">'All new starters for the year: '</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">sum</span><span class=\"token punctuation\">(</span>newHires<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>which gives us the graph below:</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/205767315-0f06ff37-d228-477b-ad58-be72a56ef611.png\" alt=\"data-gen31\"></p>\n<h6><em>New hire count from the first dataset</em></h6>\n<p>As you can see, there is some variation in the total of new hires over the year, and the total new hire count is at 19073.\nTo keep the counts similar, we'll create 20000 vacancies in our vacancy dataset.</p>\n<h3>Timeframe for the dataset</h3>\n<p>Next, let's look at the time duration for which we will be generating data. As I'll be using the data within January\n2014 to December 2014 from the first data source, I've defined the same time period for vacancy dataset as well.</p>\n<p>Let's define the number of vacancies for the year, and start and end dates for the timeframe.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\">n_vacancies <span class=\"token operator\">=</span> <span class=\"token number\">20000</span>\nyear <span class=\"token operator\">=</span> <span class=\"token string\">'2014'</span>\nmin_date <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>to_datetime<span class=\"token punctuation\">(</span>year <span class=\"token operator\">+</span> <span class=\"token string\">'/01/01'</span><span class=\"token punctuation\">)</span>\nmax_date <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>to_datetime<span class=\"token punctuation\">(</span>year <span class=\"token operator\">+</span> <span class=\"token string\">'/12/31'</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span></span></pre></div>\n<h3>Data Generation</h3>\n<p>I'll be following a two-step process in data generation. First, I created the complete list of vacancies for the timeframe\nof one year. </p>\n<p>In the next step, I'll be slicing and storing the data in separate files for each month.</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/204405237-33d7d7b3-23e6-42ce-9807-ac9469dc4620.png\" alt=\"data-gen4\"></p>\n<h6><em>Overview of the Data Generation Process</em></h6>\n<h3>1. Generate all vacancy data</h3>\n<h4>Generate IDs and add categories from first dataset</h4>\n<p>We'll get started by creating random and unique IDs for the vacancies.\n<br> And randomly assign categories from <code class=\"language-text\">FP</code> and <code class=\"language-text\">BU Region</code> columns from the first dataset. </p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/204242537-cffc47d5-ec3c-4850-9040-2f3391072db4.png\" alt=\"data-gen5\"></p>\n<h6><em>First three columns in the vacancy dataset</em></h6>\n<p>In my function to generate all vacancy data, I'm adding the ID, FP and BU Region columns.\nTo generate IDs, I'm using <code class=\"language-text\">random.sample()</code> function to do random sampling without replacement. To assign <code class=\"language-text\">FP</code> and <code class=\"language-text\">BU Region</code>,\nI'm using <code class=\"language-text\">random.choice()</code> from Numpy to pick items with replacement. </p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token comment\"># generate all vacancy data</span>\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">generate_vacancy_data</span><span class=\"token punctuation\">(</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    vacancy_df <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>DataFrame<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># create random ids for each row</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'ID'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> random<span class=\"token punctuation\">.</span>sample<span class=\"token punctuation\">(</span><span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>n_vacancies<span class=\"token operator\">+</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> n_vacancies<span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># randomly add values for FP and BU Region columns</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'FP'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>choice<span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span><span class=\"token string\">'F'</span><span class=\"token punctuation\">,</span><span class=\"token string\">'P'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> n_vacancies<span class=\"token punctuation\">,</span> replace<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n    regions <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span> <span class=\"token number\">2</span><span class=\"token punctuation\">,</span> <span class=\"token number\">3</span><span class=\"token punctuation\">,</span> <span class=\"token number\">4</span><span class=\"token punctuation\">,</span> <span class=\"token number\">5</span><span class=\"token punctuation\">,</span> <span class=\"token number\">6</span><span class=\"token punctuation\">,</span> <span class=\"token number\">7</span><span class=\"token punctuation\">,</span> <span class=\"token number\">8</span><span class=\"token punctuation\">,</span> <span class=\"token number\">9</span><span class=\"token punctuation\">,</span> <span class=\"token number\">10</span><span class=\"token punctuation\">,</span> <span class=\"token number\">11</span><span class=\"token punctuation\">,</span> <span class=\"token number\">12</span><span class=\"token punctuation\">,</span> <span class=\"token number\">13</span><span class=\"token punctuation\">,</span> <span class=\"token number\">14</span><span class=\"token punctuation\">,</span> <span class=\"token number\">15</span><span class=\"token punctuation\">,</span> <span class=\"token number\">16</span><span class=\"token punctuation\">,</span> <span class=\"token number\">17</span><span class=\"token punctuation\">,</span> <span class=\"token number\">18</span><span class=\"token punctuation\">,</span> \n                              <span class=\"token number\">19</span><span class=\"token punctuation\">,</span> <span class=\"token number\">20</span><span class=\"token punctuation\">,</span> <span class=\"token number\">21</span><span class=\"token punctuation\">,</span> <span class=\"token number\">22</span><span class=\"token punctuation\">,</span> <span class=\"token number\">23</span><span class=\"token punctuation\">,</span> <span class=\"token number\">24</span><span class=\"token punctuation\">,</span> <span class=\"token number\">94</span><span class=\"token punctuation\">,</span> <span class=\"token number\">95</span><span class=\"token punctuation\">,</span> <span class=\"token number\">96</span><span class=\"token punctuation\">,</span> <span class=\"token number\">97</span><span class=\"token punctuation\">,</span> <span class=\"token number\">98</span><span class=\"token punctuation\">,</span> <span class=\"token number\">99</span><span class=\"token punctuation\">]</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'BU Region'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>choice<span class=\"token punctuation\">(</span>regions<span class=\"token punctuation\">,</span> n_vacancies<span class=\"token punctuation\">,</span> replace<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<h4>Generate dates based on the hiring timeline</h4>\n<p>Next, I'll be adding more columns to explain the status of vacancies over time. <br>\nFor this purpose, I've defined a hypothetical hiring timeline that we can assume that the company is using in their hiring process.\nThe image below shows the major milestones for a single vacancy.  </p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/204225994-565c8661-d271-4f9a-b5dc-c6118608d6c3.png\" alt=\"data-gen6\"></p>\n<h6><em>Hiring timeline</em></h6>\n<p>I had fun creating this using my own imagination. The main purpose was to bring in some aspects followed in a\nreal-world hiring process. </p>\n<p>Each vacancy would have an <code class=\"language-text\">Approved</code> Date which indicates when the vacancy was created. To generate random approved dates\nspread across the year, I'm using <code class=\"language-text\">pd.to_timedelta()</code> function to add a random number of days to the first date of the year.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\">    <span class=\"token comment\"># get the number of days in the time period</span>\n    n_days <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span>max_date <span class=\"token operator\">-</span> min_date<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>days <span class=\"token operator\">+</span> <span class=\"token number\">1</span>\n    <span class=\"token comment\"># generate random approved dates</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Approved'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> min_date <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>n_days<span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Usually companies have some of their vacancies put on hold. To replicate this, a defined portion of 1% of all\nvacancies were to be put on hold. The <code class=\"language-text\">On hold</code> date would fall within 10-30 days from the <code class=\"language-text\">Approved</code> date.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\">    <span class=\"token comment\"># add empty column filled with NaT</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>NaT\n    <span class=\"token comment\"># generate 1% on hold vacancies by selecting 1% of rows</span>\n    a <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">.</span>sample<span class=\"token punctuation\">(</span>frac<span class=\"token operator\">=</span><span class=\"token number\">0.01</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># generate random on hold dates for the selected rows in 10-30 days after approved date</span>\n    a<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> a<span class=\"token punctuation\">[</span><span class=\"token string\">'Approved'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>low<span class=\"token operator\">=</span><span class=\"token number\">10</span><span class=\"token punctuation\">,</span> high<span class=\"token operator\">=</span><span class=\"token number\">30</span><span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>a<span class=\"token punctuation\">.</span>shape<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># replace modified rows in original dataset</span>\n    vacancy_df<span class=\"token punctuation\">.</span>loc<span class=\"token punctuation\">[</span>a<span class=\"token punctuation\">.</span>index<span class=\"token punctuation\">,</span> <span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> a<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>dt<span class=\"token punctuation\">.</span>date</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Then, I'm using the same approach to add the <code class=\"language-text\">Sourcing start</code>, <code class=\"language-text\">Interview start</code>, <code class=\"language-text\">Interview end</code>, <code class=\"language-text\">Offered</code> and <code class=\"language-text\">Filled</code>\ndates based on the hiring timeline. Finally, I've removed the date values for <code class=\"language-text\">On hold</code> vacancies.</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/204243025-97b91432-1cb4-4620-b5df-d6f8f854b306.png\" alt=\"data-gen7\"></p>\n<h6><em>Generated columns in the Vacancy dataset</em></h6>\n<p>With the addition of this section, we now have the complete function to generate all\nvacancy data. </p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token comment\"># generate all vacancy data</span>\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">generate_vacancy_data</span><span class=\"token punctuation\">(</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    vacancy_df <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>DataFrame<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># create random ids for each row</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'ID'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> random<span class=\"token punctuation\">.</span>sample<span class=\"token punctuation\">(</span><span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>n_vacancies<span class=\"token operator\">+</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> n_vacancies<span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># randomly add values for FP and BU Region columns</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'FP'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>choice<span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span><span class=\"token string\">'F'</span><span class=\"token punctuation\">,</span><span class=\"token string\">'P'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> n_vacancies<span class=\"token punctuation\">,</span> replace<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n    regions <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span> <span class=\"token number\">2</span><span class=\"token punctuation\">,</span> <span class=\"token number\">3</span><span class=\"token punctuation\">,</span> <span class=\"token number\">4</span><span class=\"token punctuation\">,</span> <span class=\"token number\">5</span><span class=\"token punctuation\">,</span> <span class=\"token number\">6</span><span class=\"token punctuation\">,</span> <span class=\"token number\">7</span><span class=\"token punctuation\">,</span> <span class=\"token number\">8</span><span class=\"token punctuation\">,</span> <span class=\"token number\">9</span><span class=\"token punctuation\">,</span> <span class=\"token number\">10</span><span class=\"token punctuation\">,</span> <span class=\"token number\">11</span><span class=\"token punctuation\">,</span> <span class=\"token number\">12</span><span class=\"token punctuation\">,</span> <span class=\"token number\">13</span><span class=\"token punctuation\">,</span> <span class=\"token number\">14</span><span class=\"token punctuation\">,</span> <span class=\"token number\">15</span><span class=\"token punctuation\">,</span> <span class=\"token number\">16</span><span class=\"token punctuation\">,</span> <span class=\"token number\">17</span><span class=\"token punctuation\">,</span> <span class=\"token number\">18</span><span class=\"token punctuation\">,</span> \n                              <span class=\"token number\">19</span><span class=\"token punctuation\">,</span> <span class=\"token number\">20</span><span class=\"token punctuation\">,</span> <span class=\"token number\">21</span><span class=\"token punctuation\">,</span> <span class=\"token number\">22</span><span class=\"token punctuation\">,</span> <span class=\"token number\">23</span><span class=\"token punctuation\">,</span> <span class=\"token number\">24</span><span class=\"token punctuation\">,</span> <span class=\"token number\">94</span><span class=\"token punctuation\">,</span> <span class=\"token number\">95</span><span class=\"token punctuation\">,</span> <span class=\"token number\">96</span><span class=\"token punctuation\">,</span> <span class=\"token number\">97</span><span class=\"token punctuation\">,</span> <span class=\"token number\">98</span><span class=\"token punctuation\">,</span> <span class=\"token number\">99</span><span class=\"token punctuation\">]</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'BU Region'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>choice<span class=\"token punctuation\">(</span>regions<span class=\"token punctuation\">,</span> n_vacancies<span class=\"token punctuation\">,</span> replace<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n    \n    <span class=\"token comment\"># get the number of days in the time period</span>\n    n_days <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span>max_date <span class=\"token operator\">-</span> min_date<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>days <span class=\"token operator\">+</span> <span class=\"token number\">1</span>\n    <span class=\"token comment\"># generate random approved dates</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Approved'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> min_date <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>n_days<span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># add empty column filled with NaT</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>NaT\n    <span class=\"token comment\"># generate 1% on hold vacancies by selecting 1% of rows</span>\n    a <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">.</span>sample<span class=\"token punctuation\">(</span>frac<span class=\"token operator\">=</span><span class=\"token number\">0.01</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># generate random on hold dates for the selected rows in 10-30 days after approved date</span>\n    a<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> a<span class=\"token punctuation\">[</span><span class=\"token string\">'Approved'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>low<span class=\"token operator\">=</span><span class=\"token number\">10</span><span class=\"token punctuation\">,</span> high<span class=\"token operator\">=</span><span class=\"token number\">30</span><span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>a<span class=\"token punctuation\">.</span>shape<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># replace modified rows in original dataset</span>\n    vacancy_df<span class=\"token punctuation\">.</span>loc<span class=\"token punctuation\">[</span>a<span class=\"token punctuation\">.</span>index<span class=\"token punctuation\">,</span> <span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> a<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>dt<span class=\"token punctuation\">.</span>date\n    \n    <span class=\"token comment\"># generate sourcing start date to be within 5-10 days after approved date</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Sourcing start'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Approved'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>low<span class=\"token operator\">=</span><span class=\"token number\">5</span><span class=\"token punctuation\">,</span> high<span class=\"token operator\">=</span><span class=\"token number\">10</span><span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># generate Interview start date to be within 10-20 days after sourcing start date</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Interview start'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Sourcing start'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>low<span class=\"token operator\">=</span><span class=\"token number\">10</span><span class=\"token punctuation\">,</span> high<span class=\"token operator\">=</span><span class=\"token number\">20</span><span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># generate Interview end date to be within 15-30 days after interview start date</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Interview end'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Interview start'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>low<span class=\"token operator\">=</span><span class=\"token number\">15</span><span class=\"token punctuation\">,</span> high<span class=\"token operator\">=</span><span class=\"token number\">30</span><span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># generate Offered date to be within 5-10 days after interview end date</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Offered'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Interview end'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>low<span class=\"token operator\">=</span><span class=\"token number\">5</span><span class=\"token punctuation\">,</span> high<span class=\"token operator\">=</span><span class=\"token number\">10</span><span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># generate Filled date to be within 5-10 days after offered date</span>\n    vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Filled'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Offered'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>randint<span class=\"token punctuation\">(</span>low<span class=\"token operator\">=</span><span class=\"token number\">5</span><span class=\"token punctuation\">,</span> high<span class=\"token operator\">=</span><span class=\"token number\">10</span><span class=\"token punctuation\">,</span> size<span class=\"token operator\">=</span>n_vacancies<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># remove values for on hold vacancies</span>\n    vacancy_df<span class=\"token punctuation\">.</span>loc<span class=\"token punctuation\">[</span>a<span class=\"token punctuation\">.</span>index<span class=\"token punctuation\">,</span> <span class=\"token punctuation\">[</span><span class=\"token string\">'Sourcing start'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Interview start'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Interview end'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Offered'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Filled'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>NaT\n\n    <span class=\"token keyword\">return</span> vacancy_df</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>I'm sure you're curious to see how it would turn out after running this function. </p>\n<p>And we've got the answer here:</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token comment\"># generate complete dataset</span>\nvacancy_df <span class=\"token operator\">=</span> generate_vacancy_data<span class=\"token punctuation\">(</span>n_vacancies<span class=\"token punctuation\">)</span>\nvacancy_df</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span></span></pre></div>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/205769073-e9d210b6-bf83-4790-9e8e-d35e537392e0.png\" alt=\"data-gen8\"></p>\n<h6><em>Vacancy data generation output</em></h6>\n<h4>Why I'm getting a different output?</h4>\n<p>If you tried this out, you may notice that you get a different output from what's shown here. This is because we use random\nnumbers to generate the data. If you prefer to get the same output every time, set the seed for random numbers before\nexecuting the above function. This needs to be done for both Python and NumPy random number generators. </p>\n<p><a href=\"https://builtin.com/data-science/numpy-random-seed\">This post</a> is a helpful guide to using random seed in NumPy.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token comment\"># set seed for Python random generator</span>\nrandom<span class=\"token punctuation\">.</span>seed<span class=\"token punctuation\">(</span><span class=\"token number\">42</span><span class=\"token punctuation\">)</span>\n\n<span class=\"token comment\"># create NumPy random number generator</span>\nrand_gen <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>random<span class=\"token punctuation\">.</span>RandomState<span class=\"token punctuation\">(</span><span class=\"token number\">42</span><span class=\"token punctuation\">)</span>\n\n<span class=\"token comment\"># replace np.random with rand_gen</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<h3>2. Split data into separate files</h3>\n<p>Now that we have the complete vacancy data for the year, we can move to the next step and separate the data for each month. </p>\n<p>For each month, I'm going to filter the dates less than or equal to the last date for the month, and replace other\ndate values with NaT.</p>\n<p>Next, I'll remove the roles which have <code class=\"language-text\">Filled</code> dates in previous months.</p>\n<p>Finally, I'll be adding a new column called <code class=\"language-text\">Status</code> to indicate the status for each vacancy at the end of the month. </p>\n<p>I'll create a new function for this purpose, and run it for each month in our time period.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token comment\"># create data files for each month</span>\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">create_monthly_df</span><span class=\"token punctuation\">(</span>month<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token comment\"># get start date for the month</span>\n    month_start <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>to_datetime<span class=\"token punctuation\">(</span>year <span class=\"token operator\">+</span> <span class=\"token string\">'/'</span><span class=\"token operator\">+</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>month<span class=\"token punctuation\">)</span><span class=\"token operator\">+</span><span class=\"token string\">'/01'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># get end date for the month</span>\n    month_end <span class=\"token operator\">=</span> month_start <span class=\"token operator\">+</span> pd<span class=\"token punctuation\">.</span>to_timedelta<span class=\"token punctuation\">(</span>calendar<span class=\"token punctuation\">.</span>monthrange<span class=\"token punctuation\">(</span><span class=\"token builtin\">int</span><span class=\"token punctuation\">(</span>year<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> month<span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token number\">1</span><span class=\"token punctuation\">]</span><span class=\"token operator\">-</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span> unit<span class=\"token operator\">=</span><span class=\"token string\">'d'</span><span class=\"token punctuation\">)</span>\n    \n    <span class=\"token comment\"># create monthly data</span>\n    monthly_df <span class=\"token operator\">=</span> vacancy_df<span class=\"token punctuation\">.</span>copy<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n    \n    <span class=\"token comment\"># replace dates after month end date with NaT</span>\n    monthly_df<span class=\"token punctuation\">[</span>monthly_df<span class=\"token punctuation\">.</span>columns<span class=\"token punctuation\">[</span><span class=\"token number\">3</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> monthly_df<span class=\"token punctuation\">.</span>iloc<span class=\"token punctuation\">[</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">,</span><span class=\"token number\">3</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>where<span class=\"token punctuation\">(</span>monthly_df<span class=\"token punctuation\">.</span>iloc<span class=\"token punctuation\">[</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">,</span><span class=\"token number\">3</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">]</span><span class=\"token operator\">&lt;=</span>month_end<span class=\"token punctuation\">,</span> pd<span class=\"token punctuation\">.</span>NaT<span class=\"token punctuation\">)</span>\n    \n    <span class=\"token comment\"># replace rows with NaT if Filled date is in a previous month</span>\n    monthly_df<span class=\"token punctuation\">[</span>monthly_df<span class=\"token punctuation\">.</span>columns<span class=\"token punctuation\">[</span><span class=\"token number\">3</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> monthly_df<span class=\"token punctuation\">.</span>iloc<span class=\"token punctuation\">[</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">,</span><span class=\"token number\">3</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>where<span class=\"token punctuation\">(</span>\n        <span class=\"token punctuation\">(</span>pd<span class=\"token punctuation\">.</span>isnull<span class=\"token punctuation\">(</span>monthly_df<span class=\"token punctuation\">.</span>iloc<span class=\"token punctuation\">[</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">,</span><span class=\"token operator\">-</span><span class=\"token number\">1</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">|</span> <span class=\"token punctuation\">(</span>monthly_df<span class=\"token punctuation\">.</span>iloc<span class=\"token punctuation\">[</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">,</span><span class=\"token operator\">-</span><span class=\"token number\">1</span><span class=\"token punctuation\">]</span><span class=\"token operator\">>=</span>month_start<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> pd<span class=\"token punctuation\">.</span>NaT<span class=\"token punctuation\">)</span>\n    \n    <span class=\"token comment\"># remove rows with all blank values using index</span>\n    empty_index <span class=\"token operator\">=</span> monthly_df<span class=\"token punctuation\">[</span>monthly_df<span class=\"token punctuation\">.</span>iloc<span class=\"token punctuation\">[</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">,</span><span class=\"token number\">3</span><span class=\"token punctuation\">:</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>isnull<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span><span class=\"token builtin\">all</span><span class=\"token punctuation\">(</span>axis<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>index\n    monthly_df <span class=\"token operator\">=</span> monthly_df<span class=\"token punctuation\">.</span>drop<span class=\"token punctuation\">(</span>index<span class=\"token operator\">=</span>empty_index<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>reset_index<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># convert On hold column to datetime to be able to compare in the next step with idxmax</span>\n    monthly_df<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>to_datetime<span class=\"token punctuation\">(</span>monthly_df<span class=\"token punctuation\">[</span><span class=\"token string\">'On hold'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> errors<span class=\"token operator\">=</span><span class=\"token string\">'coerce'</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># add status column by using the max date value from the specified columns</span>\n    monthly_df<span class=\"token punctuation\">[</span><span class=\"token string\">'Status'</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> monthly_df<span class=\"token punctuation\">[</span><span class=\"token punctuation\">[</span><span class=\"token string\">'Approved'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'On hold'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Sourcing start'</span><span class=\"token punctuation\">,</span> \n                                       <span class=\"token string\">'Interview start'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Interview end'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Offered'</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'Filled'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>idxmax<span class=\"token punctuation\">(</span>axis<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># remove previous index column </span>\n    monthly_df <span class=\"token operator\">=</span> monthly_df<span class=\"token punctuation\">.</span>drop<span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span><span class=\"token string\">'index'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> axis<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span>\n    \n    <span class=\"token comment\"># remove NaT values</span>\n    monthly_df <span class=\"token operator\">=</span> monthly_df<span class=\"token punctuation\">.</span>fillna<span class=\"token punctuation\">(</span><span class=\"token string\">''</span><span class=\"token punctuation\">)</span>\n    \n    <span class=\"token comment\"># write to csv</span>\n    file_name <span class=\"token operator\">=</span> <span class=\"token string\">'Vacancy data/'</span> <span class=\"token operator\">+</span>calendar<span class=\"token punctuation\">.</span>month_name<span class=\"token punctuation\">[</span>month<span class=\"token punctuation\">]</span> <span class=\"token operator\">+</span> <span class=\"token string\">'-'</span> <span class=\"token operator\">+</span> year <span class=\"token operator\">+</span> <span class=\"token string\">'.csv'</span>\n    monthly_df<span class=\"token punctuation\">.</span>to_csv<span class=\"token punctuation\">(</span>file_name<span class=\"token punctuation\">,</span> index<span class=\"token operator\">=</span><span class=\"token boolean\">False</span><span class=\"token punctuation\">)</span>\n\n\n<span class=\"token comment\"># generate monthly files</span>\n<span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">12</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    create_monthly_df<span class=\"token punctuation\">(</span>i<span class=\"token operator\">+</span><span class=\"token number\">1</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>As you can see, we have a new <code class=\"language-text\">Status</code> column in the output for 2014 April.</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/205770015-add3cd6b-7214-45c9-8f28-4cc01ba618e7.png\" alt=\"data-gen9\"></p>\n<h6><em>First six rows of the final vacancy dataset output for 2014 April</em></h6>\n<p>If you tried this out, you will have 12 csv files created for vacancy data in each month.\n<br> Each file shows the status of the vacancies at the end of the month.</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/205769604-a6489b54-7b06-4535-ae82-ea91986e4402.png\" alt=\"data-gen10\"></p>\n<h6><em>Generated data files</em></h6>\n<p>I had a brief look at the count of filled roles over the year using the generated data.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token comment\"># plot filled roles in vacancy data</span>\n\n<span class=\"token comment\"># get all files in directory</span>\nvac_files <span class=\"token operator\">=</span> glob<span class=\"token punctuation\">.</span>glob<span class=\"token punctuation\">(</span>cwd <span class=\"token operator\">+</span> <span class=\"token string\">\"/Vacancy data/*\"</span><span class=\"token punctuation\">)</span>\n\nmonths <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\nfilledRoles <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n\n<span class=\"token keyword\">for</span> <span class=\"token builtin\">file</span> <span class=\"token keyword\">in</span> vac_files<span class=\"token punctuation\">:</span>\n    df <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>read_csv<span class=\"token punctuation\">(</span><span class=\"token builtin\">file</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># get the count of filled roles</span>\n    filledRoles<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>df<span class=\"token punctuation\">[</span>df<span class=\"token punctuation\">[</span><span class=\"token string\">'Status'</span><span class=\"token punctuation\">]</span><span class=\"token operator\">==</span><span class=\"token string\">'Filled'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">[</span><span class=\"token string\">'ID'</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>count<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n    <span class=\"token comment\"># get the month for the file</span>\n    months<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>pd<span class=\"token punctuation\">.</span>to_datetime<span class=\"token punctuation\">(</span><span class=\"token string\">'1-'</span><span class=\"token operator\">+</span><span class=\"token builtin\">file</span><span class=\"token punctuation\">.</span>split<span class=\"token punctuation\">(</span><span class=\"token string\">'/'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token operator\">-</span><span class=\"token number\">1</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>split<span class=\"token punctuation\">(</span><span class=\"token string\">'.'</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> dayfirst<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>month<span class=\"token punctuation\">)</span>\n                 \n<span class=\"token comment\"># sort filled roles based on month</span>\nfilledRoles <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span>x <span class=\"token keyword\">for</span> _<span class=\"token punctuation\">,</span>x <span class=\"token keyword\">in</span> <span class=\"token builtin\">sorted</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">zip</span><span class=\"token punctuation\">(</span>months<span class=\"token punctuation\">,</span> filledRoles<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span>\nmonths <span class=\"token operator\">=</span> <span class=\"token builtin\">sorted</span><span class=\"token punctuation\">(</span>months<span class=\"token punctuation\">)</span>\n<span class=\"token comment\"># replace month numbers with names</span>\nmonths <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span>calendar<span class=\"token punctuation\">.</span>month_name<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> months<span class=\"token punctuation\">]</span>\n\n<span class=\"token comment\"># plot new hires</span>\nfig<span class=\"token punctuation\">,</span> ax <span class=\"token operator\">=</span> plt<span class=\"token punctuation\">.</span>subplots<span class=\"token punctuation\">(</span>figsize<span class=\"token operator\">=</span><span class=\"token punctuation\">(</span><span class=\"token number\">15</span><span class=\"token punctuation\">,</span><span class=\"token number\">8</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\nbars <span class=\"token operator\">=</span> ax<span class=\"token punctuation\">.</span>bar<span class=\"token punctuation\">(</span>months<span class=\"token punctuation\">,</span> filledRoles<span class=\"token punctuation\">)</span>\n<span class=\"token comment\"># add data labels</span>\n<span class=\"token keyword\">for</span> i<span class=\"token punctuation\">,</span>v <span class=\"token keyword\">in</span> <span class=\"token builtin\">enumerate</span><span class=\"token punctuation\">(</span>filledRoles<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    ax<span class=\"token punctuation\">.</span>text<span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">,</span>v<span class=\"token operator\">+</span><span class=\"token number\">20</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">(</span>v<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> ha<span class=\"token operator\">=</span><span class=\"token string\">'center'</span><span class=\"token punctuation\">)</span>\nplt<span class=\"token punctuation\">.</span>title<span class=\"token punctuation\">(</span><span class=\"token string\">'Filled Roles'</span><span class=\"token punctuation\">)</span>\nplt<span class=\"token punctuation\">.</span>show<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string\">'All filled roles for the year: '</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">sum</span><span class=\"token punctuation\">(</span>filledRoles<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>The output shows that we have a fairly similar distribution over the year as I used random sampling to generate the data.</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/205770665-3ccfede3-f062-4006-8224-bb73c3010e36.png\" alt=\"data-gen11\"></p>\n<h6><em>Total number of filled roles using generated data</em></h6>\n<p>Although the distribution is different, we can use the generated data to explore more aspects related to the hiring\nprocess - such as time taken to source candidates and time taken to fill roles.</p>\n<p>You may be wondering about the different numbers for total hires from the first dataset, and the total filled\nroles from generated data. Usually in companies, these two values tend to be different because although a role\nwas filled in one month, the employee may start working on a later date.</p>\n<p>And that's it! That's how I generated my own dataset so that I can use it for my personal project. </p>\n<p>Check my <a href=\"https://github.com/MalshaL/HR-data-visualisation\">GitHub repository</a> to see the complete code for\nthis project, and view the generated dataset in <a href=\"https://www.kaggle.com/datasets/malsha/employee-hiring-data\">Kaggle</a>.</p>\n<p>Have fun with Python!</p>","frontmatter":{"date":"November 18, 2022","path":"/generate-your-own-dataset","title":"Build your own dataset for personal projects","tags":["Datasets","Project","Python","Data Visualisation"]}}},"pageContext":{}},"staticQueryHashes":["3649515864"]}