<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[The Data Science Swiss Army Knife]]></title><description><![CDATA[Super passionate up and coming data scientist documenting my journey!
I dedicate my time to learning and creating ML content (data science projects and blog pos]]></description><link>https://www.kamwithk.com</link><generator>RSS for Node</generator><lastBuildDate>Tue, 21 Apr 2026 09:34:43 GMT</lastBuildDate><atom:link href="https://www.kamwithk.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[2 Years of Machine Learning and Data Science]]></title><description><![CDATA[Over two years ago I started of as a high school student unsure of what to do in life, all I knew was that I'd always been enthusiastic about technology and learning how things work. Back then everything felt so overwhelming, the prospect of spending...]]></description><link>https://www.kamwithk.com/2-years-of-machine-learning-and-data-science</link><guid isPermaLink="true">https://www.kamwithk.com/2-years-of-machine-learning-and-data-science</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Programming Blogs]]></category><category><![CDATA[portfolio]]></category><category><![CDATA[coding]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Tue, 22 Mar 2022 23:19:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/hND1OG3q67k/upload/v1647954614663/R8rAcjUbc.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over two years ago I started of as a high school student unsure of what to do in life, all I knew was that I'd always been enthusiastic about technology and learning how things work. Back then everything felt so overwhelming, the prospect of spending the next three plus years at Uni (and potentially the rest of my life aha) on something which I barely knew anything about really did frighten me 😨.</p>
<p>This article/portfolio is a retrospective on my experience becoming a machine learning engineer/data scientist so far! I hope others who read this feel similarly inspired to when I randomly heard/read about data science for the first time!
I want to share some of my experiences so far, the cool things I've built and knowledge gathered.</p>
<p><em>Through a large portion of my journey as a beginner I documented my learning process with detailed blogs and <a target="_blank" href="https://github.com/KamWithK">GitHub</a> repositories which I link throughout this post, if you're interested in seeing more details please check them out!</em></p>
<p>This post is structured in several sections:</p>
<ul>
<li>How I got into data science/machine learning</li>
<li>The passion projects I've worked on</li>
<li>Experience working/interning</li>
</ul>
<h1 id="heading-my-road-towards-data-science">My Road towards Data Science</h1>
<p>Back in 2019 when I was at a Monash open day I randomly came across a student engineering team called Monash DeepNeuron.
At the stall one of their founders showed me a small number of AI demos. I was blown away. I kind of knew it existed before but I never even considered the possibility that it was possible for a student with minimal technical knowledge to go about working on something as big as colourising black and white pictures or AI playing video games.
Straight after this I researched online as much as I could about how I could get started working on similar projects myself.
Watching videos like the <a target="_blank" href="https://www.youtube.com/watch?v=WXuK6gekU1Y">Alpha Go documentary</a> inspired me to dig deeper and see whether this was a realistic career to pursue (TLDR ~ I ended up joining and becoming an active member of Monash DeepNeuron as soon as I got into University)!</p>
<p>To try and reassure myself that this was 100% the right path for me, I made it my mission to reach out to as many professional data scientists, machine learning engineers and software engineers as I humanly could. ~2 years later according to LinkedIn I successfully connected with 376 experts in the field! Although I can't say I've talked to every single person, I made a solid attempt every time and got to know a few people pretty well. The discussions I had from early on were invaluable in giving me direction, motivation and several opportunities like internships.</p>
<p>A good month or so after talking to dozens of data scientists I decided to use my newfound knowledge/skills to <a target="_blank" href="https://www.kamwithk.com/snake-classification-report">create an app to detect snake species</a> 🐍... Projects, projects and more exciting applications of deep learning awaited me!</p>
<h1 id="heading-getting-experience-working-on-cool-projects">Getting Experience working on cool Projects</h1>
<p>Right from the beginning I knew I wanted to start working on projects so I could collaborate with others to create interesting solutions which showcase how AI can be used to influence real problems (or at the least small prototypes to display what it would look and feel like).
Here's a small overview of some opportunities I'm grateful to be able to have worked on.</p>
<h2 id="heading-detecting-venomous-snakes">Detecting Venomous Snakes 🐍</h2>
<p>5.4 million people are bitten by snakes every year, with 81-138 million recorded deaths due to snake bites. Maybe an app which would tell you their species and from that whether or not they're venomous would be a useful aid to professional medical examinations? With this in mind I found a dataset of labeled snake images and created machine learning models to classify each images species!</p>
<p>I had a chance to experiment with image scraping (which I didn't end up using here but utilised heavily in many future projects), various techniques to modify/enhance/increase image data and varying types of models and training techniques. At the beginning of this project I had absolutely no idea about how any of these things worked or what to do but slowly testing out different approaches to solving this one problem was incredibly engaging and helped build out my foundations.</p>
<p>I have several blog posts discussion my journey on my first project. If you're interested please feel free to take a read 😉:</p>
<ul>
<li><a target="_blank" href="https://www.kamwithk.com/how-i-published-an-app-and-model-to-classify-85-snake-species-and-how-you-can-too">Progress and Journey Report</a></li>
<li><a target="_blank" href="https://www.kamwithk.com/snake-classification-report">Analysis of Results</a></li>
<li><a target="_blank" href="https://github.com/KamWithK/Snaked">GitHub Repository</a></li>
<li><a target="_blank" href="https://play.google.com/store/apps/details?id=com.kamwithk.snaked">Google Play Published App</a></li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647945812938/DfbqDdYsP.png" alt="image.png" /></p>
<h2 id="heading-predicting-energy-demand">Predicting Energy Demand 🤖</h2>
<p>Energy is used daily for phones, computers, washing machines, heaters and a vast array of appliances. I definitely can't imagine what it'd be like without it! Our dependence on electricity makes it critical to accurately predict how much energy we will need to generate on any given day.</p>
<p>At University I worked with a few fellow students to try and model energy usage via temperature data. This was a great opportunity to expand what I knew as "data science/machine learning". At first I thought it was all neural networks but during this project I had a chance to learn how time series forecasting usually works using "classical machine learning" (the old school algorithms, from when things were simpler 🥸).</p>
<p>I started by learning to <a target="_blank" href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning">clean and structure the data</a>, <a target="_blank" href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-2-storytelling-using-graphs">visualise it</a> and use the insights/knowledge I acquired from my analysis <a target="_blank" href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-3-modelling-with-decission-trees">build a model</a>.
During the process I realised that in reality, how well you do isn't defined by how smart or complicated your model is, but rather how well you're able to break down the problem and transform your understanding of the domain into code.
I had several long discussions about our usage of energy, renewables and more with lecturers and a relative who researches renewable energy at Bloomberg.
These helped tremendously to help me interpret the numbers, tables and graphs which I extracted (knowing what to look for, what types of graphs/visualisations/models can be used, whether you're picking up on the right factors, etc).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647947506818/te9NJSdyi.png" alt="image.png" />
<a target="_blank" href="https://unsplash.com/photos/yETqkLnhsUI">Photo by Henry</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647947603485/INnE1OGjF.png" alt="image.png" /></p>
<p>Here are some links to articles I've written which present the exploratory analysis and modelling in the form of beginner tutorials:</p>
<ul>
<li><a target="_blank" href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning">Data Cleaning</a></li>
<li><a target="_blank" href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-2-storytelling-using-graphs">Story Telling through Visualisation</a></li>
<li><a target="_blank" href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-3-modelling-with-decission-trees">Modelling with Decision Trees</a></li>
<li><a target="_blank" href="https://github.com/KamWithK/Temp2Enrgy">GitHub Repository</a></li>
</ul>
<h2 id="heading-generating-faces-and-comic-characters">Generating Faces and Comic Characters 🤡</h2>
<p>Straight after I got into Monash University I applied and joined the student engineering AI research team Monash DeepNeuron and pretty soon was selected to work on a research paper in a team of 6 to use Generative Adversarial Networks to transform CT scans into images of peoples face.
Up until this point, I'd worked on a small variety of projects but I didn't know about any of the big "subfields" of deep learning, nor had I worked in an actual team to develop anything "realistic". Hence I was super psyched when I found out I'd be able to contribute to a <a target="_blank" href="https://ieeexplore.ieee.org/document/9647290">research paper</a> which had real life applications for forensic facial reconstruction.</p>
<p>Since I was new I started by picking up small "tickets" or "issues" off the "kanban board" to work on. These were initially small things like "find and fix this bug" or "test xyz feature out" but eventually the project refactors and features I worked on broadened in scope. I ended up implementing our model training framework and setting it all up to work with a project tracking software called Weights and Biases.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647948372360/kD9BdIrm8.png" alt="image.png" /></p>
<p><a target="_blank" href="https://ieeexplore.ieee.org/document/9647290">Research Paper</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647953444540/7BlNgu5Z1.png" alt="image.png" /></p>
<p>Working on the project was absolutely amazing as I learned an absolute tonne from my team lead and fellow teammates (which were all in their final year graduating before me).</p>
<p>After the project finished up I was eager to learn more about how the model we were using worked and so decided to start my own small project to generate images of comic characters with another friend.
During this time I read up and <a target="_blank" href="https://github.com/KamWithK/Comic-Character-Generation">implemented my own GANs from scratch for the problem</a> 🤓 and very soon afterwards ended up running a workshop explaining to other Monash students how GANs can be used to generate images like faces and artwork!</p>
<p><a target="_blank" href="https://www.facebook.com/MonashDeepNeuron/posts/1126733647797240"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647949423351/h9wwX9VUm.png" alt="image.png" /></a></p>
<p>Here are some of the resources I created:</p>
<ul>
<li><a target="_blank" href="https://github.com/KamWithK/Comic-Character-Generation">Comic Character Generation GitHub Repository</a></li>
<li><a target="_blank" href="https://docs.google.com/presentation/d/1KqfEXFLlkrk8-bdMZPBU5L2fe8J8_crYBfAUkHW4WS4/edit?usp=sharing">Workshop Slides</a></li>
<li><a target="_blank" href="https://colab.research.google.com/github/DeepNeuron-AI/Training/blob/master/Workshops/Generative%20Modelling/GAN.ipynb">Workshop Code</a></li>
</ul>
<h2 id="heading-other-projects">Other Projects</h2>
<p>Quick mention that I was fortunate enough to work on several other small personal projects which you can find on my <a target="_blank" href="https://github.com/KamWithK">GitHub Profile</a>.
Here are a few examples:</p>
<ul>
<li><a target="_blank" href="https://github.com/KamWithK/JDRL">Creating a Racing Car Bot for Jelly Drift Video Game</a> - <a target="_blank" href="https://drive.google.com/file/d/1rb20GnL1WYJA3BtnmK89RQN5tMjdUyy0/view?usp=sharing">Video Recording of Progress</a></li>
<li><a target="_blank" href="https://www.kamwithk.com/big-data-from-public-apis-for-data-science-the-github-popularity-project">Scraping Huge Amounts of Data of GitHub</a></li>
<li><a target="_blank" href="https://github.com/KamWithK/AnkiconnectAndroid">Implementation of Desktop HTTP Rest Dictionary/Anki API within Android App for Language Learning</a> - Note this is a real project which several people I know use 🇯🇵 (I would talk more about it but doesn't involve ML)!</li>
<li><a target="_blank" href="https://github.com/KamWithK/JPVocabBuilder">Japanese Vocabulary Frequency Analysis with Rust and Web Assembly</a></li>
<li><a target="_blank" href="https://github.com/KamWithK/AllocatePlusPlus">Enhanced Dashboard to Allocate to Monash Classes</a></li>
</ul>
<h1 id="heading-internships-and-real-world-work-experience">Internships and Real World Work Experience</h1>
<p>Throughout University I've had internships at various small companies so far (here's a small summary of what I can talk about, due to NDAs I can't show any results/discuss in detail what I did though).</p>
<p>I first started working with a small company called Penta Global where I worked on a project to analyse the quality of coffee beans. I extracted data from sensors, went through it and analysed statistics and ended up creating a series of map visualisations which illustrate how various sensor metrics (like heat and density) impact the quality of the coffee as it traveled across the globe.</p>
<p><a target="_blank" href="https://penta.solutions/"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647952802134/XytQ0NVDd.png" alt="image.png" /></a></p>
<p>My second internship was at a data-centric real estate company called <a target="_blank" href="https://milkchoc.com.au/">Milk Chocolate Property</a>. I worked with their relational databases, APIs and geographic Python libraries to find and filter what particular locations their customers would be likely to want to live in.</p>
<p><a target="_blank" href="https://milkchoc.com.au/"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647952556835/gfIzuq9vI.png" alt="image.png" /></a></p>
<p>Afterwards I interned with the Monash Data Science and AI Platform working on a NLP project to identifying potential patients for medical trials based on their submitted medical health records.
<a target="_blank" href="https://www.monash.edu/researchinfrastructure/datascienceandai"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647952855836/GLJlULMTc.png" alt="image.png" /></a></p>
<p>I currently work as a Junior Machine Learning Engineer as part of the AI division of a small search and recommendation company called <a target="_blank" href="https://systema.ai/">Systema AI</a>. My day to day work involves analysis of our systems performance in respect to client shop purchases and helping maintain and improve various models and services.</p>
<p><a target="_blank" href="https://systema.ai/"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647952397556/YONJSMfpE.png" alt="image.png" /></a></p>
<p>To top it all off, for the bulk of my time at University (until very recently) I was the "Deep Learning Training Team Lead" at a <a target="_blank" href="https://www.deepneuron.org/">student engineering team Monash DeepNeuron</a>.
I led/managed a team of ~6 people creating and demoing at a wide variety of deep learning workshops were I helped teach students how it all works! We together ran workshops and <a target="_blank" href="https://www.deepneuron.org/dl-blogs">blogs</a> on topics ranging from the foundational basics of "how neural networks work", "how to start a project", to complex applications like "reinforcement learning" and "generative adversarial networks".</p>
<p><a target="_blank" href="https://www.deepneuron.org/"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647952459131/5qLulJ53A.png" alt="image.png" /></a></p>
<h1 id="heading-conclusions">Conclusions</h1>
<p>Over the last few years I've had a massive amount of fun working on heaps of projects with heaps off different people!
I've learnt so much from all of my work. Right from my first project detecting images of snakes to working on generating images of faces and comic characters, teaching other students how deep learning works, creating random applications and scripts to aid in my language learning and person life, and to working at Systema on our smart search and recommendations models.</p>
<p>I hope this both goes to illustrate my progress throughout my data science journey and to motivate anyone who happens upon this that if you just give it a shot you can gain a lot of skills whilst just working on cool projects you enjoy 🤩!</p>
<p><a target="_blank" href="https://unsplash.com/photos/hND1OG3q67k">Cover Photo by Lucas</a></p>
]]></content:encoded></item><item><title><![CDATA[Gathering and Using Big Data from Public APIs for Data Science - The GitHub Popularity Project]]></title><description><![CDATA[Cool unique data makes for intriguing projects, so let's go find some on the web!
Today we'll get what we need to tell a story about the magic-making GitHub projects popular ⭐🌟⭐.
Readmes, descriptions, languages... we'll collect it all.
So, let the ...]]></description><link>https://www.kamwithk.com/big-data-from-public-apis-for-data-science-the-github-popularity-project</link><guid isPermaLink="true">https://www.kamwithk.com/big-data-from-public-apis-for-data-science-the-github-popularity-project</guid><category><![CDATA[big data]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[APIs]]></category><category><![CDATA[GraphQL]]></category><category><![CDATA[scalability]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Wed, 16 Sep 2020 13:17:32 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1600260957533/2I2tKTOkn.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cool unique data makes for intriguing projects, so let's go find some on the web!
Today we'll get what we need to tell a story about the magic-making GitHub projects popular ⭐🌟⭐.
Readmes, descriptions, languages... we'll collect it all.</p>
<p>So, let the public API and big data sorcery begin.</p>
<blockquote>
<p>WARNING: Collecting, storing and using mass data from public APIs won't be quick, easy or clean. Prepare to dial up the madness (you have been warned)...</p>
</blockquote>
<h1 id="the-story">The Story</h1>
<p>GitHub, such a beautiful place filled with amazing creative projects.
Some get popular, others stagnate and die.
It truly is <em>the circle of life, the wheel of fortune</em> (eloquently stated by Elton John).</p>
<p>Now we could use a massive (terabytes large 🤯) archive like <a target="_blank" href="https://www.gharchive.org/">GH Archive</a> or <a target="_blank" href="https://ghtorrent.org/">GHTorrent</a>, but I'm not looking to fry a computer (haha).
We could use <a target="_blank" href="https://cloud.google.com/bigquery/docs">Google BigQuery</a> to filter through this, but why not take a journey closer to the source 😏 by using the official API (side-note, it's cheaper).</p>
<p>With a good bit of <a target="_blank" href="https://www.kamwithk.com/zero-to-hero-data-collection-through-web-scraping-ck78o0bmg08ktd9s1bi7znd19">web scraping</a> experience under our belt, it surely can't be <em>that hard</em> to use a public API...
Nope, it doesn't work like that.
You get a little data, you see a few repositories... and then that's all for poor you 🥺.</p>
<p>But what if you want more 🥺😒?
<strong>What if you want a LOT MORE</strong>?
Well, I'd welcome you to the <em>slightly dodgy but still legitimate public API user club</em>.
Beginner lingo - If you play your cards right you can <strong>get what you need without waiting a hundred years</strong>.
Well, it's smooth sailing if you know what you're doing 🤯🤔.</p>
<p>Luckily, we can avoid the time-consuming pain of switching from technology to technology, by considering what's out there.
To start off, we can consider our familiar cozy tried and tested data science tools (Python with Pandas and Requests).
We'll consider what it does well, but also it's major drawbacks (lots exist).
To come to our rescue, we'll discuss a few unique tools and techniques to bridge the gap between gathering and using data in a scalable way.</p>
<p>After we decide on a technology stack we can start looking at our API and (with thoughtful research) figure out ways to <em>optimise our search queries</em>:</p>
<ul>
<li>Break down large searches into multiple parts</li>
<li>Send multiple queries at once (asynchronously)</li>
<li>Getting extra API time through multiple API access tokens (get your friends together, and maybe use a rotating proxy)</li>
</ul>
<p>After all the work, we can finally sit back and relax.
Knowing that we've got virtually every last drop of data 🤪.
Yes, 50 glorious gigabytes of GitHub readme's and stats 😱🥳!</p>
<h1 id="technology-stack">Technology Stack</h1>
<p>Before we get into the nitty-gritty details on how to collect data it's important to decide what technologies to use.
Normally this wouldn't be a big deal (you'd want to get started asap), but when collecting a large amount of data, we can't afford to have to rewrite everything with a different library (i.e. because of an overhead or general slow speed).</p>
<blockquote>
<p>With small datasets, library/framework choice doesn't matter much. With ~50 GB, your technology choices make or breaks the project!</p>
</blockquote>
<p>This section is pretty long, so here's a summary of the technologies I used along with alternate options I would use if I started over:</p>
<table>
<thead>
<tr>
<td>Task</td><td>Technology Used</td><td>Potentially Better Options</td></tr>
</thead>
<tbody>
<tr>
<td>Querying GitHub API</td><td><a target="_blank" href="https://github.com/aio-libs/aiohttp">AIOHttp</a> and <a target="_blank" href="https://docs.python.org/3/library/asyncio.html">AsyncIO</a></td><td><a target="_blank" href="https://github.com/apollographql/apollo-client">Apollo Client</a></td></tr>
<tr>
<td>Saving Data</td><td><a target="_blank" href="https://www.postgresql.org/">PostgreSQL</a></td><td><a target="_blank" href="https://parquet.apache.org/">Parquet</a> using <a target="_blank" href="https://arrow.apache.org/docs/python/parquet.html">PyArrow</a></td></tr>
<tr>
<td>Processing Data (Data Pipeline)</td><td><a target="_blank" href="https://dask.org/">Dask</a></td><td><a target="_blank" href="https://spark.apache.org/">Spark</a></td></tr>
<tr>
<td>Machine Learning Modelling (Ensembles)</td><td><a target="_blank" href="https://www.h2o.ai/">H2O</a></td><td><a target="_blank" href="https://xgboost.readthedocs.io/">XGBoost</a> or <a target="_blank" href="https://github.com/microsoft/LightGBM">LightGBM</a></td></tr>
<tr>
<td>Deep Learning and NLP Modelling</td><td><a target="_blank" href="https://pytorch.org/">PyTorch</a>/<a target="_blank" href="https://pytorch-lightning.readthedocs.io">PyTorch Lightning</a> with <a target="_blank" href="https://github.com/uber/petastorm">Petastorm</a> and <a target="_blank" href="https://huggingface.co/transformers">Hugging Face Transformers</a></td><td><a target="_blank" href="https://pytorch.org/">PyTorch</a> with <a target="_blank" href="https://github.com/uber/petastorm">Petastorm</a></td></tr>
</tbody>
</table>
<blockquote>
<p>Always create a working environment.yaml or requirements.txt file listing all dependencies like <a target="_blank" href="https://github.com/KamWithK/GitStarred/blob/master/environment.yml">ours</a>)</p>
</blockquote>
<p>There are multiple types of APIs, the most popular ones are REST and GraphQL APIs.
GitHub has both, GraphQL is newer and allows us to carefully choose what information we want.
Normally you'd use an HTTP requests library for GraphQL, but dedicated clients (like <a target="_blank" href="https://github.com/apollographql/apollo-client">Apollo</a> or <a target="_blank" href="https://github.com/graphql-python/gql">GQL</a>) do exist.
It might not seem important to use a dedicated library, but they can do a lot for you (like handling pagination).
Without one, you'll need to handle asynchronous requests yourself (using threading or <a target="_blank" href="https://realpython.com/async-io-python/">async await</a>).
It isn't impossible but it is a burden to deal with (I, unfortunately, went this route).</p>
<p><em>Note that <a target="_blank" href="https://github.com/graphql-python/gql">GQL</a> is still under heavy development, so <a target="_blank" href="https://github.com/apollographql/apollo-client">Apollo Client</a> may be better (if you use JS).</em></p>
<p>The next step is to decide where to store your data.
Normally <a target="_blank" href="https://pandas.pydata.org/">Pandas</a> would be fine, but here... it's slow and unreliable (as it completely overwrites the data each time it saves).
The best option may seem like a database (they were designed to overcome these problems) like <a target="_blank" href="https://www.postgresql.org/">PostgreSQL</a>.
However, let me warn you right beforehand, <strong>databases... are horribly supported by machine and deep learning frameworks</strong>!
But... <a target="_blank" href="http://www.h5py.org/">HDF5</a>, <a target="_blank" href="https://parquet.apache.org/">Parquet</a> and <a target="_blank" href="https://lmdb.readthedocs.io/">LMDB</a> can work quite well.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1600261793384/u7le__2IL.jpeg" alt="database.jpg" /></p>
<p>Now that we've got all our data, it's time to consider how we will process and analyse it.
When using databases it's best to stick to <a target="_blank" href="https://spark.apache.org/">Apache Spark</a>.
<a target="_blank" href="https://spark.apache.org/">Spark</a> is nice to work with, supports reading/writing to nearly ANY format, works at scale and has support for <a target="_blank" href="https://www.h2o.ai/">H2O</a> (useful if you want to try out AutoML, but it is quite buggy).
The downside is that we ironically don't have enough data to make <a target="_blank" href="https://spark.apache.org/">Spark</a>'s overhead worth it (best for ~300+ GB datasets).
Just as long as you used <a target="_blank" href="http://www.h5py.org/">HDF5</a> or (better yet) <a target="_blank" href="https://parquet.apache.org/">Parquet</a>, both <a target="_blank" href="https://dask.org/">Dask</a> and <a target="_blank" href="https://github.com/vaexio/vaex">Vaex</a> should work like a charm though.
<a target="_blank" href="https://github.com/vaexio/vaex">Vaex</a> is a highly efficient dataframe library which allows us to process our data for an ML model (similarly to <a target="_blank" href="https://pandas.pydata.org/">Pandas</a>).
Although <a target="_blank" href="https://github.com/vaexio/vaex">Vaex</a> is efficient, you might run into memory problems.
When you do, <a target="_blank" href="https://dask.org/">Dask</a>'s out-of-core functionality springs to life 😲!
We can also train classical <a target="_blank" href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">machine learning models</a> (ensembles like random forests and gradient boosted trees) though <a target="_blank" href="https://dask.org/">Dask</a> and <a target="_blank" href="https://github.com/vaexio/vaex">Vaex</a>.
<a target="_blank" href="https://dask.org/">Dask</a> and <a target="_blank" href="https://github.com/vaexio/vaex">Vaex</a> provide wrappers for <a target="_blank" href="https://scikit-learn.org/">Scikit Learn</a>, <a target="_blank" href="https://xgboost.readthedocs.io/">XGBoost</a> and <a target="_blank" href="https://github.com/microsoft/LightGBM">LightGBM</a>.</p>
<p>When it comes to deep learning, <a target="_blank" href="https://pytorch.org/">PyTorch</a> is a natural go-to!
It can use <a target="_blank" href="http://www.h5py.org/">HDF5</a> or <a target="_blank" href="https://lmdb.readthedocs.io/">LMDB</a> quite easily (with custom data loaders like <a target="_blank" href="https://github.com/Lyken17/Efficient-PyTorch">this one</a>, which you can either find or create yourself).
For anything else use <a target="_blank" href="https://github.com/uber/petastorm">Petastorm</a> (from Uber) to get data into PyTorch (by itself for <a target="_blank" href="https://parquet.apache.org/">Parquet</a> and otherwise with <a target="_blank" href="https://spark.apache.org/">Spark</a>).
A neat trick if you use <a target="_blank" href="https://spark.apache.org/">Spark</a> is to <strong>save the processed data into Parquet files</strong> so you can easily and quickly import them through <a target="_blank" href="https://github.com/uber/petastorm">Petastorm</a>.</p>
<blockquote>
<p>Be very, very careful with the libraries you decide to use. It's easy for conflicts and errors to arise 😱</p>
</blockquote>
<p>With a solid technology stack, you and I are ready to get going!
<em>Quick side note - you'll quickly realise that <a target="_blank" href="https://www.apache.org/">Apache</a> is a HUGE player in the big-data world!</em></p>
<h1 id="assembling-a-query">Assembling a Query</h1>
<p>GraphQL is really finicky.
It complains about the simplest mistake, and so it can be difficult to figure out how to construct your search query.
The process to come up with an appropriate API request is as follows:</p>
<ul>
<li>Assemble a list of all the information you want/need (the number of stars and forks, readmes, descriptions, etc)</li>
<li>Read through the <a target="_blank" href="https://docs.github.com/en/graphql/reference/">official documentation</a> to find how to gather the basic elements</li>
<li>Google for anything else</li>
<li>Run your queries as you build/add to them to ensure they work/figure out the problem</li>
</ul>
<p>It's a surprisingly long process, but it does pay off in the end.
Here's the final query for GitHub.</p>
<pre><code>query (<span class="hljs-variable">$after</span>: String, <span class="hljs-variable">$first</span>: Int, <span class="hljs-variable">$conditions</span>: String=<span class="hljs-string">"is:public sort:created"</span>) {
    search(query: <span class="hljs-variable">$conditions</span>, <span class="hljs-built_in">type</span>: REPOSITORY, first: <span class="hljs-variable">$first</span>, after: <span class="hljs-variable">$after</span>) {
        edges {
            node {
                ... on Repository {
                    name
                    id
                    description
                    forkCount
                    isFork
                    isArchived
                    isLocked
                    createdAt
                    pushedAt

                    primaryLanguage {
                        name
                    }

                    assignableUsers {
                        totalCount
                    }

                    stargazers {
                        totalCount
                    }

                    watchers {
                        totalCount
                    }

                    issues {
                        totalCount
                    }

                    pullRequests {
                        totalCount
                    }

                    repositoryTopics(first: 5) {
                        edges {
                            node {
                                topic {
                                    name
                                }
                            }
                        }
                    }

                    licenseInfo {
                        key
                    }

                    commits: object(expression: <span class="hljs-string">"master"</span>) {
                        ... on Commit {
                            <span class="hljs-built_in">history</span> {
                                totalCount
                            }
                        }
                    }

                    readme: object(expression: <span class="hljs-string">"master:README.md"</span>) {
                        ... on Blob {
                            text
                        }
                    }
                }
            }
        }
        pageInfo {
            hasNextPage
            endCursor
        }
    }
}
</code></pre><p><em>You can parse in the arguments/variables <code>after</code>, <code>first</code> and <code>conditions</code> through a separate JSON dictionary.</em></p>
<h1 id="challenging-your-query">Challenging Your Query</h1>
<h2 id="divide-and-conquer">Divide and Conquer</h2>
<blockquote>
<p>You wrote one nice simple query to find all your data? You Fool 🤡🥱</p>
</blockquote>
<p>Big companies are (mostly) smart.
They know that if they allow us to do anything with their API, we will use and abuse it frequently 👌.
So the easiest thing for them to do is to set strict restrictions!
The thing about these restrictions though is, that <strong>they don't completely stop you from gathering data, they just make it a lot harder</strong>.</p>
<p>The most fundamental limitation is the <em>amount you can query at once</em>.
On GitHub, it is:</p>
<ul>
<li>2000 queries per hour</li>
<li>1000 items per query</li>
<li>100 items per page</li>
</ul>
<p>To stay within these bounds whilst still being able to collect data we need to bundle lots of small queries together, whilst splitting apart single large queries.
Combining smaller queries is easy enough (string concatenation), but to split a large query apart requires some clever coding.
We can start by finding out how many items appear for a search (through the following GraphQL):</p>
<pre><code>query (<span class="hljs-variable">$conditions</span>: String=<span class="hljs-string">"is:public sort:created"</span>) {
    search(query: <span class="hljs-variable">$conditions</span>, <span class="hljs-built_in">type</span>: REPOSITORY) {
        repositoryCount
    }
}
</code></pre><p>By modifying the <code>conditions</code> variables, we can limit the searches scope (for example to just 2017-2018).
You can test this (or another) query with the official <a target="_blank" href="https://developer.github.com/v4/explorer/">online interactive iGraphQL explorer</a>.
In essence, we can create two smaller searches by diving the original time period into two halves.</p>
<p>If the outgoing number is greater than 1000, we'll need to create two independent queries which gather half the data each!
We can break a search in half by diving the original period of time into two half as long periods.
By continuously repeating the division process, we'll <em>eventually</em> end up with a long list of valid searches!</p>
<p>So in essence, here's what happens (continuously repeated through a <code>while True</code> loop):</p>
<pre><code class="lang-python">trial_periods = []

<span class="hljs-comment"># Handle one period at a time</span>
<span class="hljs-keyword">for</span> period, num_repos, is_valid <span class="hljs-keyword">in</span> valid_periods:
    <span class="hljs-keyword">if</span> is_valid == <span class="hljs-keyword">True</span> <span class="hljs-keyword">and</span> num_repos &gt; <span class="hljs-number">0</span>:
        <span class="hljs-comment"># Save the data</span>
        ...
    <span class="hljs-keyword">else</span>:
        <span class="hljs-comment"># Add to new list of still unfinished periods</span>
        trial_periods.extend(self.half_period(*period))

<span class="hljs-keyword">if</span> trial_periods == []: <span class="hljs-keyword">break</span>
...
</code></pre>
<p><em>Please see the <a target="_blank" href="https://github.com/KamWithK/GitStarred/">GitHub repo</a> for the full working code. This is just a sample to illustrate how it works on its own, without databases, async-await and other extraneous bits...</em></p>
<p>Do remember that if (like me) you're using the API through an HTTP client instead of a dedicated GraphQL one, you'll need to manage pagination yourself!
To do so you'll need to include in your query (after the huge <code>edges</code> part):</p>
<pre><code><span class="hljs-section">pageInfo</span> {
    <span class="hljs-attribute">hasNextPage</span>
    endCursor
}
</code></pre><p>Then pass the cursor as a variable for where to start the next query.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1600261982295/UmOHyIytB.jpeg" alt="message.jpg" /></p>
<h2 id="asynchronous-code">Asynchronous Code</h2>
<p>If you're using an HTTP client it is important to know how to write code to run blazing fast, and ideally in a way that multiple requests can be made at once.
This helps because the GitHub server will take a long time (from around 1 to 10 seconds) to respond to our queries!
One great way to do this is to use asynchronous libraries.
With asynchronous code, the Python interpreter (like our human brain) switches between tasks extremely fast.
So whilst we're waiting for our first query to return a response, our second one can be sent off straight away!
It might not seem like much, but it definitely is.</p>
<p>There are three ways to achieve this.
The easiest way (especially for beginners) is to use threads.
We can create 100 threads (arbitrary example), and launch one query on each.
The operating system will handle switching between them itself!
Once an operation finishes, the thread can be recycled and used for a separate query (or other operation)!</p>
<p>The second method is to utilize your computer's processes.
When we do this, we get multiple tasks to perform in parallel!
This is useful for high CPU tasks (like data processing), but we have very few processes (not everyone has a high-end i7 or threadripper 🤣).</p>
<p>The third and final method is async-await.
It is similar in philosophy to the first (quickly switch between tasks), but instead of the OS handling it... we do!
The idea here is that the OS has a lot to do, but we don't, so it's better Python handles it itself.
<em>In theory, async-await is easier and quicker than writing thread-safe code</em> (but <strong>much, much harder in practice</strong>).
The primary reason for this is that <strong>asynchronous code can behave synchronously</strong>.
Simply put: if your design pattern is slightly off, you may have 0 performance gain!</p>
<p>I rewrote, redesigned and refactored my code a million times, and here's what I figured out:</p>
<ul>
<li>Use a <em>breath first approach</em> (i.e. maximise the number of queries you can run near-simultaneously independent of anything else)</li>
<li>Avoid the <em>consumer-producer pattern</em> where one function produces items, and the other consumes them (as there aren't many guides or explanations of how to use it in practice, and seems to arbitrarily limits itself)</li>
<li>Whenever you're trying to run multiple things simultaneously use <code>asyncio.gather(...)</code></li>
<li>Avoid <code>async</code> loops, they in practice run synchronously since loops maintain their order of elements (i.e. one by one first, second, third...)</li>
<li>CPU intensive tasks must run within a separate process</li>
<li>Find non-CPU intensive alternative (ideally asynchronous) technologies where possible (i.e. don't write to a single JSON/CSV file as they need to be completely overridden each time you append another item)</li>
<li>Automatically restart queries (<a target="_blank" href="https://github.com/jd/tenacity">Tenacity</a> does this with a simple function decorator)</li>
</ul>
<blockquote>
<p>When you stick to the rules, development stays lite, fast and fluid 😏😌!</p>
</blockquote>
<p>Note that proper asynchronous code can easily bombard a server with requests.
Please don't let this happen or you'll be blocked 🥶.
<strong>Simply rate limit requests with a <a target="_blank" href="https://tutorialedge.net/python/concurrency/python-asyncio-semaphores-tutorial/">Semaphore</a></strong> (other fancy methods exist, but this is enough)!</p>
<h1 id="thanks-for-reading">THANKS FOR READING!</h1>
<p>I know all we currently have is raw data, saved as a file (or database server), but this is the first big step to any unique project.
Soon (in part 2) we'll be able to process and view our data with <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.html">PySpark</a>, look at results from a few basic <a target="_blank" href="https://www.kamwithk.com/machine-learning-field-guide">ML</a> models (created with <a target="_blank" href="https://www.h2o.ai/products/h2o-sparkling-water/">H2O PySparkling AI</a>)!
After that (part 3) we can take a look at using <a target="_blank" href="https://pytorch.org/">PyTorch</a> with <a target="_blank" href="https://allennlp.org/">AllenNLP</a> or <a target="_blank" href="https://huggingface.co/transformers">Hugging Face Transformers</a>.</p>
<p>If you enjoyed this and you’re interested in coding or data science feel free to check out my other posts like tutorials on <a target="_blank" href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook?guid=363cfb0b-4bda-4d10-a8fd-00ef9f412ab5&amp;deviceId=bed01825-b655-427b-8d69-52b30b0a2d78">practical coding skills</a> or <a target="_blank" href="https://www.kamwithk.com/the-state-of-data-websites-and-portfolios-in-2020-develop-a-dashboard-in-a-day-dash-vs-streamlit-and-is-javascript-still-king">easy portfolio dashboards/websites</a>!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1600262184183/FsLB_B84g.jpeg" alt="continued.jpg" /></p>
<p>Images by <a target="_blank" href="https://pixabay.com/users/geralt-9301">Gerd Altmann</a> on <a target="_blank" href="https://pixabay.com/illustrations/social-media-media-board-networking-1989152/">PixaBay</a>, <a target="_blank" href="https://unsplash.com/@campaign_creators">Campaign Creators</a>, <a target="_blank" href="https://unsplash.com/@florianolv">Florian Olivo</a> and <a target="_blank" href="https://unsplash.com/photos/C4sxVxcXEQg">Reuben Juarez</a> on Unsplash</p>
]]></content:encoded></item><item><title><![CDATA[Machine Learning Energy Demand Prediction Project - Part 3 Modelling with Decission Trees]]></title><description><![CDATA[Let's see how our machine Learning, project planning and essential coding tools can be brought to life in a real-world project!
Today we're going through how we can predict how much energy we use daily using temperature data.
We start here with impor...]]></description><link>https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-3-modelling-with-decission-trees</link><guid isPermaLink="true">https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-3-modelling-with-decission-trees</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[ML]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data]]></category><category><![CDATA[projects]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Wed, 12 Aug 2020 02:39:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1597199949695/nBr8IKzi-.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let&#39;s see how our <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">machine Learning</a>, <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/insight-is-king-how-to-get-it-and-avoid-pitfalls-ckbjfohz201ujzqs1lwu5l7xd">project planning</a> and <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z">essential coding tools</a> can be brought to life in a real-world project!
Today we&#39;re going through how we can predict how much energy we use daily using temperature data.
We start here with <strong>importing and cleaning data, before graphing and depicting the story of our energy usage and finally modelling it</strong>.</p>
<p>This is the last section, where we take our <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning-ckc5nni0j00edkss13rgm75h4?guid=14f9ef0e-cd44-4f28-b588-fec4b33b41cf">cleaned data</a> and our <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-2-storytelling-using-graphs-ckcecai0i006xbrs1hcase6f6">understanding of temperature and energy</a> to develop a predictive model. Feel free to code along, the full project is on <a target='_blank' rel='noopener'  href="https://github.com/KamWithK/Temp2Enrgy">GitHub</a>.</p>
<h1 id="the-story">The story</h1>
<p>We wake up in the mornings, turn on the heater/air conditioner, find some yogurt from the fridge for breakfast, shave, turn on a computer, get the music rolling and finally get to work.
These tasks all have one thing in common - they use power!
Our heavy reliance on electricity makes it crucial to estimate how much energy we&#39;ll need to generate each day.</p>
<p>But, fear not if this seems challenging.
We will take it one step at a time.
At each stage linking back to how it relates to our <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">ML field guide</a>.</p>
<p>We already <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning-ckc5nni0j00edkss13rgm75h4?guid=14f9ef0e-cd44-4f28-b588-fec4b33b41cf">found, cleaned</a>, <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-2-storytelling-using-graphs-ckcecai0i006xbrs1hcase6f6">visualised and interpreted our energy usage</a> 😊.
Now we can translate this into a model able to predict how much energy we use!
We start our journey of right where we left off taking a deep look into how we can remove annual increases/decreases in energy usage caused by economic and population growth.
This is a hard problem, so to simplify we break down the energy demand time series into separate parts (which we graph).
The three important <em>components</em> of the time series are <em>overall increasing, decreasing and stable trends</em>, <em>seasonal repetitive changes</em> and other <em>random residual noise</em>.
Research into time series indicates that we can <em>detrend</em> our data through a process called <em>differentiating</em>, where we subtract the previous <em>n</em>&#39;th value at each point.</p>
<p>With data in a clean and predictable state, our minds are at ease and we can move onto quickly dividing up the dataset and creating a <em>decision tree model</em> for each state.
We&#39;ll find out what variables (specifically hyperparameters) effect it and then use grid search to find the best values.</p>
<p>We can then judge how well they fair, and contemplate how they could improve.
Then, finally, we can honour the project by showing it off to everyone we know 🥳!</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-keyword">from</span> statsmodels.tsa.seasonal <span class="hljs-keyword">import</span> seasonal_decompose
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> cross_val_score, GridSearchCV
<span class="hljs-keyword">from</span> sklearn.tree <span class="hljs-keyword">import</span> DecisionTreeRegressor
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> IPython.core.interactiveshell <span class="hljs-keyword">import</span> InteractiveShell

InteractiveShell.ast_node_interactivity = <span class="hljs-string">"all"</span>
pd.options.display.max_columns = <span class="hljs-keyword">None</span>
np.random.seed(<span class="hljs-number">0</span>)
</code></pre>
<pre><code class="lang-python">data = pd.read_pickle(<span class="hljs-string">"../Data/Data.pickle"</span>)

data[<span class="hljs-string">"Month"</span>] = data.index.month
data[<span class="hljs-string">"Week"</span>] = data.index.week
data[<span class="hljs-string">"Day"</span>] = data.index.dayofweek
</code></pre>
<h1 id="the-epochs">The Epochs</h1>
<h2 id="removing-trends-seasonality">Removing trends/seasonality</h2>
<p>Let&#39;s begin by clearly explaining our situation.
As mentioned before, this is achieved through graphs which break down our time series!
They help us visually interpret precisely what needs to be removed/modified.</p>
<pre><code class="lang-python">seasonal_decompose(data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"TotalDemand"</span>].resample(<span class="hljs-string">"M"</span>).mean(), model=<span class="hljs-string">"additive"</span>).plot()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1597198480113/MerwcT0xL.png" alt="output_3_0.png"></p>
<p>At the top, we see the full graphs, and below it <em>trends</em> (the increase/decrease of values with time), <em>seasonality</em> (the repeating pattern of increasing and then decreasing values) and <em>residuals</em> (everything else which is present, but more so random since it doesn&#39;t seem to repeat itself).</p>
<p>We just decomposed Victoria&#39;s energy demand here, so it isn&#39;t completely representative of what we&#39;ll remove, but close enough to remind us of the problem.
What we want is to eliminate that gradual long-term increasing/decreasing trend.
This is normally done through <em>diffing</em> our data (subtract <em>n</em>&#39;th previous value from each entry).
We&#39;ll subtract the value half a year ago since our trends happen in the long-run.
Since we do it separately on each state, we first have to order the data by region and time (we can reverse this after).</p>
<pre><code class="lang-python"><span class="hljs-comment"># Sort dataframe by region so groupby's output can be combined and used for another column</span>
data.sort_values(by=[<span class="hljs-string">"Region"</span>, <span class="hljs-string">"Date"</span>], inplace=<span class="hljs-keyword">True</span>)
data[<span class="hljs-string">"AdjustedDemand"</span>] = data.groupby(<span class="hljs-string">"Region"</span>)[<span class="hljs-string">"TotalDemand"</span>].diff(<span class="hljs-number">8544</span>)
all([region[<span class="hljs-number">1</span>].sort_index().index.equals(region[<span class="hljs-number">1</span>].index) <span class="hljs-keyword">for</span> region <span class="hljs-keyword">in</span> data.groupby(<span class="hljs-string">"Region"</span>)])
data.sort_index(inplace=<span class="hljs-keyword">True</span>)
</code></pre>
<pre><code><span class="hljs-keyword">True</span>
</code></pre><p>When we graph the original total demand we find that it was not stationary, however, when we overlay the new adjusted version it varies up and down across one straight line (implying that the trend has been removed).</p>
<pre><code class="lang-python">data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"TotalDemand"</span>].resample(<span class="hljs-string">"M"</span>).mean().plot()
data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"AdjustedDemand"</span>].resample(<span class="hljs-string">"M"</span>).mean().plot(secondary_y=<span class="hljs-keyword">True</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1597198532977/VN7X-MumL.png" alt="output_7_0.png"></p>
<p>To ensure we&#39;re right, we can graph the distribution of temperature against energy, and see how it changes with time.
We can see that the graphs become tighter, showing how that the number of energy values for a certain temperature has been reduced (good)!</p>
<pre><code class="lang-python">fix, axes = plt.subplots(ncols=<span class="hljs-number">2</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>))
data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>).resample(<span class="hljs-string">"W"</span>).mean().plot(x=<span class="hljs-string">"WetBulbTemperature"</span>, y=<span class="hljs-string">"TotalDemand"</span>, kind=<span class="hljs-string">"scatter"</span>, ax=axes[<span class="hljs-number">0</span>], title=<span class="hljs-string">"Before adjustments"</span>)
data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>).resample(<span class="hljs-string">"W"</span>).mean().plot(x=<span class="hljs-string">"WetBulbTemperature"</span>, y=<span class="hljs-string">"AdjustedDemand"</span>, kind=<span class="hljs-string">"scatter"</span>, ax=axes[<span class="hljs-number">1</span>], title=<span class="hljs-string">"After adjustments"</span>, color=<span class="hljs-string">"red"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1597198552312/BxT93PDJt.png" alt="output_9_0.png"></p>
<h2 id="divide-up-data">Divide up data</h2>
<p>There&#39;s one thing everyone knows by now - we have a lot of data.
So now we have to split it up and be 100% confident that it&#39;s what we&#39;re looking for.
The default distribution of 75% of data for training and 25% for testing is good enough for our purpose.</p>
<p>Not everyone has a monstrously powerful computer, so to ensure it&#39;s easy and fast to train our model (useful to quickly see how our model fares, try changing a few things and retrain) we&#39;ll only predict the overall energy usage every day (instead of per 30 minutes).
This should decrease the effect of any present outliers too!</p>
<p>Whilst testing out changes, one thing which will become immediately obvious is that including complete information on time causes overfitting.
This is because there are <em>only 20 years present</em>, meaning any information encoded in the year is likely, not generalisable.
To fix this, we can just use integers for the day, week and month number.
Anomalies/outliers should now be relatively rare (hopefully 😲).</p>
<pre><code class="lang-python">resampled_data = data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"D"</span>).mean().reset_index(<span class="hljs-string">"Region"</span>).sort_index()
train_data, test_data = train_test_split(resampled_data, shuffle=<span class="hljs-keyword">False</span>)

input_columns = [<span class="hljs-string">"WetBulbTemperature"</span>, <span class="hljs-string">"Month"</span>, <span class="hljs-string">"Week"</span>, <span class="hljs-string">"Day"</span>]
output_columns = [<span class="hljs-string">"AdjustedDemand"</span>]

train_input_data, train_output_data = train_data[input_columns + [<span class="hljs-string">"Region"</span>]], train_data[output_columns + [<span class="hljs-string">"Region"</span>]]
test_input_data, test_output_data = test_data[input_columns + [<span class="hljs-string">"Region"</span>]], test_data[output_columns + [<span class="hljs-string">"Region"</span>]]
</code></pre>
<h2 id="create-and-train-a-model">Create and train a model</h2>
<p>We have a large selection of models we can use.
We can try and see which ones work (good idea for beginners), but after the long trials, it becomes obvious that simple and fast models like decision trees work just as well as more complex ensemble models like random forests.
Of course, linear models won&#39;t work (our data is shaped like a parabola in most cases).</p>
<p>As <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx#hyperparameter-tuning">mentioned before</a>, we need to find the optimal hyperparameters (how deep and complex our decision tree can become), and grid search is the standard way to do it.
We run <a target='_blank' rel='noopener'  href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">grid search</a> over 5 subsets of our dataset (K-fold cross-validation).
The max depth (how <em>deep</em> the tree can grow), along with min sample leaves (minimum end-nodes the tree must-have, which produces shallow trees) and max-leaf nodes (maximum end-nodes the tree is allowed to have, which produces deeper trees) are our hyperparameters (put into a <code>paramaters</code> array).
We&#39;ve found these through <a target='_blank' rel='noopener'  href="https://scikit-learn.org/stable/modules/tree.html">reading the decision-tree Scikit Learn documentation</a> which has a lot of details on how to use each model (with a few examples)!
Knowing what specific values to try out for each variable comes down to manual testing (try and see what happens).</p>
<p>The <code>get_best_model</code> function here is responsible for hyperparameter tuning, whilst the <code>get_predictions</code> one loops through each state and trains a model for it.
Separating different concepts into their own functions/classes is a good way to ensure code clarity, reproducibility and modification.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_best_model</span><span class="hljs-params">(test_input: pd.DataFrame, test_output: pd.DataFrame)</span>:</span>
    paramaters = {<span class="hljs-string">"max_depth"</span>: [*range(<span class="hljs-number">1</span>, <span class="hljs-number">20</span>), <span class="hljs-keyword">None</span>], <span class="hljs-string">"min_samples_leaf"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">10</span>, <span class="hljs-number">15</span>], <span class="hljs-string">"max_leaf_nodes"</span>: [<span class="hljs-number">5</span>, <span class="hljs-number">10</span>, <span class="hljs-number">20</span>, <span class="hljs-keyword">None</span>]}
    regressor = DecisionTreeRegressor()
    grid = GridSearchCV(regressor, param_grid=paramaters, n_jobs=<span class="hljs-number">1</span>)
    grid.fit(test_input, test_output.values.ravel())
    best_score, best_depth = grid.best_score_, grid.best_params_

    <span class="hljs-keyword">return</span> grid, best_score, best_depth
</code></pre>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_predictions</span><span class="hljs-params">(regressor, test_input, test_output, train_input=None, train_output=None)</span>:</span>
    test_predictions = regressor.predict(test_input)
    test_results = pd.DataFrame(test_predictions, columns=output_columns, index=test_input.index)
    test_results = test_data[input_columns].join(test_results)

    <span class="hljs-keyword">if</span> type(train_input) != <span class="hljs-keyword">None</span> <span class="hljs-keyword">and</span> type(train_output) != <span class="hljs-keyword">None</span>:
        train_predictions = regressor.predict(train_input)
        train_results = pd.DataFrame(train_predictions, columns=output_columns, index=train_input.index)
        train_results = train_data[input_columns].join(train_results)

        <span class="hljs-keyword">return</span> test_results, train_results
    <span class="hljs-keyword">return</span> test_results
</code></pre>
<pre><code class="lang-python">models, regressors = [], []

test_predictions, train_predictions = [], []

<span class="hljs-keyword">for</span> region, dataframe <span class="hljs-keyword">in</span> train_data.groupby(<span class="hljs-string">"Region"</span>):
    <span class="hljs-comment"># Cross validate to find the best model</span>
    model_input, model_output = dataframe.dropna()[input_columns], dataframe.dropna()[output_columns]
    grid, score, params = get_best_model(model_input, model_output)
    regressors.append(grid)
    models.append(regressors[<span class="hljs-number">-1</span>].fit(model_input, model_output.values.ravel()))

    print(f<span class="hljs-string">"Best {region} model has a score of {score} and best params {params}"</span>)

    <span class="hljs-comment"># Get the test data for this specific region</span>
    test_input = test_data.groupby(<span class="hljs-string">"Region"</span>).get_group(region)[input_columns].dropna()
    test_output = test_data.groupby(<span class="hljs-string">"Region"</span>).get_group(region)[output_columns].dropna()

    <span class="hljs-comment"># Generate predictions, obtain and log the final formatted data</span>
    test_results, train_results = get_predictions(regressors[<span class="hljs-number">-1</span>], test_input, test_output, model_input, model_output)
    test_predictions.append(test_results)
    train_predictions.append(train_results)
</code></pre>
<pre><code>Best NSW model has a score <span class="hljs-keyword">of</span> <span class="hljs-number">0.6673092603240149</span> <span class="hljs-keyword">and</span> best params {<span class="hljs-string">'max_depth'</span>: <span class="hljs-number">11</span>, <span class="hljs-string">'max_leaf_nodes'</span>: None, <span class="hljs-string">'min_samples_leaf'</span>: <span class="hljs-number">15</span>}
Best QLD model has a score <span class="hljs-keyword">of</span> <span class="hljs-number">0.679797201035001</span> <span class="hljs-keyword">and</span> best params {<span class="hljs-string">'max_depth'</span>: <span class="hljs-number">11</span>, <span class="hljs-string">'max_leaf_nodes'</span>: None, <span class="hljs-string">'min_samples_leaf'</span>: <span class="hljs-number">15</span>}
Best SA model has a score <span class="hljs-keyword">of</span> <span class="hljs-number">0.4171236821322447</span> <span class="hljs-keyword">and</span> best params {<span class="hljs-string">'max_depth'</span>: <span class="hljs-number">9</span>, <span class="hljs-string">'max_leaf_nodes'</span>: None, <span class="hljs-string">'min_samples_leaf'</span>: <span class="hljs-number">10</span>}
Best TAS model has a score <span class="hljs-keyword">of</span> <span class="hljs-number">0.7609030948185131</span> <span class="hljs-keyword">and</span> best params {<span class="hljs-string">'max_depth'</span>: <span class="hljs-number">15</span>, <span class="hljs-string">'max_leaf_nodes'</span>: None, <span class="hljs-string">'min_samples_leaf'</span>: <span class="hljs-number">15</span>}
Best VIC model has a score <span class="hljs-keyword">of</span> <span class="hljs-number">0.6325583799486684</span> <span class="hljs-keyword">and</span> best params {<span class="hljs-string">'max_depth'</span>: <span class="hljs-number">10</span>, <span class="hljs-string">'max_leaf_nodes'</span>: None, <span class="hljs-string">'min_samples_leaf'</span>: <span class="hljs-number">15</span>}
</code></pre><p>We can see that Tasmania fares quite well, with a score just under 80%.
This makes sense, as Tasmania started with very little trend, meaning the should be a higher correlation between temperature and energy.
Queensland, New South Wales and Victoria aren&#39;t all too bad either, with scores between 60%-70%!</p>
<p>If we look at the <code>max_depth</code>, we can also tell that our models aren&#39;t as complex as they could be.</p>
<h2 id="visualise-performance">Visualise performance</h2>
<p>To judge how well our model fairs, we create and analyse plots of energy and temperature.
We start by seeing the correlation of energy and temperature for each state (the predictions are blue, and the real values are red).
We can see that the model isn&#39;t perfect and doesn&#39;t always predict the right values, but is pretty decent given the small number of features we are using.</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), constrained_layout=<span class="hljs-keyword">True</span>)
counter = [<span class="hljs-number">0</span>, <span class="hljs-number">0</span>]

<span class="hljs-keyword">for</span> region, region_data <span class="hljs-keyword">in</span> test_data.groupby(<span class="hljs-string">"Region"</span>):
    region_data.plot(ax=axes[counter[<span class="hljs-number">0</span>], counter[<span class="hljs-number">1</span>]], x=<span class="hljs-string">"WetBulbTemperature"</span>, y=output_columns, kind=<span class="hljs-string">"scatter"</span>, color=<span class="hljs-string">"red"</span>, title=region)

    <span class="hljs-keyword">if</span> counter[<span class="hljs-number">1</span>] &lt; <span class="hljs-number">2</span>: counter[<span class="hljs-number">1</span>] += <span class="hljs-number">1</span>
    <span class="hljs-keyword">elif</span> counter[<span class="hljs-number">1</span>] == <span class="hljs-number">2</span>: counter[<span class="hljs-number">1</span>] = <span class="hljs-number">0</span>; counter[<span class="hljs-number">0</span>] += <span class="hljs-number">1</span>

counter = [<span class="hljs-number">0</span>, <span class="hljs-number">0</span>]

<span class="hljs-keyword">for</span> region_data <span class="hljs-keyword">in</span> test_predictions:
    region_data.plot(ax=axes[counter[<span class="hljs-number">0</span>], counter[<span class="hljs-number">1</span>]], x=<span class="hljs-string">"WetBulbTemperature"</span>, y=output_columns, kind=<span class="hljs-string">"scatter"</span>)

    <span class="hljs-keyword">if</span> counter[<span class="hljs-number">1</span>] &lt; <span class="hljs-number">2</span>: counter[<span class="hljs-number">1</span>] += <span class="hljs-number">1</span>
    <span class="hljs-keyword">elif</span> counter[<span class="hljs-number">1</span>] == <span class="hljs-number">2</span>: counter[<span class="hljs-number">1</span>] = <span class="hljs-number">0</span>; counter[<span class="hljs-number">0</span>] += <span class="hljs-number">1</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1597198574018/sQHLx3Woo.png" alt="output_17_0.png"></p>
<p>Majority of the time the predictions align with the actual data, but it tends to be off at times.
We can see that Tasmania and Queensland are quite well done (explaining their high performance).
It looks like their graphs are more linear and less curved, which may be why they outperform other states.</p>
<p>We can move onto looking at the energy time series now.</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">5</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), constrained_layout=<span class="hljs-keyword">True</span>)

<span class="hljs-keyword">for</span> i, (region, region_data) <span class="hljs-keyword">in</span> enumerate(test_data.groupby(<span class="hljs-string">"Region"</span>)):
    region_data[output_columns].plot(ax=axes[i], title=region)
    test_predictions[i][output_columns].plot(ax=axes[i])
    axes[i].set_ylabel(<span class="hljs-string">"Adjusted Demand"</span>)
    axes[i].legend([<span class="hljs-string">"Demand"</span>, <span class="hljs-string">"Demand Prediction"</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1597198588923/7-fk52wWv.png" alt="output_19_0.png"></p>
<p>We can see that the general ups and downs are <em>mostly</em> found and predicted across time.
The magnitude of the energy demand is also forecasted relatively well!
This is incredible given that we <em>only need temperature and dates/times</em> to generate our predictions!</p>
<p>To be more critical we can see that around halfway through most years, our energy predictions derail.
The model for Tasmania also seems to particularly struggle in 2017.
The temporary half-yearly blunders seem more severe in New South Wales and South Australia.
To get a feel for how bad these problems may be, we can just plot 2019.
It&#39;ll show that our predicts are pretty decent!</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">5</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), constrained_layout=<span class="hljs-keyword">True</span>)

<span class="hljs-keyword">for</span> i, (region, region_data) <span class="hljs-keyword">in</span> enumerate(test_data.groupby(<span class="hljs-string">"Region"</span>)):
    region_data[output_columns][<span class="hljs-string">"2019"</span>].plot(ax=axes[i], title=region)
    test_predictions[i][output_columns][<span class="hljs-string">"2019"</span>].plot(ax=axes[i])
    axes[i].set_ylabel(<span class="hljs-string">"Adjusted Demand"</span>)
    axes[i].legend([<span class="hljs-string">"Demand"</span>, <span class="hljs-string">"Demand Prediction"</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1597198603056/97fb9I9va.png" alt="output_21_0.png"></p>
<h2 id="conclusions">Conclusions</h2>
<p>We&#39;ve come a long way 🤠!
We started off knowing little to nothing about energy, and now have a solid understanding of what effects it.
To start off we cleaned the data (an arduous process, but worth it), then created a bunch of graphs which illustrated patterns and trends in temperature and energy demand.
Energy demand was is greatest when the temperature is extremely high or low (too cool or too warm and we freak out 🤣).
We now know that 9 am to 6 pm is active high energy time too!</p>
<p>Being able to create and train a working model to model these changes though, it&#39;s pushed us to a new level.
Everything we saw and labelled before, our model can predict.
We even know how to improve the model, by collecting data on population, technological advancements and economic activity.
All in all, if there&#39;s one thing to take from this, it is that <strong>graphs and conceptual knowledge are the building blocks for creating and understanding a model</strong>!</p>
]]></content:encoded></item><item><title><![CDATA[The State of Data Websites and Portfolios in 2020 - Develop a Dashboard in a Day, Dash vs Streamlit and is JavaScript still king?]]></title><description><![CDATA[Having a visual product, website or dashboard to show what your arduous efforts on a coding/machine learning project amounted to is something truly spectacular!
Yet, it's often extremely difficult as numerous tools and technologies are usually requir...]]></description><link>https://www.kamwithk.com/the-state-of-data-websites-and-portfolios-in-2020-develop-a-dashboard-in-a-day-dash-vs-streamlit-and-is-javascript-still-king</link><guid isPermaLink="true">https://www.kamwithk.com/the-state-of-data-websites-and-portfolios-in-2020-develop-a-dashboard-in-a-day-dash-vs-streamlit-and-is-javascript-still-king</guid><category><![CDATA[Web Development]]></category><category><![CDATA[General Programming]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[portfolio]]></category><category><![CDATA[projects]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Mon, 13 Jul 2020 15:05:45 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1594652950469/v1XYP84zz.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Having a visual product, website or dashboard to show what your arduous efforts on a coding/machine learning project amounted to is something truly spectacular!
Yet, it&#39;s often extremely difficult as numerous tools and technologies are usually required.
Don&#39;t stress though, we&#39;ll discuss and compare <strong>two frameworks</strong> (Dash and Streamlit) which make it <strong>simple and easy to create an impressive portfolio</strong> (without a steep learning curve)!</p>
<h1 id="the-story">The Story</h1>
<p>You just created a machine learning model.
It took a long time, but  <em>it&#39;s finally done</em>, and you want to take in your victory for a second.
You deserve a break... but wise old you knows the importance of creating a monument to show off your work.</p>
<p>You take the <em>natural next step</em>, looking up how to build a website.
They begin with Python frameworks like Flask and Django, then proceed to JavaScript and before long you&#39;re stuck contemplating which front-end framework to use, and how you&#39;ll parse the data between the Python back-end (model) and JavaScript front-end (actual website).
Oh, boy... this is a long and dark rabbit hole to scurry through.
But then out of the blue, you hear that there is an <em>easy solution for simple websites</em>.
You look up this new shinny framework (Streamlit), and it sure is easy 😊 and quick to use.
Before long you&#39;ve forgotten all your troubles and insecurities!
But then you suddenly realise Streamlit&#39;s catch... it only works for <em>simple Jupyter notebook-esk websites</em>.
It&#39;s all aboard the web dev train again for you.
Requests and JavaScript, here you come 😰.</p>
<p><strong>It doesn&#39;t have to be that way though</strong>... you <em>can find middle ground</em>.
Something simple enough to be understood in a few days, but complex enough... well, for nearly anything 🤓!
Welcome to Dash.
You still need to know a few web fundamentals (HTML and CSS), but at least your development journey has a clearly defined path ahead.
Even if it feels slightly clunky, it does get the job done well enough!</p>
<p><strong>The whole process can be dumbed down to three decisions</strong>:</p>
<ol>
<li>What you want on the page (text, graphs, tables, images, etc)</li>
<li>How to arrange and style the page (using CSS)</li>
<li>How you want the user to interact with the page</li>
</ol>
<p>No JavaScript, HTTP requests or even multiple separate frameworks for the front and back end any more!</p>
<h1 id="get-started-with-dash">Get Started with Dash</h1>
<p>To get started, make sure you have Dash installed.
With plain vanilla Python use <code>pip install dash</code> and for Anaconda <code>conda install -c conda-forge dash</code>.
Next, create a new Python file and import the relevant libraries:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> dash
<span class="hljs-keyword">import</span> dash_core_components <span class="hljs-keyword">as</span> dcc
<span class="hljs-keyword">import</span> dash_html_components <span class="hljs-keyword">as</span> html
</code></pre>
<p>If you try and run the app so far, you&#39;ll notice one thing - nothing happens.
That&#39;s because we actually have to create a Dash app object and tell it to start.</p>
<pre><code class="lang-python">app = dash.Dash(__name__, external_stylesheets=[<span class="hljs-string">"https://codepen.io/chriddyp/pen/bWLwgP.css"</span>])
app.title = <span class="hljs-string">"Allocate++"</span>

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    app.run_server(debug=<span class="hljs-keyword">True</span>)
</code></pre>
<p>We can include a style sheet (CSS using <code>external_stylesheets</code>) and set our website&#39;s title (<code>app.title</code>) to make things look better.
Checking that <code>__name__ == &quot;__main__&quot;</code> just ensures that the website only launches when directly started (not when imported in another file).</p>
<p>If we try to run this code, in the terminal we&#39;ll get a message like:</p>
<pre><code>Running <span class="hljs-literal">on</span> http:<span class="hljs-regexp">//</span><span class="hljs-number">127.0</span><span class="hljs-number">.0</span><span class="hljs-number">.1</span>:<span class="hljs-number">8050</span>/
Debugger PIN: <span class="hljs-number">409</span><span class="hljs-number">-929</span><span class="hljs-number">-250</span>
 * Serving Flask app <span class="hljs-string">"Main"</span> (lazy loading)
 * Environment: production
   WARNING: This <span class="hljs-keyword">is</span> a development server. Do <span class="hljs-keyword">not</span> use it <span class="hljs-keyword">in</span> a production deployment.
   Use a production WSGI server instead.
 * Debug mode: <span class="hljs-literal">on</span>
Running <span class="hljs-literal">on</span> http:<span class="hljs-regexp">//</span><span class="hljs-number">127.0</span><span class="hljs-number">.0</span><span class="hljs-number">.1</span>:<span class="hljs-number">8050</span>/
Debugger PIN: <span class="hljs-number">791</span><span class="hljs-number">-028</span><span class="hljs-number">-264</span>
</code></pre><p>It indicates that your app has started and can be found using the URL <code>http://127.0.0.1:8050/</code>.
Although it&#39;s currently just a blank page (real <em>fancy-schmancy</em>), it does indicate that everything is working fine.</p>
<p>Once you&#39;re ready to progress, try adding in a heading:</p>
<pre><code class="lang-python">app.layout = html.H1(children=<span class="hljs-string">"Fancy-Schmancy Website"</span>)
</code></pre>
<p>After you save the file, that website should automatically reload.
If it hasn&#39;t reloaded, or there are popups on the screen, you probably have an error in the source code.
Just check the actual terminal/debugger for more information.</p>
<p>Now that you&#39;re familiar with how to get a basic website, let&#39;s move onto transitioning your concept into code.
It starts with what&#39;s called a layout, which is composed of components.
Dash provides core (<code>dash_core_components</code>) and HTML (<code>dash_html_components</code>) components.
You always start using the HTML elements, since they provide the basic building blocks for text and grouping components together, before moving onto the core components.
Core components offer more interactivity (graphs, tables, check box&#39;s...).
It&#39;s now natural to ask, how to style the web page.
In short, you use CSS (cascading style sheets) for this.
Dash themselves provide concrete overviews of <a target='_blank' rel='noopener noreferrer'  href="https://dash.plotly.com/dash-core-components">core components</a> and trusty Mozilla have an amazing <a target='_blank' rel='noopener noreferrer'  href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics">HTML</a> and <a target='_blank' rel='noopener noreferrer'  href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics">CSS</a> intro.
Several examples of how to use the elements are <a target='_blank' rel='noopener noreferrer'  href="https://dash.plotly.com/layout">here</a>.</p>
<p>The last part of any Dash app is making it responsive.
Getting the buttons you click, text you enter and images you upload... do something!
This is where things would normally get difficult, but here it really <em>isn&#39;t too bad</em>.
With Dash, all you&#39;ve got to define is a function which receives and controls specific element/s properties.
<em>Properties</em> start with the &quot;@&quot; symbol.</p>
<pre><code class="lang-python"><span class="hljs-meta">@app.callback(</span>
    [dash.dependencies.Output(<span class="hljs-string">"output element id"</span>, <span class="hljs-string">"property to set value of"</span>)],
    [dash.dependencies.Input(<span class="hljs-string">"input element id"</span>, <span class="hljs-string">"input property"</span>)]
)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">update_output</span><span class="hljs-params">(value)</span>:</span>
       <span class="hljs-keyword">return</span> value
</code></pre>
<p>We can do this for multiple elements by adding more <code>Input</code> and <code>Output</code> objects to those lists!
One thing to watch out for here though - more <code>Input</code> objects means more inputs to the function are required, and more <code>Output</code> objects mean more values to return (sounds obvious, but it can easily slip your mind). 
Also, note that you <em>shouldn&#39;t modify global variables</em> within these functions (this for technical reasons is an antipattern).
Further documentation is provided on these <a target='_blank' rel='noopener noreferrer'  href="https://dash.plotly.com/basic-callbacks">callbacks</a>.</p>
<h1 id="going-forwards-and-javascript">Going Forwards and JavaScript</h1>
<p>There it is, everything you&#39;ll need to know to start creating an interactive and impressive web application!
It&#39;ll likely still be difficult to create one, but the official documentation and tutorials for <a target='_blank' rel='noopener noreferrer'  href="https://docs.streamlit.io/en/stable/getting_started.html">Steamlit</a> and <a target='_blank' rel='noopener noreferrer'  href="https://dash.plotly.com/">Dash</a> are amazing.
There are also cool galleries of sample apps using <a target='_blank' rel='noopener noreferrer'  href="https://dash-gallery.plotly.host/Portal/">Dash</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.streamlit.io/gallery">Streamlit</a> (so you can learn from others examples).</p>
<p>Of course, there are use cases for JavaScript.
In fact, you can build plugins for Dash with <a target='_blank' rel='noopener noreferrer'  href="https://dash.plotly.com/plugins">JavaScript/React</a> and <a target='_blank' rel='noopener noreferrer'  href="https://dash.plotly.com/d3-react-components">D3.js</a>.
Hell, if you are already comfortable with web technologies it may even be easier for you to use them.
However, using JavaScript <strong>isn&#39;t 100% necessary to build websites</strong> any more (it&#39;s more so optional).
It may be useful to know about web technologies, but if your aim isn&#39;t to become a full-stack web developer, you don&#39;t need to become an expert to put together a flashy portfolio 🥳!</p>
<p>I hope this has helped you out!
Dash helped me hack together <a target='_blank' rel='noopener noreferrer'  href="https://github.com/KamWithK/AllocatePlusPlus">my first dashboard</a> in a day.
If you&#39;ve made a cool website, app or portfolio make sure to comment and tell me about them.
Feel free to check out my other posts - some highlights are <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z?guid=34cbed9b-13ac-43c7-94a3-dbfe4ac247a9&amp;deviceId=a348da4b-4d6e-44a9-80b2-3456c05bf4d0">practical coding tools</a>, <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/zero-to-hero-data-collection-through-web-scraping-ck78o0bmg08ktd9s1bi7znd19">web scrapping</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">machine learning</a> (with the <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning-ckc5nni0j00edkss13rgm75h4">practical project</a>).
You can follow my <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/">newsletter</a> and <a target='_blank' rel='noopener noreferrer'  href="https://twitter.com/kamwithk_">Twitter</a> for updates 😉.</p>
<p><em>Photo by Luke Peters on <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/B6JINerWMz0">Unsplash</a></em></p>
]]></content:encoded></item><item><title><![CDATA[Machine Learning Energy Demand Prediction Project - Part 2 Storytelling using Graphs]]></title><description><![CDATA[Let's see how our machine learning, project planning and essential coding tools can be brought to life in a real-world project!
Today we're going through how we can predict how much energy we use daily using temperature data.
We previously imported a...]]></description><link>https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-2-storytelling-using-graphs</link><guid isPermaLink="true">https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-2-storytelling-using-graphs</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[projects]]></category><category><![CDATA[data analysis]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Thu, 09 Jul 2020 05:17:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1594271772519/Vh4MHgtHw.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let&#39;s see how our <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">machine learning</a>, <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/insight-is-king-how-to-get-it-and-avoid-pitfalls-ckbjfohz201ujzqs1lwu5l7xd">project planning</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z">essential coding tools</a> can be brought to life in a real-world project!
Today we&#39;re going through how we can predict how much energy we use daily using temperature data.
We previously imported and cleaning our data, so will now <strong>graph and depict the story behind our energy usage</strong>!.</p>
<p>This is the second part of three (<a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning-ckc5nni0j00edkss13rgm75h4">first here</a>). Feel free to code along, the full project is on <a target='_blank' rel='noopener noreferrer'  href="https://github.com/KamWithK/Temp2Enrgy">GitHub</a>.</p>
<h1 id="the-story">The story</h1>
<p>We wake up in the mornings, turn on the heater/air conditioner, find some yogurt from the fridge for breakfast, shave, turn on a computer, get the music rolling and finally get to work.
These tasks all have one thing in common - they use power!
Our heavy reliance on electricity makes it crucial to estimate how much energy we&#39;ll need to generate each day.</p>
<p>We <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning-ckc5nni0j00edkss13rgm75h4">already found, imported and cleaned our data</a> (good work guys), so we can move onto telling a story about our power usage.
But, fear not if this seems challenging.
We will take it one step at a time.
At each stage linking back to how it relates to our <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">ML field guide</a>.</p>
<p>We start with the <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/insight-is-king-how-to-get-it-and-avoid-pitfalls-ckbjfohz201ujzqs1lwu5l7xd">difficult but necessary</a> task of interpreting our data.
Our first thought is to plot the whole time series at once, but damn a graph with 4 features, each with around five measurement every 30 minutes over 20 years isn&#39;t pretty, meaningful or fast to graph.
After banging our head against a brick wall for a while, we, of course, realise that we can plot specific features and relationships instead of <em>everything at once</em>.
With little to lose we start using simple summary statistics to find the maximum, minimum and average values.
These give us a rough overview of each column, but to push ourselves one step further we take a look at <em>how correlated our features are</em>.</p>
<p>Once we understand that temperature relates highly to energy demand (intuitive enough), we&#39;re ready to get going with some graphs 😉!
Although we can&#39;t graph <em>everything at once</em>, we still want to get a grasp of the overall picture - how our data changes with time.
We begin by identifying our problem - when we look for changes over 20 years, movement every 30 minutes is really meaningless and just blotches the picture.
Lucky for us our <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx#chapter-3-visualisation">field guide</a> explains that we can plot each week&#39;s average value through <em>resampling</em>!
Now we know the general increasing and decreasing trends between states.</p>
<p>After looking at individual data for energy and temperature we move onto finding where the correlation between the two occurs.
The graphs for each state are different.
The states which had larger trends have more complex looking graphs.
This is complex, and we don&#39;t have the data to account for these trends, so we&#39;ll need to remove them later on.</p>
<p>Now there&#39;s only one thing left for us - to find out how energy demand changes during a day and week.
Then... in no time, we&#39;ve managed to depict the story of our energy usage through each invigorating day, month and year 😎.
At this point, we&#39;d have successfully made it through the majority of our project!
After a brief celebration, we can move onto modelling... Let&#39;s not jump the gun though, this will be in the next (final) tutorial.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-keyword">from</span> IPython.core.interactiveshell <span class="hljs-keyword">import</span> InteractiveShell

InteractiveShell.ast_node_interactivity = <span class="hljs-string">"all"</span>
pd.options.display.max_columns = <span class="hljs-keyword">None</span>

data = pd.read_pickle(<span class="hljs-string">"../Data/Data.pickle"</span>)
</code></pre>
<h1 id="the-epochs">The Epochs</h1>
<h2 id="chapter-1-descriptive-statistics">Chapter 1 - Descriptive Statistics</h2>
<p>Since we can&#39;t view everything at once, we want to get a rough gauge on what our data looks like.
The natural first step is to look at each column&#39;s mean, minimum and maximum value.
These are called descriptive statistics, and Pandas calculates them for us using the <code>describe</code> function.</p>
<p>Since we want to extend this to see what is related to energy demand (since we&#39;re trying to predict it later on), we&#39;ll find the <em>correlations</em>.
To find the correlations between features Pandas provides the <code>corr</code> function.</p>
<p>The stats show:</p>
<ul>
<li><code>TotalDemand</code> has an average of 4619MW with a minimum of 22 mW and a maximum of 14580 MW.</li>
<li><code>WetBulbTemperature</code> ranges from a minimum of -9.0 °C to a maximum of 41 °C.</li>
<li><code>TotalDemand</code> is most correlated to <code>WetBulbTemperature</code></li>
</ul>
<p>Although the correlation function only accounts for linear relationships (straight lines), it is still useful in knowing which features are worth graphing and including in our model.
Here primarily <code>WetBulbTemperature</code>, but <code>StationPressure</code> may also be useful.</p>
<pre><code class="lang-python">data.describe()
</code></pre>
<table>
<thead>
<tr>
<td></td><td>TotalDemand</td><td>RRP</td><td>WetBulbTemperature</td><td>SeaPressure</td><td>StationPressure</td></tr>
</thead>
<tbody>
<tr>
<td>count</td><td>1.656254e+06</td><td>1.656254e+06</td><td>1.656254e+06</td><td>1.656254e+06</td><td>1.656254e+06</td></tr>
<tr>
<td>mean</td><td>4.619521e+03</td><td>5.143376e+01</td><td>1.346589e+01</td><td>1.016535e+03</td><td>1.012486e+03</td></tr>
<tr>
<td>std</td><td>2.848791e+03</td><td>1.910091e+02</td><td>4.668981e+00</td><td>7.543408e+00</td><td>7.798352e+00</td></tr>
<tr>
<td>min</td><td>2.189000e+01</td><td>-1.000000e+03</td><td>-9.000000e-01</td><td>9.772000e+02</td><td>9.693000e+02</td></tr>
<tr>
<td>25%</td><td>1.413990e+03</td><td>2.336000e+01</td><td>9.900000e+00</td><td>1.011900e+03</td><td>1.007800e+03</td></tr>
<tr>
<td>50%</td><td>5.131249e+03</td><td>3.443000e+01</td><td>1.310000e+01</td><td>1.016800e+03</td><td>1.013100e+03</td></tr>
<tr>
<td>75%</td><td>6.591798e+03</td><td>5.490000e+01</td><td>1.700000e+01</td><td>1.021600e+03</td><td>1.017900e+03</td></tr>
<tr>
<td>max</td><td>1.457986e+04</td><td>1.470000e+04</td><td>4.100000e+01</td><td>1.041800e+03</td><td>1.037600e+03</td></tr>
</tbody>
</table>
<pre><code class="lang-python">data.corr()
</code></pre>
<table>
<thead>
<tr>
<td></td><td>TotalDemand</td><td>RRP</td><td>WetBulbTemperature</td><td>SeaPressure</td><td>StationPressure</td></tr>
</thead>
<tbody>
<tr>
<td>TotalDemand</td><td>1.000000</td><td>0.014473</td><td>0.357300</td><td>0.044859</td><td>0.188955</td></tr>
<tr>
<td>RRP</td><td>0.014473</td><td>1.000000</td><td>0.032914</td><td>-0.019025</td><td>-0.017678</td></tr>
<tr>
<td>WetBulbTemperature</td><td>0.357300</td><td>0.032914</td><td>1.000000</td><td>-0.249321</td><td>-0.125920</td></tr>
<tr>
<td>SeaPressure</td><td>0.044859</td><td>-0.019025</td><td>-0.249321</td><td>1.000000</td><td>0.887758</td></tr>
<tr>
<td>StationPressure</td><td>0.188955</td><td>-0.017678</td><td>-0.125920</td><td>0.887758</td><td>1.000000</td></tr>
</tbody>
</table>
<h1 id="chapter-2-finding-long-term-trends">Chapter 2 - Finding Long-Term Trends</h1>
<h2 id="energy-over-20-years">Energy over 20 Years</h2>
<p>We want to know the story of how we use energy.
There&#39;s one simple way to do that - graphs 🤓.
We can start by looking at what happens on a large scale, and then slowly zoom in.</p>
<p>We&#39;ll view each state separately, since their trends may not be the same.</p>
<pre><code class="lang-python">fig, axes  = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), constrained_layout=<span class="hljs-keyword">True</span>)

data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"TAS"</span>].plot(color=<span class="hljs-string">"red"</span>, title=<span class="hljs-string">"Tasmania Energy Demand"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"VIC"</span>].plot(color=<span class="hljs-string">"green"</span>, title=<span class="hljs-string">"Victoria Energy Demand"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"NSW"</span>].plot(color=<span class="hljs-string">"purple"</span>, title=<span class="hljs-string">"New South Wales Energy Demand"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"QLD"</span>].plot(color=<span class="hljs-string">"orange"</span>, title=<span class="hljs-string">"Queensland Energy Demand"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"SA"</span>].plot(color=<span class="hljs-string">"blue"</span>, title=<span class="hljs-string">"South Australia Energy Demand"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1594270980198/_scTdQRS7.png" alt="output_7_5.png"></p>
<p>It can still be difficult to interpret graphs after they&#39;re resampled.
So let&#39;s take it slowly, one step at a time.</p>
<p>The first noticeable pattern is that energy always fluctuates between a high and low point.
The high and low points aren&#39;t always the same.</p>
<ul>
<li>Tasmania and South Australia range from around 900 to 1400</li>
<li>Victoria from 4500 to 6500</li>
<li>New South Wales from 6000 to 10000</li>
<li>Queensland from 4500 to 7500</li>
</ul>
<p>We can tell though that the trends aren&#39;t constant.
There can be a rapid increase in energy usage (Queensland until ~2010), a steep fall (Victoria after ~2010) or even continuous stability (Tasmania)!
The patterns are clearly not regular or caused directly by temperature (and so not predictable using historic temperature and energy data).</p>
<p>Although we don&#39;t have data on these trends, we can give an educated guess on what causes them.
We know that the population isn&#39;t stable, and grows at different rates for different states.
There also been a massive increase in technologies power efficiency, and economic conditions affect peoples willingness to use power.
On top of this, global warming pushes more and more people to install solar panels (which produce power which isn&#39;t accounted for).
Since we don&#39;t have data on any of these features, we&#39;ll try to remove the trends before we begin our modelling.</p>
<h2 id="energy-over-single-years">Energy over Single Years</h2>
<p>Let&#39;s now zoom in!
We&#39;ll look at trends which occur during a single year.
Since we&#39;re graphing 5 years instead of 20 we&#39;ll, of course, <em>need less resampling</em>.</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), constrained_layout=<span class="hljs-keyword">True</span>)

data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"TAS"</span>][<span class="hljs-string">"2015"</span>:<span class="hljs-string">"2020"</span>].plot(color=<span class="hljs-string">"red"</span>, title=<span class="hljs-string">"Tasmania Energy Demand"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"VIC"</span>][<span class="hljs-string">"2015"</span>:<span class="hljs-string">"2020"</span>].plot(color=<span class="hljs-string">"green"</span>, title=<span class="hljs-string">"Victoria Energy Demand"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"NSW"</span>][<span class="hljs-string">"2015"</span>:<span class="hljs-string">"2020"</span>].plot(color=<span class="hljs-string">"purple"</span>, title=<span class="hljs-string">"New South Wales Energy Demand"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"QLD"</span>][<span class="hljs-string">"2015"</span>:<span class="hljs-string">"2020"</span>].plot(color=<span class="hljs-string">"orange"</span>, title=<span class="hljs-string">"Queensland Energy Demand"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"W"</span>).mean()[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"SA"</span>][<span class="hljs-string">"2015"</span>:<span class="hljs-string">"2020"</span>].plot(color=<span class="hljs-string">"blue"</span>, title=<span class="hljs-string">"South Australia Energy Demand"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1594271025727/590RucBfB.png" alt="output_9_5.png"></p>
<p>We can tell that the energy demand is usually lowest in spring and autumn, whilst highest during winter and/or summer.
Tasmania tends to have a higher demand in winter than summer.
Victoria&#39;s similar, but with more frequent peaks in energy demand during summer.
On the other hand, Queensland uses the most energy during summer.
New South Wales and South Australia are both have max energy in summer and winter!</p>
<p>Tasmania is consistently cooler (being the small island) unlike hot and sweaty New South Wales and South Australia.
This would explain the relative differences in where max&#39;s/min&#39;s occur.</p>
<h2 id="temperature-over-20-years">Temperature over 20 Years</h2>
<p>Temperature is just as important as energy though.
So we&#39;ll take a look at it as well!</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">6</span>), constrained_layout=<span class="hljs-keyword">True</span>)

data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"TAS"</span>].plot(color= <span class="hljs-string">"red"</span>, title=<span class="hljs-string">"Tasmania Temperature"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"VIC"</span>].plot(color= <span class="hljs-string">"green"</span>, title=<span class="hljs-string">"Victoria Temperature"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"NSW"</span>].plot(color= <span class="hljs-string">"purple"</span>, title=<span class="hljs-string">"New South Wales Temperature"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"QLD"</span>].plot(color= <span class="hljs-string">"orange"</span>, title=<span class="hljs-string">"Queensland Temperature"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>])
data.groupby(<span class="hljs-string">"Region"</span>).resample(<span class="hljs-string">"3W"</span>).mean()[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"SA"</span>].plot(color=<span class="hljs-string">"blue"</span>, title=<span class="hljs-string">"South Australia Temperature"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1594271061927/-aAek8bIa.png" alt="output_11_5.png"></p>
<p>Unlike the energy graphs, the temperature graphs don&#39;t have any large immediately noticeable trends.
However, we can see that the temperature varies from a minimum of around 8° to a maximum of around 22°.
Although this plot doesn&#39;t show any significant variations of temperature between states, they do exist.
Tasmania is consistently cooler (being the small island) unlike hot and sweaty New South Wales and South Australia.</p>
<h2 id="temperature-and-energy-correlations">Temperature and Energy Correlations</h2>
<p>We know temperature and energy are highly correlated, but we don&#39;t yet know-how.
Well, let&#39;s find out!</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), constrained_layout=<span class="hljs-keyword">True</span>)

data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"TAS"</span>).resample(<span class="hljs-string">"D"</span>).mean().plot(kind=<span class="hljs-string">"scatter"</span>,x=<span class="hljs-string">"WetBulbTemperature"</span>, y=<span class="hljs-string">"TotalDemand"</span>, s=<span class="hljs-number">10</span>, color= <span class="hljs-string">"red"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>], title=<span class="hljs-string">"Tasmania"</span>)
data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>).resample(<span class="hljs-string">"D"</span>).mean().plot(kind=<span class="hljs-string">"scatter"</span>,x=<span class="hljs-string">"WetBulbTemperature"</span>, y=<span class="hljs-string">"TotalDemand"</span>, s=<span class="hljs-number">10</span>, color= <span class="hljs-string">"green"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>], title=<span class="hljs-string">"Victoria"</span>)
data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"NSW"</span>).resample(<span class="hljs-string">"D"</span>).mean().plot(kind=<span class="hljs-string">"scatter"</span>,x=<span class="hljs-string">"WetBulbTemperature"</span>, y=<span class="hljs-string">"TotalDemand"</span>, s=<span class="hljs-number">10</span>, color= <span class="hljs-string">"purple"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>], title=<span class="hljs-string">"New South Wales"</span>)
data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"QLD"</span>).resample(<span class="hljs-string">"D"</span>).mean().plot(kind=<span class="hljs-string">"scatter"</span>,x=<span class="hljs-string">"WetBulbTemperature"</span>, y=<span class="hljs-string">"TotalDemand"</span>, s=<span class="hljs-number">10</span>, color= <span class="hljs-string">"orange"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>], title=<span class="hljs-string">"Queensland"</span>)
data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"SA"</span>).resample(<span class="hljs-string">"D"</span>).mean().plot(kind=<span class="hljs-string">"scatter"</span>,x=<span class="hljs-string">"WetBulbTemperature"</span>, y=<span class="hljs-string">"TotalDemand"</span>, s=<span class="hljs-number">10</span>, color= <span class="hljs-string">"blue"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>], title=<span class="hljs-string">"South Australia"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1594271092857/_V5zPgnct.png" alt="output_13_5.png"></p>
<p>These charts show us one major thing, that the greater the trend, the more confusing (and complicated) the relationship between temperature and energy demand becomes.
This is why the graph of temperature vs energy demand for Tasmania is almost a straight line (albeit a thick one), whereas the rest curved.
In other words, the greater the trend, the wider and thicker the curve!</p>
<p>Since we don&#39;t have any population or economic data, the trend must be removed (in the next tutorial).</p>
<h2 id="chapter-3-analysing-small-timeframes">Chapter 3 - Analysing Small Timeframes</h2>
<p>The graphs below show the comparison of energy demand between regions for a single day and a week during winter and summer.
We can start with a week (11/06/2017 to 17/06/2017 here) to see how energy demand fluctuates during the week.
We&#39;re <em>only testing one small timeframe</em>, this is for brevity (the same patterns below can be seen elsewhere too).</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">4</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">10</span>), tight_layout=<span class="hljs-keyword">True</span>)

<span class="hljs-comment"># Winter</span>
data[<span class="hljs-string">"2017-06-11"</span>:<span class="hljs-string">"2017-06-17"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"TAS"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"red"</span>, title=<span class="hljs-string">"Tasmania Winter"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
data[<span class="hljs-string">"2017-06-11"</span>:<span class="hljs-string">"2017-06-17"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"green"</span>, title=<span class="hljs-string">"Victoria Winter"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>])
data[<span class="hljs-string">"2017-06-11"</span>:<span class="hljs-string">"2017-06-17"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"NSW"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"purple"</span>, title=<span class="hljs-string">"New South Wales Winter"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>])
data[<span class="hljs-string">"2017-06-11"</span>:<span class="hljs-string">"2017-06-17"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"QLD"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"orange"</span>, title=<span class="hljs-string">"Queensland Winter"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>])
data[<span class="hljs-string">"2017-06-11"</span>:<span class="hljs-string">"2017-06-17"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"SA"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"blue"</span>, title=<span class="hljs-string">"South Australia Winter"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>])

<span class="hljs-comment"># Summer</span>
data[<span class="hljs-string">"2017-1-14"</span>:<span class="hljs-string">"2017-1-20"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"TAS"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"red"</span>, title=<span class="hljs-string">"Tasmania Summer"</span>, ax=axes[<span class="hljs-number">2</span>,<span class="hljs-number">0</span>])
data[<span class="hljs-string">"2017-1-14"</span>:<span class="hljs-string">"2017-1-20"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"green"</span>, title=<span class="hljs-string">"Victoria Summer"</span>, ax=axes[<span class="hljs-number">2</span>,<span class="hljs-number">1</span>])
data[<span class="hljs-string">"2017-1-14"</span>:<span class="hljs-string">"2017-1-20"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"NSW"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"purple"</span>, title=<span class="hljs-string">"New South Wales Summer"</span>, ax=axes[<span class="hljs-number">2</span>,<span class="hljs-number">2</span>])
data[<span class="hljs-string">"2017-1-14"</span>:<span class="hljs-string">"2017-1-20"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"QLD"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"orange"</span>, title=<span class="hljs-string">"Queensland Summer"</span>, ax=axes[<span class="hljs-number">3</span>,<span class="hljs-number">0</span>])
data[<span class="hljs-string">"2017-1-14"</span>:<span class="hljs-string">"2017-1-20"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"SA"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(color=<span class="hljs-string">"blue"</span>, title=<span class="hljs-string">"South Australia Summer"</span>, ax=axes[<span class="hljs-number">3</span>,<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1594271126554/q2mf7wdC0.png" alt="output_15_10.png"></p>
<p>All states energy usage daily tend to be just about the same.
There are two peaks in Summer and Winter.
The first one is smaller and during the day (5-9 am), whilst the second is larger and at night (4-7 pm).
These occur at times when people are most active in homes (before and after work).
Although only a few graphs can be shown here, these patterns do persist (swapping out different days will show this).</p>
<p>The energy demand across a week in summer tends to be similar to winter, but the demand increases far more throughout the week!</p>
<p>We can now move onto looking at a single day (11/06/2017 here).</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">10</span>), constrained_layout=<span class="hljs-keyword">True</span>)

data[<span class="hljs-string">"2017-06-11"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"TAS"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(title=<span class="hljs-string">"Tasmania"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>], color=<span class="hljs-string">"red"</span>)
data[<span class="hljs-string">"2017-06-11"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(title=<span class="hljs-string">"Victoria"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>], color=<span class="hljs-string">"green"</span>)
data[<span class="hljs-string">"2017-06-11"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"NSW"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(title=<span class="hljs-string">"New South Wales"</span>, ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>], color=<span class="hljs-string">"purple"</span>)
data[<span class="hljs-string">"2017-06-11"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"QLD"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(title=<span class="hljs-string">"Queensland"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>], color=<span class="hljs-string">"orange"</span>)
data[<span class="hljs-string">"2017-06-11"</span>].groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"SA"</span>)[<span class="hljs-string">"TotalDemand"</span>].plot(title=<span class="hljs-string">"South Australia"</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>], color=<span class="hljs-string">"blue"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1594271153028/fp2UZICmL.png" alt="output_17_5.png"></p>
<p>From these charts, we can see that energy usage ramps up from 6 am to 9 am and again from 3 pm to 6 pm.
At 12 am to 3 pm our energy usage remains stable.
It typically drops after the start and end of the day (likely when most people are asleep).
The demand for summer and winter days are mostly similar.</p>
<p><em>Photo by Scott Graham on <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/5fNmWej4tAA">Unsplash</a></em></p>
]]></content:encoded></item><item><title><![CDATA[Machine Learning Energy Demand Prediction Project - Part 1 Data Cleaning]]></title><description><![CDATA[Let's see how our machine learning, project planning and essential coding tools can be brought to life in a real-world project!
Today we're going through how we can predict how much energy we use daily using temperature data.
We start here with impor...]]></description><link>https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning</link><guid isPermaLink="true">https://www.kamwithk.com/machine-learning-energy-demand-prediction-project-part-1-data-cleaning</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[ML]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[pandas]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Fri, 03 Jul 2020 03:25:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1593746060281/DlGrb_OpA.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let&#39;s see how our <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">machine learning</a>, <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/insight-is-king-how-to-get-it-and-avoid-pitfalls-ckbjfohz201ujzqs1lwu5l7xd">project planning</a> and <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z">essential coding tools</a> can be brought to life in a real-world project!
Today we&#39;re going through how we can predict how much energy we use daily using temperature data.
We start here with <strong>importing and cleaning data, before graphing and depicting the story of our energy usage and finally modelling it</strong>.</p>
<p>This is the first part of three. Feel free to code along, the full project is on <a target='_blank' rel='noopener'  href="https://github.com/KamWithK/Temp2Enrgy">GitHub</a>.</p>
<h1 id="the-story">The story</h1>
<p>We wake up in the mornings, turn on the heater/air conditioner, find some yogurt from the fridge for breakfast, shave, turn on a computer, get the music rolling and finally get to work.
These tasks all have one thing in common - they use power!
Our heavy reliance on electricity makes it crucial to estimate how much energy we&#39;ll need to generate each day.</p>
<p>But, fear not if this seems challenging.
We will take it one step at a time.
At each stage linking back to how it relates to our <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">ML field guide</a>.</p>
<p>We start with finding energy and temperature data (can&#39;t do much without it 😊).
Ours is from the Bureau of Meteorology and Australian Energy Market Operator, but please do replicate the process for another country (i.e. America).
After a quick and painless download (lucky us), we can briefly review our spreadsheets.
But a look at the data highlights a horrifying truth - there&#39;s simply... far too much to deal with!
The merciless cells of numbers, more numbers and categories, are really overwhelming.
It&#39;s not really apparent how we&#39;ll combine the array of spreadsheets together, nor how we&#39;ll be able to analyse, learn from or model it.</p>
<p>We as optimistic folk start by noting down how the data is organised.
The folder with files, where they are and what each contains.
Combining our understanding of the data&#39;s structure with the <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx#chapter-1-importing-data">importing techniques</a> naturally leads to us overcome our first fear - providing easy access to the data with code.</p>
<p>Next, we seek to eliminate the clumsy mess.
We need to <em>clean the temperature &amp; energy data</em> by identifying what information has the greatest impact on our net energy usage!
It once again starts with simple observations of the spreadsheets to get a rough grip on the type of data present.
We are specifically interested in finding weird quirks/recurring patterns which <strong><em>could indicate</em> that something is wrong</strong>.
Once we follow up on each hunch, we can become more confident about the origin of our problems.
This allows us to confidently <em>decide what to straight-up remove, keep and a quick fix</em> 🤔 (we don&#39;t want to go Rambo 👹 on everything).
Simple stats and graphs form the cornerstone of this analysis!</p>
<p>At this point, we&#39;d have successfully made it through the first and most important part of our project!
After a brief celebration, we can move onto combining our two separate datasets (one for energy and one for temperature).
This allows us to correlate the two.
Finally, we&#39;re able to depict our story of how we use energy through each envigorating day, month and year... with the help of the trends and patterns we see in graphs!
What on Earth would be more satisfying?
Well, a few things... but let&#39;s not forget to create a model (it&#39;ll be fun) to show off to all our friends!
Let&#39;s not jump the gun though... this will all be in the next two tutorials.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os

<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-keyword">from</span> IPython.core.interactiveshell <span class="hljs-keyword">import</span> InteractiveShell

InteractiveShell.ast_node_interactivity = <span class="hljs-string">"all"</span>
pd.options.display.max_columns = <span class="hljs-keyword">None</span>
</code></pre>
<h1 id="the-epochs">The Epochs</h1>
<h2 id="chapter-1-importing-data">Chapter 1 - Importing Data</h2>
<blockquote>
<p>Data comes in all kinds of shapes and sizes and so the process we use to get everything into code often varies.</p>
</blockquote>
<p>Through analysing the files available we have found out <strong>how our data is structured</strong>.
We start on a high level, noticing that there are many temperature and energy spreadsheets formatted as CSVs.
Although there&#39;s an incredible number of them, it is just because the data was divided into small chunks.
Each CSV is a continuation of the last one.
The actual temperature spreadsheets contain dates, along with a variety of temperature, humidity and precipitation measurements.
Our energy files are far more basic, containing just dates, energy demand history, prices (RRP) and whether the data was manually or automatically logged.
The measurements have been made on a 30-minute basis.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1593739286496/5LcA_l5Bn.png" alt="file structure.png"></p>
<blockquote>
<p>Divide and conquer!</p>
</blockquote>
<p>As we can see, all of this information comes together to form an intuitive understanding of the raw data.
Of course, we <em>don&#39;t yet understand everything we&#39;ll need to perform our analysis, but we have enough to transition from having raw data to useable code</em> 🥳!</p>
<p>To transition into code, we compare our findings, to our <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx#chapter-1-importing-data">importing techniques</a>.
We know that we have a list of spreadsheets to be combined, so we can first form lists and then use Pandas <code>concat</code> to stack them together.</p>
<pre><code class="lang-python">energy_locations = os.listdir(<span class="hljs-string">"../Data/Energy"</span>)
temperature_locations = os.listdir(<span class="hljs-string">"../Data/Temperature"</span>)

energy_CSVs = [pd.read_csv(<span class="hljs-string">"../Data/Energy/"</span> + location) <span class="hljs-keyword">for</span> location <span class="hljs-keyword">in</span> energy_locations]
temperature_CSVs = [pd.read_csv(<span class="hljs-string">"../Data/Temperature/"</span> + location) <span class="hljs-keyword">for</span> location <span class="hljs-keyword">in</span> temperature_locations <span class="hljs-keyword">if</span> <span class="hljs-string">"Data"</span> <span class="hljs-keyword">in</span> location]
</code></pre>
<pre><code class="lang-python">energy_data = pd.concat(energy_CSVs, ignore_index=<span class="hljs-keyword">True</span>)
temperature_data = pd.concat(temperature_CSVs, ignore_index=<span class="hljs-keyword">True</span>)
</code></pre>
<p>Now, believe it or not, we&#39;ve done 90% of the importing, the only thing left is to ensure our features (columns) are named succinctly and consistently.
Through renaming our columns (like below), we make it clear what is in each column.
Future us will definitely be grateful!</p>
<pre><code class="lang-python">energy_data.columns
temperature_data.columns
</code></pre>
<pre><code>Index([<span class="hljs-string">'REGION'</span>, <span class="hljs-string">'SETTLEMENTDATE'</span>, <span class="hljs-string">'TOTALDEMAND'</span>, <span class="hljs-string">'RRP'</span>, <span class="hljs-string">'PERIODTYPE'</span>], dtype=<span class="hljs-string">'object'</span>)
Index([<span class="hljs-string">'hm'</span>, <span class="hljs-string">'Station Number'</span>, <span class="hljs-string">'Year Month Day Hour Minutes in YYYY'</span>, <span class="hljs-string">'MM'</span>,
       <span class="hljs-string">'DD'</span>, <span class="hljs-string">'HH24'</span>, <span class="hljs-string">'MI format in Local time'</span>,
       <span class="hljs-string">'Year Month Day Hour Minutes in YYYY.1'</span>, <span class="hljs-string">'MM.1'</span>, <span class="hljs-string">'DD.1'</span>, <span class="hljs-string">'HH24.1'</span>,
       <span class="hljs-string">'MI format in Local standard time'</span>,
       <span class="hljs-string">'Precipitation since 9am local time in mm'</span>,
       <span class="hljs-string">'Quality of precipitation since 9am local time'</span>,
       <span class="hljs-string">'Air Temperature in degrees C'</span>, <span class="hljs-string">'Quality of air temperature'</span>,
       <span class="hljs-string">'Wet bulb temperature in degrees C'</span>, <span class="hljs-string">'Quality of Wet bulb temperature'</span>,
       <span class="hljs-string">'Dew point temperature in degrees C'</span>,
       <span class="hljs-string">'Quality of dew point temperature'</span>, <span class="hljs-string">'Relative humidity in percentage %'</span>,
       <span class="hljs-string">'Quality of relative humidity'</span>, <span class="hljs-string">'Wind speed in km/h'</span>,
       <span class="hljs-string">'Wind speed quality'</span>, <span class="hljs-string">'Wind direction in degrees true'</span>,
       <span class="hljs-string">'Wind direction quality'</span>,
       <span class="hljs-string">'Speed of maximum windgust in last 10 minutes in  km/h'</span>,
       <span class="hljs-string">'Quality of speed of maximum windgust in last 10 minutes'</span>,
       <span class="hljs-string">'Mean sea level pressure in hPa'</span>, <span class="hljs-string">'Quality of mean sea level pressure'</span>,
       <span class="hljs-string">'Station level pressure in hPa'</span>, <span class="hljs-string">'Quality of station level pressure'</span>,
       <span class="hljs-string">'AWS Flag'</span>, <span class="hljs-string">'#'</span>],
      dtype=<span class="hljs-string">'object'</span>)
</code></pre><pre><code class="lang-python">energy_data.columns = [<span class="hljs-string">"Region"</span>, <span class="hljs-string">"Date"</span>, <span class="hljs-string">"TotalDemand"</span>, <span class="hljs-string">"RRP"</span>, <span class="hljs-string">"PeriodType"</span>]
temperature_data.columns = [
    <span class="hljs-string">"HM"</span>, <span class="hljs-string">"StationNumber"</span>, <span class="hljs-string">"Year1"</span>, <span class="hljs-string">"Month1"</span>, <span class="hljs-string">"Day1"</span>, <span class="hljs-string">"Hour1"</span>, <span class="hljs-string">"Minute1"</span>, <span class="hljs-string">"Year"</span>, <span class="hljs-string">"Month"</span>, <span class="hljs-string">"Day"</span>, <span class="hljs-string">"Hour"</span>, <span class="hljs-string">"Minute"</span>, <span class="hljs-string">"Precipitation"</span>, <span class="hljs-string">"PrecipitationQuality"</span>,
    <span class="hljs-string">"AirTemperature"</span>, <span class="hljs-string">"AirTemperatureQuality"</span>, <span class="hljs-string">"WetBulbTemperature"</span>, <span class="hljs-string">"WetBulbTemperatureQuality"</span>, <span class="hljs-string">"DewTemperature"</span>, <span class="hljs-string">"DewTemperatureQuality"</span>, <span class="hljs-string">"RelativeHumidity"</span>,
    <span class="hljs-string">"RelativeHumidityQuality"</span>, <span class="hljs-string">"WindSpeed"</span>, <span class="hljs-string">"WindSpeedQuality"</span>, <span class="hljs-string">"WindDirection"</span>, <span class="hljs-string">"WindDirectionQuality"</span>, <span class="hljs-string">"WindgustSpeed"</span>, <span class="hljs-string">"WindgustSpeedQuality"</span>, <span class="hljs-string">"SeaPressure"</span>,
    <span class="hljs-string">"SeaPressureQuality"</span>, <span class="hljs-string">"StationPressure"</span>, <span class="hljs-string">"StationPressureQuality"</span>, <span class="hljs-string">"AWSFlag"</span>, <span class="hljs-string">"#"</span>
]
</code></pre>
<p>Now be proud because we just finished the first part of our journey!
Now that we&#39;ve gotten the ball rolling, things will be smoother sailing from here on out.</p>
<h2 id="chapter-2-data-cleaning">Chapter 2 - Data Cleaning</h2>
<h3 id="formatting-the-data">Formatting the data</h3>
<blockquote>
<p>Everyone is driven insane by missing data, but there&#39;s always a light at the end of the tunnel.</p>
</blockquote>
<p>There&#39;s good and bad news, so I&#39;ll start with the good news first.
We&#39;ve gone through our initial phase of getting everything together and so we now have a bare-bones understanding of what&#39;s available to us/how to access it.
We can view our data using the <code>energy_data</code> and <code>temperature_data</code> dataframes!</p>
<p>Now for the bad news.
Although we likely haven&#39;t noticed it yet, our data is far from perfect.
We have loads of missing (empty) cells, along with duplicated and malformatted data.
But don&#39;t be disheartened, because this isn&#39;t a rare cataclysmic disaster:
It happens all the time 😎 (what&#39;s not to like about it?) 😎.</p>
<p>This process can appear threatening since everything just seems... messed up.
Now insight and experience do help a lot, BUT but but... that doesn&#39;t mean that it&#39;s impossible for us mortals!
There&#39;s one thing we can do to overcome this - work like mad scientists!
We can identify our datasets quirks/issues, and then test every technique we think off 🤯.
Our techniques come from the <a target='_blank' rel='noopener'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx#chapter-2-data-cleaning">field guide</a> (NEVER reinvent the wheel)!</p>
<p>Just to doubly make sure we&#39;re not running stray, here are the problems we&#39;re looking for:</p>
<ul>
<li>Completely empty columns/rows</li>
<li>Duplicate values</li>
<li>Inaccurate/generic datatypes</li>
</ul>
<p>Yes, there are only three right now, but... don&#39;t forget that we won&#39;t robust analysis!
So actually dealing with these problems in a concrete fashion does take a little bit of effort (don&#39;t be too dodgy, that right&#39;s reserved for politicians - no offence).</p>
<p><em>Final disclaimer - there&#39;s a lot to take in, so please take a deep breath, drink some coffee and slowly look for patterns.</em></p>
<pre><code class="lang-python">energy_data
temperature_data
</code></pre>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" data-card-width="600px" data-card-key="2e4d628b39a64b99917c73956a16b477" href="<iframe class="airtable-embed" src="https://airtable.com/embed/shrAdtqxd30xD7SSd" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe>" data-card-controls="0" data-card-theme="light"><iframe class="airtable-embed" src="https://airtable.com/embed/shrAdtqxd30xD7SSd" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe></a></div>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" data-card-width="600px" data-card-key="2e4d628b39a64b99917c73956a16b477" href="<iframe class="airtable-embed" src="https://airtable.com/embed/shrjQzcOu1FyVzKbd" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe>" data-card-controls="0" data-card-theme="light"><iframe class="airtable-embed" src="https://airtable.com/embed/shrjQzcOu1FyVzKbd" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe></a></div>
<p>We can see that columns like <code>PrecipitationQuality</code> and <code>HM</code> seem to have the same value throughout.
To amend this we can remove columns with two or fewer unique elements.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">remove_non_uniques</span><span class="hljs-params">(dataframe: pd.DataFrame, filter = [])</span>:</span>
    remove = [name <span class="hljs-keyword">for</span> name, series <span class="hljs-keyword">in</span> dataframe.items() <span class="hljs-keyword">if</span> len(series.unique()) &lt;= <span class="hljs-number">2</span> <span class="hljs-keyword">and</span> <span class="hljs-keyword">not</span> name <span class="hljs-keyword">in</span> filter]
    dataframe.drop(remove, axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-keyword">True</span>)
    <span class="hljs-keyword">return</span> remove

print(<span class="hljs-string">"Removed:"</span>)
remove_non_uniques(energy_data)
remove_non_uniques(temperature_data)
</code></pre>
<pre><code>Removed:
[<span class="hljs-string">'PeriodType'</span>]

[<span class="hljs-string">'HM'</span>,
 <span class="hljs-string">'PrecipitationQuality'</span>,
 <span class="hljs-string">'AirTemperatureQuality'</span>,
 <span class="hljs-string">'WetBulbTemperatureQuality'</span>,
 <span class="hljs-string">'DewTemperatureQuality'</span>,
 <span class="hljs-string">'RelativeHumidityQuality'</span>,
 <span class="hljs-string">'WindSpeedQuality'</span>,
 <span class="hljs-string">'WindDirectionQuality'</span>,
 <span class="hljs-string">'WindgustSpeedQuality'</span>,
 <span class="hljs-string">'SeaPressureQuality'</span>,
 <span class="hljs-string">'StationPressureQuality'</span>,
 <span class="hljs-string">'#'</span>]
</code></pre><p>Duplicate rows can also be removed.
This is far easier!</p>
<pre><code class="lang-python">energy_data.drop_duplicates(inplace=<span class="hljs-keyword">True</span>)
temperature_data.drop_duplicates(inplace=<span class="hljs-keyword">True</span>)
</code></pre>
<p>The last thing is to check our datatypes.
This seems unnecessary here, yet modelling and graphing libraries are quite touchy about datatypes.</p>
<p>The process is quite straightforwards, look at the column/what it contains and then compare that to the actual datatype.
With a large number of columns, it can be best to start by looking at dates and categories since they&#39;re almost always misinterpreted (as objects, floats or integers).
In general <code>object</code> should only be used for strings.</p>
<pre><code class="lang-python">energy_data.dtypes
temperature_data.dtypes
</code></pre>
<pre><code>Region          <span class="hljs-keyword">object</span>
Date            <span class="hljs-keyword">object</span>
TotalDemand    float64
RRP            float64
dtype: <span class="hljs-keyword">object</span>

StationNumber          int64
Year1                  int64
Month1                 int64
Day1                   int64
Hour1                  int64
Minute1                int64
Year                   int64
Month                  int64
Day                    int64
Hour                   int64
Minute                 int64
Precipitation         <span class="hljs-keyword">object</span>
AirTemperature        <span class="hljs-keyword">object</span>
WetBulbTemperature    <span class="hljs-keyword">object</span>
DewTemperature        <span class="hljs-keyword">object</span>
RelativeHumidity      <span class="hljs-keyword">object</span>
WindSpeed             <span class="hljs-keyword">object</span>
WindDirection         <span class="hljs-keyword">object</span>
WindgustSpeed         <span class="hljs-keyword">object</span>
SeaPressure           <span class="hljs-keyword">object</span>
StationPressure       <span class="hljs-keyword">object</span>
AWSFlag               <span class="hljs-keyword">object</span>
dtype: <span class="hljs-keyword">object</span>
</code></pre><p>In our case, we have not just one set of dates, but two (damn, the BOM data collection team needs to chill) 🥴.
As we predicted, the dates are integers and spread out across multiple columns (one for the year, one for the month, day, hour and minute).</p>
<p>We can start by getting rid of the duplicated set of dates (the second was due to daylight saving), and then we can parse the remaining date columns.
This formats our data in the nice orderly way we desired!</p>
<pre><code class="lang-python"><span class="hljs-comment"># Remove extra dates</span>
temperature_data.drop([<span class="hljs-string">"Year1"</span>, <span class="hljs-string">"Month1"</span>, <span class="hljs-string">"Day1"</span>, <span class="hljs-string">"Hour1"</span>, <span class="hljs-string">"Minute1"</span>], axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-keyword">True</span>)

<span class="hljs-comment"># Reformat dates into Pandas' datatime64 objects</span>
<span class="hljs-comment"># Replacing old format</span>
temperature_data[<span class="hljs-string">"Date"</span>] = pd.to_datetime(temperature_data[[<span class="hljs-string">"Year"</span>, <span class="hljs-string">"Month"</span>, <span class="hljs-string">"Day"</span>, <span class="hljs-string">"Hour"</span>, <span class="hljs-string">"Minute"</span>]])
energy_data[<span class="hljs-string">"Date"</span>] = pd.to_datetime(energy_data[<span class="hljs-string">"Date"</span>])

temperature_data.drop([<span class="hljs-string">"Year"</span>, <span class="hljs-string">"Month"</span>, <span class="hljs-string">"Day"</span>, <span class="hljs-string">"Hour"</span>, <span class="hljs-string">"Minute"</span>], axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-keyword">True</span>)
</code></pre>
<p>Now we can also see a few problems with station numbers (where measurements were made), <code>AWSFlag</code>&#39;s (whether data was manually collected), temperature, humidity, pressure and precipitation measurements.
We do need to change these datatypes, but to do so we&#39;ll need to go slightly off the books as converting datatypes using the standard <code>.astype(&quot;category&quot;)</code> throws a few errors.
We can overcome these by noting down what the complaint is about, accounting for it and then trying to run the above function once again.</p>
<p>Just to be sure we&#39;re all on the same page, here&#39;s a short summary of the errors we&#39;re dealing with:</p>
<ul>
<li>Leading/trailling spaces (so &quot;12&quot; becomes &quot;       12     &quot;)</li>
<li>Random hashtags occasionally present (so 99.99% of cells will contain numbers, but then one will contain &quot;###&quot;)</li>
<li>There&#39;s a small amount of missing categorical data</li>
</ul>
<p>We can remove the leading and trailing spaces by using <code>.str.strip()</code>.
Next, to remove the rouge hashtag, we can use Pandas&#39; <code>replace</code> function to overwrite it with <code>np.NaN</code> (the default datatype used for null data).
To finish off, we can just assume that any missing data was manually collected (worst case scenario).
The <code>fillna</code> and <code>replace</code> functions are both needed, as Pandas treats <code>np.NaN</code> and empty strings (&quot;&quot;) differently.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">to_object_columns</span><span class="hljs-params">(lambda_function)</span>:</span>
    string_columns = temperature_data.select_dtypes(<span class="hljs-string">"object"</span>).columns
    temperature_data[string_columns] = temperature_data[string_columns].apply(lambda_function)
</code></pre>
<pre><code class="lang-python">to_object_columns(<span class="hljs-keyword">lambda</span> column: column.str.strip())

temperature_data[<span class="hljs-string">"AWSFlag"</span>] = temperature_data[<span class="hljs-string">"AWSFlag"</span>].replace(<span class="hljs-string">""</span>, <span class="hljs-number">0</span>).astype(<span class="hljs-string">"category"</span>)
temperature_data[<span class="hljs-string">"AWSFlag"</span>].fillna(<span class="hljs-number">0</span>, inplace=<span class="hljs-keyword">True</span>)
temperature_data[<span class="hljs-string">"RelativeHumidity"</span>] = temperature_data[<span class="hljs-string">"RelativeHumidity"</span>].replace(<span class="hljs-string">"###"</span>, np.NaN)

to_object_columns(<span class="hljs-keyword">lambda</span> column: pd.to_numeric(column))
</code></pre>
<pre><code class="lang-python">temperature_data.dtypes
</code></pre>
<pre><code>StationNumber                  <span class="hljs-keyword">int64</span>
Precipitation                <span class="hljs-keyword">float64</span>
AirTemperature               <span class="hljs-keyword">float64</span>
WetBulbTemperature           <span class="hljs-keyword">float64</span>
DewTemperature               <span class="hljs-keyword">float64</span>
RelativeHumidity             <span class="hljs-keyword">float64</span>
WindSpeed                    <span class="hljs-keyword">float64</span>
WindDirection                <span class="hljs-keyword">float64</span>
WindgustSpeed                <span class="hljs-keyword">float64</span>
SeaPressure                  <span class="hljs-keyword">float64</span>
StationPressure              <span class="hljs-keyword">float64</span>
AWSFlag                     category
Date                  datetime64[ns]
dtype: object
</code></pre><p>There is one final thing we can do to improve how our data is formatted.
That is to ensure that the column used to identify where our temperature and energy measurements were made both use the same categories.</p>
<p>Since we only have one station per region, we can replace the separate region codes with their short forms.
Note that this information was provided in the dataset notes (don&#39;t worry, we&#39;re not expected to remember that 94029 means Victoria).
To do these conversions we just create two dictionaries.
Each key-value pair represents the old code to map to the new one (so map &quot;SA1&quot; to &quot;SA&quot; and 23090 to &quot;SA&quot;).
The Pandas <code>map</code> function does the rest of the work.</p>
<pre><code class="lang-python">energy_data[<span class="hljs-string">"Region"</span>].unique()
temperature_data[<span class="hljs-string">"StationNumber"</span>].unique()
</code></pre>
<pre><code><span class="hljs-keyword">array</span>([<span class="hljs-string">'VIC1'</span>, <span class="hljs-string">'SA1'</span>, <span class="hljs-string">'TAS1'</span>, <span class="hljs-string">'QLD1'</span>, <span class="hljs-string">'NSW1'</span>], dtype=object)
<span class="hljs-keyword">array</span>([<span class="hljs-number">94029</span>, <span class="hljs-number">86071</span>, <span class="hljs-number">66062</span>, <span class="hljs-number">40913</span>, <span class="hljs-number">86338</span>, <span class="hljs-number">23090</span>])
</code></pre><pre><code class="lang-python">region_remove_number_map = {<span class="hljs-string">"SA1"</span>: <span class="hljs-string">"SA"</span>, <span class="hljs-string">"QLD1"</span>: <span class="hljs-string">"QLD"</span>, <span class="hljs-string">"NSW1"</span>: <span class="hljs-string">"NSW"</span>, <span class="hljs-string">"VIC1"</span>: <span class="hljs-string">"VIC"</span>, <span class="hljs-string">"TAS1"</span>: <span class="hljs-string">"TAS"</span>}
station_to_region_map = {<span class="hljs-number">23090</span>: <span class="hljs-string">"SA"</span>, <span class="hljs-number">40913</span>: <span class="hljs-string">"QLD"</span>, <span class="hljs-number">66062</span>: <span class="hljs-string">"NSW"</span>, <span class="hljs-number">86071</span>: <span class="hljs-string">"VIC"</span>, <span class="hljs-number">94029</span>: <span class="hljs-string">"TAS"</span>, <span class="hljs-number">86338</span>: <span class="hljs-string">"VIC"</span>}

temperature_data[<span class="hljs-string">"Region"</span>] = temperature_data[<span class="hljs-string">"StationNumber"</span>].map(station_to_region_map)
energy_data[<span class="hljs-string">"Region"</span>] = energy_data[<span class="hljs-string">"Region"</span>].map(region_remove_number_map)

temperature_data.drop(<span class="hljs-string">"StationNumber"</span>, axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-keyword">True</span>)
</code></pre>
<p>One last thing to note about the way our data is formatted (promise).
We currently don&#39;t index/sort our data in any specific way, even though it is a time series.
So we can use <code>set_index</code> to change that.</p>
<pre><code class="lang-python">energy_data.set_index(<span class="hljs-string">"Date"</span>, inplace=<span class="hljs-keyword">True</span>)
temperature_data.set_index(<span class="hljs-string">"Date"</span>, inplace=<span class="hljs-keyword">True</span>)
</code></pre>
<h2 id="dealing-with-missing-data">Dealing with missing data</h2>
<p>So far we&#39;ve made sure that all our data can be easily accessed without any troubles.
We&#39;ve made sure everything is formatted right, and now we can use it... well kind of.
Although our data is correctly formatted, it doesn&#39;t quite mean that it&#39;s meaningful, useful or even present!</p>
<p>We can get through this though, we just need to be strategic.
The key thing to remember here:</p>
<blockquote>
<p>Don&#39;t do more work than necessary!</p>
</blockquote>
<p><strong>Our ultimate goal isn&#39;t to fix everything, but to remove what definitely is useless and enhance the quality of what could be especially interesting/useful</strong>.
This process aids us in knowing we&#39;re making solid, generalisable and reasonable predictions or interpretations (there&#39;s little point in the whole process otherwise).</p>
<p>One nice way to do this is to use graphs.
Through visualising our data we can easily spot where it&#39;s missing, where outliers exist and where two features are correlated.
We, of course, can&#39;t do <em>all of this on one plot</em>, and so we&#39;ll start by just looking for missing data.
Sections of large or frequent gaps are the potentially problematic regions we&#39;re looking for.
If these don&#39;t exist (i.e. there&#39;s little to no missing data), then our work is reduced.</p>
<p>Keep in mind that we have two datasets (not one), categorised by states!
As the data is recorded in separate states, grouping it all together will not correctly represent it.
Hence, we will have a series of plots (one per state) for each feature we want to analyse.
We are slightly lucky though because there&#39;s only one meaningful energy feature (<code>TotalDemand</code>), which we will see has little to no missing data.</p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), tight_layout=<span class="hljs-keyword">True</span>)

energy_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"TAS"</span>)[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"red"</span>,title=<span class="hljs-string">"Tasmania Energy Demand"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
energy_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"green"</span>,title=<span class="hljs-string">"Victoria Energy Demand"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>])
energy_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"NSW"</span>)[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"purple"</span>,title=<span class="hljs-string">"New South Wales Energy Demand"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>])
energy_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"QLD"</span>)[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"orange"</span>,title=<span class="hljs-string">"Queensland Energy Demand"</span>,ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>])
energy_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"SA"</span>)[<span class="hljs-string">"TotalDemand"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color=<span class="hljs-string">"blue"</span>,title=<span class="hljs-string">"South Australia Energy Demand"</span>,ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1593739057463/2yxDKwnrW.png" alt="output_30_5.png"></p>
<p>As we can see the plots are all continuous, this is how we confirm that there is no major source of missing data.
There are a variety of other trends here, but we&#39;ll leave those for later!</p>
<p>Now to move onto weather data.
This is where we&#39;ll see the usefulness of graphs!
Although it&#39;s possible to simply find the percent of missing data, the graphs easily show the nature of the null values.
We immediately see where it&#39;s missing, which itself suggests what method should be used (i.e. removing the data, resampling, etc).</p>
<p>We start by looking at <code>WetBulbTemperature</code>.
We will see that it is largely intact just like our energy data.
We will then see <code>AirTemperature</code>, and it&#39;ll be... rough and tattered.</p>
<p>For brevity, only a few key graphs are included here.
However, loads more can be graphed (please do play around with the code to see what else can be done)!
The problems with <code>AirTemperature</code> are similar to those in the following features:</p>
<ul>
<li>Precipitation</li>
<li>AirTemperature</li>
<li>DewTemperature</li>
<li>RelativeHumidity</li>
<li>WindSpeed</li>
<li>WindDirection</li>
<li>WindgustSpeed</li>
</ul>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), tight_layout=<span class="hljs-keyword">True</span>)

temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"TAS"</span>)[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"red"</span>,title=<span class="hljs-string">"Tasmania Wet Bulb Temperature"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"green"</span>,title=<span class="hljs-string">"Victoria Wet Bulb Temperature"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"NSW"</span>)[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"purple"</span>,title=<span class="hljs-string">"New South Wales Wet Bulb Temperature"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"QLD"</span>)[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"orange"</span>,title=<span class="hljs-string">"Queensland Wet Bulb Temperature"</span>,ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"SA"</span>)[<span class="hljs-string">"WetBulbTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"blue"</span>,title=<span class="hljs-string">"South Australia Wet Bulb Temperature"</span>,ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1593739151024/Oc8PP-Uyl.png" alt="output_32_5.png"></p>
<pre><code class="lang-python">fig, axes = plt.subplots(nrows=<span class="hljs-number">2</span>, ncols=<span class="hljs-number">3</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>), tight_layout=<span class="hljs-keyword">True</span>)

temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"TAS"</span>)[<span class="hljs-string">"AirTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"red"</span>,title=<span class="hljs-string">"Tasmania Air Temperature"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"VIC"</span>)[<span class="hljs-string">"AirTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"green"</span>,title=<span class="hljs-string">"Victoria Air Temperature"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"NSW"</span>)[<span class="hljs-string">"AirTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"purple"</span>,title=<span class="hljs-string">"New South Wales Air Temperature"</span>,ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"QLD"</span>)[<span class="hljs-string">"AirTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"orange"</span>,title=<span class="hljs-string">"Queensland Wind Air Temperatue"</span>,ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>])
temperature_data.groupby(<span class="hljs-string">"Region"</span>).get_group(<span class="hljs-string">"SA"</span>)[<span class="hljs-string">"AirTemperature"</span>][<span class="hljs-string">"2000"</span>:<span class="hljs-string">"2019"</span>].plot(color= <span class="hljs-string">"blue"</span>,title=<span class="hljs-string">"South Australia Air Tempeature"</span>,ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1593739182026/Sa1jPitGK.png" alt="output_33_5.png"></p>
<p>The missing months to years of missing (the blank sections) air temperature data at random places in the graph indicates that it&#39;s not worth looking into further.
This actually <strong>isn&#39;t a bad thing, it allows us to focus more on what is present</strong>: energy demand and wet bulb temperature.</p>
<p>These graphs show large or regular sections of missing data, however, they don&#39;t show the small amounts randomly distributed.
We for safety can quickly use Pandas <code>DataFrame.isnull</code> to find which values are null.
It immediately shows that our energy data is in perfect condition (nothing missing), whilst most temperature columns have a very large proportion missing!</p>
<p>We&#39;ll remove most features since they&#39;d require us to sacrifice large numbers of rows.
What we want to keep (i.e. <code>WetBulbTemperature</code>) can have its missing values interpolated (deduce what the value should be based on its surrounding values).</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_null_counts</span><span class="hljs-params">(dataframe: pd.DataFrame)</span>:</span>
    <span class="hljs-keyword">return</span> dataframe.isnull().mean()[dataframe.isnull().mean() &gt; <span class="hljs-number">0</span>]
</code></pre>
<pre><code class="lang-python">get_null_counts(energy_data)
get_null_counts(temperature_data)
</code></pre>
<pre><code><span class="hljs-selector-tag">Series</span>(<span class="hljs-selector-attr">[]</span>, <span class="hljs-selector-tag">dtype</span>: <span class="hljs-selector-tag">float64</span>)

<span class="hljs-selector-tag">Precipitation</span>         0<span class="hljs-selector-class">.229916</span>
<span class="hljs-selector-tag">AirTemperature</span>        0<span class="hljs-selector-class">.444437</span>
<span class="hljs-selector-tag">WetBulbTemperature</span>    0<span class="hljs-selector-class">.011324</span>
<span class="hljs-selector-tag">DewTemperature</span>        0<span class="hljs-selector-class">.375311</span>
<span class="hljs-selector-tag">RelativeHumidity</span>      0<span class="hljs-selector-class">.375312</span>
<span class="hljs-selector-tag">WindSpeed</span>             0<span class="hljs-selector-class">.532966</span>
<span class="hljs-selector-tag">WindDirection</span>         0<span class="hljs-selector-class">.432305</span>
<span class="hljs-selector-tag">WindgustSpeed</span>         0<span class="hljs-selector-class">.403183</span>
<span class="hljs-selector-tag">SeaPressure</span>           0<span class="hljs-selector-class">.137730</span>
<span class="hljs-selector-tag">StationPressure</span>       0<span class="hljs-selector-class">.011135</span>
<span class="hljs-selector-tag">dtype</span>: <span class="hljs-selector-tag">float64</span>
</code></pre><pre><code class="lang-python">remove_columns = [<span class="hljs-string">"Precipitation"</span>, <span class="hljs-string">"AirTemperature"</span>, <span class="hljs-string">"DewTemperature"</span>, <span class="hljs-string">"RelativeHumidity"</span>, <span class="hljs-string">"WindSpeed"</span>, <span class="hljs-string">"WindDirection"</span>, <span class="hljs-string">"WindgustSpeed"</span>]
temperature_data.drop(remove_columns, axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-keyword">True</span>)

<span class="hljs-comment"># Note that using inplace currently throws an error</span>
<span class="hljs-comment"># So interpolated columns must be manually overridden</span>
missing_columns = list(get_null_counts(temperature_data).keys())
temperature_data[missing_columns] = temperature_data[missing_columns].interpolate(method=<span class="hljs-string">"time"</span>)
</code></pre>
<h2 id="combining-energy-and-temperature-data">Combining energy and temperature data</h2>
<p>Now, for the very last step.
Combining together the two dataframes into one, so we can associate our temperature data with energy demand.</p>
<p>We can use the <code>merge_asof</code> function to merge the two datasets.
This function merges the <em>closest values together</em>.
Since we have data grouped by region, we specify that with the <code>by</code> parameter.
We can choose to only merge energy and temperature entries which are 30-minutes or less apart.</p>
<pre><code class="lang-python">energy_data.sort_index(inplace=<span class="hljs-keyword">True</span>)
temperature_data.sort_index(inplace=<span class="hljs-keyword">True</span>)

data = pd.merge_asof(energy_data, temperature_data, left_index=<span class="hljs-keyword">True</span>, right_index=<span class="hljs-keyword">True</span>, by=<span class="hljs-string">"Region"</span>, tolerance=pd.Timedelta(<span class="hljs-string">"30 min"</span>))
</code></pre>
<p>To check whether the merge has happened successfully, we can check how many null values are present.
This works since unpaired rows cause null values.</p>
<pre><code class="lang-python">get_null_counts(data)
data.dropna(inplace=<span class="hljs-keyword">True</span>)
</code></pre>
<pre><code><span class="hljs-selector-tag">WetBulbTemperature</span>    0<span class="hljs-selector-class">.001634</span>
<span class="hljs-selector-tag">SeaPressure</span>           0<span class="hljs-selector-class">.001634</span>
<span class="hljs-selector-tag">StationPressure</span>       0<span class="hljs-selector-class">.001634</span>
<span class="hljs-selector-tag">AWSFlag</span>               0<span class="hljs-selector-class">.001634</span>
<span class="hljs-selector-tag">dtype</span>: <span class="hljs-selector-tag">float64</span>
</code></pre><p>Now we can finally see some clean and sensible data!
This is the first table we see which does not pose a massive health and safety hazard.
Now that we&#39;ve got to this stage we should celebrate... it only gets better from here 👊.</p>
<pre><code class="lang-python">data
</code></pre>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" data-card-width="600px" data-card-key="2e4d628b39a64b99917c73956a16b477" href="<iframe class="airtable-embed" src="https://airtable.com/embed/shrQWLQqpH7XvnTGw" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe>" data-card-controls="0" data-card-theme="light"><iframe class="airtable-embed" src="https://airtable.com/embed/shrQWLQqpH7XvnTGw" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe></a></div>
<h2 id="saving-final-data">Saving final data</h2>
<pre><code class="lang-python">pd.to_pickle(data, <span class="hljs-string">"../Data/Data.pickle"</span>)
</code></pre>
<p>Photo by Matthew Henry on <a target='_blank' rel='noopener'  href="https://unsplash.com/photos/yETqkLnhsUI">Unsplash</a></p>
]]></content:encoded></item><item><title><![CDATA[Insight is KING - How to Get it and AVOID PITFALLS]]></title><description><![CDATA[It is so hard to find an intuitive understanding of how your dataset functions!
Yet, coherently interpreting how your system works is crucial to finding a way to model or analyse any feature.
Stick on till the end to find out why initial insight is k...]]></description><link>https://www.kamwithk.com/insight-is-king-how-to-get-it-and-avoid-pitfalls</link><guid isPermaLink="true">https://www.kamwithk.com/insight-is-king-how-to-get-it-and-avoid-pitfalls</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[ML]]></category><category><![CDATA[projects]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[advice]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Wed, 17 Jun 2020 14:11:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1592402604108/N7fGNYgFW.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It is <em>so hard</em> to find an intuitive understanding of how your dataset functions!
Yet, coherently interpreting how your system works is crucial to finding a way to model or analyse any feature.
Stick on till the end to find out why initial insight is key to good analysis, what you have to do to formulate your understanding and finally how to easily avoid and overcome problems.</p>
<h1 id="the-story">The story</h1>
<p>I&#39;ve been working on a <a target='_blank' rel='noopener noreferrer'  href="https://github.com/KamWithK/Temp2Enrgy">project to predict energy demand using Australian weather data</a> in a team of 5.
It spanned half a year and mimicked a classic data science project.
<em>Start with data collection, move onto cleaning, modelling and then formulate a report.</em>
Halfway through I discovered one thing - I was <em>still cleaning the data</em>, and it was taking forever, despite having four others working on it!
It seemed like there were a million features we had to deal with, and most of them were nowhere near <em>easily ready to use</em>.
I had seen it as an opportunity to learn more about the different (fancy) techniques which could be used to interpolate (fill in) missing data.
But... <strong>I completely misunderstood my project</strong>.</p>
<blockquote>
<p>We didn&#39;t understand how our data worked and so had to account for 100 features instead of 3...</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1592401734115/-Yj3sMJWB.jpeg" alt="complex.jpg"></p>
<p>It was a naive mistake but one that taught me several valuable lessons:</p>
<ol>
<li>Low-quality/irrelevant data does more harm than good</li>
<li>My teams work can only be completed as well as it is understood</li>
</ol>
<p>Now don&#39;t worry, we did end up finishing on time, with a model and complete report!
But... only after I accounted for these flaws could my team act like a well-oiled machine.
So here we&#39;ll explore how to find and profit from insight, along with how to avoid ever-present flaws and potential pitfalls (which everyone at some point comes across).</p>
<h1 id="how-to-find-insight">How to find insight</h1>
<p>It&#39;s easy to emphasise the importance of understanding how data functions, but hard to discover it.
With tutorials, kaggle competitions and simple beginner exercises it was easy... the understanding and interpreting part was handed to you on a silver platter.
Just get out your golden fork (<em>code</em>) and knife (<em>ensemble model</em>), start cutting (<em>testing</em>) and whala you can eat (<em>a high-performance model</em>)!</p>
<p>Then you progress and begin a few real projects... oh no.
There may not be a one size fits all strategy, but we can make the process smooth and less painful by setting ourselves up properly.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1592401817467/bpHYDVisj.jpeg" alt="insight car.jpg"></p>
<h2 id="where-to-begin">Where to begin</h2>
<blockquote>
<p>Just COOOODE... NO PLEASE WAIT!</p>
</blockquote>
<p>Code is important, but I&#39;ll let you in on a little secret - <em>it takes less time when you know the process</em>.
Bbubbbut... how to know the process, before starting the first time?
Easy, interpret the mission objectives.
Missing objectives are what you want to get out of the project.
Mission objectives include the model and report itself, but also what you&#39;ll need to learn, what stages you&#39;ll go through and what challenges need to be accounted for.</p>
<blockquote>
<p>I thought my goal was to create the best model I could to predict energy demand, hahaha.
I was completely wrong!</p>
</blockquote>
<p>My goal wasn&#39;t to predict energy demand... because that would be nearly impossible.
What I actually wanted was to identify and model the energy demand trends and seasonal patterns which occurred in the short-term, and how temperature fit into this equation.
It involved importing data, researching to find which variables were useful, creating graphs to intuitively show how the data looked/worked and THEN finally creating a model to concretely measure the relationship between temperature and energy demand.
The main difficulties would lie in learning about how the energy time-series&#39; worked and learning to guide my team through each stage of the process.
Painfully verbose yet?
The end-goal may have been a report, showing how everything worked its way till we got a model, but in reality, the model was just 10% of the work!</p>
<p>It indeed though is easy to put this aside... to say that it is soft, unnecessary planning which is unlikely to directly impact the project right now.
In fact... yes, it is extra work and fair enough if you&#39;re not compelled to plan everything out like this.
If you&#39;ve got a better alternative - let me know.
If not, give it a try.
It may not <em>impact you right now</em>, but it will aid in mitigating large problems, and illustrating how everything ties together!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1592402036350/ZCtKxoC0S.jpeg" alt="road.jpg"></p>
<h2 id="finding-out-where-to-go-next">Finding out where to go next</h2>
<p><em>But I have no idea how to do any of this...</em>
Don&#39;t worry, with time, you&#39;ll figure it all out.
Just remember:</p>
<blockquote>
<p>The path seems a little less bumpy once you get started!</p>
</blockquote>
<p>If though you have no idea where to start, find out what you&#39;ll need to learn and find ways to do so.
Simple tutorials and videos are a great way to start off.
Then, once you&#39;ve got a vague idea of what things should look like and what to do, just get started.
Simply follow the trail and see where it leads!</p>
<p>For data science projects, know the practical steps and processes.
I explain these in my <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">machine learning field guide article</a> which goes through every single step of the process in detail!
If you want to learn more, books like <a target='_blank' rel='noopener noreferrer'  href="https://amzn.to/3fua1k8">Hands-On Machine Learning</a> and <a target='_blank' rel='noopener noreferrer'  href="https://amzn.to/2UVmHZE">The Hundred-Page Machine Learning Book</a> are extremely useful.</p>
<p>To work better in a team make sure you understand collaborative coding tools (all explained <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z">here</a>) and how to lead.
The book <a target='_blank' rel='noopener noreferrer'  href="https://amzn.to/30NT8g6">Extreme Ownership</a> is an amazing guide to teamwork and leadership (not data science specific, but Jocko Willink&#39;s advice applies nonetheless).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1592402182523/I34tixWVm.jpeg" alt="collaboration.jpg"></p>
<h2 id="avoiding-collapse">Avoiding collapse</h2>
<p>Everything was going so well... until I realised we were still cleaning the data halfway through the project.
Everything seemed fine, progress seemed alright, not perfect... but fine.</p>
<p>Even when you&#39;ve set yourself up to succeed, and everything has progressed fine... things can go south!
But... I was lucky because a teacher told me to regularly do one thing:</p>
<blockquote>
<p>Keep a simple journal of progress, specifically commenting on what you&#39;ve done, how it panned out and what can be done to improve.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1592401337556/RJ_lNTO77.jpeg" alt="coffee time.jpg"></p>
<p>It <em>worked wonders</em>!
Instead of stressing out after coming to grips of how much I needed to finish, I was able to prioritise and execute, because I knew where I could go wrong and I could account for it.
I knew my team could get distracted and lose focus, so I made sure to stick to the point and emphasise what we were trying to accomplish instead of writing down narrow tasks.
I knew it was difficult to pace ourselves, so I kept a count on how many weeks were left and made sure everyone understood.
I knew the coding was particularly challenging and threatening to most people, so I did a brief rundown on what it would involve/what it should look like with sample code.
In short, I accounted for my weaknesses and managed to turn around a bad situation.</p>
<p>The process <em>only took ~5 minutes</em> each week and drastically boosted progress.</p>
<blockquote>
<p>All you&#39;ve got to do is reflect on how your actions enfold and consider what could help you out further.</p>
</blockquote>
<p>This <strong>leads to simple actionable steps</strong>.</p>
<h1 id="thanks-for-reading">THANKS FOR READING</h1>
<p>I hope you&#39;ve enjoyed this, and that you&#39;ve found it helpful!
Please feel free to share this with anyone it may help.</p>
<p>My other articles on <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z">practical coding skills</a>, <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx">machine learning</a>, <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/zero-to-hero-nlp-project-edition-ck6zsqtbo05srdfs135o8blcf">starting projects</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/zero-to-hero-data-collection-through-web-scraping-ck78o0bmg08ktd9s1bi7znd19">web scraping</a> may be interesting.</p>
<p>Follow me on <a target='_blank' rel='noopener noreferrer'  href="https://twitter.com/kamwithk_">Twitter</a> for updates.</p>
<p>Photos by <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/QnUywvDdI1o">Toa Heftiba</a>, <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/l090uFWoPaI">John Barkiple</a>, <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/lypqQBIRXpo">Yang Jing</a>, <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/1n6jYq40syA">Kyle Glenn</a> and <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/Ev1XqeVL2wI">Josh Calabrese</a> on Unsplash</p>
]]></content:encoded></item><item><title><![CDATA[Machine Learning Field Guide]]></title><description><![CDATA[We all have to deal with data, and we try to learn about and implement machine learning into our projects.
But everyone seems to forget one thing... it's far from perfect, and there is so much to go through!
Don't worry, we'll discuss every little st...]]></description><link>https://www.kamwithk.com/machine-learning-field-guide</link><guid isPermaLink="true">https://www.kamwithk.com/machine-learning-field-guide</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[projects]]></category><category><![CDATA[side project]]></category><category><![CDATA[programming]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Fri, 12 Jun 2020 05:00:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1591892207311/PEqetlMtX.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We all have to deal with data, and we try to learn about and implement machine learning into our projects.
But everyone seems to forget one thing... it&#39;s far from perfect, and there is <em>so much to go through</em>!
Don&#39;t worry, we&#39;ll discuss every little step, from start to finish 👀.</p>
<p>All you&#39;ll need are <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z">these fundementals</a></p>
<h1 id="the-story-behind-it-all">The Story Behind it All</h1>
<p>We all start with either a dataset or a goal in mind.
Once <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/zero-to-hero-data-collection-through-web-scraping-ck78o0bmg08ktd9s1bi7znd19">we&#39;ve found, collected or scraped our data</a>, we pull it up, and witness the overwhelming sight of merciless cells of numbers, more numbers, categories, and maybe some words 😨!
A naive thought crosses our mind, to use our machine learning prowess to deal with this tangled mess... but a quick search reveals the host of tasks we&#39;ll need to consider before <em>training a model</em> 😱!</p>
<p>Once we overcome the shock of our unruly data we look for ways to battle our formidable nemesis 🤔.
We start with trying to get our data into Python.
It is relatively simple on paper, but the process can be slightly... <em>involved</em>.
Nonetheless, a little effort was all that was needed (lucky us).</p>
<p>Without wasting any time we begin <em>data cleaning</em> to get rid of the bogus and expose the beautiful.
Our methods start simple - observe and remove.
It works a few times, but then we realise... it really doesn&#39;t do us justice!
To deal with the mess though, we find a powerful tool to add to our arsenal: charts!
With our graphs, we can get a feel for our data, the patterns within it and where things are missing.
We can <em>interpolating</em> (fill in) or removing missing data.</p>
<p>Finally, we approach our highly anticipated 😎 challenge, data modelling!
With a little research, we find out which tactics and models are commonly used.
It is a little difficult to decipher which one we should use, but we still manage to get through it and figure it all out!</p>
<p>We can&#39;t finish a project without doing something impressive though.
So, a final product, website, app or even a report will take us far!
We know first impressions are important so we fix up the GitHub repository and make sure everything&#39;s well documented and explained.
Now we are <em>finally able to show off our hard work to the rest of the world</em> 😎!</p>
<h1 id="the-epochs">The epochs</h1>
<h2 id="chapter-1-importing-data">Chapter 1 - Importing Data</h2>
<p>Data comes in all kinds of shapes and sizes and so the process we use to get everything into code often varies.</p>
<blockquote>
<p>Let&#39;s be real, importing data seems easy, but sometimes... it&#39;s a little pesky.</p>
</blockquote>
<p>The hard part about data cleaning isn&#39;t the coding or theory, but instead our preparation!
When we first start a new project and download our dataset, it can be tempting to open up a code editor and start typing... but this won&#39;t do us any good.
If we want to get a head start we need to prepare ourselves for the best and worst parts of our data.
To do this we&#39;ll need to start basic, by manually inspecting our spreadsheet/s.
Once we understand the basic format of the data (filetype along with any particularities) we can move onto getting it all into Python.</p>
<p>When we&#39;re lucky and just have one spreadsheet we can use the Pandas <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html"><code>read_csv</code></a> function (letting it know where our data lies):</p>
<pre><code class="lang-python">pd.read_csv(<span class="hljs-string">"file_path.csv"</span>)
</code></pre>
<p>In reality, we run into way more complex situations, so look out for:</p>
<ul>
<li>File starts with unneeded information (which we need to skip)</li>
<li>We only want to import a few columns</li>
<li>We want to rename our columns</li>
<li>Data includes dates</li>
<li>We want to combine data from multiple sources into one place</li>
<li>Data can be grouped together</li>
</ul>
<blockquote>
<p>Although we&#39;re discussing a range of scenarios, we normally only deal with a few at a time.</p>
</blockquote>
<p>Our first few problems (importing specific parts of our data/renaming columns) are easy enough to deal with using a few parameters, like the number of rows to skip, the specific columns to import and our column names:</p>
<pre><code class="lang-python">pd.read_csv(<span class="hljs-string">"file_path.csv"</span>, skiprows=<span class="hljs-number">5</span>, usecols=[<span class="hljs-number">0</span>, <span class="hljs-number">1</span>], names=[<span class="hljs-string">"Column1"</span>, <span class="hljs-string">"Column2"</span>])
</code></pre>
<p>Whenever our data is spread across multiple files, we can combine them using Pandas <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html"><code>concat</code></a> function.
The <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html"><code>concat</code></a> function combines a list of <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html"><code>DataFrame</code></a>&#39;s together:</p>
<pre><code class="lang-python">my_spreadsheets = [pd.read_csv(<span class="hljs-string">"first_spreadsheet.csv"</span>), pd.read_csv(<span class="hljs-string">"second_spreadsheet.csv"</span>)]
pd.concat(my_spreadsheets, ignore_index=<span class="hljs-keyword">True</span>)
</code></pre>
<p>We parse to <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html"><code>concat</code></a> a list of spreadsheets (which we import just like before).
The list can, of course, be attained in any way (so a fancy list comprehension or a casual list of every file both work just as well), but just remember that <strong>we need dataframes, not filenames/paths</strong>!</p>
<p>If we don&#39;t have a CSV file Pandas still works!
We can just <em>swap out</em> <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html"><code>read_csv</code></a> for <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html"><code>read_excel</code></a>, <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html"><code>read_sql</code></a> or another <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html">option</a>.</p>
<p>After all the data is inside a Pandas dataframe, we need to double-check that our data is <em>formatted correctly</em>.
In practice, this means checking each series datatype, and making sure they are not generic objects.
We do this to ensure that we can utilize Pandas inbuilt functionality for numeric, categorical and date/time values.
To look at this just run <code>DataFrame.dtypes</code>.
If the output seems reasonable (i.e. numbers are numeric, categories are categorical, ect), then we should be fine to move on.
However, this normally is not the case, and as we need to change our datatypes!
This can be done with Pandas <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html"><code>DataFrame.astype</code></a>.
If this doesn&#39;t work, there should be another more Pandas function for that specific conversion:</p>
<pre><code class="lang-python">data[<span class="hljs-string">"Rating"</span>] = data[<span class="hljs-string">"Rating"</span>].as_type(<span class="hljs-string">"category"</span>)
data[<span class="hljs-string">"Number"</span>] = pd.to_numeric(data[<span class="hljs-string">"Number"</span>])
data[<span class="hljs-string">"Date"</span>] = pd.to_datetime(data[<span class="hljs-string">"Date"</span>])
data[<span class="hljs-string">"Date"</span>] = pd.to_datetime(data[[<span class="hljs-string">"Year"</span>, <span class="hljs-string">"Month"</span>, <span class="hljs-string">"Day"</span>, <span class="hljs-string">"Hour"</span>, <span class="hljs-string">"Minute"</span>]])
</code></pre>
<p>If we need to analyse separate groups of data (i.e. maybe our data is divided by country), we can use Pandas <code>groupby</code>.
We can use <code>groupby</code> to select particular data, and to run functions on each group separately:</p>
<pre><code class="lang-python">data.groupby(<span class="hljs-string">"Country"</span>).get_group(<span class="hljs-string">"Australia"</span>)
data.groupby(<span class="hljs-string">"Country"</span>).mean()
</code></pre>
<p><em>Other more niche tricks like multi/hierarchical indices can also be helpful in specific scenarios, however, are more tricky to understand and use.</em></p>
<h2 id="chapter-2-data-cleaning">Chapter 2 - Data Cleaning</h2>
<p>Data is useful, data is necessary, however, it <em>needs to be clean and to the point</em>!
If our data is everywhere, it simply won&#39;t be of any use to our machine learning model.</p>
<blockquote>
<p>Everyone is driven insane by missing data, but there&#39;s always a light at the end of the tunnel.</p>
</blockquote>
<p>The easiest and quickest way to go through data cleaning is to ask ourselves:</p>
<blockquote>
<p>What features within our data will impact our end-goal?</p>
</blockquote>
<p>By end-goal, we mean whatever variable we are working towards predicting, categorising or analysing.
The point of this is to narrow our scope and not get bogged down in useless information.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1591933470258/ZETFabnur.jpeg" alt="data analysis.jpg"></p>
<p>Once we know what our primary objective features are, we can try to find patterns, relations, missing data and more.
An easy and intuitive way to do this is graphing!
Quickly use Pandas to sketch out each variable in the dataset, and try to see where everything fits into place.</p>
<p>Once we have identified potential problems, or trends in the data we can try and fix them.
In general, we have the following options:</p>
<ul>
<li>Remove missing entries</li>
<li>Remove full columns of data</li>
<li>Fill in missing data entries</li>
<li>Resample data (i.e. change the resolution)</li>
<li>Gather more information</li>
</ul>
<p>To go from identifying missing data to choosing what to do with it we need to consider how it affects our end-goal.
With missing data we remove anything which doesn&#39;t seem to have a major influence on the end result (i.e. we couldn&#39;t find a meaningful pattern), or where there just seems <em>too much missing to derive value</em>.
Sometimes we also decide to remove very small amounts of missing data (since it&#39;s easier than filling it in).</p>
<p>If we&#39;ve decided to get rid of information, Pandas <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html"><code>DataFrame.drop</code></a> can be used.
It removes columns or rows from a dataframe.
It is quite easy to use, but remember that <strong>Pandas does not modify/remove data from the source dataframe by default</strong>, so <code>inplace=True</code> must be specified.
It may be useful to note that the <code>axis</code> parameter specifies whether rows or columns are being removed.</p>
<p>When not removing a full column, or particularly targeting missing data, it can often be useful to rely on a few nifty Pandas functions.
For removing null values, <code>DataFrame.dropna</code> can be utilized.
Do keep in mind though that by default, <code>dropna</code> completely removes all missing values.
However, setting either the parameter <code>how</code> to <code>all</code> or setting a threshold (<code>thresh</code>, representing how many null values are required for it to delete) can compensate for this.</p>
<p>If we&#39;ve got small amounts of irregular missing values, we can fill them in several ways.
The simplest is <code>DataFrame.fillna</code> which sets the missing values to some preset value.
The more complex, but flexible option is interpolation using <code>DataFrame.interpolate</code>.
Interpolation essentially allows anyone to simply set the <em>method</em> they would like to replace each null value with.
These include the previous/next value, linear and time (the last two deduce based on the data).
Whenever working with time, time is a natural choice, and otherwise make a reasonable choice based on how much data is being interpolated and how complex it is.</p>
<pre><code class="lang-python">data[<span class="hljs-string">"Column"</span>].fillna(<span class="hljs-number">0</span>, inplace=<span class="hljs-keyword">True</span>)
data[[<span class="hljs-string">"Column"</span>]] = data[[<span class="hljs-string">"Column"</span>]].interpolate(method=<span class="hljs-string">"linear"</span>)
</code></pre>
<p><em>As seen above, interpolate needs to be passed in a dataframe purely containing the columns with missing data</em> (otherwise an error will be thrown).</p>
<p>Resampling is useful can whenever we see regularly missing data or have multiple sources of data using different timescales (like ensuring measurements in minutes and hours can be combined).
It can be slightly difficult to intuitively understand resampling, but it is essential when you average measurements over a certain timeframe.
For example, we can get monthly values by specifying that we want to get the mean of each month&#39;s values:</p>
<pre><code class="lang-python">data.resample(<span class="hljs-string">"M"</span>).mean()
</code></pre>
<p>The <code>&quot;M&quot;</code> stands for month and can be replaced with <code>&quot;Y&quot;</code> for year and other options.</p>
<p>Although the data cleaning process can be quite challenging, if we remember our initial intent, it becomes a far more logical and straight forward task!
If we still don&#39;t have the needed data, we may need to go back to phase one and collect some more.
<em>Note that missing data indicates a problem with data collection, so it&#39;s useful to carefully consider, and note down where occurs.</em></p>
<p><em>For completion, the Pandas <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html"><code>unique</code></a> and <a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html"><code>value_counts</code></a> functions are useful to decide which features to straight-up remove and which to graph/research.</em></p>
<h2 id="chapter-3-visualisation">Chapter 3 - Visualisation</h2>
<p>Visualisation sounds simple and it is, but it&#39;s hard to... <em>not overcomplicate</em>.
It&#39;s far too easy for us to consider plots as a chore to create.
Yet, these bad boys do one thing very, very well - intuitively demonstrate the inner workings of our data!
Just remember:</p>
<blockquote>
<p>We graph data to find and explain how everything works.</p>
</blockquote>
<p>Hence, when stuck for ideas, or not quite sure what to do, we basics can always fall back on <strong>identifying useful patterns and meaningful relationships</strong>.
It may seem iffy 🥶, but it is really useful.</p>
<blockquote>
<p>Our goal isn&#39;t to draw fancy hexagon plots, but instead to picture what is going on, so <em>absolutely anyone</em> can simply interpret a complex system!</p>
</blockquote>
<p>A few techniques are undeniably useful:</p>
<ul>
<li>Resampling when we <em>have too much data</em></li>
<li>Secondary axis when plots have different scales</li>
<li>Grouping when our data can be split categorically</li>
</ul>
<p>To get started graphing, simply use Pandas <code>.plot()</code> on any series or dataframe!
When we need more, we can delve into MatPlotLib, Seaborn or an interactive plotting library.</p>
<pre><code><span class="hljs-keyword">data</span>.plot(x=<span class="hljs-string">"column 1 name"</span>, y=<span class="hljs-string">"column 2 name"</span>, kind=<span class="hljs-string">"bar"</span>, figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">10</span>))
<span class="hljs-keyword">data</span>.plot(x=<span class="hljs-string">"column 1 name"</span>, y=<span class="hljs-string">"column 3 name"</span>, secondary_y=True)
<span class="hljs-keyword">data</span>.hist()
<span class="hljs-keyword">data</span>.groupby(<span class="hljs-string">"group"</span>).boxplot()
</code></pre><p>90% of the time, this basic functionality will suffice (<a target='_blank' rel='noopener noreferrer'  href="https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#plot-formatting">more info here</a>), and where it doesn&#39;t a search should reveal how to <em>draw particularly exotic graphs</em> 😏.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1591932777077/oKrkIQYS9.jpeg" alt="dashboard.jpg"></p>
<h2 id="chapter-4-modelling">Chapter 4 - Modelling</h2>
<h3 id="a-brief-overview">A Brief Overview</h3>
<p>Now finally for the fun stuff - deriving results.
It seems <em>so simple to train a scikit learn model, but no one goes into the details</em>!
So, let&#39;s be honest here, not every dataset, nor model are equal.</p>
<p>Our approach to modelling will vary widely based on our data.
There are three especially important factors:</p>
<ul>
<li><strong>Type</strong> of problem</li>
<li><strong>Amount</strong> of data</li>
<li><strong>Complexity</strong> of data</li>
</ul>
<p>Our type of problem comes down to whether we are trying to predict a class/label (called <em>classification</em>), a value (called <em>regression</em>), or to group data (called <em>clustering</em>).
If we are trying to train a model on a dataset where we already have examples of what we&#39;re trying to predict we call our model <em>supervised</em>, if not, <em>unsupervised</em>.
The amount of available data, and how complex it is foreshadows how simple a model will suffice.
<em>Data with more features (i.e. columns) tends to be more complex</em>.</p>
<blockquote>
<p>The point of interpreting complexity is to understand which models are <em>too good or too bad for our data</em></p>
</blockquote>
<p>Models <em>goodness of fit</em> informs us on this!
If a model struggles to interpret our data (too simple) we can say it <em>underfits</em>, and if it is completely overkill (too complex) we say it <em>overfits</em>.
We can think of it as a spectrum from learning nothing to memorising everything.
We need to <em>strike balance</em>, to ensure our model is <strong>able to <em>generalise</em> our conclusions</strong> to new information.
This is typically known as the bias-variance tradeoff.
<em>Note that complexity also affects model interpretability.</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1591931791416/qtb6eievP.png" alt="Goodness of Fit.png"></p>
<p><strong>Complex models take substantially more time to train</strong>, especially with large datasets.
So, upgrade that computer, run the model overnight, and chill for a while 😁!</p>
<h3 id="preparation">Preparation</h3>
<h4 id="splitting-up-data">Splitting up data</h4>
<p>Before training a model it is important to note that we will need some dataset to test it on (so we know how well it performs).
Hence, we often divide our dataset into <strong>separate training and testing sets</strong>.
This allows us to test <em>how well our model can generalise to new unseen data</em>.
This normally works because we know our data is decently representative of the real world.</p>
<p>The actual amount of test data doesn&#39;t matter too much, but 80% train and 20% test is often used.</p>
<p>In Python with Scikit learn the <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html"><code>train_test_split</code></a> function does this:</p>
<pre><code class="lang-python">train_data, test_data = train_test_split(data)
</code></pre>
<p>Cross-validation is where a dataset is split into several folds (i.e. subsets or portions of the original dataset).
This tends to be more robust and <em>resistant to overfitting</em> than using a single test/validation set!
Several Sklearn functions help with <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/cross_validation.html">cross-validation</a>, however, it&#39;s normally done straight through a grid or random search (discussed below).</p>
<pre><code class="lang-python">cross_val_score(model, input_data, output_data, cv=<span class="hljs-number">5</span>)
</code></pre>
<h4 id="hyperparameter-tuning">Hyperparameter tuning</h4>
<p>There are some factors our model cannot account for, and so we <em>set certain hyperparameters</em>.
These vary model to model, but we can either find optimal values through manual trial and error or a simple algorithm like grid or random search.
With grid search, we try all possible values (brute force 😇) and with random search random values from within some distribution/selection.
Both approaches typically use cross-validation.</p>
<p><a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">Grid search</a> in Sklearn works through a <code>parameters</code> dictionary.
Each entries key represents the hyperparameter to tune, and the value (a list or tuple) is the selection of values to chose from:</p>
<pre><code class="lang-python">parameters = {<span class="hljs-string">'kernel'</span>:(<span class="hljs-string">'linear'</span>, <span class="hljs-string">'rbf'</span>), <span class="hljs-string">'C'</span>:[<span class="hljs-number">1</span>, <span class="hljs-number">10</span>]}
model = = SVC()
grid = GridSearchCV(model, param_grid=parameters)
</code></pre>
<p>After we&#39;ve created the grid, we can use it to train the models, and extract the scores:</p>
<pre><code class="lang-python">grid.fit(train_input, train_output)
best_score, best_depth = grid.best_score_, grid.best_params_
</code></pre>
<p>The important thing here is to remember that we need to <strong>train on the training and not testing data</strong>.
Even though cross-validation is used to test the models, we&#39;re ultimately trying to get the best fit on the training data and will proceed to test each model on the testing set afterwards:</p>
<pre><code><span class="hljs-built_in">test</span>_predictions = grid.predict(<span class="hljs-built_in">test</span>_input)
</code></pre><p><a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">Random search</a> in Sklearn works similarly but is slightly more complex as we need to know what type of distribution each hyperparameter takes in.
Although it, in theory, <em>can yield the same or better results faster</em>, that changes from situation to situation.
<em>For simplicity it is likely best to stick to a grid search.</em></p>
<h3 id="model-choices">Model Choices</h3>
<h4 id="using-a-model">Using a model</h4>
<p>With Sklearn, it&#39;s as simple as finding our desired model name and then just creating a variable for it.
Check the links to the documentation for further details!
For example</p>
<pre><code class="lang-python">support_vector_regressor = SVR()
</code></pre>
<h4 id="basic-choices">Basic Choices</h4>
<h5 id="linear-logistic-regression">Linear/Logistic Regression</h5>
<p><a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model">Linear regression</a> is trying to <em>fit a straight line</em> to our data.
It is the most basic and fundamental model.
There are several variants of linear regression, like lasso and ridge regression (which are regularisation methods to prevent overfitting).
Polynomial regression can be used to fit curves of higher degrees (like parabolas and other curves).
Logistic regression is another variant which can be used for classification.</p>
<h5 id="support-vector-machines">Support Vector Machines</h5>
<p>Just like with linear/logistic regression, <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/svm.html">support vector machines (SVM&#39;s)</a> try to fit a line or curve to data points.
However, with SVM&#39;s the aim is to maximise the distance between a boundary and each point (instead of getting the line/curve to go through each point).</p>
<p>The main advantage of support vector machines is their ability to <em>use different kernels</em>.
A kernel is a function which calculates similarity.
These kernels allow for both linear and non-linear data, whilst staying decently efficient.
The kernels map the input into a higher dimensional space so a boundary becomes present.
This process is typically not feasible for large numbers of features.
A neural network or another model will then likely be a better choice!</p>
<h5 id="neural-networks">Neural Networks</h5>
<p>All the buzz is always about deep learning and <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/neural_networks_supervised.html">neural networks</a>.
They are complex, slow and resource-intensive models which can be used for complex data.
Yet, they are extremely useful when encountering large unstructured datasets.</p>
<p>When using a neural net, make sure to watch out for overfitting.
An easy way is through tracking changes in error with time (known as learning curves).</p>
<p>Deep learning is an extremely rich field, so there is far too much to discuss here.
In fact, Scikit learn is a machine learning library, with little deep learning abilities (compared to <a target='_blank' rel='noopener noreferrer'  href="PyTorch.org/">PyTorch</a> or <a target='_blank' rel='noopener noreferrer'  href="https://www.tensorflow.org/">TensorFlow</a>).</p>
<h5 id="decision-trees">Decision Trees</h5>
<p><a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/tree.html">Decision trees</a> are simple and quick ways to model relationships.
They are basically a <em>tree of decisions</em> which help to decide on what class or label a datapoint belongs too.
Decision trees can be used for regression problems too.
Although simple, to avoid overfitting, several hyperparameters must be chosen.
These all, in general, relate to how deep the tree is and how many decisions are to be made.</p>
<h5 id="k-means">K-Means</h5>
<p>We can group unlabeled data into several <em>clusters</em> using <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/clustering.html#k-means">k-means</a>.
Normally the number of clusters present is a chosen hyperparameter.</p>
<p>K-means works by trying to optimize (reduce) some criterion (i.e. function) called inertia.
It can be thought of like trying to minimize the distance from a set of <em>centroids</em> to each data point.</p>
<h4 id="ensembles">Ensembles</h4>
<h5 id="random-forests">Random Forests</h5>
<p>Random forests are combinations of multiple decision trees trained on random subsets of the data (bootstrapping).
This process is called bagging and allows random forests to obtain a good fit (low bias and low variance) with complex data.</p>
<p>The rationale behind this can be likened to democracy.</p>
<blockquote>
<p>One voter may vote for a bad candidate, but we&#39;d hope that the majority of voters make informed, positive decisions</p>
</blockquote>
<p>For <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">regression</a> problems, we average each decision tree&#39;s outputs, and for <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">classification</a>, we choose the most popular one.
This <em>might not always work, but we generally assume it will</em> (especially with large datasets with multiple columns).</p>
<p>Another advantage with random forests is that insignificant features shouldn&#39;t negatively impact performance because of the democratic-esc bootstrapping process!</p>
<p>Hyperparameter choices are the same as those for decision trees but with the number of decision trees as well.
For the aforementioned reasons, more trees equal less overfitting!</p>
<p><em>Note that random forests use random subsets with the replacement of rows and columns!</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1591933623422/ih9FpjpMj.jpeg" alt="forest.jpg"></p>
<h5 id="gradient-boosting">Gradient Boosting</h5>
<p>Ensemble models like AdaBoost or <a target='_blank' rel='noopener noreferrer'  href="https://xgboost.readthedocs.io/">XGBoost</a> work by stacking one model on top of another.
The assumption here is that each successive weak learner will correct for the flaws of the previous one (hence called boosting).
Hence, the combination of models should provide the advantages of each model without its potential pitfalls.</p>
<p>The iterative approach means previous models performances effects current models, and better models are given a higher priority.
Boosted models perform slightly better than bagging models (a.k.a random forests), but are also slightly more likely to overfit.
Sklearn provides AdaBoost for <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html">classification</a> and <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html">regression</a>.</p>
<h2 id="chapter-5-production">Chapter 5 - Production</h2>
<p>This is the last but potentially most important part of the process 🧐.
We&#39;ve put in all this work, and so we need to go the distance and <strong>create something impressive</strong>!</p>
<p>There are a variety of options.
<a target='_blank' rel='noopener noreferrer'  href="https://www.streamlit.io/">Streamlit</a> is an exciting option for data-oriented websites, and tools like Kotlin, Swift and Dart can be used for Android/IOS development.
JavaScript with frameworks like VueJS can also be used for extra flexibility.</p>
<p><em>After trying most of these I honestly would recommend sticking to <a target='_blank' rel='noopener noreferrer'  href="https://www.streamlit.io/">Streamlit</a>, since it is so much easier than the others!</em></p>
<p>Here it is important to start with a vision (simpler the better) and try to find out which parts are most important.
Then try and specifically work on those.
Continue till completion!
For websites, a hosting service like <a target='_blank' rel='noopener noreferrer'  href="https://www.heroku.com/">Heroku</a> will be needed, so the rest of the world can see the amazing end-product of all our hard work 🤯😱.</p>
<p>Even if none of the above options above suit the scenario, a report or article covering what we&#39;ve done, what we&#39;ve learnt and any suggestions/lessons learnt along with a well documented GitHub repository are indispensable!
<em>Make sure that readme file is up to date.</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1591933241648/LMVyc0w61.jpeg" alt="presentation.jpg"></p>
<h1 id="thanks-for-reading-">THANKS FOR READING!</h1>
<p>I really hope this article has helped you out!
For updates <a target='_blank' rel='noopener noreferrer'  href="https://twitter.com/kamwithk_">follow me on Twitter</a>.</p>
<p>If you enjoyed this, you may also like <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-complete-coding-practitioners-handbook-ck9u1vmgv03kg7bs1e5zwit2z">The Complete Coding Practitioners Handbook</a> which goes through each and every practical coding tool you&#39;ll need to know.
If you&#39;re lost considering what project to take on, consider checkout out my zero-to-hero guide on <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/zero-to-hero-nlp-project-edition-ck6zsqtbo05srdfs135o8blcf">choosing a project</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/zero-to-hero-data-collection-through-web-scraping-ck78o0bmg08ktd9s1bi7znd19">collecting your own dataset through webscraping</a>.</p>
<p><em>Photos by <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/pCqzMe04s8g">National Cancer Institute</a>, <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/JKUTrJ4vK00">Dane Deaner</a>, <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/z3cMjI6kP_I">ThisisEngineering RAEng</a>, <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/D4LDw5eXhgg">Adam Nowakowski</a> and <a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/3ap0EoGXXGk">Guilherme Caetano</a> on Unsplash.</em>
The goodness of fit graph is a modified version of the <a target='_blank' rel='noopener noreferrer'  href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html">Sklearn documentation</a></p>
]]></content:encoded></item><item><title><![CDATA[The Complete Coding Practitioners Handbook]]></title><description><![CDATA[Git, debugging, testing, the terminal, Linux, the cloud, networking,  patterns/antipatterns - what even is this mess?
Don't worry we'll go through from beginning to end (all the way, I promise) everything you need to know to collaborate proficiency w...]]></description><link>https://www.kamwithk.com/the-complete-coding-practitioners-handbook</link><guid isPermaLink="true">https://www.kamwithk.com/the-complete-coding-practitioners-handbook</guid><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[programmer]]></category><category><![CDATA[learning]]></category><category><![CDATA[code]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Tue, 05 May 2020 15:11:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1588690873051/OVgGvewK7.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Git, debugging, testing, the terminal, Linux, the cloud, networking,  patterns/antipatterns - what even is this mess?
Don&#39;t worry we&#39;ll go through from beginning to end (all the way, I promise) everything you need to know to collaborate proficiency with others.</p>
<h1 id="why-so-many-tools-">Why so many tools?</h1>
<p>We&#39;re flooded with tools which are all titled <em>essential to boost productivity</em>, but... why so many of them?
To answer this let&#39;s start at the very beginning and slowly work our way through our coding journey!
We all started on a small solo project working to build an app, create a simple model, or just to finish an assignment.
As we begin to code we notice that it just... doesn&#39;t run 😢 and so we sigh, take a deep breath in and begin to look for what went wrong.
The first bug is just a small innocent typo, but with time we start running into more and more silly pesky bugs 🐞, each one a slight bit harder to deal with than the last!
Once we read our code, find the typo and fix it (a little golden <em>debugging</em>) our coding journey continues, and we work on creating something slightly more impressive.</p>
<p>We soon get to a crossroad, we finish working on our small little program and want to work on something slightly more ambitious (yay)!
Although we&#39;re ambitious, we notice one small thing - we make a good few mistakes.
Like any good student, we get a few books, read a few articles, watch a few videos, and before long we&#39;ve learned several <em>design patterns</em> which make for a nice, smooth coding experience and <em>antipatterns</em>... to avoid like the plague.</p>
<p>Now with a few sophisticated patterns/antipatterns in mind, we feel like we&#39;re ready to show the world our coding prowess!
We start névé and nervous but with passion, and so through gathering a few friends together, we begin a new chapter of our lives 😅.
The work is fun and everyone wants to play their part, but soon one question arises - <strong>how can we work together</strong>?
At first, emailing/messaging code from one person to another works fine... but then a few more people pitch in, and combining every line of code becomes - unmanageable!
In a moment of chaos, one man did the impossible though, Linus Torvalds extended his olive branch and gave us Git - the perfect system to collaborate with others.</p>
<p>Eventually, we approach another challenge, although we&#39;re writing the code just fine... we feel bogged down by our workflow.
To our surprise, there&#39;s an easy and elegant solution - Linux and the terminal.
Linus Torvald proposes Linux as an <em>alternative to Windows</em> (the ugly behemoth) and with it a terminal to write code in a fashion which completely <em>bash</em>&#39;s Windows.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1588752185776/4r4lV7IeZ.jpeg" alt="programmer.jpg"></p>
<p>Now with our workflow smoothened out, there are just a few questions left - how can we run this code anywhere and what if we need... more?
Luckily for us, the dot com boom unfolds and the internet is ablaze!
What we once had to run <em>on our machines, can now be run on the cloud</em> (other people/companies servers).
Now we can run and distribute progressively larger (and more heavyweight) code right from the comfort of our houses!</p>
<h1 id="the-epochs">The epochs</h1>
<h2 id="chapter-1-debugging">Chapter 1 - Debugging</h2>
<p>Our code is bound to have problems... even if we&#39;re genius&#39;, they&#39;ll still crop up!
We can&#39;t *completely avoid them, but we can approach each problem in just the right way, so we&#39;re able to smoothly eliminate it.
There&#39;s a simple technique to help with this:</p>
<ol>
<li>SIMPLIFY - Keep it simple stupid, the simpler it is the easier it is to find the problem!</li>
<li>EXPLORE - It&#39;s fine when we don&#39;t know what&#39;s wrong, relax and start exploring, use a few print statements, read a few errors and try to figure things out 😌</li>
<li>ISOLATE - Try to find where your code goes south (focused effort reveals bugs quickest)</li>
</ol>
<p>Now I know it&#39;s <em>easier said than done</em>, but <strong>just try this out</strong>... it makes a big difference!
Just remember to keep calm, take a deep breath 🫁 and continue, if it&#39;s a bug you&#39;ll find and destroy it with time and effort 😌!</p>
<h2 id="chapter-2-testing">Chapter 2 - Testing</h2>
<p>Our code works... or does it?
Testing is all about finding whether something which <em>seems to work fine</em> actually <em>works fine</em>.
It&#39;s about finding whether your changes break how things work (likely in a subtle way).</p>
<p>Testing can be simple, or complex.
At its simplest, it&#39;s about looking at what we think our code does and double-checking just that, in a more complex light it&#39;s about writing <strong>small pieces of code (unit or integration tests) to test the code</strong> (yes, code to test the code).
Unit tests are for small isolated tests/scenarios and integration tests for larger/more realistic ones.
Although this sounds simple (so far), testing is extremely nuanced as the way we write code has an extremely large impact on our ability to test it (hence knowledge of patterns/antipatterns may be useful)!</p>
<p><em>There&#39;s a lot to testing and I&#39;m not an expert, but I hope that this is enough to get you going/give you some sense of direction...</em></p>
<h2 id="chapter-3-design-patterns-antipatterns">Chapter 3 - Design Patterns/Antipatterns</h2>
<p>Patterns and antipatterns are just good and bad coding practices we should try and use more/less respectively.
Although at their heart <strong>design patterns/antipatterns are simple, they tend to be sorely overcomplicated</strong>!
In essence, we see good and bad code all the time, so learning these comes naturally, however lots of books/articles go into fine detail by naming and shaming.</p>
<p>All <a target='_blank' rel='noopener noreferrer'  href="https://en.wikipedia.org/wiki/Software_design_pattern#Classification_and_list">design patterns</a> have three basic purposes, to help <em>create</em>, organise (<em>structural</em>) or communicate (<em>behavioural</em>) between classes and objects.</p>
<p>A few examples:</p>
<ul>
<li>Singleton - creating classes which are only initialised (used) once</li>
<li>Strategy - when we abstract (group) multiple algorithms (or models) into one class so they can easily be swapped out</li>
<li>Observer - when multiple objects need to know about when an event is triggered we can distinguish between observers and callers</li>
</ul>
<p>Since <a target='_blank' rel='noopener noreferrer'  href="https://en.wikipedia.org/wiki/Anti-pattern#Software_design">antipatterns</a> are just mistakes they&#39;re a good few that exist:</p>
<ul>
<li>Analysis paralysis - when we&#39;re stuck planning and never start coding</li>
<li>Cargo cult programming - when we use code without understanding it</li>
<li>Rule of credibility - the last 10% of our work takes 90% of our time</li>
<li>Big ball of mud - when all our code is in one large clump</li>
<li>Spaghetti code - where our code isn&#39;t cleanly separated</li>
<li>Poltergeist - creating excess classes/code for no reason</li>
<li>Repeated logic/redundant code - can just use classes/functions when code is used in multiple places</li>
<li>Ambiguous naming of variables and functions - names should be short but still express meaning</li>
<li>Magic strings - fixed values with an unknown purpose</li>
</ul>
<p><em>Note it&#39;s more practical to pick these all up through carefully inspecting code (especially off Stack Overflow)!</em></p>
<h2 id="chapter-4-git">Chapter 4 - Git</h2>
<p>Git is the collaboration one-stop-shop!
It is elegant and beautiful once we learn to use it... but seemingly not before that 😧.
Don&#39;t worry though, it&#39;s quite simple, Git works through tracking what changes we make (hence it&#39;s called <em>version control</em>), and it does this by breaking up our timeline into chunks that we&#39;ve <strong>committed to using</strong> (commits).</p>
<p>We may now ask though - how does this help to combine our changes?
Luckily for us, it&#39;s not too difficult to interpret, Git stores our work in <em>repositories</em> which can be shared and <em>forked/cloned</em>.
Whenever we make changes we can <em>commit</em> these and then <em>push</em> them out to our online repositories (technically called remote repositories).
Then once we&#39;re ready to share our brilliant code we can <strong><em>pull</em> others over to see/confirm what we&#39;ve done</strong> (with a pull request)!
Although this all just sounds weirdly social right now, it gets useful when Git provides us with overviews of our changes, so we&#39;re certain that our team&#39;s outstanding work won&#39;t collide/conflict with our work.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1588752091542/GpiA9RXHz.jpeg" alt="git.jpg"></p>
<p>Now there are a few more technical ways we can to use Git, primarily through segmenting work/progress into <em>branches</em> and providing special ways to <em>combine our changes</em>.
<em>Branches</em> allow us to highlight particular parts of our codebase which we&#39;d like to share, whilst also allowing us to isolate certain <em>features</em> which may be unstable/not quite ready yet!
The first way to combine branches is to <strong><em>merge</em> changes</strong> by adding the changes made into a new commit.
The second is to <em>replay one branch&#39;s changes on another</em> (which we call a <em>rebase</em>).
Which one we use depends on our situation:</p>
<ul>
<li>When we try to make our commit history as simple as possible, a rebase is an amazing and flexible option</li>
<li>If we need to remove, modify, combine or change the order of commits, to keep a simple and clean history, only a rebase will suffice</li>
<li>However, just like time travel, <strong>a rebase is dangerous whenever we do it on anything others are using</strong><ul>
<li>In practice <strong>only rebase non-publish/non-used code</strong> (this is often referred to as the <em>golden rule</em>)</li>
</ul>
</li>
</ul>
<p>Now that we&#39;ve discussed the difficult concepts, let us take a look at the terminal (explained further below) commands we can use:</p>
<p>To clone a repository</p>
<pre><code class="lang-bash">git <span class="hljs-built_in">clone</span> git_website_url
</code></pre>
<p>To add a file/folder to be tracked in the next commit (stores changes at the time the commend&#39;s run)</p>
<pre><code class="lang-bash">git add new_file_or_folder_location
</code></pre>
<p>To commit</p>
<pre><code class="lang-bash">git commit -m <span class="hljs-string">"added amazing new features"</span>
</code></pre>
<p>To change branches</p>
<pre><code class="lang-bash">git checkout my_branch
</code></pre>
<p>To create and switch to a new branch</p>
<pre><code class="lang-bash">git checkout -b my_new_branch
</code></pre>
<p>To push</p>
<pre><code class="lang-bash">git push
</code></pre>
<p>To merge branches</p>
<pre><code class="lang-bash">git merge my_feature_branch
</code></pre>
<p>To rebase a branch (n is the number of commits to consider)</p>
<pre><code class="lang-bash">git rebase -i HEAD~n
</code></pre>
<p>To add an upstream branch</p>
<pre><code class="lang-bash">git remote add upstream original_repo_url
</code></pre>
<p>To sync a local repository (to its remote) </p>
<pre><code class="lang-bash">git fetch upstream
git merge upstream/master
</code></pre>
<p>A few mistakes to avoid:</p>
<ul>
<li>The URL to a Git repository doesn&#39;t include any specific file/folder</li>
<li>We fork repositories to keep an isolated version to work with ourselves before we&#39;re ready to pull together our work (so our changes don&#39;t affect each other in the middle of things)<ul>
<li>So the URL to enter when <em>cloning</em> a repo to work with is <em>your forked version</em> and then the original repositories main branch becomes the forked repositories <em>upstream branch</em> (as it&#39;s likely newer)</li>
<li>Be <strong>careful when copy-pasting their URLs as they&#39;re quite easy to mix the wrong way round</strong></li>
<li>Note the <strong>upstream branch only has to be set once</strong></li>
</ul>
</li>
<li>Pull requests happen through an online UI (i.e. the GitHub website) not the terminal (normally)</li>
<li>Once we start an interactive rebase, <strong>carefully read the provided options</strong></li>
</ul>
<p><em><a target='_blank' rel='noopener noreferrer'  href="https://www.atlassian.com/git/tutorials/">Atlassian documentation</a> provides further details and examples of how to use Git.</em></p>
<h2 id="chapter-5-linux-and-the-terminal">Chapter 5 - Linux and the Terminal</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1588752126703/m6Oates_2.jpeg" alt="mobo.jpg">
As explained above, Linux is an amazing replacement for Windows (it&#39;s free by the way) which is far more flexible and lightweight!
One distinct feature is the inbuilt powerful terminal (called bash) which allows us to perform complex tasks easily.</p>
<p>Here are the essential commands:</p>
<p>List files</p>
<pre><code class="lang-bash">ls
ls my_folder
</code></pre>
<p>Check current location (i.e. current folder/directory)</p>
<pre><code class="lang-bash"><span class="hljs-built_in">pwd</span>
</code></pre>
<p>Change directory (into another folder)</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> folder_path
</code></pre>
<p>Move a file/folder</p>
<pre><code class="lang-bash">mv old_location new_location
</code></pre>
<p>Copy a file</p>
<pre><code class="lang-bash">cp file_location copy_location
</code></pre>
<p>Copy a folder</p>
<pre><code class="lang-bash">cp -r folder_location copy_location
</code></pre>
<p>Run another program (like a text editor, normally vi, vim or nano)</p>
<pre><code class="lang-bash">program_location
</code></pre>
<p>Although they don&#39;t seem anything out of the ordinary, the terminal provides a solid way to do a variety of tasks!</p>
<p><em>Note if you ever enter a text editor you can&#39;t seem to close (likely vi/a variant of vi) hit escape and then :q!</em></p>
<h1 id="going-further">Going further</h1>
<p>For more information, <a target='_blank' rel='noopener noreferrer'  href="https://missing.csail.mit.edu/">The Missing Semester of Your CS Education</a> is a useful guide.
Thanks for reading and I really hope that this has helped you out!</p>
<p><em><a target='_blank' rel='noopener noreferrer'  href="https://unsplash.com/photos/w7ZyuGYNpRQ">Photo by Kevin Ku on Unsplash</a></em></p>
]]></content:encoded></item><item><title><![CDATA[Zero to Hero... Data Collection through Web Scraping]]></title><description><![CDATA[Why?
Machine learning is cool, but we can't really do much without data.
So let's kick off our journey the right way through web scraping!
Now I'm going to preface this post by stating there are two options:

Popular and easy to manage data
Unique an...]]></description><link>https://www.kamwithk.com/zero-to-hero-data-collection-through-web-scraping</link><guid isPermaLink="true">https://www.kamwithk.com/zero-to-hero-data-collection-through-web-scraping</guid><category><![CDATA[Data Science]]></category><category><![CDATA[web scraping]]></category><category><![CDATA[nlp]]></category><category><![CDATA[projects]]></category><category><![CDATA[side project]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Sun, 01 Mar 2020 06:40:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1583044805187/wJ-8-AzXl.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>Machine learning is cool, but we can&#39;t really do much without data.
So let&#39;s kick off our journey the right way through web scraping!</p>
<p>Now I&#39;m going to preface this post by stating there are two options:</p>
<ul>
<li>Popular and easy to manage data</li>
<li>Unique and niche data</li>
</ul>
<p>For this mini-series we&#39;re aiming to create a <strong>unique, exciting and impressive</strong> project, so we are skipping over the overused and cliche projects!
Instead, we&#39;ve chosen to create a video game recommendation system.</p>
<h1 id="how-we-got-here">How we got here</h1>
<p>Before diving into code and pulling logic apart I feel like it&#39;s important to give a brief overview of how we can collect data:</p>
<ul>
<li>Decided on what data we&#39;re looking for (game titles, summaries, and reviews)</li>
<li>Research and find potential data sources (Wikipedia and Metacritic)</li>
<li>Scrape and export final input data</li>
</ul>
<h1 id="primary-precedence">Primary precedence</h1>
<p>Let me start with a little insight on this project, I&#39;m actually pairing up with a friend who just started their machine learning journey.
Now I understand how it feels to work with a minimal amount of knowledge as I started my first projects this way!</p>
<p>I&#39;ve never mentioned this before, but I started programming as a nieve kid.
I always wanted to program, and I thought I could when I was ~13.
So I joined a game-creating competition where I completely bombed out!
Know why?
I thought I knew it all, I thought tutorials would explain all the details for any project I chose, and that they&#39;d explain how everything should come together magically.
Thus I was completely unprepared for the challenge ahead of me!</p>
<p>My friend is at the same stage I was back then, and it&#39;s reflected by his order of precedence (what he considered most to least important).
Now let me explain why this is important with an anecdote.</p>
<p>My friend just started working on his first task, to find box art for all our games.
This seemed easy to him, but alas, <strong>it was quite dubious</strong>!
You see Metacritic and Wikipedia provide lists of games with small icons displayed next to each title, but they&#39;re <em>just <strong>small thumbnail image</strong></em>.
This detail was easy to miss, so he breezed over it and tried to collect images without a second thought (a costly mistake).</p>
<p>Our mistakes are similar to wandering through a jungle without reading a map!
We may hike halfway through a jungle, but only after thoroughly examining a map can we possibly hope to find which direction we need to travel in!
These wild bets which have extremely low chances of paying off (i.e. getting us to our desired location) are what I call nieve assumptions!</p>
<h1 id="the-challenge">The challenge</h1>
<p>I hear people asking how nieve assumptions relate to <em>our amazing project</em>.</p>
<blockquote>
<p>Data collection is when you start writing code for your project and so you&#39;re currently most vulnerable to the nieve tendency</p>
</blockquote>
<p>As your guide I need to explain how we can simply avoid nieve assumptions:</p>
<ol>
<li>Start with finding context about the problem you&#39;re solving</li>
<li>Proceed to decompose your problem into smaller pieces</li>
<li>Strategise by planning how to overcome the problem</li>
<li>Fill in the blanks</li>
</ol>
<p>The main takeaway here is to always research the mechanisms at play <strong>before working</strong>.
For us, this means <strong>closely inspecting <em>where</em> your data will come from and <em>how</em> it will be used</strong>.
Don&#39;t worry if you&#39;re naturally not thinking this way (yet), as it&#39;ll come with time and practice!</p>
<h1 id="writing-the-code">Writing the code</h1>
<p>Finally some code!</p>
<p>Let&#39;s start off by deciding our output file format and location.
Since we&#39;re using Scrapy it&#39;s relatively easy!
Just set the <code>FEED_FORMAT</code> (file format) and <code>FEED_URI</code> (file path) settings and we&#39;re good to go.</p>
<p>Our first task will be to collect a list of PC video game titles and summaries.
For simplicity, we&#39;re going to use our primary data source Wikipedia!
Although it&#39;s possible to easily download all of Wikipedia, we only really need articles on PC games, so we&#39;re going to find the URL for each article.
There&#39;s a <a target='_blank' rel='noopener noreferrer'  href="https://en.wikipedia.org/wiki/List_of_PC_games">pre-created list on Wikipedia</a>, which we&#39;ll be scraping for titles and URLs.</p>
<p>We start by isolating the parts of the page with relevant data.
To do this we can use CSS or XPath selectors.
The first step to finding the right selector is understanding a page&#39;s HTML structure.
So we can take a look at our browser&#39;s inspector which provides a small interactive view of the HTML code (in Firefox press <code>cntrl+shift+i</code>).
To easily find a HTML element on the page use the select element feature (press <code>cntrl+shift+c</code> in Firefox).</p>
<p>After playing around it&#39;s apparent that all our game titles are <code>a</code> elements.
We can try and use a CSS selector of <code>a</code>, however, you&#39;ll notice that its output is composed of more than just game titles.
We can become more specific, <code>i &gt; a</code>, and we&#39;ll get slightly better results!
The trick for CSS selectors is to try your luck by starting fairly generic and progressively making them more specific.
In the end, our experimentation revealed that <code>td &gt; i &gt; a</code> selects our desired game elements.
Now though we actually want two things:</p>
<ul>
<li>The name of each game</li>
<li>The URL to each Wikipedia article on a game</li>
</ul>
<p>To find specific parts of our element with can use <em>attributes</em> <code>::</code>!
For URLs you use <code>::attr(href)</code> and text <code>::text</code>.
Note that we don&#39;t want HTML elements, so we can use the <code>get</code> or <code>getall</code> functions to extract our data!
Now here are our CSS selectors for collecting our two pieces of information from the page:</p>
<ul>
<li>The name of each game: <code>td &gt; i &gt; a::text</code></li>
<li>The URL to each Wikipedia article on a game: <code>td &gt; i &gt; a::attr(href)</code></li>
</ul>
<p>If you&#39;ve looked at our Wikipedia webpage you should realise that we need to manually switch between pages 😑.
We luckily have links to each page present at the top of the Wikipedia list!
Now that means we&#39;ll need to find a CSS selector, and then loop through each page of Wikipedia games.</p>
<p>If you followed my advice (general -&gt; specific) you&#39;d eventually realise that some elements specify <em>classes</em>.
We can easily use CSS <em>classes</em> by writing <code>.</code> following a class name!
These classes are another amazing way to make selectors more specific.
Using our new knowledge of <em>classes</em> we can create a final CSS selector for each page: <code>div.toc &gt; div &gt; ul &gt; li &gt; a::attr(href)</code>.</p>
<p>All we have to do now is to loop through and scrape each page.
With Scrapy all you have to do is <code>yield</code> another response object!
To do this easily for relative paths use the <code>response.follow</code> function.
Note here that we would scrape the first page twice if we go through each link on the page.</p>
<p>Now, of course, these are just game titles and webpages, but we can use Wikipedia&#39;s <a target='_blank' rel='noopener noreferrer'  href="https://en.wikipedia.org/wiki/Special:Export">special export articles page</a>!
However, this, of course, isn&#39;t automated and so a small amount of manual work will have to be done to replicate the results.</p>
<h1 id="technical-summary">Technical summary</h1>
<p>CSS selector tips and tricks:</p>
<ul>
<li>Elements can be specified like <code>div</code></li>
<li>Using <code>.class_name</code> allows you to specify an element&#39;s class</li>
<li>Attributes can be specified after <code>::</code><ul>
<li>URLs are found with the <code>attr(href)</code> attribute</li>
</ul>
</li>
</ul>
<p>For more take a look at <a target='_blank' rel='noopener noreferrer'  href="https://www.w3schools.com/cssref/css_selectors.asp">w3schools complete list</a>.</p>
<p>To extract data from HTML elements use the <code>get</code> or <code>getall</code> function and to remove extraneous characters use <code>strip</code>.</p>
<h1 id="the-trouble-with-web-scraping">The trouble with web scraping</h1>
<p>I hope you appreciate how well designed Scrapy is!
It makes the process of web scraping relatively easy to do.</p>
<p>But now why am I gloating about Scrapy when I previously emphasised how <em>troublesome</em> web scraping can be?
Well, unfortunately for us whilst Scrapy can make <em>how</em> we web scrape easy, it can&#39;t remove inconsistencies in web pages 😥.
We, together, have discussed how to web scrape a Wikipedia page, and I sincerely hope that web scraping with me has provided you with some transferable value (i.e. you become capable of replicating this yourself).</p>
<h1 id="my-code">My code</h1>
<p>To understand more about how to create a Scrapy <em>spider</em> to extract information from our website see <a target='_blank' rel='noopener noreferrer'  href="https://github.com/KamWithK/GameRec">my GitHub repository</a>!</p>
<h1 id="thanks-for-reading-">THANKS FOR READING!</h1>
<p>I know data collection (especially web scraping) can be time-consuming, and often feel like a slight pain.
However, being persistent and fighting through the data collection process will definitely give us a strong start!</p>
<p>If you haven&#39;t already, check out the <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/zero-to-hero-nlp-project-edition-ck6zsqtbo05srdfs135o8blcf">first intro to building end-to-end machine learning projects</a>.
Make sure to stay put for the next post where I actively go through the next  (and potentially most important) step of our journey (data preprocessing)!
Last but not least, make sure to <a target='_blank' rel='noopener noreferrer'  href="https://twitter.com/kamwithk_?ref_src=twsrc%5Etfw">follow me on Twitter</a> for updates!</p>
]]></content:encoded></item><item><title><![CDATA[Zero to Hero... NLP project edition]]></title><description><![CDATA[Why?
So you just went through another tutorial, another MOOC.
Your guilty gut instinct knows another one just won't help, but you have no idea what else to do.

You've been told a project can go a long way to show initiative, motivation and even skil...]]></description><link>https://www.kamwithk.com/zero-to-hero-nlp-project-edition</link><guid isPermaLink="true">https://www.kamwithk.com/zero-to-hero-nlp-project-edition</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[nlp]]></category><category><![CDATA[projects]]></category><category><![CDATA[side project]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Mon, 24 Feb 2020 01:42:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1582519877890/zJGDPW4qv.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>So you just went through another tutorial, another MOOC.
Your guilty gut instinct knows another one just <em>won&#39;t help</em>, but you have no idea what else to do.</p>
<blockquote>
<p>You&#39;ve been told a project can go a long way to show initiative, motivation and even skill, but... you&#39;ve got no idea what to do, where to go or even how to start!</p>
</blockquote>
<p>Apparently, it should &quot;genuinely motivate you to work&quot;, but... how?
What should your project even be about?</p>
<p>Your first instinct for starting a project may be to go with the flow, and see where it leads to.
You can try, be my guest, but that&#39;s like learning to navigate around a jungle yourself (you <em>might</em> find your target location... <em>eventually</em>)!
Instead, you could try conducting deep research into the surrounding landscape and wildlife.
However, when you&#39;re exposed to real, aggressive animals you&#39;ll notice the <strong>major difference between theoretical knowledge and real practical skills</strong>.</p>
<p>What you really need is a guide!
Someone who knows the place well enough to give you a brief tour around.
Someone able to point out general points of interest and significant events to watch out for.
This way when you&#39;re left alone, you&#39;ll roughly know what to do and how to live/navigate around the jungle yourself.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1582519903805/ugPgvIcbQ.png" alt="todo.png"></p>
<blockquote>
<p>It&#39;s easy to get stuck without any sense of direction during a project (like in a jungle)</p>
</blockquote>
<p>Please don&#39;t get caught alone in the jungle!
Instead, allow me to be your guide.
In this mini-series, we together will go from the ground up building a unique (and therefore impressive) Natural Language Processing project!
I hope this mini-series inspires you to start your own project whilst also offering a solid foundation to replicate the process yourself!</p>
<h1 id="a-light-bulb">A light bulb</h1>
<p>It&#39;s great to start of intrinsically motivated to work, but it&#39;s just... unrealistic.
How many times have you been so blown away by a random perfect idea that was so aligned to what you were <em>about</em> to do, that you could <strong>take immediate action</strong> and bring it to life?
If your answer is daily, you&#39;re lucky, kudos to you.</p>
<p>However, if you&#39;ve got no idea what to do, I&#39;ve got your back:</p>
<blockquote>
<p>With time and effort, learning and absorbing information, you&#39;ll eventually encounter an <em>impressive</em> and <em>worthy</em> idea.</p>
</blockquote>
<p>This means if you&#39;ve been ruminating for a while, take a break and instead learn.
You can learn through articles, books, videos, anything you like... just bask in information!
The trick here is to <strong>continuously question how these ideas could be used</strong> in the real world.
It doesn&#39;t matter whether you completely understand yet either (with time you&#39;ll learn...), just make sure to replicate this process until you come across a gem!</p>
<p>If you&#39;re unsure about your ability to finish a project, <em>that&#39;s fine</em>!
What&#39;s the worst thing that will happen?
The worst thing is that you&#39;ll have learnt more about what you can and can&#39;t do next time!
Just remember that dedication pays off in the long run.
If you research each idea, eventually after a five or so you&#39;ll find something golden!</p>
<h1 id="breakdown">Breakdown</h1>
<p>Finding an idea was damn hard, but <strong>following through... now that&#39;s something entirely different</strong>!
Lucky for confused basic simpletons like us, there&#39;s an easy way to break down the entire project into a few key stages:</p>
<ol>
<li><p>Data collection</p>
<p>Machine learning is cool, but we can&#39;t really do much without data.
So let&#39;s kick off our journey the right way by finding quality data!
There are two options:</p>
<ul>
<li>Popular and easy to manage data</li>
<li>Unique and niche data</li>
</ul>
<p><em>What could possibly warrant going through the trouble of creating a special dataset just for a single project</em>?
<strong>Simple</strong>, you want to be a <strong>problem solver</strong>.</p>
<blockquote>
<p>You want to show your ability to <strong>solve new, unique and challenging problems</strong>, not simple tutorials!</p>
</blockquote>
<p>I know finding data from unique sources will bring about numerous seemingly unnecessary hurdles, but they&#39;re part of the fun.</p>
</li>
<li><p>Process data (make sure it&#39;s formatted correctly and cleaned)</p>
<p>Processing data could be the most important part of your project.</p>
<blockquote>
<p>High-quality data yields high-quality results.</p>
</blockquote>
<p>I know you&#39;ll be tempted to fast track your progress by simplifying your preprocessing pipeline.
But just remember the saying <em>&quot;garbage in == garbage out&quot;</em>.
It means your lazy unprocessed data manifests itself within your model.
Hence a lazy mediocre model will generate sub-optimal output (despite attempts to algorithmically improve results).</p>
</li>
<li><p>Modelling</p>
<p>The highly anticipated part of any data science project is creating a model.
There are loads of complex models (and modifications to them) you can make, however, <strong>start simple and incrementally improve afterwards</strong>.</p>
</li>
<li><p>Application</p>
<blockquote>
<p>You thought you&#39;d <em>finished</em>?
Hahaha... the model itself isn&#39;t nearly as impressive as a tangible application!</p>
</blockquote>
<p>You have a variety of options, a website, mobile app, browser extension...
Choose whatever application makes sense!</p>
<p>Creating a final application may take a little time and require you to broaden your skillset further, but it pays itself off extremely fast.
Remember that one well thought of project is far better than a dozen small and careless mediocre ones!</p>
</li>
</ol>
<p>Cover image (modified) sourced from <a target='_blank' rel='noopener noreferrer'  href="https://commons.wikimedia.org/wiki/File:To_Do_List_Scene_Vector.svg">here</a></p>
<h1 id="thanks-for-reading-">THANKS FOR READING!</h1>
<p>I know that creating our first NLP project won&#39;t be quick nor easy.
But I think it&#39;s important to find <strong>why you&#39;re doing a project</strong>.
Is it to demonstrate <em>how fast</em> you can work or <em>how <strong>able</strong> you are to do <strong>meaningful and realistic</strong> work</em>?
I hope this mini-series helps you!</p>
<p>If you&#39;ve liked this, make sure to stay put for the next post where I actively go through the first step of our journey (data collection).
Make sure to <a target='_blank' rel='noopener noreferrer'  href="https://twitter.com/kamwithk_?ref_src=twsrc%5Etfw">follow me on Twitter</a> for updates!</p>
]]></content:encoded></item><item><title><![CDATA[Snake classification report]]></title><description><![CDATA[Why?
5.4 million people are bitten by snakes every year and 81-138 million people die due to snake bites each year.
Preventing snake bites is clearly a major issue, in need of preventative measures to save lives.
The project Snaked demonstrates a pot...]]></description><link>https://www.kamwithk.com/snake-classification-report</link><guid isPermaLink="true">https://www.kamwithk.com/snake-classification-report</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Sun, 16 Feb 2020 04:28:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581827280771/gadkuVXEa.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>5.4 million people are bitten by snakes every year and 81-138 million people die due to snake bites each year.
Preventing snake bites is clearly a major issue, in need of preventative measures to save lives.
The project Snaked demonstrates a potential solution to alleviate the problem of struggling to identify the snake which has bitten a person.
A proof of concept app is available on the <a target='_blank' rel='noopener noreferrer'  href="https://play.google.com/store/apps/details?id=com.kamwithk.snaked">Google Play Store</a> which uses this model.</p>
<h1 id="challenge">Challenge</h1>
<ul>
<li>Each snake species varies in shape, colour, size, texture and more.</li>
<li>Over 3000 snake species have already been discovered worldwide!</li>
<li>Different snake species may look nearly identical, however, vary significantly</li>
</ul>
<p>So it is clearly not an easy job to identify one snake from another, even though it&#39;s quintessential that we do.</p>
<h1 id="frameworks-and-methods-utilized">Frameworks and Methods Utilized</h1>
<p>PyTorch is used for ALL deep learning code and Numpy for numerical computations.
The code is separated into the main file outlining the chosen algorithms (can be swapped/modified easily) and abstracted code to help train, evaluate and create an executable for any model easily.</p>
<p>There are several novel ideas/techniques used here which help to create neater/more readable pythonic code.
The three primary examples of new PyTorch techniques:</p>
<ul>
<li>The Item tuple class which allows modular and further extensible code (when extra data needs to be processed)</li>
<li>Use of super-convergence/the one-cycle policy in pure PyTorch instead of a highly abstracted library (i.e. Fast.AI)/bare python/numpy</li>
<li>Use of a dictionary to dictate how different data sets should be split up to allow easy modifications of data proportions (i.e. switching between full dataset for training and small batches for ensuring all code runs without errors)</li>
</ul>
<p>Although several models were trialed out on the dataset, in the end, a MobileNetV2 model provided the best results, whilst also remaining relatively lightweight and so able to run on low-power devices like phones (essential for the app).
The final model was trained using LDAM loss instead of cross-entropy loss due to the dataset being imbalanced (some classes having far more samples than others).
Note that the codebase has support for classical rebalancing, however, experimental trials show that this method causes overfitting extremely early on.
The model manages to achieve around a 70% accuracy and F1 score.
More details about the <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/modern-algorithms-choosing-a-model-ck6hwiovf004ndfs1529fcppy">choice of model</a>, <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/improving-your-computer-vision-model-ck6k3em3b0113dfs16bray6ee">how it was improved</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/how-i-published-an-app-and-model-to-classify-85-snake-species-and-how-you-can-too-ck6jb8er400r0dfs1agw7d0y4">lessons learnt from this project</a> can be found on <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/">The Data Science Swiss Army Knife blog</a>.</p>
<h1 id="android-application">Android Application</h1>
<p>The android application created for this project was written with Kotlin, using Fotoapparat (for easy camera support) and PyTorch (for utilizing the chosen PyTorch model) libraries.
Due to the lightweight MobileNetV2 model no network is required to connect to a server (which would normally run the computations itself).
This is intentionally done to facilitate use within remote locations!
Please note that this is only a sample proof of concept app and if you&#39;re bitten consult a medical expert immediately.</p>
<h1 id="sources">Sources</h1>
<p>All data currently used for the project comes from <a target='_blank' rel='noopener noreferrer'  href="https://www.aicrowd.com/challenges/snake-species-identification-challenge">AIcrowd&#39;s Snake Species Identification Challenge</a>.
The images and labels have been used, however, geographic locations have been ignored to allow easy usage for any image, even if it hasn&#39;t been tagged.
The dataset allows 85 species to be labelled.
A Jupyter Notebook is also provided which demonstrates how to collect further data using Google Image searches!
All statistics used here, or within the repository are from the World Health Organisation (unless otherwise stated) or pertaining specifically to the Snaked source code.</p>
<p>For further information please see the following:</p>
<ul>
<li><a target='_blank' rel='noopener noreferrer'  href="https://www.who.int/news-room/fact-sheets/detail/snakebite-envenoming">Snakebite envenoming</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="http://apps.who.int/bloodproducts/snakeantivenoms/database/">Venomous Snakes Distribution and Species Risk Categories</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://www.aicrowd.com/challenges/snake-species-identification-challenge">AIcrowd Snake Species Identification Challenge</a></li>
</ul>
<h1 id="metrics">Metrics</h1>
<p>Throughout this report, I&#39;ll refer to F1 scores as my primary metric.
This is because of the class imbalance.
Please ensure to note that there is a major difference between the support number and number of present samples for a class.
The first refers to the count in the validation dataset, whereas the latter in the training dataset.
The number present in the training set will be primarily used to judge the effect of a skewed dataset on model predictions.</p>
<h1 id="training-and-validation-graphs">Training and Validation Graphs</h1>
<h2 id="train-epoch-vs-accuracy">Train epoch vs accuracy</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581743456233/FZRKWh-vQ.png" alt="train_acc.png"></p>
<h2 id="train-epoch-vs-loss">Train epoch vs loss</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581743461828/-C68AWthQ.png" alt="train_loss.png"></p>
<h2 id="validation-epoch-vs-accuracy">Validation epoch vs accuracy</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581743468939/u2pb8qLhd.png" alt="validation_acc.png"></p>
<h2 id="validation-epoch-vs-loss">Validation epoch vs loss</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581743644871/p-ZBR3Lk_.png" alt="validation_loss.png"></p>
<p><em>Model trained for 51 epochs</em></p>
<h1 id="classification-report">Classification Report</h1>
<table>
<thead>
<tr>
<td>species</td><td>precision</td><td>recall</td><td>f1-score</td><td>support</td></tr>
</thead>
<tbody>
<tr>
<td>agkistrodon-contortrix</td><td>0.8452380952380952</td><td>0.8765432098765432</td><td>0.8606060606060606</td><td>81.0</td></tr>
<tr>
<td>agkistrodon-piscivorus</td><td>0.676923076923077</td><td>0.6027397260273972</td><td>0.6376811594202899</td><td>73.0</td></tr>
<tr>
<td>ahaetulla-prasina</td><td>0.8</td><td>0.6666666666666666</td><td>0.7272727272727272</td><td>6.0</td></tr>
<tr>
<td>arizona-elegans</td><td>0.631578947368421</td><td>0.5714285714285714</td><td>0.6</td><td>21.0</td></tr>
<tr>
<td>boa-imperator</td><td>0.6</td><td>0.5294117647058824</td><td>0.5625</td><td>17.0</td></tr>
<tr>
<td>bothriechis-schlegelii</td><td>0.8888888888888888</td><td>0.7619047619047619</td><td>0.8205128205128205</td><td>21.0</td></tr>
<tr>
<td>bothrops-asper</td><td>0.875</td><td>0.5384615384615384</td><td>0.6666666666666667</td><td>13.0</td></tr>
<tr>
<td>carphophis-amoenus</td><td>0.6551724137931034</td><td>0.6551724137931034</td><td>0.6551724137931034</td><td>29.0</td></tr>
<tr>
<td>charina-bottae</td><td>0.8947368421052632</td><td>0.7391304347826086</td><td>0.8095238095238095</td><td>23.0</td></tr>
<tr>
<td>coluber-constrictor</td><td>0.5689655172413793</td><td>0.559322033898305</td><td>0.5641025641025641</td><td>59.0</td></tr>
<tr>
<td>contia-tenuis</td><td>0.52</td><td>0.6190476190476191</td><td>0.5652173913043478</td><td>21.0</td></tr>
<tr>
<td>coronella-austriaca</td><td>0.3333333333333333</td><td>0.1</td><td>0.15384615384615383</td><td>10.0</td></tr>
<tr>
<td>crotalus-adamanteus</td><td>0.6153846153846154</td><td>0.7272727272727273</td><td>0.6666666666666667</td><td>11.0</td></tr>
<tr>
<td>crotalus-atrox</td><td>0.7747252747252747</td><td>0.8392857142857143</td><td>0.8057142857142857</td><td>168.0</td></tr>
<tr>
<td>crotalus-cerastes</td><td>0.7857142857142857</td><td>0.8461538461538461</td><td>0.8148148148148148</td><td>13.0</td></tr>
<tr>
<td>crotalus-horridus</td><td>0.7741935483870968</td><td>0.8135593220338984</td><td>0.7933884297520662</td><td>59.0</td></tr>
<tr>
<td>crotalus-molossus</td><td>0.6363636363636364</td><td>0.7</td><td>0.6666666666666666</td><td>10.0</td></tr>
<tr>
<td>crotalus-oreganus</td><td>0.5333333333333333</td><td>0.6666666666666666</td><td>0.5925925925925926</td><td>12.0</td></tr>
<tr>
<td>crotalus-ornatus</td><td>1.0</td><td>0.8181818181818182</td><td>0.9</td><td>11.0</td></tr>
<tr>
<td>crotalus-pyrrhus</td><td>0.8571428571428571</td><td>0.5454545454545454</td><td>0.6666666666666665</td><td>22.0</td></tr>
<tr>
<td>crotalus-ruber</td><td>0.7894736842105263</td><td>0.7142857142857143</td><td>0.7500000000000001</td><td>21.0</td></tr>
<tr>
<td>crotalus-scutulatus</td><td>0.8620689655172413</td><td>0.7575757575757576</td><td>0.8064516129032258</td><td>33.0</td></tr>
<tr>
<td>crotalus-viridis</td><td>0.5384615384615384</td><td>0.6363636363636364</td><td>0.5833333333333334</td><td>22.0</td></tr>
<tr>
<td>diadophis-punctatus</td><td>0.8266666666666667</td><td>0.7948717948717948</td><td>0.8104575163398693</td><td>78.0</td></tr>
<tr>
<td>epicrates-cenchria</td><td>1.0</td><td>0.5</td><td>0.6666666666666666</td><td>2.0</td></tr>
<tr>
<td>haldea-striatula</td><td>0.5454545454545454</td><td>0.47058823529411764</td><td>0.5052631578947367</td><td>51.0</td></tr>
<tr>
<td>heterodon-nasicus</td><td>0.7</td><td>0.5384615384615384</td><td>0.608695652173913</td><td>13.0</td></tr>
<tr>
<td>heterodon-platirhinos</td><td>0.7058823529411765</td><td>0.75</td><td>0.7272727272727272</td><td>48.0</td></tr>
<tr>
<td>hierophis-viridiflavus</td><td>0.5</td><td>0.4375</td><td>0.4666666666666667</td><td>16.0</td></tr>
<tr>
<td>hypsiglena-jani</td><td>0.5238095238095238</td><td>0.6111111111111112</td><td>0.5641025641025642</td><td>18.0</td></tr>
<tr>
<td>lampropeltis-californiae</td><td>0.8701298701298701</td><td>0.8170731707317073</td><td>0.8427672955974842</td><td>82.0</td></tr>
<tr>
<td>lampropeltis-getula</td><td>0.8333333333333334</td><td>0.625</td><td>0.7142857142857143</td><td>16.0</td></tr>
<tr>
<td>lampropeltis-holbrooki</td><td>0.6923076923076923</td><td>0.6428571428571429</td><td>0.6666666666666666</td><td>14.0</td></tr>
<tr>
<td>lampropeltis-triangulum</td><td>0.8166666666666667</td><td>0.8305084745762712</td><td>0.8235294117647058</td><td>59.0</td></tr>
<tr>
<td>lichanura-trivirgata</td><td>0.9333333333333333</td><td>0.8235294117647058</td><td>0.8749999999999999</td><td>17.0</td></tr>
<tr>
<td>masticophis-flagellum</td><td>0.5294117647058824</td><td>0.6428571428571429</td><td>0.5806451612903226</td><td>42.0</td></tr>
<tr>
<td>micrurus-tener</td><td>1.0</td><td>0.95</td><td>0.9743589743589743</td><td>20.0</td></tr>
<tr>
<td>morelia-spilota</td><td>0.4</td><td>0.2222222222222222</td><td>0.2857142857142857</td><td>9.0</td></tr>
<tr>
<td>naja-naja</td><td>0.8333333333333334</td><td>0.7142857142857143</td><td>0.7692307692307692</td><td>7.0</td></tr>
<tr>
<td>natrix-maura</td><td>0.25</td><td>0.125</td><td>0.16666666666666666</td><td>8.0</td></tr>
<tr>
<td>natrix-natrix</td><td>0.7857142857142857</td><td>0.4074074074074074</td><td>0.5365853658536585</td><td>27.0</td></tr>
<tr>
<td>natrix-tessellata</td><td>0.5</td><td>0.36363636363636365</td><td>0.4210526315789474</td><td>11.0</td></tr>
<tr>
<td>nerodia-cyclopion</td><td>0.5833333333333334</td><td>0.4666666666666667</td><td>0.5185185185185186</td><td>15.0</td></tr>
<tr>
<td>nerodia-erythrogaster</td><td>0.53125</td><td>0.4358974358974359</td><td>0.47887323943661975</td><td>78.0</td></tr>
<tr>
<td>nerodia-fasciata</td><td>0.6190476190476191</td><td>0.43333333333333335</td><td>0.5098039215686274</td><td>30.0</td></tr>
<tr>
<td>nerodia-rhombifer</td><td>0.7213114754098361</td><td>0.6567164179104478</td><td>0.6875</td><td>67.0</td></tr>
<tr>
<td>nerodia-sipedon</td><td>0.5084745762711864</td><td>0.5309734513274337</td><td>0.5194805194805195</td><td>113.0</td></tr>
<tr>
<td>nerodia-taxispilota</td><td>0.8461538461538461</td><td>0.55</td><td>0.6666666666666667</td><td>20.0</td></tr>
<tr>
<td>opheodrys-aestivus</td><td>0.9042553191489362</td><td>0.9444444444444444</td><td>0.9239130434782609</td><td>90.0</td></tr>
<tr>
<td>opheodrys-vernalis</td><td>0.8235294117647058</td><td>0.6666666666666666</td><td>0.7368421052631577</td><td>21.0</td></tr>
<tr>
<td>pantherophis-alleghaniensis</td><td>0.36585365853658536</td><td>0.4</td><td>0.38216560509554137</td><td>75.0</td></tr>
<tr>
<td>pantherophis-emoryi</td><td>0.5714285714285714</td><td>0.6666666666666666</td><td>0.6153846153846153</td><td>36.0</td></tr>
<tr>
<td>pantherophis-guttatus</td><td>0.9183673469387755</td><td>0.8035714285714286</td><td>0.8571428571428571</td><td>56.0</td></tr>
<tr>
<td>pantherophis-obsoletus</td><td>0.5395348837209303</td><td>0.6041666666666666</td><td>0.5700245700245701</td><td>192.0</td></tr>
<tr>
<td>pantherophis-spiloides</td><td>0.43478260869565216</td><td>0.25</td><td>0.3174603174603175</td><td>40.0</td></tr>
<tr>
<td>pantherophis-vulpinus</td><td>0.5957446808510638</td><td>0.8484848484848485</td><td>0.7</td><td>33.0</td></tr>
<tr>
<td>phyllorhynchus-decurtatus</td><td>0.6956521739130435</td><td>0.9411764705882353</td><td>0.7999999999999999</td><td>17.0</td></tr>
<tr>
<td>pituophis-catenifer</td><td>0.7073170731707317</td><td>0.7016129032258065</td><td>0.7044534412955465</td><td>124.0</td></tr>
<tr>
<td>pseudechis-porphyriacus</td><td>0.6666666666666666</td><td>0.2857142857142857</td><td>0.4</td><td>7.0</td></tr>
<tr>
<td>python-bivittatus</td><td>0.875</td><td>0.7777777777777778</td><td>0.823529411764706</td><td>9.0</td></tr>
<tr>
<td>python-regius</td><td>0.0</td><td>0.0</td><td>0.0</td><td>3.0</td></tr>
<tr>
<td>regina-septemvittata</td><td>0.56</td><td>0.6666666666666666</td><td>0.6086956521739131</td><td>21.0</td></tr>
<tr>
<td>rena-dulcis</td><td>0.6666666666666666</td><td>0.8</td><td>0.7272727272727272</td><td>10.0</td></tr>
<tr>
<td>rhinocheilus-lecontei</td><td>0.8292682926829268</td><td>0.85</td><td>0.8395061728395061</td><td>40.0</td></tr>
<tr>
<td>sistrurus-catenatus</td><td>1.0</td><td>0.8</td><td>0.888888888888889</td><td>5.0</td></tr>
<tr>
<td>sistrurus-miliarius</td><td>1.0</td><td>0.6666666666666666</td><td>0.8</td><td>6.0</td></tr>
<tr>
<td>sonora-semiannulata</td><td>0.3333333333333333</td><td>0.4</td><td>0.3636363636363636</td><td>5.0</td></tr>
<tr>
<td>storeria-dekayi</td><td>0.76</td><td>0.8465346534653465</td><td>0.8009367681498829</td><td>202.0</td></tr>
<tr>
<td>storeria-occipitomaculata</td><td>0.631578947368421</td><td>0.6486486486486487</td><td>0.64</td><td>37.0</td></tr>
<tr>
<td>tantilla-gracilis</td><td>0.5454545454545454</td><td>0.6</td><td>0.5714285714285713</td><td>10.0</td></tr>
<tr>
<td>thamnophis-cyrtopsis</td><td>0.5714285714285714</td><td>0.4444444444444444</td><td>0.5</td><td>9.0</td></tr>
<tr>
<td>thamnophis-elegans</td><td>0.47058823529411764</td><td>0.2962962962962963</td><td>0.3636363636363636</td><td>27.0</td></tr>
<tr>
<td>thamnophis-hammondii</td><td>0.5555555555555556</td><td>0.35714285714285715</td><td>0.43478260869565216</td><td>14.0</td></tr>
<tr>
<td>thamnophis-marcianus</td><td>0.8787878787878788</td><td>0.8285714285714286</td><td>0.8529411764705883</td><td>35.0</td></tr>
<tr>
<td>thamnophis-ordinoides</td><td>0.375</td><td>0.16666666666666666</td><td>0.23076923076923078</td><td>18.0</td></tr>
<tr>
<td>thamnophis-proximus</td><td>0.7543859649122807</td><td>0.7413793103448276</td><td>0.7478260869565219</td><td>58.0</td></tr>
<tr>
<td>thamnophis-radix</td><td>0.8181818181818182</td><td>0.47368421052631576</td><td>0.6</td><td>38.0</td></tr>
<tr>
<td>thamnophis-sirtalis</td><td>0.7167070217917676</td><td>0.896969696969697</td><td>0.7967698519515478</td><td>330.0</td></tr>
<tr>
<td>tropidoclonion-lineatum</td><td>0.7777777777777778</td><td>0.6363636363636364</td><td>0.7000000000000001</td><td>11.0</td></tr>
<tr>
<td>vermicella-annulata</td><td>0.0</td><td>0.0</td><td>0.0</td><td>3.0</td></tr>
<tr>
<td>vipera-aspis</td><td>0.4</td><td>0.5714285714285714</td><td>0.47058823529411764</td><td>7.0</td></tr>
<tr>
<td>vipera-berus</td><td>0.75</td><td>0.75</td><td>0.75</td><td>12.0</td></tr>
<tr>
<td>virginia-valeriae</td><td>0.2</td><td>0.14285714285714285</td><td>0.16666666666666666</td><td>7.0</td></tr>
<tr>
<td>xenodon-rabdocephalus</td><td>1.0</td><td>1.0</td><td>1.0</td><td>8.0</td></tr>
<tr>
<td>zamenis-longissimus</td><td>0.4</td><td>0.3333333333333333</td><td>0.3636363636363636</td><td>12.0</td></tr>
<tr>
<td>accuracy</td><td>0.6930662557781202</td><td>0.6930662557781202</td><td>0.6930662557781202</td><td>0.6930662557781202</td></tr>
<tr>
<td>macro avg</td><td>0.6659430597272402</td><td>0.6050948460385794</td><td>0.6247619446039016</td><td>3245.0</td></tr>
<tr>
<td>weighted avg</td><td>0.6925634979734435</td><td>0.6930662557781202</td><td>0.6866874083865039</td><td>3245.0</td></tr>
</tbody>
</table>
<p>Despite a skewed dataset, the majority of snakes had a similar precision and recall score.</p>
<h1 id="number-of-samples-vs-f1-score">Number of Samples vs F1 score</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581823824791/-F0a54VnR.png" alt="Samples vs F1.png">
The above graph of the number of samples the neural network was trained over and F1 score per class indicates two significant points:</p>
<ul>
<li>Having more samples of a class will increase the probability of a model robustly identifying the snake species correctly</li>
<li>However, this does not mean that an inability to collect a large number of images for all classes will create a major imbalance in a model&#39;s predictions</li>
</ul>
<p>The latter takeaway shows that LDAM loss has been largely successful!
On a side note though, when a class has few samples, the F1 score may not be completely representative of how the model will generalise.
This is primarily because, in a small number of images, only a small number of conditions can be shown and tested.
Yet, in reality, an image can be taken in any environment and transformed in a very large number of ways.</p>
<h1 id="confusion-matrix">Confusion Matrix</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581741953427/cKk37rtOR.png" alt="Confusion Matrix.png"></p>
<p>The test dataset had too few images to clearly judge anything from the confusion matrix.
This is due to a combination of having 85 classes, and classes with few samples being quite dark.</p>
<h1 id="improvements">Improvements</h1>
<p>If this model was going to be retrained, a good idea would be to use a 90-5-5, or better a 80-10-10 split.
This would ensure that there would be enough data in the testing dataset to judge where exactly the model&#39;s confusion stems from (using a confusion matrix).</p>
<p>Combining the current classification model with an additional preprocessing pipeline which includes segmentation may also allow far higher F1 scores.
This is because segmentation is able to remove extraneous noise around a snake (subtle environmental cues which prevent the model from generalising).
Additionally, this would completely avoid the chance of a snake being cropped out of an image (currently unlikely but still possible for small snakes).</p>
]]></content:encoded></item><item><title><![CDATA[Super-Convergence with JUST PyTorch]]></title><description><![CDATA[Why?
When creating Snaked, my snake classification model I needed to find a way to improve results.
Super-Convergence was just that, a way to train a model faster whilst getting better results!
HOWEVER, I found no guides on how to do it with the buil...]]></description><link>https://www.kamwithk.com/super-convergence-with-just-pytorch</link><guid isPermaLink="true">https://www.kamwithk.com/super-convergence-with-just-pytorch</guid><category><![CDATA[Deep Learning]]></category><category><![CDATA[pytorch]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Thu, 13 Feb 2020 04:56:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581569707308/6ucCpsvsf.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>When creating Snaked, my snake classification model I needed to find a way to improve results.
Super-Convergence was just that, a way to train a model faster whilst getting better results!
HOWEVER, I found <strong>no guides</strong> on how to do it with the built-in PyTorch scheduler.</p>
<h1 id="learn-the-theory">Learn the theory</h1>
<p>Before you go through this you&#39;d probably like to know <em>what</em> super-convergence is and <em>how</em> it works.
The general gist is to increase the learning rate as much as possible at the beginning and then progressively decrease it at a cyclical rate.
This is because larger learning rate&#39;s train faster, but cause the loss to diverge.
My focus here is with PyTorch though, so I myself won&#39;t explain any further.</p>
<p>Here&#39;s a list of resources to delve deeper:</p>
<ul>
<li><a target='_blank' rel='noopener noreferrer'  href="https://arxiv.org/abs/1708.07120">Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://medium.com/vitalify-asia/whats-up-with-deep-learning-optimizers-since-adam-5c1d862b9db0">What’s up with Deep Learning optimizers since Adam?</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://towardsdatascience.com/https-medium-com-super-convergence-very-fast-training-of-neural-networks-using-large-learning-rates-decb689b9eb0">Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://www.fast.ai/2018/07/02/adam-weight-decay/">AdamW and Super-convergence is now the fastest way to train neural nets</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://sgugger.github.io/the-1cycle-policy.html">The 1cycle policy</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html">How Do You Find A Good Learning Rate</a></li>
</ul>
<h1 id="imports">Imports</h1>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> torchvision <span class="hljs-keyword">import</span> datasets, models, transforms
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader

<span class="hljs-keyword">from</span> torch <span class="hljs-keyword">import</span> nn, optim
<span class="hljs-keyword">from</span> torch_lr_finder <span class="hljs-keyword">import</span> LRFinder
</code></pre>
<h1 id="setting-hyperparameters">Setting Hyperparameters</h1>
<h2 id="set-transforms">Set transforms</h2>
<pre><code class="lang-python">transforms = transforms.Compose([
transforms.RandomResizedCrop(size=<span class="hljs-number">256</span>, scale=(<span class="hljs-number">0.8</span>, <span class="hljs-number">1</span>)),
    transforms.RandomRotation(<span class="hljs-number">90</span>),
    transforms.ColorJitter(),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.CenterCrop(size=<span class="hljs-number">224</span>), <span class="hljs-comment">#ImgNet standards</span>
    transforms.ToTensor(),
    transforms.Normalize((<span class="hljs-number">0.485</span>, <span class="hljs-number">0.456</span>, <span class="hljs-number">0.406</span>), (<span class="hljs-number">0.229</span>, <span class="hljs-number">0.224</span>, <span class="hljs-number">0.225</span>)), <span class="hljs-comment"># ImgNet standards</span>
])
</code></pre>
<h2 id="load-the-data-model-and-basic-hyper-parameters">Load the data, model and basic hyper parameters</h2>
<pre><code class="lang-python">train_loader = DataLoader(datasets.CIFAR10(root=<span class="hljs-string">"train_data"</span>, train=<span class="hljs-keyword">True</span>, download=<span class="hljs-keyword">True</span>, transform=transforms))
test_loader = DataLoader(datasets.CIFAR10(root=<span class="hljs-string">"test_data"</span>, train=<span class="hljs-keyword">False</span>, download=<span class="hljs-keyword">True</span>, transform=transforms))

model = models.mobilenet_v2(pretrained=<span class="hljs-keyword">True</span>)

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters())

<span class="hljs-comment"># Set the device in use to GPU (when it's available)</span>
device = torch.device(<span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>)
model = model.to(device)

<span class="hljs-comment">## Find the perfect learning rate</span>
Note that doing this requires a separate library <span class="hljs-keyword">from</span> [here](https://github.com/davidtvs/pytorch-lr-finder).

```python
lr_finder = LRFinder(model, optimizer, criterion, device)
lr_finder.range_test(train_loader, end_lr=<span class="hljs-number">10</span>, num_iter=<span class="hljs-number">1000</span>)
lr_finder.plot()
plt.savefig(<span class="hljs-string">"LRvsLoss.png"</span>)
plt.close()
</code></pre>
<pre><code>HBox(children=(FloatProgress(<span class="hljs-keyword">value</span>=<span class="hljs-number">0.0</span>, max=<span class="hljs-number">1000.0</span>), HTML(<span class="hljs-keyword">value</span>=<span class="hljs-string">''</span>)))


Stopping early, the loss has diverged
Learning rate search finished. See the graph with {finder_name}.plot()
</code></pre><p><img src="Super-Convergence_files/Super-Convergence_8_2.svg" alt="svg"></p>
<h2 id="create-a-scheduler">Create a scheduler</h2>
<p>Use the one cycle learning rate scheduler (for super-convergence).</p>
<p>Note that the scheduler uses the maximum learning rate from the graph.
To choose look for the maximum gradient (slope) downwards.</p>
<p>The number of epochs to train for and the steps per epoch must be entered in.
It is common practice to use the batch size as the steps per epoch.</p>
<pre><code class="lang-python">scheduler = optim.lr_scheduler.OneCycleLR(optimizer, <span class="hljs-number">2e-3</span>, epochs=<span class="hljs-number">50</span>, steps_per_epoch=len(train_loader))
</code></pre>
<h1 id="train-model">Train model</h1>
<p>Train the model for 50 epochs.
Print stats after every epoch (loss and accuracy).</p>
<p>Different schedulers should be called in different within the code.
Placing the scheduler in the wrong place will cause bugs, so with the one-cycle policy ensure that the step method is called straight after each batch.</p>
<pre><code class="lang-python">best_acc = <span class="hljs-number">0</span>
epoch_no_change = <span class="hljs-number">0</span>

<span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, <span class="hljs-number">50</span>):
    print(f<span class="hljs-string">"Epoch {epoch}/49"</span>.format())

    <span class="hljs-keyword">for</span> phase <span class="hljs-keyword">in</span> [<span class="hljs-string">"train"</span>, <span class="hljs-string">"validation"</span>]:
        running_loss = <span class="hljs-number">0.0</span>
        running_corrects = <span class="hljs-number">0</span>

        <span class="hljs-comment"># PyTorch model's state must be changend</span>
        <span class="hljs-comment"># As layers like dropout work differently depending on state</span>
        <span class="hljs-keyword">if</span> phase == <span class="hljs-string">"train"</span>:
            model.train()
        <span class="hljs-keyword">else</span>: model.eval()

        <span class="hljs-comment"># Loop through the dataset</span>
        <span class="hljs-keyword">for</span> (inputs, labels) <span class="hljs-keyword">in</span> train_loader:
            <span class="hljs-comment"># Transfer data to the GPU</span>
            inputs, labels = inputs.to(device), labels.to(device)

            <span class="hljs-comment"># Reset the gradient (so the gradient doesn't accumilate)</span>
            optimizer.zero_grad()

            <span class="hljs-keyword">with</span> torch.set_grad_enabled(phase == <span class="hljs-string">"train"</span>):
                <span class="hljs-comment"># Predict the label which the model gives the max probability (of being true)</span>
                outputs = model(inputs)
                _, preds = torch.max(outputs, <span class="hljs-number">1</span>)
                loss = criterion(outputs, labels)

                <span class="hljs-keyword">if</span> phase == <span class="hljs-string">"train"</span>:
                    <span class="hljs-comment"># Backprop</span>
                    loss.backward()
                    optimizer.step()

                    <span class="hljs-comment"># Super convergence changes the learning rate</span>
                    scheduler.step()

            running_loss += loss.item() * inputs.size(<span class="hljs-number">0</span>)
            running_corrects += torch.sum(preds == labels.data)

        <span class="hljs-comment"># Calculate and output metrics</span>
        epoch_loss = running_loss / len(self.data_loaders[phase].sampler)
        epoch_acc = running_corrects.double() / len(self.data_loaders[phase].sampler)
        print(<span class="hljs-string">"\nPhase: {}, Loss: {:.4f}, Acc: {:.4f}"</span>.format(phase, epoch_loss, epoch_acc))

        <span class="hljs-comment"># Stop the model from training further if it hasn't improved for 5 consecutive epochs</span>
        <span class="hljs-keyword">if</span> phase == <span class="hljs-string">"validation"</span> <span class="hljs-keyword">and</span> epoch_acc &gt; best_acc:
            epoch_no_change += <span class="hljs-number">1</span>

            <span class="hljs-keyword">if</span> epoch_no_change &gt; <span class="hljs-number">5</span>:
                <span class="hljs-keyword">break</span>
</code></pre>
<h1 id="thanks-for-reading-">Thanks for READING!</h1>
<p>I hope this is easy enough to understand relatively quickly.
As when I first implemented super-convergence it took me a long time to figure out how to use the scheduler (I couldn&#39;t find any code which utilized it).
If you liked this blog post consider checking out other <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/improving-your-computer-vision-model-ck6k3em3b0113dfs16bray6ee">ways to improve your model</a>.
If you&#39;d like to see how super-convergence is used in a real project, look no further than <a target='_blank' rel='noopener noreferrer'  href="https://github.com/KamWithK/Snaked">my snake classification project</a>.</p>
<p>Cover image sourced <a target='_blank' rel='noopener noreferrer'  href="https://pixnio.com/objects/mechanism-metal-gears-steel-iron#">here</a></p>
]]></content:encoded></item><item><title><![CDATA[Improving your computer vision model]]></title><description><![CDATA[Why?
So you've cleaned your data, written some basic code to train a model, but now don't know where to go next.
Don't worry, I've got your back.
I'm going to explain in as much detail as I can the tricks I've learnt about which can help improve any ...]]></description><link>https://www.kamwithk.com/improving-your-computer-vision-model</link><guid isPermaLink="true">https://www.kamwithk.com/improving-your-computer-vision-model</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[neural networks]]></category><category><![CDATA[Computer Vision]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Thu, 13 Feb 2020 01:57:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581558940825/qeU9eeUo2.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>So you&#39;ve cleaned your data, written some basic code to train a model, but now don&#39;t know where to go next.
Don&#39;t worry, I&#39;ve got your back.
I&#39;m going to explain in as much detail as I can the tricks I&#39;ve learnt about which can help improve any model.</p>
<h1 id="data-augmentation">Data Augmentation</h1>
<p>Data augmentations are modifications you can make to your input data.
They can help make your model more robust, and able to generalise better.
There is a wide variety available, but I&#39;ll just describe some of the one&#39;s I&#39;ve tried:</p>
<ul>
<li>Randomly crop 80%-100% of each image</li>
<li>Adjustment the aspect ratio</li>
<li>Random colour jitter</li>
<li>Resize images to have 256-pixel height</li>
<li>Centre crop the image to 224x224 pixels</li>
</ul>
<h1 id="super-convergence">Super-Convergence</h1>
<p>The point of super-convergence is to speed up training, whilst also improving performance (a win-win situation).
It works based on the idea that higher learning rates train models fast and can act as regularizers.
It decreases the learning rate with time in a cyclic fashion.
Note that you can try out the AdamW optimizer as well, as it&#39;s supposed to give better results.
I&#39;m writing a whole article on how to use super-convergence in pure PyTorch, so if interested make sure to check that out!
For the nitty-gritty details take a look at the <a target='_blank' rel='noopener noreferrer'  href="https://arxiv.org/abs/1708.07120">original paper</a>.</p>
<h1 id="learning-rate-finder-tuning-hyperparameters">Learning rate finder/tuning hyperparameters</h1>
<p>This one is usually done in combination with super-convergence, but can also be used itself.
The idea is to plot a graph of learning rate vs loss.
In this way, you can find the maximum learning rate you can safely use (without the gradient diverging and loss increasing).</p>
<p>Note that whilst training it does need to be decreased, or else loss will increase again (super-convergence does this for you).</p>
<h1 id="test-time-augmentation">Test time augmentation</h1>
<p>Test time augmentation involves averaging the results of a model over several <em>augmented images</em>.
This can yield higher results, however, I decided against it due to unusually high VRAM usage with the PyTorch library I found (easier with <a target='_blank' rel='noopener noreferrer'  href="https://www.fast.ai/">Fast.AI</a> though).</p>
<h1 id="balance-dataset">Balance dataset</h1>
<p>This one is REALLY important.
I originally didn&#39;t notice this, but someone who looked at my original code saw that I hadn&#39;t balanced the number of images present per class.
This meant that classes with fewer images would be significantly less likely to be predicted.
The reason for this is that classes with more images have a much higher contribution to loss functions (standard ones at the least).</p>
<p>The best way to deal with this is to <em>get more data</em>, however in many problems, this just <em>can&#39;t be done</em>.</p>
<p>The first simple alternative is simply oversampling but can lead to overfitting.
Another way is undersampling the majority classes, but this may discard important samples!
Instead, try out a newer loss function like <a target='_blank' rel='noopener noreferrer'  href="https://arxiv.org/pdf/1906.07413.pdf">LDAM Loss</a>.</p>
<h1 id="confusion-matrices">Confusion Matrices</h1>
<p>I&#39;m not an expert here but in essence, the confusion matrix (from my understanding) can be used to find out which classes aren&#39;t being classified well.
From here you can analyse those classes, and try to see why the model isn&#39;t doing so well and then try and improve it (i.e. maybe more images are needed of one class).</p>
<h1 id="progressive-resizing">Progressive resizing</h1>
<p>This one&#39;s simple and effective.
You start with training your model with small images of low resolutions, and then progressively increase it.
The reason this increases model robustness is that the model is forced to look for <strong>simple patterns before complex ones</strong>.</p>
<h1 id="consider-metrics-other-than-accuracy">Consider metrics other than accuracy</h1>
<p>Accuracy may not be the best metric for your problem.
Metrics like F1 score can be equally if not more useful!</p>
<h1 id="thanks-for-reading-">THANKS FOR READING!</h1>
<p>Now that you&#39;ve heard me ramble, I&#39;d like to thank you for taking the time to read through my blog (or skipping to the end).
If this has helped you out consider checking out <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/how-i-published-an-app-and-model-to-classify-85-snake-species-and-how-you-can-too-ck6jb8er400r0dfs1agw7d0y4">my article on problems I encountered whilst building my first project</a>!</p>
<p>Cover image sourced <a target='_blank' rel='noopener noreferrer'  href="https://pxhere.com/en/photo/1449707">here</a></p>
]]></content:encoded></item><item><title><![CDATA[How I published an app and model to classify 85 snake species (and how you can too)]]></title><description><![CDATA[Why?
I had just finished my last MOOC and couldn't stop wondering whether I was ready to start a project.
I was frightened, scared, and lacked confidence.
However, after weeks of contemplation, I bit the bullet and announced I'd create a simple image...]]></description><link>https://www.kamwithk.com/how-i-published-an-app-and-model-to-classify-85-snake-species-and-how-you-can-too</link><guid isPermaLink="true">https://www.kamwithk.com/how-i-published-an-app-and-model-to-classify-85-snake-species-and-how-you-can-too</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[learning]]></category><category><![CDATA[projects]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Wed, 12 Feb 2020 12:48:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581474478774/CG9sEyV3P.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>I had just finished my last MOOC and couldn&#39;t stop wondering whether I was ready to start a project.
I was frightened, scared, and lacked confidence.
However, after weeks of contemplation, I bit the bullet and announced I&#39;d create a simple image classification model.</p>
<p>Now here I am with an app <a target='_blank' rel='noopener noreferrer'  href="https://play.google.com/store/apps/details?id=com.kamwithk.snaked">officially available on the Play Store</a> and a  <a target='_blank' rel='noopener noreferrer'  href="https://github.com/KamWithK/Snaked">GitHub repo</a> with its open-sourced code.
I&#39;d like to explain the hurdles I&#39;ve faced and the lessons I&#39;ve learnt overcoming them (in hopes it&#39;ll help you out too)!</p>
<h1 id="my-journey">My Journey</h1>
<ul>
<li>Created my own dataset from Google Image search results</li>
<li>Started with the simplest linear regression model possible</li>
<li>Tried to use my own custom <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/convolutional-neural-networks-basic-theory-in-a-nutshell-ck6hl9wdl000qi5s1venh4nk1">CNN</a></li>
<li>Switched to a <a target='_blank' rel='noopener noreferrer'  href="https://www.aicrowd.com/challenges/snake-species-identification-challenge"><strong>FAR larger dataset</strong></a></li>
<li>Took <a target='_blank' rel='noopener noreferrer'  href="http://cs231n.stanford.edu/">Stanford&#39;s CS231n</a> to strengthen my foundations (theoretical knowledge)</li>
<li>Created the basic code to train a <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/modern-algorithms-choosing-a-model-ck6hwiovf004ndfs1529fcppy">MobileNetV2 model</a></li>
<li>Learnt about Super Convergence (article about Super Convergence with PURE PyTorch coming soon...)</li>
<li>Created an <a target='_blank' rel='noopener noreferrer'  href="https://play.google.com/store/apps/details?id=com.kamwithk.snaked">Android app for Snaked</a> which could take and import photos, before outputting the snake species</li>
</ul>
<p>In short, I made a <strong>LOT of mistakes</strong>, and that&#39;s precisely how I learnt.
Google scrapping for images allowed me to appreciate the effort involved in creating a dataset with <strong>120, 000 images</strong>.
Starting with linear regression was a grand, stupid and hilarious blunder, but it taught me the value of <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/convolutional-neural-networks-basic-theory-in-a-nutshell-ck6hl9wdl000qi5s1venh4nk1">CNN&#39;s</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/modern-algorithms-choosing-a-model-ck6hwiovf004ndfs1529fcppy">pre-trained models</a> first hand.
Trying linear regression for such a task also forced me to find out how and <strong>why</strong> neural networks work!
The long training times and mediocre results from a plain pre-trained model caused me to find Super Convergence!</p>
<blockquote>
<p><em>My mistakes were like stages</em>, without <strong>each and every one</strong> of them I wouldn&#39;t have learnt anywhere near the amount I did</p>
</blockquote>
<h1 id="the-benefits">The benefits</h1>
<p>An obvious question is why bother overcoming hurdle after hurdle when a free MOOC can <em>teach</em> the same content (maybe even taking less time and effort).
I&#39;ve already <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/how-to-learn-data-science-and-eventually-become-an-expert-ck6j0fd7o00lodfs1bbhafek9">answered the question</a>, but in short, it comes down to:</p>
<blockquote>
<p>You remember information which you repeatedly use, and progressively forget all else</p>
</blockquote>
<p>Now, this doesn&#39;t mean you don&#39;t go through any MOOC&#39;s/tutorials, but just <strong>make sure you don&#39;t get &quot;stuck in tutorial hell&quot;</strong>!
Instead, if you know the basics apply the stuff you&#39;ve learnt right now to create a cool, interesting project you can show of!</p>
<h1 id="challenges-i-overcame">Challenges I overcame</h1>
<h2 id="should-i-start-now-">Should I start now?</h2>
<p>The fact you&#39;re questioning your current skill, knowledge and theoretical foundation level indicates that you&#39;re aware of the limits of your understanding!
This doesn&#39;t mean that you&#39;re stupid, or not ready, but instead, you&#39;ve learnt enough to know that there&#39;s way more ahead of you.</p>
<blockquote>
<p>Just understand that there will always be more to learn, so may as well start using what you already know</p>
</blockquote>
<h2 id="is-my-idea-good-enough-">Is my idea good enough?</h2>
<p>Two things you can do to judge:</p>
<ul>
<li>Is my idea too simple/complex?</li>
<li>Ask someone else</li>
</ul>
<p>If you&#39;re unsure how complex the project will be, consider how others have faired on similar tasks.
One way to find out is to search for online articles or research papers.
If you find hundreds then the problem is probably too simple, but if you only find a few it&#39;s probably unrealistic (or you&#39;ve got a genius idea).
Knowledge on the topic at hand is a must, so just research the topic and see what you find out!</p>
<p>If you don&#39;t already have any connections to data scientists, then you&#39;re going to have to reach out (like I did)!
I&#39;d do this ASAP no matter what, as it&#39;s <strong>always incredibly useful to have a variety of opinions</strong> on any situation.
My personal method is to reach out to data scientists near me saying I&#39;m studying machine learning and looking for some advice.
Most people reject, but if you put enough out it still works!</p>
<h2 id="what-if-it-fails-">What if it fails?</h2>
<blockquote>
<p>Find out <strong>WHY</strong> your project can&#39;t work!</p>
</blockquote>
<p>If finding out <em>why</em> an idea can&#39;t work doesn&#39;t unravel another solution, then you&#39;ve found out something new... and that&#39;s an accomplishment!</p>
<h2 id="i-don-t-know-what-to-do-">I don&#39;t know what to do!</h2>
<p>Find similar problems to the one you&#39;re trying to solve.
For me, I started with tutorials on how to use PyTorch to classify digits (MNIST) and more complex objects (CIFAR100).
I followed the tutorials and figured out how they achieved their task.
I then used <strong>transfer learning</strong>, replicating what each tutorial had done, but this time for my own problem.</p>
<blockquote>
<p>Of course, I was far from ready to tackle the full challenge but <strong>with time I figured out more and more</strong>.</p>
</blockquote>
<p>If you&#39;re still stuck though, you may actually need to go back to the books (or courses).</p>
<h2 id="nothing-is-working-">Nothing is working!</h2>
<p>Just stick at it and after a while, something will click!
At the beginning of creating my first model, none of my code ran, but eventually (after a few days), I managed to find the bug and fix it.
Know I get how to run training and evaluation loops with PyTorch!
Note that it&#39;s often a small minor change which finally revives the code (so play around, debug a lot and you&#39;ll figure it out).</p>
<h2 id="i-don-t-understand-how-everything-works-">I don&#39;t understand <em>how</em> everything works?</h2>
<p>Theory can be difficult.
You can have a working model, but not know anything about how the transfer learning model, optimizer, loss function... or something else you&#39;ve used works.
But how long does it take to train the model... hours, days, weeks?
If you can write the code, do that, run it and just learn the theory whilst it&#39;s working.
Your both <em>training then</em> (pun intended)!</p>
<h2 id="it-works-but-how-can-i-improve-it-">It works, but how can I improve it?</h2>
<p>I&#39;ve got a blog post specifically on <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/improving-your-computer-vision-model-ck6k3em3b0113dfs16bray6ee">how to improve your model</a>.
Take a look once you&#39;ve got a working model!</p>
<h2 id="what-do-i-do-after-creating-a-decent-model-">What do I do after creating a decent model?</h2>
<p>This is the cycle:</p>
<blockquote>
<p>Learn, create, improve, show off, rinse and repeat!</p>
</blockquote>
<p>Just create blogs, create projects and continue through that cycle.</p>
<h1 id="thanks-for-reading-">THANKS FOR READING!</h1>
<p>Now that you&#39;ve heard me ramble, I&#39;d like to thank you for taking the time to read through my blog (or skipping to the end).
If this has helped you out consider checking out <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/modern-algorithms-choosing-a-model-ck6hwiovf004ndfs1529fcppy">how to choose a model</a>!</p>
<p>Cover image sourced <a target='_blank' rel='noopener noreferrer'  href="https://en.wikipedia.org/wiki/Snake">here</a></p>
]]></content:encoded></item><item><title><![CDATA[How to learn data science and (eventually) become an expert]]></title><description><![CDATA[Why?
When I first found out I wanted to become a data scientist I was completely overwhelmed by the vast breadth and depth of the field of study.
The overwhelming complexity blinded me on how to learn.
I began with a rough outline of the different re...]]></description><link>https://www.kamwithk.com/how-to-learn-data-science-and-eventually-become-an-expert</link><guid isPermaLink="true">https://www.kamwithk.com/how-to-learn-data-science-and-eventually-become-an-expert</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Wed, 12 Feb 2020 07:45:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581493541329/t44KhcrBR.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>When I first found out I wanted to become a data scientist I was completely overwhelmed by the vast breadth and depth of the field of study.
The overwhelming complexity blinded me on how to learn.
I began with a <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-machine-learning-data-science-path-ck6frs5ws01773cs171xj1ydi">rough outline</a> of the different resourced I&#39;d found available and planned to take a few MOOC&#39;s before <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/moving-towards-the-real-deal-hint-projects-ck6fz696h019v3ns17ks2jyij">moving onto projects</a>.</p>
<p>Well, I did that, but made a few costly mistakes along the way which have slowed down my progress.
I&#39;m going to through the problems I encountered and how to overcome them.</p>
<h1 id="mooc-s-vs-projects">MOOC&#39;s vs Projects</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581490933923/Kj556QfD5.png" alt="MOOC&#39;s vs Projects.png"></p>
<p>The logical diagram above (I admit I went overboard) rationally explains why you should start by learning from a MOOC and then immediately progress to projects.
It explains 3 points:</p>
<ul>
<li>MOOC&#39;s teach core knowledge, but in a way where you&#39;re likely to forget</li>
<li>You can&#39;t start a project if you don&#39;t already have a rough understanding of the problem you&#39;re trying to solve</li>
<li>So you can learn from a MOOC to get the core knowledge and then immediately do projects to consolidate knowledge</li>
</ul>
<p>The key takeaway is to ALWAYS start your learning off with theory (MOOC&#39;s or books) and then immediately follow up with projects!</p>
<h1 id="the-ideal-first-mooc">The ideal first MOOC</h1>
<p>How can you decide which MOOC is right for you?
That simply comes down to which MOOC&#39;s give you enough foundational knowledge that afterwards, you can understand problems enough to know roughly where to look/how to piece different bits of knowledge together.</p>
<p>From what I&#39;ve seen and learnt so far, I&#39;d recommend <a target='_blank' rel='noopener noreferrer'  href="https://www.fast.ai/">Fast.AI</a> or <a target='_blank' rel='noopener noreferrer'  href="https://www.deeplearning.ai/">DeepLearning.AI</a> for this.
The main difference between the two courses is their approach to teaching.
<a target='_blank' rel='noopener noreferrer'  href="https://www.fast.ai/">Fast.AI</a> is top-bottom (starts with applied high-level stuff before going into the nitty-gritty details) whereas <a target='_blank' rel='noopener noreferrer'  href="https://www.deeplearning.ai/">DeepLearning.AI</a> is bottom-top (starts with the basic maths and then builds up into modern cutting edge content).
Do note though that there are a plethora of other courses I&#39;ve <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/the-machine-learning-data-science-path-ck6frs5ws01773cs171xj1ydi">already categorised/broken down before</a>.</p>
<h1 id="thanks-for-reading">THANKS FOR READING</h1>
<p>Now that you&#39;ve heard me ramble, I&#39;d like to thank you for taking the time to read through my blog (or skipping to the end).</p>
]]></content:encoded></item><item><title><![CDATA[Modern Algorithms: Choosing a Model]]></title><description><![CDATA[Why?
There are so many models to trial out, but training a neural network is a slow process when you have a substantial amount of data.
Here I'll step you through the full model life cycle, starting with a list of potential models, explaining their d...]]></description><link>https://www.kamwithk.com/modern-algorithms-choosing-a-model</link><guid isPermaLink="true">https://www.kamwithk.com/modern-algorithms-choosing-a-model</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[neural networks]]></category><category><![CDATA[Mobile Development]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Fri, 07 Feb 2020 00:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581426358473/fW3ve3rnl.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>There are so many models to trial out, but training a neural network is a slow process when you have a substantial amount of data.
Here I&#39;ll step you through the full model life cycle, starting with a list of potential models, explaining their differentiating factors and then how one final model works.</p>
<h1 id="the-contenders">The contenders</h1>
<ul>
<li>ResNext152</li>
<li>ResNext50_32x4d</li>
<li>Inception V2</li>
<li>VGG16</li>
<li>SqueezeNet 1_1</li>
<li>MobileNetV2</li>
</ul>
<h1 id="choosing-a-model">Choosing a Model</h1>
<p>When choosing a neural network model it&#39;s important to consider each model&#39;s <a target='_blank' rel='noopener noreferrer'  href="https://paperswithcode.com/sota/image-classification-on-imagenet">performances</a> as well as its computational cost.</p>
<p>Consider where you&#39;re model will be used (i.e. mobile), as this may impact your choice of models (i.e. a lightweight model may be required).
The second consideration is the training time.
If you have a powerful graphics card like the RTX 2080Ti with a large amount of VRAM (i.e. 11 GB+) then training time likely wouldn&#39;t be a large concern.
However, if you have a large dataset and a GPU with less VRAM, a model designed to run quickly will train far quicker!
MobileNet balances both factors quite well, and so is the focus of this blog post.</p>
<p>For completion note that there&#39;s a classic pattern where larger amounts of resource usage on average leads to small performance improvements.
This is because adding more layers will cause <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/vanishing-gradients-ck6hu7fmp003wd9s1hg776af0">vanishing/exploding gradients</a>, so older networks like Inception tended to perform worse than modern models.</p>
<h1 id="mobilenetv2">MobileNetV2</h1>
<h2 id="a-drop-in-replacement-for-standard-cnn-s">A drop-in replacement for standard CNN&#39;s</h2>
<p>A &quot;factorized&quot; version of regular CNN&#39;s can be created by dividing CNN&#39;s into two parts:</p>
<ul>
<li>Filtration (depth wise convolution)</li>
<li>Finding new linear combinations of features (pointwise convolution)</li>
</ul>
<p>The first depthwise convolution applies a single convolutional kernel to each input layer, before aggregating the results.
The second pointwise convolution is a 1x1 convolution</p>
<p>Rationally the model reduces parameters as:</p>
<ul>
<li>Standard convolutions produce <img src="https://latex.codecogs.com/gif.download?height_%7Binput%7D%20*%20width_%7Binput%7D%20*%20depth_%7Binput%7D%20*%20%7Bno%5C_kernels%7D%20*%20%7Bkernel%5C_size%7D%5E2" alt=""></li>
<li>Depth wise convolutions produce <img src="https://latex.codecogs.com/gif.download?height_%7Binput%7D%20*%20width_%7Binput%7D%20*%20depth_%7Binput%7D%20*%20%28%7Bno%5C_kernels%7D%5E2%20+%20%7Bno%5C_kernels%7D%29" alt=""></li>
</ul>
<p>Thus if a kernel size of 3 (standard) is used, these convolutions will be 8-9x faster!</p>
<h2 id="linear-bottleneck-layers">Linear bottleneck layers</h2>
<p>Let me preface this by asserting that <strong>channels</strong> mentioned here are <strong>layers</strong> (like RGB).</p>
<p>There are two assumptions at play:</p>
<ul>
<li>When ReLU transforms maintain a non-zero volume, the transformation is linear</li>
<li>ReLU can preserve input information, but only if it originated in a low-dimensional subspace</li>
</ul>
<p>Using <strong>linear bottleneck layers</strong> (convolutional layers without ReLU&#39;s) allows greater preservation of information.
This is because linear functions <strong>don&#39;t collapse any channel</strong>, unlike ReLU&#39;s.
Note that the paper describing this emphasizes that collapsing a channel is fine when that information is likely stored elsewhere in another channel.
This is explained further later on.</p>
<h2 id="inverted-residuals">Inverted residuals</h2>
<p>Very much like ResNet&#39;s, MobileNetV2 uses <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/vanishing-gradients-ck6hu7fmp003wd9s1hg776af0">residual blocks</a> to improve gradient flow!
Here though, the links are between the linear bottleneck layers which reduce the output dimensions.
This design choice is more efficient, as computations (matrix multiplication) occurs between smaller matrices.</p>
<h2 id="relu6">ReLU6</h2>
<p>Throughout the MobileNetV2 implementation, ReLU6 is always used instead of a regular ReLU.
ReLU6 is a modified version of the original Rectified Linear Unity which stops activations from growing too large.
This allows ReLU6 to be more robust.
However, the <strong>6</strong> itself is an arbitrary choice of value.</p>
<h2 id="layers">Layers</h2>
<ul>
<li>1x1 convolution with ReLU</li>
<li>Depthwise convolution (with 3x3 kernel) as a residual bottleneck layer</li>
<li>Pointwise 1x1 convolution (finding linear combinations between features)</li>
</ul>
<p>The first stage with 1x1 convolutions effectively increases the number of channels.
An <strong>expansion ratio</strong> is used to represent the desired increase in channels (the size of the input bottleneck vs inner size).
As there are now more channels present, it is fine to use ReLU after the bottleneck layer.
The idea is that with a large number of channels if one channel is collapsed, that same info is likely still within another channel as well.
Hence, ReLU can be used.</p>
<p>The final pointwise 1x1 convolution does the opposite (decrease output dimensions).
No ReLU is used as reducing dimensionality itself can cause destruction of information.</p>
<p>Note more hyperparameters are present in the actual model, but for simplicity, I&#39;m leaving them out.
To learn about MobileNetV2 in more detail check out its <a target='_blank' rel='noopener noreferrer'  href="https://arxiv.org/pdf/1801.04381.pdf">paper</a>.</p>
<h1 id="a-final-note">A final note</h1>
<p>Now that you know <em>how</em> a <em>basic lightweight mode</em> works, you may be interested in where it could come handy. I previously mentioned reduced training time, but a model which can run on any device without the need of an internet connection, or decent hardware can be useful in several scenarios. One such scenario is <a target='_blank' rel='noopener noreferrer'  href="https://play.google.com/store/apps/details?id=com.kamwithk.snaked">my snake species classification app</a> which tells you the snake species in a photo (i.e. if bitten). Please feel free to use the <a target='_blank' rel='noopener noreferrer'  href="https://github.com/KamWithK/Snaked">GitHub repo</a> with code as a reference to how you can do the same for your model!</p>
<h1 id="resources-">Resources?</h1>
<ul>
<li>Original <a target='_blank' rel='noopener noreferrer'  href="https://arxiv.org/pdf/1801.04381.pdf">paper</a> on MobileNetV2</li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://medium.com/@luis_gonzales/a-look-at-mobilenetv2-inverted-residuals-and-linear-bottlenecks-d49f85c12423">A Look at MobileNetV2: Inverted Residuals and Linear Bottlenecks</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://machinethink.net/blog/mobilenet-v2">Breakdown of MobileNetV2</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://medium.com/zylapp/review-of-deep-learning-algorithms-for-image-classification-5fdbca4a05e2">Review of Deep Learning Algorithms for Image Classification</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://towardsdatascience.com/review-mobilenetv2-light-weight-model-image-classification-8febb490e61c">Review: MobileNetV2 — Light Weight Model (Image Classification)</a></li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215">A Comprehensive Introduction to Different Types of Convolutions in Deep Learning</a></li>
</ul>
<p>Cover image sourced from <a target='_blank' rel='noopener noreferrer'  href="https://www.piqsels.com/en/public-domain-photo-fvbay">here</a></p>
<h1 id="thanks-for-reading-">THANKS FOR READING!</h1>
<p>Now that you&#39;ve heard me ramble, I&#39;d like to thank you for taking the time to read through my blog (or skipping to the end).
If you liked this article, then check out <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/improving-your-computer-vision-model-ck6k3em3b0113dfs16bray6ee">how to improve your model</a> and <a target='_blank' rel='noopener noreferrer'  href="https://www.kamwithk.com/how-i-published-an-app-and-model-to-classify-85-snake-species-and-how-you-can-too-ck6jb8er400r0dfs1agw7d0y4">how to overcome several hurdles during creating a project of your own</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Vanishing Gradients]]></title><description><![CDATA[Why?
Deep learning is massive right now and there are few innovations which really made this possible.
Residual blocks are a quintessential modern discovery to solve the problem of exploding and vanishing gradients!
More layers == Better
Modern deep ...]]></description><link>https://www.kamwithk.com/vanishing-gradients</link><guid isPermaLink="true">https://www.kamwithk.com/vanishing-gradients</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[neural networks]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Tue, 07 Jan 2020 09:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581422443832/7jTiUsqbB.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>Deep learning is massive right now and there are few innovations which really made this possible.
Residual blocks are a quintessential modern discovery to solve the problem of exploding and vanishing gradients!</p>
<h1 id="more-layers-better">More layers == Better</h1>
<p>Modern deep learning revolves around adding more layers, however, this begs the question, at what point will our model no longer improve?
Residual blocks provide one way to increase the number of useful layers a neural network may have (before it overfits).</p>
<h1 id="linking-forwards">Linking forwards</h1>
<p>Residual blocks work by linking the current layer to not only the next layer (like normal) but two or three after it.
This allows larger gradients to flow back downstream!
By doing so we offset the major impact of vanishing gradients, allowing us to delve deeper once again.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1581422675249/O0RY5jmoh.png" alt="Resnet.png"></p>
<p>The image was sourced <a target='_blank' rel='noopener noreferrer'  href="https://commons.wikimedia.org/wiki/File:Resnet.png">here</a></p>
<h1 id="resources">Resources</h1>
<ul>
<li><a target='_blank' rel='noopener noreferrer'  href="https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec">Residual blocks — Building blocks of ResNet</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Convolutional Neural Networks: Basic Theory in a Nutshell]]></title><description><![CDATA[Why?
Majority of the tutorials I've seen on convolutional neural networks either focus on providing a basic analogy or going straight into describing terminology.
Therefore, I aim to start with an overview of the stages involved in CNN's (Convolution...]]></description><link>https://www.kamwithk.com/convolutional-neural-networks-basic-theory-in-a-nutshell</link><guid isPermaLink="true">https://www.kamwithk.com/convolutional-neural-networks-basic-theory-in-a-nutshell</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[neural networks]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Thu, 17 Oct 2019 09:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581407554752/yRvKaUe8a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>Majority of the tutorials I&#39;ve seen on convolutional neural networks either focus on providing a basic analogy or going straight into describing terminology.
Therefore, I aim to start with an overview of the stages involved in CNN&#39;s (Convolutional Neural Networks) and then provide an analogy, as well as a small glossary of key and external resources for further assistance.
Make sure to utilize the glossary to understand key terms used throughout the blog post to help understand the material and continue onwards to a few other articles or video&#39;s mentioned in my resources section!
By the way, don&#39;t expect to completely understand CNN&#39;s straight away, as they ain&#39;t all too simple!</p>
<p>Note that I&#39;ll be providing tangible/practical code in another one of my problem-solution blog post (where I take a problem I&#39;ve had and explain my final derived solution, along with how I&#39;ve overcome some major hurdles).</p>
<h1 id="key-stages-">Key Stages?</h1>
<ul>
<li>Convolutional Layers (extract features from the input)<ul>
<li>Filters (matrices of weights) convolve over the input to produce feature maps<ul>
<li>When the filter and input are similar, a high number is produced</li>
</ul>
</li>
<li>Applying ReLu functions to increase non-linearity</li>
</ul>
</li>
<li>Pooling/Down Sampling (combine clusters of neurons together) to reduce dimensionality<ul>
<li>Flattens clusters of neurons into one-long vectors</li>
</ul>
</li>
<li>Fully Connected Layers (connect a neuron to all those in the next layer)</li>
</ul>
<h1 id="analogy-">Analogy?</h1>
<p>You probably won&#39;t understand the above descriptions straight away, however a worded example feel more intuitive (to ease the confusion)!</p>
<p>So humor me and imagine the following scenario:</p>
<ul>
<li>You&#39;re given a few hundred paintings and need to identify which picture corresponds to which shape</li>
<li>None of the paintings are too precise, and the grid they were painted onto is HUGE (so there&#39;s no use just trying every single all combinations in a fully connected artificial neural network)</li>
<li>You notice that each picture/painting is composed of smaller, <em>more subtle</em>, strokes (lines), which form curves, which themselves create each shape&#39;s final outline</li>
</ul>
<p>This might sound insane (are we like 2?), but more complicated and <strong>meaningful</strong> problems can be solved in the exact <strong>same way</strong>!</p>
<p>The process of segregating/labelling images begins with realizing that you can&#39;t comprehend a full picture at once, so you must break each down into smaller 2x2 squares.
You can move across, with a <em>stride</em> of 2 pixels at a time and compare these squares against a few <em>filters</em>.
The <em>filters</em> themselves are just another grid, which resembles unique <em>features</em> which may be present in the original input image.
Example basic features are like lines in different directions:</p>
<ul>
<li>Horizontal</li>
<li>Vertical</li>
<li>Diagonal from left to right (upwards)</li>
<li>Diagonal from right to left (downwards)</li>
</ul>
<p>Through <em>convolving</em> from one mini-image (<em>receptive field</em>) to the next, documenting how similar each <em>receptive</em> and <em>filter</em> grids are, a smaller <em>down sampled</em> image can form (this is <em>pooling</em>).
The aim of this gradual comparative process is to form a more abstract, higher level image composed of lines instead of individual pixels!</p>
<p>Now the enlightening idea is that you can apply the same <em>down sampling</em> process used to extract lines from pixels, to find curves in lines, and then shapes formed from the curves!
Each of these stages form separate layers of your <em>neural network</em> and are separated by activation functions (for accentuating <em>non-linearity</em>) and <em>fully connected layers</em> (to join together the different patters and eventually produce a final output)!</p>
<h1 id="glossary-">Glossary?</h1>
<table>
<thead>
<tr>
<td>Term</td><td>Definition</td></tr>
</thead>
<tbody>
<tr>
<td>CNN</td><td>Convolutional Neural Network</td></tr>
<tr>
<td>Kernel/Channel</td><td>A matrix of weights used to produce a feature map by convolving over the input (note that multiple can be used to preserve spacial depth to a higher extent)</td></tr>
<tr>
<td>Filter</td><td>A set of kernel/channel\&#39;s</td></tr>
<tr>
<td>Convolving</td><td>Moving through a broken down version of the input, summing the input values in each section and multiplying them by the filter/kernel</td></tr>
<tr>
<td>Activation/Feature Map</td><td>Original input processed by filters to accentuate certain features (effectively performs operations like edge detection, blur, ect)</td></tr>
<tr>
<td>Stride</td><td>The number of divisions of the input to scroll across at a time</td></tr>
<tr>
<td>Receptive Field</td><td>Part of the input which the filter\&#39;s scrolling over</td></tr>
<tr>
<td>Padding</td><td>Adding zeroes (i.e. zero-padding) to the input or dropping part of it (valid-padding), to mitigate the impact of the stride not perfectly dividing the input (e.g. there is a remainder)</td></tr>
</tbody>
</table>
<h1 id="places-to-learn-more-">Places to LEARN MORE?</h1>
<p>I&#39;ve come across several amazing blogs and videos describing how convolutional neural networks work, so here&#39;s a rough list of the one&#39;s I feel are the most valuable!</p>
<ul>
<li><a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/yyr7cqcg">MIT 6.S191: Convolutional Neural Networks</a> is probably the most wholesome and complete (to a <em>small</em> extent) video on CNN&#39;s</li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/y7zb6bcw">A Beginner&#39;s Guide To Understanding Convolutional Neural Networks</a> is just amazing, I wish I read this one first<ul>
<li>Goes through a few details about how the filters actually work which no other guide did</li>
</ul>
</li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/y7tlv6lg">A friendly introduction to Convolutional Neural Networks and Image Recognition</a> has the most easily interpretable example situations<ul>
<li>The example scenario&#39;s build from extremely simple to slightly more complex</li>
</ul>
</li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/y7bz8czn">The Complete Beginner’s Guide to Deep Learning: Convolutional Neural Networks and Image Classification</a> has amazing visuals</li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/ybatmchm">Understanding of Convolutional Neural Network (CNN) — Deep Learning</a> is well broken down into sections</li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/y5ro8rph">Intuitively Understanding Convolutions for Deep Learning</a> is good for consolidation after the rough ideas behind CNN&#39;s are understood<ul>
<li>However, it&#39;s quite confusing at times (especially for a first read)</li>
</ul>
</li>
<li><a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/jges5k5">Neural Network that Changes Everything</a> is from <a target='_blank' rel='noopener noreferrer'  href="https://tinyurl.com/h3a8slm">Computerphile</a> and is a great first deep dive into the ideas behind CNN&#39;s<ul>
<li>Do note that they have other video&#39;s on these topics, however this seems like their best introduction</li>
</ul>
</li>
</ul>
<p>Cover image sourced from <a target='_blank' rel='noopener noreferrer'  href="https://commons.wikimedia.org/wiki/File:Typical_cnn.png">here</a></p>
<h1 id="thanks-for-reading-">THANKS FOR READING!</h1>
<p>Now that you&#39;ve heard me ramble, I&#39;d like to thank you for taking the time to read through my blog (or skipping to the end).</p>
]]></content:encoded></item><item><title><![CDATA[Life Hack Web Scrapping]]></title><description><![CDATA[Why?
Web scrapping has made my life SO MUCH EASIER.
Yet, the process for actually extracting content from websites which lock their content down using proprietary systems is never really mentioned.
This makes it extremely difficult if not impossible ...]]></description><link>https://www.kamwithk.com/life-hack-web-scrapping</link><guid isPermaLink="true">https://www.kamwithk.com/life-hack-web-scrapping</guid><category><![CDATA[Python]]></category><category><![CDATA[web scraping]]></category><category><![CDATA[caching]]></category><category><![CDATA[data]]></category><category><![CDATA[selenium]]></category><dc:creator><![CDATA[Kamron Bhavnagri]]></dc:creator><pubDate>Sat, 05 Oct 2019 11:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1581322449513/jsI8VPt3U.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="why-">Why?</h1>
<p>Web scrapping has made my life SO MUCH EASIER.
Yet, the process for actually extracting content from websites which lock their content down using proprietary systems is never really mentioned.
This makes it extremely difficult if not impossible to reformat information into a desirable format.
Over a few years, I&#39;ve found several (nearly) failproof techniques to help me out, and now I&#39;d like to pass them on.</p>
<p>I&#39;m going to walk you through the process of converting a web-only book to a pdf.
Feel free to replicate/modify this for use in other circumstances!
If you have any other tricks (or even useful scripts) for tasks like these, make sure to let me know, as creating these life-hack scripts is an interesting hobby!</p>
<h1 id="reproducibility-applicability-">Reproducibility/Applicability?</h1>
<p>The example I&#39;m outlining is from a website which provides study guides for a monthly fee (to protect their security I&#39;m excluding specific URL&#39;s).
Despite the potential lack of reproducibility, this guide should stay quite useful, as I&#39;m outlining several flaws/hickups that come up for any similar project along the way!</p>
<h1 id="mistakes-to-make-">Mistakes to Make?</h1>
<p>I&#39;ve made several mistakes when trying to web scrape for limited access information.
Each mistake consumed <strong>large amounts</strong> of <strong>time</strong> and <strong>energy</strong>, so here they are:</p>
<ul>
<li>Using AutoHotKey or similar to directly affect the mouse/keyboard<ul>
<li>This seems effective, however, it is <strong>extremely dodgy</strong> and <strong>isn&#39;t reproducible</strong></li>
</ul>
</li>
<li>Load all pages and then export a HAR file<ul>
<li>HAR files don&#39;t contain any actual data</li>
<li>HAR files take ages to load in any text editor</li>
</ul>
</li>
<li>Attempt to use GET/HEAD requests<ul>
<li>Majority of pages will randomly assign tokens and other <strong>authorization</strong> approaches which are incredibly hard to reverse engineer</li>
<li>Code can never be reproduced/a different approach will be needed</li>
</ul>
</li>
</ul>
<h1 id="slow-progress">Slow Progress</h1>
<p>It seems easy/quick to write a 300 line short script for web scrapping these websites, but they are always more difficult than that.
Here are potential hurdles with solutions:</p>
<ul>
<li>Browser profile used by Selenium changing<ul>
<li>Programmatically find the profile</li>
</ul>
</li>
<li>Not knowing how long to wait for a link to load<ul>
<li>Detect when the link <strong>isn&#39;t equal</strong> to the <strong>current one</strong></li>
<li>Or use browser JavaScript (where possible, described more below)</li>
</ul>
</li>
<li>Needing to find information about the current web page&#39;s content<ul>
<li>Look at potential JavaScript functions and URL&#39;s</li>
</ul>
</li>
<li>Restarting a long script when it fails<ul>
<li><strong>Reduce</strong> the number of <strong>lookups</strong> for files</li>
<li>Copy files to <strong>predictable locations</strong></li>
<li>Before beginning doing anything complex check these files</li>
</ul>
</li>
<li>Not knowing what a long script is up to<ul>
<li>Print any necessary output (only for that which takes considerable time and doesn&#39;t have another metric)</li>
</ul>
</li>
</ul>
<h1 id="code">Code</h1>
<h2 id="preperation">Preperation</h2>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
<span class="hljs-keyword">from</span> selenium.webdriver.common.by <span class="hljs-keyword">import</span> By
<span class="hljs-keyword">from</span> selenium.webdriver.support.ui <span class="hljs-keyword">import</span> WebDriverWait
<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image
<span class="hljs-keyword">from</span> natsort <span class="hljs-keyword">import</span> natsorted

<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> shutil
<span class="hljs-keyword">import</span> img2pdf
<span class="hljs-keyword">import</span> hashlib
</code></pre>
<pre><code class="lang-python">driver = webdriver.Firefox()
cacheLocation = driver.capabilities[<span class="hljs-string">'moz:profile'</span>] + <span class="hljs-string">'/cache2/entries/'</span>
originalPath =  os.getcwd()
baseURL = <span class="hljs-string">'https://edunlimited.com'</span>
</code></pre>
<h2 id="loading-book">Loading Book</h2>
<pre><code class="lang-python">driver.get(loginURL)
driver.get(bookURL)

wait.until(<span class="hljs-keyword">lambda</span> driver: driver.current_url != loginURL)
</code></pre>
<h2 id="get-metadata">Get Metadata</h2>
<p>Quite often it is possible to find JavaScript functions which are used to provide useful information.
There are a few ways you may go about doing this:</p>
<ul>
<li>View the page&#39;s HTML source (right-click &#39;View Page Source&#39;)</li>
<li>Use the web console</li>
</ul>
<pre><code class="lang-python">bookTitle = driver.execute_script(<span class="hljs-string">'return app.book'</span>)
bookPages = driver.execute_script(<span class="hljs-string">'return app.pageTotal'</span>)
bookID = driver.execute_script(<span class="hljs-string">'return app.book_id'</span>)
</code></pre>
<h2 id="organize-files">Organize Files</h2>
<p>Scripts often don&#39;t perform as expected, and can sometimes take long periods of time to complete.
Therefore it&#39;s quite liberating to preserve progress throughout the script&#39;s iterations.
One good method to achieve this is keeping organized!</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(bookTitle):
        os.mkdir(bookTitle)
    <span class="hljs-keyword">if</span> len(os.listdir(bookTitle)) == <span class="hljs-number">0</span>:
        start = <span class="hljs-number">0</span>
    <span class="hljs-keyword">else</span>:
        start = int(natsorted(os.listdir(bookTitle), reverse=<span class="hljs-keyword">True</span>)[<span class="hljs-number">0</span>].replace(<span class="hljs-string">'.jpg'</span>, <span class="hljs-string">''</span>))
        driver.execute_script(<span class="hljs-string">'app.gotoPage('</span> + str(start) + <span class="hljs-string">')'</span>)

os.chdir(bookTitle)
</code></pre>
<h2 id="loop-through-the-book">Loop Through the Book</h2>
<p>Images are always stored in the cache, so when all else fails, just use this to your advantage!</p>
<p>This isn&#39;t easy though, first of we need to load the page and then we need to somehow recover it!</p>
<p>To make sure we <strong>always load the entire page</strong>, there are two safety measures in place:</p>
<ul>
<li>Waiting for the current page to load before moving to the next</li>
<li>Reloading the page if it <em>fails to load</em></li>
</ul>
<p>Getting these two to work requires functions which guarantee completion (JavaScript or browser responses), and fail-safe waiting timespans.
Safe time spans are trial and error, but they usually seem to work best between 0.5 to 5 seconds.</p>
<p>Recovering specific data directly from the hard drive&#39;s cache is a relatively obscure topic.
The key is to first locate a download link (normally easy as it <strong>doesn&#39;t have to work</strong>).
Then run <strong>SHA1</strong>, <strong>Hex Digest</strong> and a <strong>capitalizing function</strong> on the <strong>URL</strong>, which produces the final filename (it <strong>isn&#39;t just one</strong> of the above security algorithms, as older sources lead you to believe, but both).</p>
<p>On a final note, make sure to clean your data (removing the alpha channel from PNG images here) now instead of afterwards, as it reduces the number of loops used in the code!</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> currentPage <span class="hljs-keyword">in</span> range(start, bookPages - <span class="hljs-number">1</span>):
        <span class="hljs-comment"># Books take variable amounts of loading time</span>
        <span class="hljs-keyword">while</span> driver.execute_script(<span class="hljs-string">'return app.loading'</span>) == <span class="hljs-keyword">True</span>:
            time.sleep(<span class="hljs-number">0.5</span>)

        <span class="hljs-comment"># The service is sometimes unpredictable</span>
        <span class="hljs-comment"># So pages may fail to load</span>
        <span class="hljs-keyword">while</span> (driver.execute_script(<span class="hljs-string">'return app.pageImg'</span>) == <span class="hljs-string">'/pagetemp.jpg'</span>):
            driver.execute_script(<span class="hljs-string">'app.loadPage()'</span>)
            time.sleep(<span class="hljs-number">4</span>)

        location = driver.execute_script(<span class="hljs-string">'return app.pageImg'</span>)

        <span class="hljs-comment"># Cache is temporary</span>
        pageURL = baseURL + location
        fileName = hashlib.sha1((<span class="hljs-string">":"</span> + pageURL).encode(<span class="hljs-string">'utf-8'</span>)).hexdigest().upper()
        Image.open(cacheLocation + fileName).convert(<span class="hljs-string">'RGB'</span>).save(str(currentPage) + <span class="hljs-string">'.jpg'</span>)

        driver.execute_script(<span class="hljs-string">'app.nextPage()'</span>)
</code></pre>
<h2 id="convert-to-pdf">Convert to PDF</h2>
<p>We can finally get that one convenient PDF file</p>
<pre><code class="lang-python">finalPath =  originalPath + <span class="hljs-string">'/'</span> + bookTitle + <span class="hljs-string">'.pdf'</span>

<span class="hljs-comment"># Combine into a single PDF</span>
<span class="hljs-keyword">with</span> open(finalPath, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
    f.write(img2pdf.convert([i <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> natsorted(os.listdir(<span class="hljs-string">'.'</span>)) <span class="hljs-keyword">if</span> i.endswith(<span class="hljs-string">".jpg"</span>)]))
</code></pre>
<h2 id="remove-excess-images">Remove Excess Images</h2>
<pre><code class="lang-python">os.chdir(originalPath)
shutil.rmtree(bookTitle)
</code></pre>
<p>Cover image sourced from <a target='_blank' rel='noopener noreferrer'  href="https://www.pxfuel.com/en/free-photo-oinlw">here</a></p>
<h1 id="thanks-for-reading-">Thanks for READING!</h1>
<p>This is basically the first code-centric post I&#39;ve made on my blog, so I hope it has been useful!</p>
<p>--- Until next time, I&#39;m signing out</p>
]]></content:encoded></item></channel></rss>