The Data Science Swiss Army Knife

2 Years of Machine Learning and Data Science

Kamron Bhavnagri — Tue, 22 Mar 2022 23:19:01 GMT

Over two years ago I started of as a high school student unsure of what to do in life, all I knew was that I'd always been enthusiastic about technology and learning how things work. Back then everything felt so overwhelming, the prospect of spending the next three plus years at Uni (and potentially the rest of my life aha) on something which I barely knew anything about really did frighten me 😨.

This article/portfolio is a retrospective on my experience becoming a machine learning engineer/data scientist so far! I hope others who read this feel similarly inspired to when I randomly heard/read about data science for the first time!I want to share some of my experiences so far, the cool things I've built and knowledge gathered.

Through a large portion of my journey as a beginner I documented my learning process with detailed blogs and GitHub repositories which I link throughout this post, if you're interested in seeing more details please check them out!

This post is structured in several sections:

How I got into data science/machine learning
The passion projects I've worked on
Experience working/interning

My Road towards Data Science

Back in 2019 when I was at a Monash open day I randomly came across a student engineering team called Monash DeepNeuron.At the stall one of their founders showed me a small number of AI demos. I was blown away. I kind of knew it existed before but I never even considered the possibility that it was possible for a student with minimal technical knowledge to go about working on something as big as colourising black and white pictures or AI playing video games.Straight after this I researched online as much as I could about how I could get started working on similar projects myself.Watching videos like the Alpha Go documentary inspired me to dig deeper and see whether this was a realistic career to pursue (TLDR ~ I ended up joining and becoming an active member of Monash DeepNeuron as soon as I got into University)!

To try and reassure myself that this was 100% the right path for me, I made it my mission to reach out to as many professional data scientists, machine learning engineers and software engineers as I humanly could. ~2 years later according to LinkedIn I successfully connected with 376 experts in the field! Although I can't say I've talked to every single person, I made a solid attempt every time and got to know a few people pretty well. The discussions I had from early on were invaluable in giving me direction, motivation and several opportunities like internships.

A good month or so after talking to dozens of data scientists I decided to use my newfound knowledge/skills to create an app to detect snake species 🐍... Projects, projects and more exciting applications of deep learning awaited me!

Getting Experience working on cool Projects

Right from the beginning I knew I wanted to start working on projects so I could collaborate with others to create interesting solutions which showcase how AI can be used to influence real problems (or at the least small prototypes to display what it would look and feel like).Here's a small overview of some opportunities I'm grateful to be able to have worked on.

Detecting Venomous Snakes 🐍

5.4 million people are bitten by snakes every year, with 81-138 million recorded deaths due to snake bites. Maybe an app which would tell you their species and from that whether or not they're venomous would be a useful aid to professional medical examinations? With this in mind I found a dataset of labeled snake images and created machine learning models to classify each images species!

I had a chance to experiment with image scraping (which I didn't end up using here but utilised heavily in many future projects), various techniques to modify/enhance/increase image data and varying types of models and training techniques. At the beginning of this project I had absolutely no idea about how any of these things worked or what to do but slowly testing out different approaches to solving this one problem was incredibly engaging and helped build out my foundations.

I have several blog posts discussion my journey on my first project. If you're interested please feel free to take a read 😉:

Predicting Energy Demand 🤖

Energy is used daily for phones, computers, washing machines, heaters and a vast array of appliances. I definitely can't imagine what it'd be like without it! Our dependence on electricity makes it critical to accurately predict how much energy we will need to generate on any given day.

At University I worked with a few fellow students to try and model energy usage via temperature data. This was a great opportunity to expand what I knew as "data science/machine learning". At first I thought it was all neural networks but during this project I had a chance to learn how time series forecasting usually works using "classical machine learning" (the old school algorithms, from when things were simpler 🥸).

I started by learning to clean and structure the data, visualise it and use the insights/knowledge I acquired from my analysis build a model.During the process I realised that in reality, how well you do isn't defined by how smart or complicated your model is, but rather how well you're able to break down the problem and transform your understanding of the domain into code.I had several long discussions about our usage of energy, renewables and more with lecturers and a relative who researches renewable energy at Bloomberg.These helped tremendously to help me interpret the numbers, tables and graphs which I extracted (knowing what to look for, what types of graphs/visualisations/models can be used, whether you're picking up on the right factors, etc).

Photo by Henry

Here are some links to articles I've written which present the exploratory analysis and modelling in the form of beginner tutorials:

Generating Faces and Comic Characters 🤡

Straight after I got into Monash University I applied and joined the student engineering AI research team Monash DeepNeuron and pretty soon was selected to work on a research paper in a team of 6 to use Generative Adversarial Networks to transform CT scans into images of peoples face.Up until this point, I'd worked on a small variety of projects but I didn't know about any of the big "subfields" of deep learning, nor had I worked in an actual team to develop anything "realistic". Hence I was super psyched when I found out I'd be able to contribute to a research paper which had real life applications for forensic facial reconstruction.

Since I was new I started by picking up small "tickets" or "issues" off the "kanban board" to work on. These were initially small things like "find and fix this bug" or "test xyz feature out" but eventually the project refactors and features I worked on broadened in scope. I ended up implementing our model training framework and setting it all up to work with a project tracking software called Weights and Biases.

Research Paper

Working on the project was absolutely amazing as I learned an absolute tonne from my team lead and fellow teammates (which were all in their final year graduating before me).

After the project finished up I was eager to learn more about how the model we were using worked and so decided to start my own small project to generate images of comic characters with another friend.During this time I read up and implemented my own GANs from scratch for the problem 🤓 and very soon afterwards ended up running a workshop explaining to other Monash students how GANs can be used to generate images like faces and artwork!

Here are some of the resources I created:

Other Projects

Quick mention that I was fortunate enough to work on several other small personal projects which you can find on my GitHub Profile.Here are a few examples:

Creating a Racing Car Bot for Jelly Drift Video Game - Video Recording of Progress
Scraping Huge Amounts of Data of GitHub
Implementation of Desktop HTTP Rest Dictionary/Anki API within Android App for Language Learning - Note this is a real project which several people I know use 🇯🇵 (I would talk more about it but doesn't involve ML)!
Japanese Vocabulary Frequency Analysis with Rust and Web Assembly
Enhanced Dashboard to Allocate to Monash Classes

Internships and Real World Work Experience

Throughout University I've had internships at various small companies so far (here's a small summary of what I can talk about, due to NDAs I can't show any results/discuss in detail what I did though).

I first started working with a small company called Penta Global where I worked on a project to analyse the quality of coffee beans. I extracted data from sensors, went through it and analysed statistics and ended up creating a series of map visualisations which illustrate how various sensor metrics (like heat and density) impact the quality of the coffee as it traveled across the globe.

My second internship was at a data-centric real estate company called Milk Chocolate Property. I worked with their relational databases, APIs and geographic Python libraries to find and filter what particular locations their customers would be likely to want to live in.

Afterwards I interned with the Monash Data Science and AI Platform working on a NLP project to identifying potential patients for medical trials based on their submitted medical health records.

I currently work as a Junior Machine Learning Engineer as part of the AI division of a small search and recommendation company called Systema AI. My day to day work involves analysis of our systems performance in respect to client shop purchases and helping maintain and improve various models and services.

To top it all off, for the bulk of my time at University (until very recently) I was the "Deep Learning Training Team Lead" at a student engineering team Monash DeepNeuron.I led/managed a team of ~6 people creating and demoing at a wide variety of deep learning workshops were I helped teach students how it all works! We together ran workshops and blogs on topics ranging from the foundational basics of "how neural networks work", "how to start a project", to complex applications like "reinforcement learning" and "generative adversarial networks".

Conclusions

Over the last few years I've had a massive amount of fun working on heaps of projects with heaps off different people!I've learnt so much from all of my work. Right from my first project detecting images of snakes to working on generating images of faces and comic characters, teaching other students how deep learning works, creating random applications and scripts to aid in my language learning and person life, and to working at Systema on our smart search and recommendations models.

I hope this both goes to illustrate my progress throughout my data science journey and to motivate anyone who happens upon this that if you just give it a shot you can gain a lot of skills whilst just working on cool projects you enjoy 🤩!

Cover Photo by Lucas

Gathering and Using Big Data from Public APIs for Data Science - The GitHub Popularity Project

Kamron Bhavnagri — Wed, 16 Sep 2020 13:17:32 GMT

Cool unique data makes for intriguing projects, so let's go find some on the web!Today we'll get what we need to tell a story about the magic-making GitHub projects popular 🌟.Readmes, descriptions, languages... we'll collect it all.

So, let the public API and big data sorcery begin.

WARNING: Collecting, storing and using mass data from public APIs won't be quick, easy or clean. Prepare to dial up the madness (you have been warned)...

The Story

GitHub, such a beautiful place filled with amazing creative projects.Some get popular, others stagnate and die.It truly is the circle of life, the wheel of fortune (eloquently stated by Elton John).

Now we could use a massive (terabytes large 🤯) archive like GH Archive or GHTorrent, but I'm not looking to fry a computer (haha).We could use Google BigQuery to filter through this, but why not take a journey closer to the source 😏 by using the official API (side-note, it's cheaper).

With a good bit of web scraping experience under our belt, it surely can't be that hard to use a public API...Nope, it doesn't work like that.You get a little data, you see a few repositories... and then that's all for poor you 🥺.

But what if you want more 🥺😒?What if you want a LOT MORE?Well, I'd welcome you to the slightly dodgy but still legitimate public API user club.Beginner lingo - If you play your cards right you can get what you need without waiting a hundred years.Well, it's smooth sailing if you know what you're doing 🤯🤔.

Luckily, we can avoid the time-consuming pain of switching from technology to technology, by considering what's out there.To start off, we can consider our familiar cozy tried and tested data science tools (Python with Pandas and Requests).We'll consider what it does well, but also it's major drawbacks (lots exist).To come to our rescue, we'll discuss a few unique tools and techniques to bridge the gap between gathering and using data in a scalable way.

After we decide on a technology stack we can start looking at our API and (with thoughtful research) figure out ways to optimise our search queries:

Break down large searches into multiple parts
Send multiple queries at once (asynchronously)
Getting extra API time through multiple API access tokens (get your friends together, and maybe use a rotating proxy)

After all the work, we can finally sit back and relax.Knowing that we've got virtually every last drop of data 🤪.Yes, 50 glorious gigabytes of GitHub readme's and stats 😱🥳!

Technology Stack

Before we get into the nitty-gritty details on how to collect data it's important to decide what technologies to use.Normally this wouldn't be a big deal (you'd want to get started asap), but when collecting a large amount of data, we can't afford to have to rewrite everything with a different library (i.e. because of an overhead or general slow speed).

With small datasets, library/framework choice doesn't matter much. With ~50 GB, your technology choices make or breaks the project!

This section is pretty long, so here's a summary of the technologies I used along with alternate options I would use if I started over:

Task	Technology Used	Potentially Better Options
Querying GitHub API	AIOHttp and AsyncIO	Apollo Client
Saving Data	PostgreSQL	Parquet using PyArrow
Processing Data (Data Pipeline)	Dask	Spark
Machine Learning Modelling (Ensembles)	H2O	XGBoost or LightGBM
Deep Learning and NLP Modelling	PyTorch/PyTorch Lightning with Petastorm and Hugging Face Transformers	PyTorch with Petastorm

Always create a working environment.yaml or requirements.txt file listing all dependencies like ours)

There are multiple types of APIs, the most popular ones are REST and GraphQL APIs.GitHub has both, GraphQL is newer and allows us to carefully choose what information we want.Normally you'd use an HTTP requests library for GraphQL, but dedicated clients (like Apollo or GQL) do exist.It might not seem important to use a dedicated library, but they can do a lot for you (like handling pagination).Without one, you'll need to handle asynchronous requests yourself (using threading or async await).It isn't impossible but it is a burden to deal with (I, unfortunately, went this route).

Note that GQL is still under heavy development, so Apollo Client may be better (if you use JS).

The next step is to decide where to store your data.Normally Pandas would be fine, but here... it's slow and unreliable (as it completely overwrites the data each time it saves).The best option may seem like a database (they were designed to overcome these problems) like PostgreSQL.However, let me warn you right beforehand, databases... are horribly supported by machine and deep learning frameworks!But... HDF5, Parquet and LMDB can work quite well.

Now that we've got all our data, it's time to consider how we will process and analyse it.When using databases it's best to stick to Apache Spark.Spark is nice to work with, supports reading/writing to nearly ANY format, works at scale and has support for H2O (useful if you want to try out AutoML, but it is quite buggy).The downside is that we ironically don't have enough data to make Spark's overhead worth it (best for ~300+ GB datasets).Just as long as you used HDF5 or (better yet) Parquet, both Dask and Vaex should work like a charm though.Vaex is a highly efficient dataframe library which allows us to process our data for an ML model (similarly to Pandas).Although Vaex is efficient, you might run into memory problems.When you do, Dask's out-of-core functionality springs to life 😲!We can also train classical machine learning models (ensembles like random forests and gradient boosted trees) though Dask and Vaex.Dask and Vaex provide wrappers for Scikit Learn, XGBoost and LightGBM.

When it comes to deep learning, PyTorch is a natural go-to!It can use HDF5 or LMDB quite easily (with custom data loaders like this one, which you can either find or create yourself).For anything else use Petastorm (from Uber) to get data into PyTorch (by itself for Parquet and otherwise with Spark).A neat trick if you use Spark is to save the processed data into Parquet files so you can easily and quickly import them through Petastorm.

Be very, very careful with the libraries you decide to use. It's easy for conflicts and errors to arise 😱

With a solid technology stack, you and I are ready to get going!Quick side note - you'll quickly realise that Apache is a HUGE player in the big-data world!

Assembling a Query

GraphQL is really finicky.It complains about the simplest mistake, and so it can be difficult to figure out how to construct your search query.The process to come up with an appropriate API request is as follows:

Assemble a list of all the information you want/need (the number of stars and forks, readmes, descriptions, etc)
Read through the official documentation to find how to gather the basic elements
Google for anything else
Run your queries as you build/add to them to ensure they work/figure out the problem

It's a surprisingly long process, but it does pay off in the end.Here's the final query for GitHub.

query ($after: String, $first: Int, $conditions: String="is:public sort:created") {    search(query: $conditions, type: REPOSITORY, first: $first, after: $after) {        edges {            node {                ... on Repository {                    name                    id                    description                    forkCount                    isFork                    isArchived                    isLocked                    createdAt                    pushedAt                    primaryLanguage {                        name                    }                    assignableUsers {                        totalCount                    }                    stargazers {                        totalCount                    }                    watchers {                        totalCount                    }                    issues {                        totalCount                    }                    pullRequests {                        totalCount                    }                    repositoryTopics(first: 5) {                        edges {                            node {                                topic {                                    name                                }                            }                        }                    }                    licenseInfo {                        key                    }                    commits: object(expression: "master") {                        ... on Commit {                            history {                                totalCount                            }                        }                    }                    readme: object(expression: "master:README.md") {                        ... on Blob {                            text                        }                    }                }            }        }        pageInfo {            hasNextPage            endCursor        }    }}

You can parse in the arguments/variables after, first and conditions through a separate JSON dictionary.

Challenging Your Query

Divide and Conquer

You wrote one nice simple query to find all your data? You Fool 🤡🥱

Big companies are (mostly) smart.They know that if they allow us to do anything with their API, we will use and abuse it frequently 👌.So the easiest thing for them to do is to set strict restrictions!The thing about these restrictions though is, that they don't completely stop you from gathering data, they just make it a lot harder.

The most fundamental limitation is the amount you can query at once.On GitHub, it is:

2000 queries per hour
1000 items per query
100 items per page

To stay within these bounds whilst still being able to collect data we need to bundle lots of small queries together, whilst splitting apart single large queries.Combining smaller queries is easy enough (string concatenation), but to split a large query apart requires some clever coding.We can start by finding out how many items appear for a search (through the following GraphQL):

query ($conditions: String="is:public sort:created") {    search(query: $conditions, type: REPOSITORY) {        repositoryCount    }}

By modifying the conditions variables, we can limit the searches scope (for example to just 2017-2018).You can test this (or another) query with the official online interactive iGraphQL explorer.In essence, we can create two smaller searches by diving the original time period into two halves.

If the outgoing number is greater than 1000, we'll need to create two independent queries which gather half the data each!We can break a search in half by diving the original period of time into two half as long periods.By continuously repeating the division process, we'll eventually end up with a long list of valid searches!

So in essence, here's what happens (continuously repeated through a while True loop):

trial_periods = []# Handle one period at a timefor period, num_repos, is_valid in valid_periods:    if is_valid == True and num_repos > 0:        # Save the data        ...    else:        # Add to new list of still unfinished periods        trial_periods.extend(self.half_period(*period))if trial_periods == []: break...

Please see the GitHub repo for the full working code. This is just a sample to illustrate how it works on its own, without databases, async-await and other extraneous bits...

Do remember that if (like me) you're using the API through an HTTP client instead of a dedicated GraphQL one, you'll need to manage pagination yourself!To do so you'll need to include in your query (after the huge edges part):

pageInfo {    hasNextPage    endCursor}

Then pass the cursor as a variable for where to start the next query.

Asynchronous Code

If you're using an HTTP client it is important to know how to write code to run blazing fast, and ideally in a way that multiple requests can be made at once.This helps because the GitHub server will take a long time (from around 1 to 10 seconds) to respond to our queries!One great way to do this is to use asynchronous libraries.With asynchronous code, the Python interpreter (like our human brain) switches between tasks extremely fast.So whilst we're waiting for our first query to return a response, our second one can be sent off straight away!It might not seem like much, but it definitely is.

There are three ways to achieve this.The easiest way (especially for beginners) is to use threads.We can create 100 threads (arbitrary example), and launch one query on each.The operating system will handle switching between them itself!Once an operation finishes, the thread can be recycled and used for a separate query (or other operation)!

The second method is to utilize your computer's processes.When we do this, we get multiple tasks to perform in parallel!This is useful for high CPU tasks (like data processing), but we have very few processes (not everyone has a high-end i7 or threadripper 🤣).

The third and final method is async-await.It is similar in philosophy to the first (quickly switch between tasks), but instead of the OS handling it... we do!The idea here is that the OS has a lot to do, but we don't, so it's better Python handles it itself.In theory, async-await is easier and quicker than writing thread-safe code (but much, much harder in practice).The primary reason for this is that asynchronous code can behave synchronously.Simply put: if your design pattern is slightly off, you may have 0 performance gain!

I rewrote, redesigned and refactored my code a million times, and here's what I figured out:

Use a breath first approach (i.e. maximise the number of queries you can run near-simultaneously independent of anything else)
Avoid the consumer-producer pattern where one function produces items, and the other consumes them (as there aren't many guides or explanations of how to use it in practice, and seems to arbitrarily limits itself)
Whenever you're trying to run multiple things simultaneously use asyncio.gather(...)
Avoid async loops, they in practice run synchronously since loops maintain their order of elements (i.e. one by one first, second, third...)
CPU intensive tasks must run within a separate process
Find non-CPU intensive alternative (ideally asynchronous) technologies where possible (i.e. don't write to a single JSON/CSV file as they need to be completely overridden each time you append another item)
Automatically restart queries (Tenacity does this with a simple function decorator)

When you stick to the rules, development stays lite, fast and fluid 😏😌!

Note that proper asynchronous code can easily bombard a server with requests.Please don't let this happen or you'll be blocked 🥶.Simply rate limit requests with a Semaphore (other fancy methods exist, but this is enough)!

THANKS FOR READING!

I know all we currently have is raw data, saved as a file (or database server), but this is the first big step to any unique project.Soon (in part 2) we'll be able to process and view our data with PySpark, look at results from a few basic ML models (created with H2O PySparkling AI)!After that (part 3) we can take a look at using PyTorch with AllenNLP or Hugging Face Transformers.

If you enjoyed this and youre interested in coding or data science feel free to check out my other posts like tutorials on practical coding skills or easy portfolio dashboards/websites!

Images by Gerd Altmann on PixaBay, Campaign Creators, Florian Olivo and Reuben Juarez on Unsplash

Machine Learning Energy Demand Prediction Project - Part 3 Modelling with Decission Trees

Kamron Bhavnagri — Wed, 12 Aug 2020 02:39:35 GMT

Let's see how our machine Learning, project planning and essential coding tools can be brought to life in a real-world project!Today we're going through how we can predict how much energy we use daily using temperature data.We start here with importing and cleaning data, before graphing and depicting the story of our energy usage and finally modelling it.

This is the last section, where we take our cleaned data and our understanding of temperature and energy to develop a predictive model. Feel free to code along, the full project is on GitHub.

The story

We wake up in the mornings, turn on the heater/air conditioner, find some yogurt from the fridge for breakfast, shave, turn on a computer, get the music rolling and finally get to work.These tasks all have one thing in common - they use power!Our heavy reliance on electricity makes it crucial to estimate how much energy we'll need to generate each day.

But, fear not if this seems challenging.We will take it one step at a time.At each stage linking back to how it relates to our ML field guide.

We already found, cleaned, visualised and interpreted our energy usage 😊.Now we can translate this into a model able to predict how much energy we use!We start our journey of right where we left off taking a deep look into how we can remove annual increases/decreases in energy usage caused by economic and population growth.This is a hard problem, so to simplify we break down the energy demand time series into separate parts (which we graph).The three important components of the time series are overall increasing, decreasing and stable trends, seasonal repetitive changes and other random residual noise.Research into time series indicates that we can detrend our data through a process called differentiating, where we subtract the previous n'th value at each point.

With data in a clean and predictable state, our minds are at ease and we can move onto quickly dividing up the dataset and creating a decision tree model for each state.We'll find out what variables (specifically hyperparameters) effect it and then use grid search to find the best values.

We can then judge how well they fair, and contemplate how they could improve.Then, finally, we can honour the project by showing it off to everyone we know 🥳!

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom statsmodels.tsa.seasonal import seasonal_decomposefrom sklearn.model_selection import cross_val_score, GridSearchCVfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.model_selection import train_test_splitfrom IPython.core.interactiveshell import InteractiveShellInteractiveShell.ast_node_interactivity = "all"pd.options.display.max_columns = Nonenp.random.seed(0)

data = pd.read_pickle("../Data/Data.pickle")data["Month"] = data.index.monthdata["Week"] = data.index.weekdata["Day"] = data.index.dayofweek

The Epochs

Removing trends/seasonality

Let's begin by clearly explaining our situation.As mentioned before, this is achieved through graphs which break down our time series!They help us visually interpret precisely what needs to be removed/modified.

seasonal_decompose(data.groupby("Region").get_group("VIC")["TotalDemand"].resample("M").mean(), model="additive").plot()

At the top, we see the full graphs, and below it trends (the increase/decrease of values with time), seasonality (the repeating pattern of increasing and then decreasing values) and residuals (everything else which is present, but more so random since it doesn't seem to repeat itself).

We just decomposed Victoria's energy demand here, so it isn't completely representative of what we'll remove, but close enough to remind us of the problem.What we want is to eliminate that gradual long-term increasing/decreasing trend.This is normally done through diffing our data (subtract n'th previous value from each entry).We'll subtract the value half a year ago since our trends happen in the long-run.Since we do it separately on each state, we first have to order the data by region and time (we can reverse this after).

# Sort dataframe by region so groupby's output can be combined and used for another columndata.sort_values(by=["Region", "Date"], inplace=True)data["AdjustedDemand"] = data.groupby("Region")["TotalDemand"].diff(8544)all([region[1].sort_index().index.equals(region[1].index) for region in data.groupby("Region")])data.sort_index(inplace=True)

True

When we graph the original total demand we find that it was not stationary, however, when we overlay the new adjusted version it varies up and down across one straight line (implying that the trend has been removed).

data.groupby("Region").get_group("VIC")["TotalDemand"].resample("M").mean().plot()data.groupby("Region").get_group("VIC")["AdjustedDemand"].resample("M").mean().plot(secondary_y=True)

To ensure we're right, we can graph the distribution of temperature against energy, and see how it changes with time.We can see that the graphs become tighter, showing how that the number of energy values for a certain temperature has been reduced (good)!

fix, axes = plt.subplots(ncols=2, figsize=(20, 12))data.groupby("Region").get_group("VIC").resample("W").mean().plot(x="WetBulbTemperature", y="TotalDemand", kind="scatter", ax=axes[0], title="Before adjustments")data.groupby("Region").get_group("VIC").resample("W").mean().plot(x="WetBulbTemperature", y="AdjustedDemand", kind="scatter", ax=axes[1], title="After adjustments", color="red")

Divide up data

There's one thing everyone knows by now - we have a lot of data.So now we have to split it up and be 100% confident that it's what we're looking for.The default distribution of 75% of data for training and 25% for testing is good enough for our purpose.

Not everyone has a monstrously powerful computer, so to ensure it's easy and fast to train our model (useful to quickly see how our model fares, try changing a few things and retrain) we'll only predict the overall energy usage every day (instead of per 30 minutes).This should decrease the effect of any present outliers too!

Whilst testing out changes, one thing which will become immediately obvious is that including complete information on time causes overfitting.This is because there are only 20 years present, meaning any information encoded in the year is likely, not generalisable.To fix this, we can just use integers for the day, week and month number.Anomalies/outliers should now be relatively rare (hopefully 😲).

resampled_data = data.groupby("Region").resample("D").mean().reset_index("Region").sort_index()train_data, test_data = train_test_split(resampled_data, shuffle=False)input_columns = ["WetBulbTemperature", "Month", "Week", "Day"]output_columns = ["AdjustedDemand"]train_input_data, train_output_data = train_data[input_columns + ["Region"]], train_data[output_columns + ["Region"]]test_input_data, test_output_data = test_data[input_columns + ["Region"]], test_data[output_columns + ["Region"]]

Create and train a model

We have a large selection of models we can use.We can try and see which ones work (good idea for beginners), but after the long trials, it becomes obvious that simple and fast models like decision trees work just as well as more complex ensemble models like random forests.Of course, linear models won't work (our data is shaped like a parabola in most cases).

As mentioned before, we need to find the optimal hyperparameters (how deep and complex our decision tree can become), and grid search is the standard way to do it.We run grid search over 5 subsets of our dataset (K-fold cross-validation).The max depth (how deep the tree can grow), along with min sample leaves (minimum end-nodes the tree must-have, which produces shallow trees) and max-leaf nodes (maximum end-nodes the tree is allowed to have, which produces deeper trees) are our hyperparameters (put into a paramaters array).We've found these through reading the decision-tree Scikit Learn documentation which has a lot of details on how to use each model (with a few examples)!Knowing what specific values to try out for each variable comes down to manual testing (try and see what happens).

The get_best_model function here is responsible for hyperparameter tuning, whilst the get_predictions one loops through each state and trains a model for it.Separating different concepts into their own functions/classes is a good way to ensure code clarity, reproducibility and modification.

def get_best_model(test_input: pd.DataFrame, test_output: pd.DataFrame):    paramaters = {"max_depth": [*range(1, 20), None], "min_samples_leaf": [2, 5, 10, 15], "max_leaf_nodes": [5, 10, 20, None]}    regressor = DecisionTreeRegressor()    grid = GridSearchCV(regressor, param_grid=paramaters, n_jobs=1)    grid.fit(test_input, test_output.values.ravel())    best_score, best_depth = grid.best_score_, grid.best_params_    return grid, best_score, best_depth

def get_predictions(regressor, test_input, test_output, train_input=None, train_output=None):    test_predictions = regressor.predict(test_input)    test_results = pd.DataFrame(test_predictions, columns=output_columns, index=test_input.index)    test_results = test_data[input_columns].join(test_results)    if type(train_input) != None and type(train_output) != None:        train_predictions = regressor.predict(train_input)        train_results = pd.DataFrame(train_predictions, columns=output_columns, index=train_input.index)        train_results = train_data[input_columns].join(train_results)        return test_results, train_results    return test_results

models, regressors = [], []test_predictions, train_predictions = [], []for region, dataframe in train_data.groupby("Region"):    # Cross validate to find the best model    model_input, model_output = dataframe.dropna()[input_columns], dataframe.dropna()[output_columns]    grid, score, params = get_best_model(model_input, model_output)    regressors.append(grid)    models.append(regressors[-1].fit(model_input, model_output.values.ravel()))    print(f"Best {region} model has a score of {score} and best params {params}")    # Get the test data for this specific region    test_input = test_data.groupby("Region").get_group(region)[input_columns].dropna()    test_output = test_data.groupby("Region").get_group(region)[output_columns].dropna()    # Generate predictions, obtain and log the final formatted data    test_results, train_results = get_predictions(regressors[-1], test_input, test_output, model_input, model_output)    test_predictions.append(test_results)    train_predictions.append(train_results)

Best NSW model has a score of 0.6673092603240149 and best params {'max_depth': 11, 'max_leaf_nodes': None, 'min_samples_leaf': 15}Best QLD model has a score of 0.679797201035001 and best params {'max_depth': 11, 'max_leaf_nodes': None, 'min_samples_leaf': 15}Best SA model has a score of 0.4171236821322447 and best params {'max_depth': 9, 'max_leaf_nodes': None, 'min_samples_leaf': 10}Best TAS model has a score of 0.7609030948185131 and best params {'max_depth': 15, 'max_leaf_nodes': None, 'min_samples_leaf': 15}Best VIC model has a score of 0.6325583799486684 and best params {'max_depth': 10, 'max_leaf_nodes': None, 'min_samples_leaf': 15}

We can see that Tasmania fares quite well, with a score just under 80%.This makes sense, as Tasmania started with very little trend, meaning the should be a higher correlation between temperature and energy.Queensland, New South Wales and Victoria aren't all too bad either, with scores between 60%-70%!

If we look at the max_depth, we can also tell that our models aren't as complex as they could be.

Visualise performance

To judge how well our model fairs, we create and analyse plots of energy and temperature.We start by seeing the correlation of energy and temperature for each state (the predictions are blue, and the real values are red).We can see that the model isn't perfect and doesn't always predict the right values, but is pretty decent given the small number of features we are using.

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 12), constrained_layout=True)counter = [0, 0]for region, region_data in test_data.groupby("Region"):    region_data.plot(ax=axes[counter[0], counter[1]], x="WetBulbTemperature", y=output_columns, kind="scatter", color="red", title=region)    if counter[1] < 2: counter[1] += 1    elif counter[1] == 2: counter[1] = 0; counter[0] += 1counter = [0, 0]for region_data in test_predictions:    region_data.plot(ax=axes[counter[0], counter[1]], x="WetBulbTemperature", y=output_columns, kind="scatter")    if counter[1] < 2: counter[1] += 1    elif counter[1] == 2: counter[1] = 0; counter[0] += 1

Majority of the time the predictions align with the actual data, but it tends to be off at times.We can see that Tasmania and Queensland are quite well done (explaining their high performance).It looks like their graphs are more linear and less curved, which may be why they outperform other states.

We can move onto looking at the energy time series now.

fig, axes = plt.subplots(nrows=5, figsize=(20, 12), constrained_layout=True)for i, (region, region_data) in enumerate(test_data.groupby("Region")):    region_data[output_columns].plot(ax=axes[i], title=region)    test_predictions[i][output_columns].plot(ax=axes[i])    axes[i].set_ylabel("Adjusted Demand")    axes[i].legend(["Demand", "Demand Prediction"])

We can see that the general ups and downs are mostly found and predicted across time.The magnitude of the energy demand is also forecasted relatively well!This is incredible given that we only need temperature and dates/times to generate our predictions!

To be more critical we can see that around halfway through most years, our energy predictions derail.The model for Tasmania also seems to particularly struggle in 2017.The temporary half-yearly blunders seem more severe in New South Wales and South Australia.To get a feel for how bad these problems may be, we can just plot 2019.It'll show that our predicts are pretty decent!

fig, axes = plt.subplots(nrows=5, figsize=(20, 12), constrained_layout=True)for i, (region, region_data) in enumerate(test_data.groupby("Region")):    region_data[output_columns]["2019"].plot(ax=axes[i], title=region)    test_predictions[i][output_columns]["2019"].plot(ax=axes[i])    axes[i].set_ylabel("Adjusted Demand")    axes[i].legend(["Demand", "Demand Prediction"])

Conclusions

We've come a long way 🤠!We started off knowing little to nothing about energy, and now have a solid understanding of what effects it.To start off we cleaned the data (an arduous process, but worth it), then created a bunch of graphs which illustrated patterns and trends in temperature and energy demand.Energy demand was is greatest when the temperature is extremely high or low (too cool or too warm and we freak out 🤣).We now know that 9 am to 6 pm is active high energy time too!

Being able to create and train a working model to model these changes though, it's pushed us to a new level.Everything we saw and labelled before, our model can predict.We even know how to improve the model, by collecting data on population, technological advancements and economic activity.All in all, if there's one thing to take from this, it is that graphs and conceptual knowledge are the building blocks for creating and understanding a model!

The State of Data Websites and Portfolios in 2020 - Develop a Dashboard in a Day, Dash vs Streamlit and is JavaScript still king?

Kamron Bhavnagri — Mon, 13 Jul 2020 15:05:45 GMT

Having a visual product, website or dashboard to show what your arduous efforts on a coding/machine learning project amounted to is something truly spectacular!Yet, it's often extremely difficult as numerous tools and technologies are usually required.Don't stress though, we'll discuss and compare two frameworks (Dash and Streamlit) which make it simple and easy to create an impressive portfolio (without a steep learning curve)!

The Story

You just created a machine learning model.It took a long time, but it's finally done, and you want to take in your victory for a second.You deserve a break... but wise old you knows the importance of creating a monument to show off your work.

You take the natural next step, looking up how to build a website.They begin with Python frameworks like Flask and Django, then proceed to JavaScript and before long you're stuck contemplating which front-end framework to use, and how you'll parse the data between the Python back-end (model) and JavaScript front-end (actual website).Oh, boy... this is a long and dark rabbit hole to scurry through.But then out of the blue, you hear that there is an easy solution for simple websites.You look up this new shinny framework (Streamlit), and it sure is easy 😊 and quick to use.Before long you've forgotten all your troubles and insecurities!But then you suddenly realise Streamlit's catch... it only works for simple Jupyter notebook-esk websites.It's all aboard the web dev train again for you.Requests and JavaScript, here you come 😰.

It doesn't have to be that way though... you can find middle ground.Something simple enough to be understood in a few days, but complex enough... well, for nearly anything 🤓!Welcome to Dash.You still need to know a few web fundamentals (HTML and CSS), but at least your development journey has a clearly defined path ahead.Even if it feels slightly clunky, it does get the job done well enough!

The whole process can be dumbed down to three decisions:

What you want on the page (text, graphs, tables, images, etc)
How to arrange and style the page (using CSS)
How you want the user to interact with the page

No JavaScript, HTTP requests or even multiple separate frameworks for the front and back end any more!

Get Started with Dash

To get started, make sure you have Dash installed.With plain vanilla Python use pip install dash and for Anaconda conda install -c conda-forge dash.Next, create a new Python file and import the relevant libraries:

import dashimport dash_core_components as dccimport dash_html_components as html

If you try and run the app so far, you'll notice one thing - nothing happens.That's because we actually have to create a Dash app object and tell it to start.

app = dash.Dash(__name__, external_stylesheets=["https://codepen.io/chriddyp/pen/bWLwgP.css"])app.title = "Allocate++"if __name__ == "__main__":    app.run_server(debug=True)

We can include a style sheet (CSS using external_stylesheets) and set our website's title (app.title) to make things look better.Checking that __name__ == "__main__" just ensures that the website only launches when directly started (not when imported in another file).

If we try to run this code, in the terminal we'll get a message like:

Running on http://127.0.0.1:8050/Debugger PIN: 409-929-250 * Serving Flask app "Main" (lazy loading) * Environment: production   WARNING: This is a development server. Do not use it in a production deployment.   Use a production WSGI server instead. * Debug mode: onRunning on http://127.0.0.1:8050/Debugger PIN: 791-028-264

It indicates that your app has started and can be found using the URL http://127.0.0.1:8050/.Although it's currently just a blank page (real fancy-schmancy), it does indicate that everything is working fine.

Once you're ready to progress, try adding in a heading:

app.layout = html.H1(children="Fancy-Schmancy Website")

After you save the file, that website should automatically reload.If it hasn't reloaded, or there are popups on the screen, you probably have an error in the source code.Just check the actual terminal/debugger for more information.

Now that you're familiar with how to get a basic website, let's move onto transitioning your concept into code.It starts with what's called a layout, which is composed of components.Dash provides core (dash_core_components) and HTML (dash_html_components) components.You always start using the HTML elements, since they provide the basic building blocks for text and grouping components together, before moving onto the core components.Core components offer more interactivity (graphs, tables, check box's...).It's now natural to ask, how to style the web page.In short, you use CSS (cascading style sheets) for this.Dash themselves provide concrete overviews of core components and trusty Mozilla have an amazing HTML and CSS intro.Several examples of how to use the elements are here.

The last part of any Dash app is making it responsive.Getting the buttons you click, text you enter and images you upload... do something!This is where things would normally get difficult, but here it really isn't too bad.With Dash, all you've got to define is a function which receives and controls specific element/s properties.Properties start with the "@" symbol.

@app.callback(    [dash.dependencies.Output("output element id", "property to set value of")],    [dash.dependencies.Input("input element id", "input property")])def update_output(value):       return value

We can do this for multiple elements by adding more Input and Output objects to those lists!One thing to watch out for here though - more Input objects means more inputs to the function are required, and more Output objects mean more values to return (sounds obvious, but it can easily slip your mind). Also, note that you shouldn't modify global variables within these functions (this for technical reasons is an antipattern).Further documentation is provided on these callbacks.

Going Forwards and JavaScript

There it is, everything you'll need to know to start creating an interactive and impressive web application!It'll likely still be difficult to create one, but the official documentation and tutorials for Steamlit and Dash are amazing.There are also cool galleries of sample apps using Dash and Streamlit (so you can learn from others examples).

Of course, there are use cases for JavaScript.In fact, you can build plugins for Dash with JavaScript/React and D3.js.Hell, if you are already comfortable with web technologies it may even be easier for you to use them.However, using JavaScript isn't 100% necessary to build websites any more (it's more so optional).It may be useful to know about web technologies, but if your aim isn't to become a full-stack web developer, you don't need to become an expert to put together a flashy portfolio 🥳!

I hope this has helped you out!Dash helped me hack together my first dashboard in a day.If you've made a cool website, app or portfolio make sure to comment and tell me about them.Feel free to check out my other posts - some highlights are practical coding tools, web scrapping and machine learning (with the practical project).You can follow my newsletter and Twitter for updates 😉.

Photo by Luke Peters on Unsplash

Machine Learning Energy Demand Prediction Project - Part 2 Storytelling using Graphs

Kamron Bhavnagri — Thu, 09 Jul 2020 05:17:21 GMT

Let's see how our machine learning, project planning and essential coding tools can be brought to life in a real-world project!Today we're going through how we can predict how much energy we use daily using temperature data.We previously imported and cleaning our data, so will now graph and depict the story behind our energy usage!.

This is the second part of three (first here). Feel free to code along, the full project is on GitHub.

The story

We already found, imported and cleaned our data (good work guys), so we can move onto telling a story about our power usage.But, fear not if this seems challenging.We will take it one step at a time.At each stage linking back to how it relates to our ML field guide.

We start with the difficult but necessary task of interpreting our data.Our first thought is to plot the whole time series at once, but damn a graph with 4 features, each with around five measurement every 30 minutes over 20 years isn't pretty, meaningful or fast to graph.After banging our head against a brick wall for a while, we, of course, realise that we can plot specific features and relationships instead of everything at once.With little to lose we start using simple summary statistics to find the maximum, minimum and average values.These give us a rough overview of each column, but to push ourselves one step further we take a look at how correlated our features are.

Once we understand that temperature relates highly to energy demand (intuitive enough), we're ready to get going with some graphs 😉!Although we can't graph everything at once, we still want to get a grasp of the overall picture - how our data changes with time.We begin by identifying our problem - when we look for changes over 20 years, movement every 30 minutes is really meaningless and just blotches the picture.Lucky for us our field guide explains that we can plot each week's average value through resampling!Now we know the general increasing and decreasing trends between states.

After looking at individual data for energy and temperature we move onto finding where the correlation between the two occurs.The graphs for each state are different.The states which had larger trends have more complex looking graphs.This is complex, and we don't have the data to account for these trends, so we'll need to remove them later on.

Now there's only one thing left for us - to find out how energy demand changes during a day and week.Then... in no time, we've managed to depict the story of our energy usage through each invigorating day, month and year 😎.At this point, we'd have successfully made it through the majority of our project!After a brief celebration, we can move onto modelling... Let's not jump the gun though, this will be in the next (final) tutorial.

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom IPython.core.interactiveshell import InteractiveShellInteractiveShell.ast_node_interactivity = "all"pd.options.display.max_columns = Nonedata = pd.read_pickle("../Data/Data.pickle")

The Epochs

Chapter 1 - Descriptive Statistics

Since we can't view everything at once, we want to get a rough gauge on what our data looks like.The natural first step is to look at each column's mean, minimum and maximum value.These are called descriptive statistics, and Pandas calculates them for us using the describe function.

Since we want to extend this to see what is related to energy demand (since we're trying to predict it later on), we'll find the correlations.To find the correlations between features Pandas provides the corr function.

The stats show:

TotalDemand has an average of 4619MW with a minimum of 22 mW and a maximum of 14580 MW.
WetBulbTemperature ranges from a minimum of -9.0 C to a maximum of 41 C.
TotalDemand is most correlated to WetBulbTemperature

Although the correlation function only accounts for linear relationships (straight lines), it is still useful in knowing which features are worth graphing and including in our model.Here primarily WetBulbTemperature, but StationPressure may also be useful.

data.describe()

	TotalDemand	RRP	WetBulbTemperature	SeaPressure	StationPressure
count	1.656254e+06	1.656254e+06	1.656254e+06	1.656254e+06	1.656254e+06
mean	4.619521e+03	5.143376e+01	1.346589e+01	1.016535e+03	1.012486e+03
std	2.848791e+03	1.910091e+02	4.668981e+00	7.543408e+00	7.798352e+00
min	2.189000e+01	-1.000000e+03	-9.000000e-01	9.772000e+02	9.693000e+02
25%	1.413990e+03	2.336000e+01	9.900000e+00	1.011900e+03	1.007800e+03
50%	5.131249e+03	3.443000e+01	1.310000e+01	1.016800e+03	1.013100e+03
75%	6.591798e+03	5.490000e+01	1.700000e+01	1.021600e+03	1.017900e+03
max	1.457986e+04	1.470000e+04	4.100000e+01	1.041800e+03	1.037600e+03

data.corr()

	TotalDemand	RRP	WetBulbTemperature	SeaPressure	StationPressure
TotalDemand	1.000000	0.014473	0.357300	0.044859	0.188955
RRP	0.014473	1.000000	0.032914	-0.019025	-0.017678
WetBulbTemperature	0.357300	0.032914	1.000000	-0.249321	-0.125920
SeaPressure	0.044859	-0.019025	-0.249321	1.000000	0.887758
StationPressure	0.188955	-0.017678	-0.125920	0.887758	1.000000

Chapter 2 - Finding Long-Term Trends

Energy over 20 Years

We want to know the story of how we use energy.There's one simple way to do that - graphs 🤓.We can start by looking at what happens on a large scale, and then slowly zoom in.

We'll view each state separately, since their trends may not be the same.

fig, axes  = plt.subplots(nrows=2, ncols=3, figsize=(20, 12), constrained_layout=True)data.groupby("Region").resample("3W").mean()["TotalDemand"]["TAS"].plot(color="red", title="Tasmania Energy Demand", ax=axes[0,0])data.groupby("Region").resample("3W").mean()["TotalDemand"]["VIC"].plot(color="green", title="Victoria Energy Demand", ax=axes[0,1])data.groupby("Region").resample("3W").mean()["TotalDemand"]["NSW"].plot(color="purple", title="New South Wales Energy Demand", ax=axes[0,2])data.groupby("Region").resample("3W").mean()["TotalDemand"]["QLD"].plot(color="orange", title="Queensland Energy Demand", ax=axes[1,0])data.groupby("Region").resample("3W").mean()["TotalDemand"]["SA"].plot(color="blue", title="South Australia Energy Demand", ax=axes[1,1])

It can still be difficult to interpret graphs after they're resampled.So let's take it slowly, one step at a time.

The first noticeable pattern is that energy always fluctuates between a high and low point.The high and low points aren't always the same.

Tasmania and South Australia range from around 900 to 1400
Victoria from 4500 to 6500
New South Wales from 6000 to 10000
Queensland from 4500 to 7500

We can tell though that the trends aren't constant.There can be a rapid increase in energy usage (Queensland until ~2010), a steep fall (Victoria after ~2010) or even continuous stability (Tasmania)!The patterns are clearly not regular or caused directly by temperature (and so not predictable using historic temperature and energy data).

Although we don't have data on these trends, we can give an educated guess on what causes them.We know that the population isn't stable, and grows at different rates for different states.There also been a massive increase in technologies power efficiency, and economic conditions affect peoples willingness to use power.On top of this, global warming pushes more and more people to install solar panels (which produce power which isn't accounted for).Since we don't have data on any of these features, we'll try to remove the trends before we begin our modelling.

Energy over Single Years

Let's now zoom in!We'll look at trends which occur during a single year.Since we're graphing 5 years instead of 20 we'll, of course, need less resampling.

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 12), constrained_layout=True)data.groupby("Region").resample("W").mean()["TotalDemand"]["TAS"]["2015":"2020"].plot(color="red", title="Tasmania Energy Demand", ax=axes[0,0])data.groupby("Region").resample("W").mean()["TotalDemand"]["VIC"]["2015":"2020"].plot(color="green", title="Victoria Energy Demand", ax=axes[0,1])data.groupby("Region").resample("W").mean()["TotalDemand"]["NSW"]["2015":"2020"].plot(color="purple", title="New South Wales Energy Demand", ax=axes[0,2])data.groupby("Region").resample("W").mean()["TotalDemand"]["QLD"]["2015":"2020"].plot(color="orange", title="Queensland Energy Demand", ax=axes[1,0])data.groupby("Region").resample("W").mean()["TotalDemand"]["SA"]["2015":"2020"].plot(color="blue", title="South Australia Energy Demand", ax=axes[1,1])

We can tell that the energy demand is usually lowest in spring and autumn, whilst highest during winter and/or summer.Tasmania tends to have a higher demand in winter than summer.Victoria's similar, but with more frequent peaks in energy demand during summer.On the other hand, Queensland uses the most energy during summer.New South Wales and South Australia are both have max energy in summer and winter!

Tasmania is consistently cooler (being the small island) unlike hot and sweaty New South Wales and South Australia.This would explain the relative differences in where max's/min's occur.

Temperature over 20 Years

Temperature is just as important as energy though.So we'll take a look at it as well!

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 6), constrained_layout=True)data.groupby("Region").resample("3W").mean()["WetBulbTemperature"]["TAS"].plot(color= "red", title="Tasmania Temperature", ax=axes[0,0])data.groupby("Region").resample("3W").mean()["WetBulbTemperature"]["VIC"].plot(color= "green", title="Victoria Temperature", ax=axes[0,1])data.groupby("Region").resample("3W").mean()["WetBulbTemperature"]["NSW"].plot(color= "purple", title="New South Wales Temperature", ax=axes[0,2])data.groupby("Region").resample("3W").mean()["WetBulbTemperature"]["QLD"].plot(color= "orange", title="Queensland Temperature", ax=axes[1,0])data.groupby("Region").resample("3W").mean()["WetBulbTemperature"]["SA"].plot(color="blue", title="South Australia Temperature", ax=axes[1,1])

Unlike the energy graphs, the temperature graphs don't have any large immediately noticeable trends.However, we can see that the temperature varies from a minimum of around 8 to a maximum of around 22.Although this plot doesn't show any significant variations of temperature between states, they do exist.Tasmania is consistently cooler (being the small island) unlike hot and sweaty New South Wales and South Australia.

Temperature and Energy Correlations

We know temperature and energy are highly correlated, but we don't yet know-how.Well, let's find out!

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 12), constrained_layout=True)data.groupby("Region").get_group("TAS").resample("D").mean().plot(kind="scatter",x="WetBulbTemperature", y="TotalDemand", s=10, color= "red", ax=axes[0,0], title="Tasmania")data.groupby("Region").get_group("VIC").resample("D").mean().plot(kind="scatter",x="WetBulbTemperature", y="TotalDemand", s=10, color= "green", ax=axes[0,1], title="Victoria")data.groupby("Region").get_group("NSW").resample("D").mean().plot(kind="scatter",x="WetBulbTemperature", y="TotalDemand", s=10, color= "purple", ax=axes[0,2], title="New South Wales")data.groupby("Region").get_group("QLD").resample("D").mean().plot(kind="scatter",x="WetBulbTemperature", y="TotalDemand", s=10, color= "orange", ax=axes[1,0], title="Queensland")data.groupby("Region").get_group("SA").resample("D").mean().plot(kind="scatter",x="WetBulbTemperature", y="TotalDemand", s=10, color= "blue", ax=axes[1,1], title="South Australia")

These charts show us one major thing, that the greater the trend, the more confusing (and complicated) the relationship between temperature and energy demand becomes.This is why the graph of temperature vs energy demand for Tasmania is almost a straight line (albeit a thick one), whereas the rest curved.In other words, the greater the trend, the wider and thicker the curve!

Since we don't have any population or economic data, the trend must be removed (in the next tutorial).

Chapter 3 - Analysing Small Timeframes

The graphs below show the comparison of energy demand between regions for a single day and a week during winter and summer.We can start with a week (11/06/2017 to 17/06/2017 here) to see how energy demand fluctuates during the week.We're only testing one small timeframe, this is for brevity (the same patterns below can be seen elsewhere too).

fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 10), tight_layout=True)# Winterdata["2017-06-11":"2017-06-17"].groupby("Region").get_group("TAS")["TotalDemand"].plot(color="red", title="Tasmania Winter", ax=axes[0,0])data["2017-06-11":"2017-06-17"].groupby("Region").get_group("VIC")["TotalDemand"].plot(color="green", title="Victoria Winter", ax=axes[0,1])data["2017-06-11":"2017-06-17"].groupby("Region").get_group("NSW")["TotalDemand"].plot(color="purple", title="New South Wales Winter", ax=axes[0,2])data["2017-06-11":"2017-06-17"].groupby("Region").get_group("QLD")["TotalDemand"].plot(color="orange", title="Queensland Winter", ax=axes[1,0])data["2017-06-11":"2017-06-17"].groupby("Region").get_group("SA")["TotalDemand"].plot(color="blue", title="South Australia Winter", ax=axes[1,1])# Summerdata["2017-1-14":"2017-1-20"].groupby("Region").get_group("TAS")["TotalDemand"].plot(color="red", title="Tasmania Summer", ax=axes[2,0])data["2017-1-14":"2017-1-20"].groupby("Region").get_group("VIC")["TotalDemand"].plot(color="green", title="Victoria Summer", ax=axes[2,1])data["2017-1-14":"2017-1-20"].groupby("Region").get_group("NSW")["TotalDemand"].plot(color="purple", title="New South Wales Summer", ax=axes[2,2])data["2017-1-14":"2017-1-20"].groupby("Region").get_group("QLD")["TotalDemand"].plot(color="orange", title="Queensland Summer", ax=axes[3,0])data["2017-1-14":"2017-1-20"].groupby("Region").get_group("SA")["TotalDemand"].plot(color="blue", title="South Australia Summer", ax=axes[3,1])

All states energy usage daily tend to be just about the same.There are two peaks in Summer and Winter.The first one is smaller and during the day (5-9 am), whilst the second is larger and at night (4-7 pm).These occur at times when people are most active in homes (before and after work).Although only a few graphs can be shown here, these patterns do persist (swapping out different days will show this).

The energy demand across a week in summer tends to be similar to winter, but the demand increases far more throughout the week!

We can now move onto looking at a single day (11/06/2017 here).

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 10), constrained_layout=True)data["2017-06-11"].groupby("Region").get_group("TAS")["TotalDemand"].plot(title="Tasmania", ax=axes[0,0], color="red")data["2017-06-11"].groupby("Region").get_group("VIC")["TotalDemand"].plot(title="Victoria", ax=axes[0,1], color="green")data["2017-06-11"].groupby("Region").get_group("NSW")["TotalDemand"].plot(title="New South Wales", ax=axes[0,2], color="purple")data["2017-06-11"].groupby("Region").get_group("QLD")["TotalDemand"].plot(title="Queensland", ax=axes[1,0], color="orange")data["2017-06-11"].groupby("Region").get_group("SA")["TotalDemand"].plot(title="South Australia", ax=axes[1,1], color="blue")

From these charts, we can see that energy usage ramps up from 6 am to 9 am and again from 3 pm to 6 pm.At 12 am to 3 pm our energy usage remains stable.It typically drops after the start and end of the day (likely when most people are asleep).The demand for summer and winter days are mostly similar.

Photo by Scott Graham on Unsplash

Machine Learning Energy Demand Prediction Project - Part 1 Data Cleaning

Kamron Bhavnagri — Fri, 03 Jul 2020 03:25:28 GMT

Let's see how our machine learning, project planning and essential coding tools can be brought to life in a real-world project!Today we're going through how we can predict how much energy we use daily using temperature data.We start here with importing and cleaning data, before graphing and depicting the story of our energy usage and finally modelling it.

This is the first part of three. Feel free to code along, the full project is on GitHub.

The story

But, fear not if this seems challenging.We will take it one step at a time.At each stage linking back to how it relates to our ML field guide.

We start with finding energy and temperature data (can't do much without it 😊).Ours is from the Bureau of Meteorology and Australian Energy Market Operator, but please do replicate the process for another country (i.e. America).After a quick and painless download (lucky us), we can briefly review our spreadsheets.But a look at the data highlights a horrifying truth - there's simply... far too much to deal with!The merciless cells of numbers, more numbers and categories, are really overwhelming.It's not really apparent how we'll combine the array of spreadsheets together, nor how we'll be able to analyse, learn from or model it.

We as optimistic folk start by noting down how the data is organised.The folder with files, where they are and what each contains.Combining our understanding of the data's structure with the importing techniques naturally leads to us overcome our first fear - providing easy access to the data with code.

Next, we seek to eliminate the clumsy mess.We need to clean the temperature & energy data by identifying what information has the greatest impact on our net energy usage!It once again starts with simple observations of the spreadsheets to get a rough grip on the type of data present.We are specifically interested in finding weird quirks/recurring patterns which could indicate that something is wrong.Once we follow up on each hunch, we can become more confident about the origin of our problems.This allows us to confidently decide what to straight-up remove, keep and a quick fix 🤔 (we don't want to go Rambo 👹 on everything).Simple stats and graphs form the cornerstone of this analysis!

At this point, we'd have successfully made it through the first and most important part of our project!After a brief celebration, we can move onto combining our two separate datasets (one for energy and one for temperature).This allows us to correlate the two.Finally, we're able to depict our story of how we use energy through each envigorating day, month and year... with the help of the trends and patterns we see in graphs!What on Earth would be more satisfying?Well, a few things... but let's not forget to create a model (it'll be fun) to show off to all our friends!Let's not jump the gun though... this will all be in the next two tutorials.

import osimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom IPython.core.interactiveshell import InteractiveShellInteractiveShell.ast_node_interactivity = "all"pd.options.display.max_columns = None

The Epochs

Chapter 1 - Importing Data

Data comes in all kinds of shapes and sizes and so the process we use to get everything into code often varies.

Through analysing the files available we have found out how our data is structured.We start on a high level, noticing that there are many temperature and energy spreadsheets formatted as CSVs.Although there's an incredible number of them, it is just because the data was divided into small chunks.Each CSV is a continuation of the last one.The actual temperature spreadsheets contain dates, along with a variety of temperature, humidity and precipitation measurements.Our energy files are far more basic, containing just dates, energy demand history, prices (RRP) and whether the data was manually or automatically logged.The measurements have been made on a 30-minute basis.

Divide and conquer!

As we can see, all of this information comes together to form an intuitive understanding of the raw data.Of course, we don't yet understand everything we'll need to perform our analysis, but we have enough to transition from having raw data to useable code 🥳!

To transition into code, we compare our findings, to our importing techniques.We know that we have a list of spreadsheets to be combined, so we can first form lists and then use Pandas concat to stack them together.

energy_locations = os.listdir("../Data/Energy")temperature_locations = os.listdir("../Data/Temperature")energy_CSVs = [pd.read_csv("../Data/Energy/" + location) for location in energy_locations]temperature_CSVs = [pd.read_csv("../Data/Temperature/" + location) for location in temperature_locations if "Data" in location]

energy_data = pd.concat(energy_CSVs, ignore_index=True)temperature_data = pd.concat(temperature_CSVs, ignore_index=True)

Now, believe it or not, we've done 90% of the importing, the only thing left is to ensure our features (columns) are named succinctly and consistently.Through renaming our columns (like below), we make it clear what is in each column.Future us will definitely be grateful!

energy_data.columnstemperature_data.columns

Index(['REGION', 'SETTLEMENTDATE', 'TOTALDEMAND', 'RRP', 'PERIODTYPE'], dtype='object')Index(['hm', 'Station Number', 'Year Month Day Hour Minutes in YYYY', 'MM',       'DD', 'HH24', 'MI format in Local time',       'Year Month Day Hour Minutes in YYYY.1', 'MM.1', 'DD.1', 'HH24.1',       'MI format in Local standard time',       'Precipitation since 9am local time in mm',       'Quality of precipitation since 9am local time',       'Air Temperature in degrees C', 'Quality of air temperature',       'Wet bulb temperature in degrees C', 'Quality of Wet bulb temperature',       'Dew point temperature in degrees C',       'Quality of dew point temperature', 'Relative humidity in percentage %',       'Quality of relative humidity', 'Wind speed in km/h',       'Wind speed quality', 'Wind direction in degrees true',       'Wind direction quality',       'Speed of maximum windgust in last 10 minutes in  km/h',       'Quality of speed of maximum windgust in last 10 minutes',       'Mean sea level pressure in hPa', 'Quality of mean sea level pressure',       'Station level pressure in hPa', 'Quality of station level pressure',       'AWS Flag', '#'],      dtype='object')

energy_data.columns = ["Region", "Date", "TotalDemand", "RRP", "PeriodType"]temperature_data.columns = [    "HM", "StationNumber", "Year1", "Month1", "Day1", "Hour1", "Minute1", "Year", "Month", "Day", "Hour", "Minute", "Precipitation", "PrecipitationQuality",    "AirTemperature", "AirTemperatureQuality", "WetBulbTemperature", "WetBulbTemperatureQuality", "DewTemperature", "DewTemperatureQuality", "RelativeHumidity",    "RelativeHumidityQuality", "WindSpeed", "WindSpeedQuality", "WindDirection", "WindDirectionQuality", "WindgustSpeed", "WindgustSpeedQuality", "SeaPressure",    "SeaPressureQuality", "StationPressure", "StationPressureQuality", "AWSFlag", "#"]

Now be proud because we just finished the first part of our journey!Now that we've gotten the ball rolling, things will be smoother sailing from here on out.

Chapter 2 - Data Cleaning

Formatting the data

Everyone is driven insane by missing data, but there's always a light at the end of the tunnel.

There's good and bad news, so I'll start with the good news first.We've gone through our initial phase of getting everything together and so we now have a bare-bones understanding of what's available to us/how to access it.We can view our data using the energy_data and temperature_data dataframes!

Now for the bad news.Although we likely haven't noticed it yet, our data is far from perfect.We have loads of missing (empty) cells, along with duplicated and malformatted data.But don't be disheartened, because this isn't a rare cataclysmic disaster:It happens all the time 😎 (what's not to like about it?) 😎.

This process can appear threatening since everything just seems... messed up.Now insight and experience do help a lot, BUT but but... that doesn't mean that it's impossible for us mortals!There's one thing we can do to overcome this - work like mad scientists!We can identify our datasets quirks/issues, and then test every technique we think off 🤯.Our techniques come from the field guide (NEVER reinvent the wheel)!

Just to doubly make sure we're not running stray, here are the problems we're looking for:

Completely empty columns/rows
Duplicate values
Inaccurate/generic datatypes

Yes, there are only three right now, but... don't forget that we won't robust analysis!So actually dealing with these problems in a concrete fashion does take a little bit of effort (don't be too dodgy, that right's reserved for politicians - no offence).

Final disclaimer - there's a lot to take in, so please take a deep breath, drink some coffee and slowly look for patterns.

energy_datatemperature_data

We can see that columns like PrecipitationQuality and HM seem to have the same value throughout.To amend this we can remove columns with two or fewer unique elements.

def remove_non_uniques(dataframe: pd.DataFrame, filter = []):    remove = [name for name, series in dataframe.items() if len(series.unique()) <= 2 and not name in filter]    dataframe.drop(remove, axis=1, inplace=True)    return removeprint("Removed:")remove_non_uniques(energy_data)remove_non_uniques(temperature_data)

Removed:['PeriodType']['HM', 'PrecipitationQuality', 'AirTemperatureQuality', 'WetBulbTemperatureQuality', 'DewTemperatureQuality', 'RelativeHumidityQuality', 'WindSpeedQuality', 'WindDirectionQuality', 'WindgustSpeedQuality', 'SeaPressureQuality', 'StationPressureQuality', '#']

Duplicate rows can also be removed.This is far easier!

energy_data.drop_duplicates(inplace=True)temperature_data.drop_duplicates(inplace=True)

The last thing is to check our datatypes.This seems unnecessary here, yet modelling and graphing libraries are quite touchy about datatypes.

The process is quite straightforwards, look at the column/what it contains and then compare that to the actual datatype.With a large number of columns, it can be best to start by looking at dates and categories since they're almost always misinterpreted (as objects, floats or integers).In general object should only be used for strings.

energy_data.dtypestemperature_data.dtypes

Region          objectDate            objectTotalDemand    float64RRP            float64dtype: objectStationNumber          int64Year1                  int64Month1                 int64Day1                   int64Hour1                  int64Minute1                int64Year                   int64Month                  int64Day                    int64Hour                   int64Minute                 int64Precipitation         objectAirTemperature        objectWetBulbTemperature    objectDewTemperature        objectRelativeHumidity      objectWindSpeed             objectWindDirection         objectWindgustSpeed         objectSeaPressure           objectStationPressure       objectAWSFlag               objectdtype: object

In our case, we have not just one set of dates, but two (damn, the BOM data collection team needs to chill) 🥴.As we predicted, the dates are integers and spread out across multiple columns (one for the year, one for the month, day, hour and minute).

We can start by getting rid of the duplicated set of dates (the second was due to daylight saving), and then we can parse the remaining date columns.This formats our data in the nice orderly way we desired!

# Remove extra datestemperature_data.drop(["Year1", "Month1", "Day1", "Hour1", "Minute1"], axis=1, inplace=True)# Reformat dates into Pandas' datatime64 objects# Replacing old formattemperature_data["Date"] = pd.to_datetime(temperature_data[["Year", "Month", "Day", "Hour", "Minute"]])energy_data["Date"] = pd.to_datetime(energy_data["Date"])temperature_data.drop(["Year", "Month", "Day", "Hour", "Minute"], axis=1, inplace=True)

Now we can also see a few problems with station numbers (where measurements were made), AWSFlag's (whether data was manually collected), temperature, humidity, pressure and precipitation measurements.We do need to change these datatypes, but to do so we'll need to go slightly off the books as converting datatypes using the standard .astype("category") throws a few errors.We can overcome these by noting down what the complaint is about, accounting for it and then trying to run the above function once again.

Just to be sure we're all on the same page, here's a short summary of the errors we're dealing with:

Leading/trailling spaces (so "12" becomes " 12 ")
Random hashtags occasionally present (so 99.99% of cells will contain numbers, but then one will contain "###")
There's a small amount of missing categorical data

We can remove the leading and trailing spaces by using .str.strip().Next, to remove the rouge hashtag, we can use Pandas' replace function to overwrite it with np.NaN (the default datatype used for null data).To finish off, we can just assume that any missing data was manually collected (worst case scenario).The fillna and replace functions are both needed, as Pandas treats np.NaN and empty strings ("") differently.

def to_object_columns(lambda_function):    string_columns = temperature_data.select_dtypes("object").columns    temperature_data[string_columns] = temperature_data[string_columns].apply(lambda_function)

to_object_columns(lambda column: column.str.strip())temperature_data["AWSFlag"] = temperature_data["AWSFlag"].replace("", 0).astype("category")temperature_data["AWSFlag"].fillna(0, inplace=True)temperature_data["RelativeHumidity"] = temperature_data["RelativeHumidity"].replace("###", np.NaN)to_object_columns(lambda column: pd.to_numeric(column))

temperature_data.dtypes

StationNumber                  int64Precipitation                float64AirTemperature               float64WetBulbTemperature           float64DewTemperature               float64RelativeHumidity             float64WindSpeed                    float64WindDirection                float64WindgustSpeed                float64SeaPressure                  float64StationPressure              float64AWSFlag                     categoryDate                  datetime64[ns]dtype: object

There is one final thing we can do to improve how our data is formatted.That is to ensure that the column used to identify where our temperature and energy measurements were made both use the same categories.

Since we only have one station per region, we can replace the separate region codes with their short forms.Note that this information was provided in the dataset notes (don't worry, we're not expected to remember that 94029 means Victoria).To do these conversions we just create two dictionaries.Each key-value pair represents the old code to map to the new one (so map "SA1" to "SA" and 23090 to "SA").The Pandas map function does the rest of the work.

energy_data["Region"].unique()temperature_data["StationNumber"].unique()

array(['VIC1', 'SA1', 'TAS1', 'QLD1', 'NSW1'], dtype=object)array([94029, 86071, 66062, 40913, 86338, 23090])

region_remove_number_map = {"SA1": "SA", "QLD1": "QLD", "NSW1": "NSW", "VIC1": "VIC", "TAS1": "TAS"}station_to_region_map = {23090: "SA", 40913: "QLD", 66062: "NSW", 86071: "VIC", 94029: "TAS", 86338: "VIC"}temperature_data["Region"] = temperature_data["StationNumber"].map(station_to_region_map)energy_data["Region"] = energy_data["Region"].map(region_remove_number_map)temperature_data.drop("StationNumber", axis=1, inplace=True)

One last thing to note about the way our data is formatted (promise).We currently don't index/sort our data in any specific way, even though it is a time series.So we can use set_index to change that.

energy_data.set_index("Date", inplace=True)temperature_data.set_index("Date", inplace=True)

Dealing with missing data

So far we've made sure that all our data can be easily accessed without any troubles.We've made sure everything is formatted right, and now we can use it... well kind of.Although our data is correctly formatted, it doesn't quite mean that it's meaningful, useful or even present!

We can get through this though, we just need to be strategic.The key thing to remember here:

Don't do more work than necessary!

Our ultimate goal isn't to fix everything, but to remove what definitely is useless and enhance the quality of what could be especially interesting/useful.This process aids us in knowing we're making solid, generalisable and reasonable predictions or interpretations (there's little point in the whole process otherwise).

One nice way to do this is to use graphs.Through visualising our data we can easily spot where it's missing, where outliers exist and where two features are correlated.We, of course, can't do all of this on one plot, and so we'll start by just looking for missing data.Sections of large or frequent gaps are the potentially problematic regions we're looking for.If these don't exist (i.e. there's little to no missing data), then our work is reduced.

Keep in mind that we have two datasets (not one), categorised by states!As the data is recorded in separate states, grouping it all together will not correctly represent it.Hence, we will have a series of plots (one per state) for each feature we want to analyse.We are slightly lucky though because there's only one meaningful energy feature (TotalDemand), which we will see has little to no missing data.

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 12), tight_layout=True)energy_data.groupby("Region").get_group("TAS")["TotalDemand"]["2000":"2019"].plot(color= "red",title="Tasmania Energy Demand",ax=axes[0,0])energy_data.groupby("Region").get_group("VIC")["TotalDemand"]["2000":"2019"].plot(color= "green",title="Victoria Energy Demand",ax=axes[0,1])energy_data.groupby("Region").get_group("NSW")["TotalDemand"]["2000":"2019"].plot(color= "purple",title="New South Wales Energy Demand",ax=axes[0,2])energy_data.groupby("Region").get_group("QLD")["TotalDemand"]["2000":"2019"].plot(color= "orange",title="Queensland Energy Demand",ax=axes[1,0])energy_data.groupby("Region").get_group("SA")["TotalDemand"]["2000":"2019"].plot(color="blue",title="South Australia Energy Demand",ax=axes[1,1])

As we can see the plots are all continuous, this is how we confirm that there is no major source of missing data.There are a variety of other trends here, but we'll leave those for later!

Now to move onto weather data.This is where we'll see the usefulness of graphs!Although it's possible to simply find the percent of missing data, the graphs easily show the nature of the null values.We immediately see where it's missing, which itself suggests what method should be used (i.e. removing the data, resampling, etc).

We start by looking at WetBulbTemperature.We will see that it is largely intact just like our energy data.We will then see AirTemperature, and it'll be... rough and tattered.

For brevity, only a few key graphs are included here.However, loads more can be graphed (please do play around with the code to see what else can be done)!The problems with AirTemperature are similar to those in the following features:

Precipitation
AirTemperature
DewTemperature
RelativeHumidity
WindSpeed
WindDirection
WindgustSpeed

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 12), tight_layout=True)temperature_data.groupby("Region").get_group("TAS")["WetBulbTemperature"]["2000":"2019"].plot(color= "red",title="Tasmania Wet Bulb Temperature",ax=axes[0,0])temperature_data.groupby("Region").get_group("VIC")["WetBulbTemperature"]["2000":"2019"].plot(color= "green",title="Victoria Wet Bulb Temperature",ax=axes[0,1])temperature_data.groupby("Region").get_group("NSW")["WetBulbTemperature"]["2000":"2019"].plot(color= "purple",title="New South Wales Wet Bulb Temperature",ax=axes[0,2])temperature_data.groupby("Region").get_group("QLD")["WetBulbTemperature"]["2000":"2019"].plot(color= "orange",title="Queensland Wet Bulb Temperature",ax=axes[1,0])temperature_data.groupby("Region").get_group("SA")["WetBulbTemperature"]["2000":"2019"].plot(color= "blue",title="South Australia Wet Bulb Temperature",ax=axes[1,1])

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 12), tight_layout=True)temperature_data.groupby("Region").get_group("TAS")["AirTemperature"]["2000":"2019"].plot(color= "red",title="Tasmania Air Temperature",ax=axes[0,0])temperature_data.groupby("Region").get_group("VIC")["AirTemperature"]["2000":"2019"].plot(color= "green",title="Victoria Air Temperature",ax=axes[0,1])temperature_data.groupby("Region").get_group("NSW")["AirTemperature"]["2000":"2019"].plot(color= "purple",title="New South Wales Air Temperature",ax=axes[0,2])temperature_data.groupby("Region").get_group("QLD")["AirTemperature"]["2000":"2019"].plot(color= "orange",title="Queensland Wind Air Temperatue",ax=axes[1,0])temperature_data.groupby("Region").get_group("SA")["AirTemperature"]["2000":"2019"].plot(color= "blue",title="South Australia Air Tempeature",ax=axes[1,1])

The missing months to years of missing (the blank sections) air temperature data at random places in the graph indicates that it's not worth looking into further.This actually isn't a bad thing, it allows us to focus more on what is present: energy demand and wet bulb temperature.

These graphs show large or regular sections of missing data, however, they don't show the small amounts randomly distributed.We for safety can quickly use Pandas DataFrame.isnull to find which values are null.It immediately shows that our energy data is in perfect condition (nothing missing), whilst most temperature columns have a very large proportion missing!

We'll remove most features since they'd require us to sacrifice large numbers of rows.What we want to keep (i.e. WetBulbTemperature) can have its missing values interpolated (deduce what the value should be based on its surrounding values).

def get_null_counts(dataframe: pd.DataFrame):    return dataframe.isnull().mean()[dataframe.isnull().mean() > 0]

get_null_counts(energy_data)get_null_counts(temperature_data)

Series([], dtype: float64)Precipitation         0.229916AirTemperature        0.444437WetBulbTemperature    0.011324DewTemperature        0.375311RelativeHumidity      0.375312WindSpeed             0.532966WindDirection         0.432305WindgustSpeed         0.403183SeaPressure           0.137730StationPressure       0.011135dtype: float64

remove_columns = ["Precipitation", "AirTemperature", "DewTemperature", "RelativeHumidity", "WindSpeed", "WindDirection", "WindgustSpeed"]temperature_data.drop(remove_columns, axis=1, inplace=True)# Note that using inplace currently throws an error# So interpolated columns must be manually overriddenmissing_columns = list(get_null_counts(temperature_data).keys())temperature_data[missing_columns] = temperature_data[missing_columns].interpolate(method="time")

Combining energy and temperature data

Now, for the very last step.Combining together the two dataframes into one, so we can associate our temperature data with energy demand.

We can use the merge_asof function to merge the two datasets.This function merges the closest values together.Since we have data grouped by region, we specify that with the by parameter.We can choose to only merge energy and temperature entries which are 30-minutes or less apart.

energy_data.sort_index(inplace=True)temperature_data.sort_index(inplace=True)data = pd.merge_asof(energy_data, temperature_data, left_index=True, right_index=True, by="Region", tolerance=pd.Timedelta("30 min"))

To check whether the merge has happened successfully, we can check how many null values are present.This works since unpaired rows cause null values.

get_null_counts(data)data.dropna(inplace=True)

WetBulbTemperature    0.001634SeaPressure           0.001634StationPressure       0.001634AWSFlag               0.001634dtype: float64

Now we can finally see some clean and sensible data!This is the first table we see which does not pose a massive health and safety hazard.Now that we've got to this stage we should celebrate... it only gets better from here 👊.

data

Saving final data

pd.to_pickle(data, "../Data/Data.pickle")

Photo by Matthew Henry on Unsplash

Insight is KING - How to Get it and AVOID PITFALLS

Kamron Bhavnagri — Wed, 17 Jun 2020 14:11:22 GMT

It is so hard to find an intuitive understanding of how your dataset functions!Yet, coherently interpreting how your system works is crucial to finding a way to model or analyse any feature.Stick on till the end to find out why initial insight is key to good analysis, what you have to do to formulate your understanding and finally how to easily avoid and overcome problems.

The story

I've been working on a project to predict energy demand using Australian weather data in a team of 5.It spanned half a year and mimicked a classic data science project.Start with data collection, move onto cleaning, modelling and then formulate a report.Halfway through I discovered one thing - I was still cleaning the data, and it was taking forever, despite having four others working on it!It seemed like there were a million features we had to deal with, and most of them were nowhere near easily ready to use.I had seen it as an opportunity to learn more about the different (fancy) techniques which could be used to interpolate (fill in) missing data.But... I completely misunderstood my project.

We didn't understand how our data worked and so had to account for 100 features instead of 3...

It was a naive mistake but one that taught me several valuable lessons:

Low-quality/irrelevant data does more harm than good
My teams work can only be completed as well as it is understood

Now don't worry, we did end up finishing on time, with a model and complete report!But... only after I accounted for these flaws could my team act like a well-oiled machine.So here we'll explore how to find and profit from insight, along with how to avoid ever-present flaws and potential pitfalls (which everyone at some point comes across).

How to find insight

It's easy to emphasise the importance of understanding how data functions, but hard to discover it.With tutorials, kaggle competitions and simple beginner exercises it was easy... the understanding and interpreting part was handed to you on a silver platter.Just get out your golden fork (code) and knife (ensemble model), start cutting (testing) and whala you can eat (a high-performance model)!

Then you progress and begin a few real projects... oh no.There may not be a one size fits all strategy, but we can make the process smooth and less painful by setting ourselves up properly.

Where to begin

Just COOOODE... NO PLEASE WAIT!

Code is important, but I'll let you in on a little secret - it takes less time when you know the process.Bbubbbut... how to know the process, before starting the first time?Easy, interpret the mission objectives.Missing objectives are what you want to get out of the project.Mission objectives include the model and report itself, but also what you'll need to learn, what stages you'll go through and what challenges need to be accounted for.

I thought my goal was to create the best model I could to predict energy demand, hahaha.I was completely wrong!

My goal wasn't to predict energy demand... because that would be nearly impossible.What I actually wanted was to identify and model the energy demand trends and seasonal patterns which occurred in the short-term, and how temperature fit into this equation.It involved importing data, researching to find which variables were useful, creating graphs to intuitively show how the data looked/worked and THEN finally creating a model to concretely measure the relationship between temperature and energy demand.The main difficulties would lie in learning about how the energy time-series' worked and learning to guide my team through each stage of the process.Painfully verbose yet?The end-goal may have been a report, showing how everything worked its way till we got a model, but in reality, the model was just 10% of the work!

It indeed though is easy to put this aside... to say that it is soft, unnecessary planning which is unlikely to directly impact the project right now.In fact... yes, it is extra work and fair enough if you're not compelled to plan everything out like this.If you've got a better alternative - let me know.If not, give it a try.It may not impact you right now, but it will aid in mitigating large problems, and illustrating how everything ties together!

Finding out where to go next

But I have no idea how to do any of this...Don't worry, with time, you'll figure it all out.Just remember:

The path seems a little less bumpy once you get started!

If though you have no idea where to start, find out what you'll need to learn and find ways to do so.Simple tutorials and videos are a great way to start off.Then, once you've got a vague idea of what things should look like and what to do, just get started.Simply follow the trail and see where it leads!

For data science projects, know the practical steps and processes.I explain these in my machine learning field guide article which goes through every single step of the process in detail!If you want to learn more, books like Hands-On Machine Learning and The Hundred-Page Machine Learning Book are extremely useful.

To work better in a team make sure you understand collaborative coding tools (all explained here) and how to lead.The book Extreme Ownership is an amazing guide to teamwork and leadership (not data science specific, but Jocko Willink's advice applies nonetheless).

Avoiding collapse

Everything was going so well... until I realised we were still cleaning the data halfway through the project.Everything seemed fine, progress seemed alright, not perfect... but fine.

Even when you've set yourself up to succeed, and everything has progressed fine... things can go south!But... I was lucky because a teacher told me to regularly do one thing:

Keep a simple journal of progress, specifically commenting on what you've done, how it panned out and what can be done to improve.

It worked wonders!Instead of stressing out after coming to grips of how much I needed to finish, I was able to prioritise and execute, because I knew where I could go wrong and I could account for it.I knew my team could get distracted and lose focus, so I made sure to stick to the point and emphasise what we were trying to accomplish instead of writing down narrow tasks.I knew it was difficult to pace ourselves, so I kept a count on how many weeks were left and made sure everyone understood.I knew the coding was particularly challenging and threatening to most people, so I did a brief rundown on what it would involve/what it should look like with sample code.In short, I accounted for my weaknesses and managed to turn around a bad situation.

The process only took ~5 minutes each week and drastically boosted progress.

All you've got to do is reflect on how your actions enfold and consider what could help you out further.

This leads to simple actionable steps.

THANKS FOR READING

I hope you've enjoyed this, and that you've found it helpful!Please feel free to share this with anyone it may help.

My other articles on practical coding skills, machine learning, starting projects and web scraping may be interesting.

Follow me on Twitter for updates.

Photos by Toa Heftiba, John Barkiple, Yang Jing, Kyle Glenn and Josh Calabrese on Unsplash

Machine Learning Field Guide

Kamron Bhavnagri — Fri, 12 Jun 2020 05:00:39 GMT

We all have to deal with data, and we try to learn about and implement machine learning into our projects.But everyone seems to forget one thing... it's far from perfect, and there is so much to go through!Don't worry, we'll discuss every little step, from start to finish 👀.

All you'll need are these fundementals

The Story Behind it All

We all start with either a dataset or a goal in mind.Once we've found, collected or scraped our data, we pull it up, and witness the overwhelming sight of merciless cells of numbers, more numbers, categories, and maybe some words 😨!A naive thought crosses our mind, to use our machine learning prowess to deal with this tangled mess... but a quick search reveals the host of tasks we'll need to consider before training a model 😱!

Once we overcome the shock of our unruly data we look for ways to battle our formidable nemesis 🤔.We start with trying to get our data into Python.It is relatively simple on paper, but the process can be slightly... involved.Nonetheless, a little effort was all that was needed (lucky us).

Without wasting any time we begin data cleaning to get rid of the bogus and expose the beautiful.Our methods start simple - observe and remove.It works a few times, but then we realise... it really doesn't do us justice!To deal with the mess though, we find a powerful tool to add to our arsenal: charts!With our graphs, we can get a feel for our data, the patterns within it and where things are missing.We can interpolating (fill in) or removing missing data.

Finally, we approach our highly anticipated 😎 challenge, data modelling!With a little research, we find out which tactics and models are commonly used.It is a little difficult to decipher which one we should use, but we still manage to get through it and figure it all out!

We can't finish a project without doing something impressive though.So, a final product, website, app or even a report will take us far!We know first impressions are important so we fix up the GitHub repository and make sure everything's well documented and explained.Now we are finally able to show off our hard work to the rest of the world 😎!

The epochs

Chapter 1 - Importing Data

Data comes in all kinds of shapes and sizes and so the process we use to get everything into code often varies.

Let's be real, importing data seems easy, but sometimes... it's a little pesky.

The hard part about data cleaning isn't the coding or theory, but instead our preparation!When we first start a new project and download our dataset, it can be tempting to open up a code editor and start typing... but this won't do us any good.If we want to get a head start we need to prepare ourselves for the best and worst parts of our data.To do this we'll need to start basic, by manually inspecting our spreadsheet/s.Once we understand the basic format of the data (filetype along with any particularities) we can move onto getting it all into Python.

When we're lucky and just have one spreadsheet we can use the Pandas read_csv function (letting it know where our data lies):

pd.read_csv("file_path.csv")

In reality, we run into way more complex situations, so look out for:

File starts with unneeded information (which we need to skip)
We only want to import a few columns
We want to rename our columns
Data includes dates
We want to combine data from multiple sources into one place
Data can be grouped together

Although we're discussing a range of scenarios, we normally only deal with a few at a time.

Our first few problems (importing specific parts of our data/renaming columns) are easy enough to deal with using a few parameters, like the number of rows to skip, the specific columns to import and our column names:

pd.read_csv("file_path.csv", skiprows=5, usecols=[0, 1], names=["Column1", "Column2"])

Whenever our data is spread across multiple files, we can combine them using Pandas concat function.The concat function combines a list of DataFrame's together:

my_spreadsheets = [pd.read_csv("first_spreadsheet.csv"), pd.read_csv("second_spreadsheet.csv")]pd.concat(my_spreadsheets, ignore_index=True)

We parse to concat a list of spreadsheets (which we import just like before).The list can, of course, be attained in any way (so a fancy list comprehension or a casual list of every file both work just as well), but just remember that we need dataframes, not filenames/paths!

If we don't have a CSV file Pandas still works!We can just swap out read_csv for read_excel, read_sql or another option.

After all the data is inside a Pandas dataframe, we need to double-check that our data is formatted correctly.In practice, this means checking each series datatype, and making sure they are not generic objects.We do this to ensure that we can utilize Pandas inbuilt functionality for numeric, categorical and date/time values.To look at this just run DataFrame.dtypes.If the output seems reasonable (i.e. numbers are numeric, categories are categorical, ect), then we should be fine to move on.However, this normally is not the case, and as we need to change our datatypes!This can be done with Pandas DataFrame.astype.If this doesn't work, there should be another more Pandas function for that specific conversion:

data["Rating"] = data["Rating"].as_type("category")data["Number"] = pd.to_numeric(data["Number"])data["Date"] = pd.to_datetime(data["Date"])data["Date"] = pd.to_datetime(data[["Year", "Month", "Day", "Hour", "Minute"]])

If we need to analyse separate groups of data (i.e. maybe our data is divided by country), we can use Pandas groupby.We can use groupby to select particular data, and to run functions on each group separately:

data.groupby("Country").get_group("Australia")data.groupby("Country").mean()

Other more niche tricks like multi/hierarchical indices can also be helpful in specific scenarios, however, are more tricky to understand and use.

Chapter 2 - Data Cleaning

Data is useful, data is necessary, however, it needs to be clean and to the point!If our data is everywhere, it simply won't be of any use to our machine learning model.

Everyone is driven insane by missing data, but there's always a light at the end of the tunnel.

The easiest and quickest way to go through data cleaning is to ask ourselves:

What features within our data will impact our end-goal?

By end-goal, we mean whatever variable we are working towards predicting, categorising or analysing.The point of this is to narrow our scope and not get bogged down in useless information.

Once we know what our primary objective features are, we can try to find patterns, relations, missing data and more.An easy and intuitive way to do this is graphing!Quickly use Pandas to sketch out each variable in the dataset, and try to see where everything fits into place.

Once we have identified potential problems, or trends in the data we can try and fix them.In general, we have the following options:

Remove missing entries
Remove full columns of data
Fill in missing data entries
Resample data (i.e. change the resolution)
Gather more information

To go from identifying missing data to choosing what to do with it we need to consider how it affects our end-goal.With missing data we remove anything which doesn't seem to have a major influence on the end result (i.e. we couldn't find a meaningful pattern), or where there just seems too much missing to derive value.Sometimes we also decide to remove very small amounts of missing data (since it's easier than filling it in).

If we've decided to get rid of information, Pandas DataFrame.drop can be used.It removes columns or rows from a dataframe.It is quite easy to use, but remember that Pandas does not modify/remove data from the source dataframe by default, so inplace=True must be specified.It may be useful to note that the axis parameter specifies whether rows or columns are being removed.

When not removing a full column, or particularly targeting missing data, it can often be useful to rely on a few nifty Pandas functions.For removing null values, DataFrame.dropna can be utilized.Do keep in mind though that by default, dropna completely removes all missing values.However, setting either the parameter how to all or setting a threshold (thresh, representing how many null values are required for it to delete) can compensate for this.

If we've got small amounts of irregular missing values, we can fill them in several ways.The simplest is DataFrame.fillna which sets the missing values to some preset value.The more complex, but flexible option is interpolation using DataFrame.interpolate.Interpolation essentially allows anyone to simply set the method they would like to replace each null value with.These include the previous/next value, linear and time (the last two deduce based on the data).Whenever working with time, time is a natural choice, and otherwise make a reasonable choice based on how much data is being interpolated and how complex it is.

data["Column"].fillna(0, inplace=True)data[["Column"]] = data[["Column"]].interpolate(method="linear")

As seen above, interpolate needs to be passed in a dataframe purely containing the columns with missing data (otherwise an error will be thrown).

Resampling is useful can whenever we see regularly missing data or have multiple sources of data using different timescales (like ensuring measurements in minutes and hours can be combined).It can be slightly difficult to intuitively understand resampling, but it is essential when you average measurements over a certain timeframe.For example, we can get monthly values by specifying that we want to get the mean of each month's values:

data.resample("M").mean()

The "M" stands for month and can be replaced with "Y" for year and other options.

Although the data cleaning process can be quite challenging, if we remember our initial intent, it becomes a far more logical and straight forward task!If we still don't have the needed data, we may need to go back to phase one and collect some more.Note that missing data indicates a problem with data collection, so it's useful to carefully consider, and note down where occurs.

For completion, the Pandas unique and value_counts functions are useful to decide which features to straight-up remove and which to graph/research.

Chapter 3 - Visualisation

Visualisation sounds simple and it is, but it's hard to... not overcomplicate.It's far too easy for us to consider plots as a chore to create.Yet, these bad boys do one thing very, very well - intuitively demonstrate the inner workings of our data!Just remember:

We graph data to find and explain how everything works.

Hence, when stuck for ideas, or not quite sure what to do, we basics can always fall back on identifying useful patterns and meaningful relationships.It may seem iffy 🥶, but it is really useful.

Our goal isn't to draw fancy hexagon plots, but instead to picture what is going on, so absolutely anyone can simply interpret a complex system!

A few techniques are undeniably useful:

Resampling when we have too much data
Secondary axis when plots have different scales
Grouping when our data can be split categorically

To get started graphing, simply use Pandas .plot() on any series or dataframe!When we need more, we can delve into MatPlotLib, Seaborn or an interactive plotting library.

data.plot(x="column 1 name", y="column 2 name", kind="bar", figsize=(10, 10))data.plot(x="column 1 name", y="column 3 name", secondary_y=True)data.hist()data.groupby("group").boxplot()

90% of the time, this basic functionality will suffice (more info here), and where it doesn't a search should reveal how to draw particularly exotic graphs 😏.

Chapter 4 - Modelling

A Brief Overview

Now finally for the fun stuff - deriving results.It seems so simple to train a scikit learn model, but no one goes into the details!So, let's be honest here, not every dataset, nor model are equal.

Our approach to modelling will vary widely based on our data.There are three especially important factors:

Type of problem
Amount of data
Complexity of data

Our type of problem comes down to whether we are trying to predict a class/label (called classification), a value (called regression), or to group data (called clustering).If we are trying to train a model on a dataset where we already have examples of what we're trying to predict we call our model supervised, if not, unsupervised.The amount of available data, and how complex it is foreshadows how simple a model will suffice.Data with more features (i.e. columns) tends to be more complex.

The point of interpreting complexity is to understand which models are too good or too bad for our data

Models goodness of fit informs us on this!If a model struggles to interpret our data (too simple) we can say it underfits, and if it is completely overkill (too complex) we say it overfits.We can think of it as a spectrum from learning nothing to memorising everything.We need to strike balance, to ensure our model is able to generalise our conclusions to new information.This is typically known as the bias-variance tradeoff.Note that complexity also affects model interpretability.

Complex models take substantially more time to train, especially with large datasets.So, upgrade that computer, run the model overnight, and chill for a while 😁!

Preparation

Splitting up data

Before training a model it is important to note that we will need some dataset to test it on (so we know how well it performs).Hence, we often divide our dataset into separate training and testing sets.This allows us to test how well our model can generalise to new unseen data.This normally works because we know our data is decently representative of the real world.

The actual amount of test data doesn't matter too much, but 80% train and 20% test is often used.

In Python with Scikit learn the train_test_split function does this:

train_data, test_data = train_test_split(data)

Cross-validation is where a dataset is split into several folds (i.e. subsets or portions of the original dataset).This tends to be more robust and resistant to overfitting than using a single test/validation set!Several Sklearn functions help with cross-validation, however, it's normally done straight through a grid or random search (discussed below).

cross_val_score(model, input_data, output_data, cv=5)

Hyperparameter tuning

There are some factors our model cannot account for, and so we set certain hyperparameters.These vary model to model, but we can either find optimal values through manual trial and error or a simple algorithm like grid or random search.With grid search, we try all possible values (brute force 😇) and with random search random values from within some distribution/selection.Both approaches typically use cross-validation.

Grid search in Sklearn works through a parameters dictionary.Each entries key represents the hyperparameter to tune, and the value (a list or tuple) is the selection of values to chose from:

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}model = = SVC()grid = GridSearchCV(model, param_grid=parameters)

After we've created the grid, we can use it to train the models, and extract the scores:

grid.fit(train_input, train_output)best_score, best_depth = grid.best_score_, grid.best_params_

The important thing here is to remember that we need to train on the training and not testing data.Even though cross-validation is used to test the models, we're ultimately trying to get the best fit on the training data and will proceed to test each model on the testing set afterwards:

test_predictions = grid.predict(test_input)

Random search in Sklearn works similarly but is slightly more complex as we need to know what type of distribution each hyperparameter takes in.Although it, in theory, can yield the same or better results faster, that changes from situation to situation.For simplicity it is likely best to stick to a grid search.

Model Choices

Using a model

With Sklearn, it's as simple as finding our desired model name and then just creating a variable for it.Check the links to the documentation for further details!For example

support_vector_regressor = SVR()

Basic Choices

Linear/Logistic Regression

Linear regression is trying to fit a straight line to our data.It is the most basic and fundamental model.There are several variants of linear regression, like lasso and ridge regression (which are regularisation methods to prevent overfitting).Polynomial regression can be used to fit curves of higher degrees (like parabolas and other curves).Logistic regression is another variant which can be used for classification.

Support Vector Machines

Just like with linear/logistic regression, support vector machines (SVM's) try to fit a line or curve to data points.However, with SVM's the aim is to maximise the distance between a boundary and each point (instead of getting the line/curve to go through each point).

The main advantage of support vector machines is their ability to use different kernels.A kernel is a function which calculates similarity.These kernels allow for both linear and non-linear data, whilst staying decently efficient.The kernels map the input into a higher dimensional space so a boundary becomes present.This process is typically not feasible for large numbers of features.A neural network or another model will then likely be a better choice!

Neural Networks

All the buzz is always about deep learning and neural networks.They are complex, slow and resource-intensive models which can be used for complex data.Yet, they are extremely useful when encountering large unstructured datasets.

When using a neural net, make sure to watch out for overfitting.An easy way is through tracking changes in error with time (known as learning curves).

Deep learning is an extremely rich field, so there is far too much to discuss here.In fact, Scikit learn is a machine learning library, with little deep learning abilities (compared to PyTorch or TensorFlow).

Decision Trees

Decision trees are simple and quick ways to model relationships.They are basically a tree of decisions which help to decide on what class or label a datapoint belongs too.Decision trees can be used for regression problems too.Although simple, to avoid overfitting, several hyperparameters must be chosen.These all, in general, relate to how deep the tree is and how many decisions are to be made.

K-Means

We can group unlabeled data into several clusters using k-means.Normally the number of clusters present is a chosen hyperparameter.

K-means works by trying to optimize (reduce) some criterion (i.e. function) called inertia.It can be thought of like trying to minimize the distance from a set of centroids to each data point.

Ensembles

Random Forests

Random forests are combinations of multiple decision trees trained on random subsets of the data (bootstrapping).This process is called bagging and allows random forests to obtain a good fit (low bias and low variance) with complex data.

The rationale behind this can be likened to democracy.

One voter may vote for a bad candidate, but we'd hope that the majority of voters make informed, positive decisions

For regression problems, we average each decision tree's outputs, and for classification, we choose the most popular one.This might not always work, but we generally assume it will (especially with large datasets with multiple columns).

Another advantage with random forests is that insignificant features shouldn't negatively impact performance because of the democratic-esc bootstrapping process!

Hyperparameter choices are the same as those for decision trees but with the number of decision trees as well.For the aforementioned reasons, more trees equal less overfitting!

Note that random forests use random subsets with the replacement of rows and columns!

Gradient Boosting

Ensemble models like AdaBoost or XGBoost work by stacking one model on top of another.The assumption here is that each successive weak learner will correct for the flaws of the previous one (hence called boosting).Hence, the combination of models should provide the advantages of each model without its potential pitfalls.

The iterative approach means previous models performances effects current models, and better models are given a higher priority.Boosted models perform slightly better than bagging models (a.k.a random forests), but are also slightly more likely to overfit.Sklearn provides AdaBoost for classification and regression.

Chapter 5 - Production

This is the last but potentially most important part of the process 🧐.We've put in all this work, and so we need to go the distance and create something impressive!

There are a variety of options.Streamlit is an exciting option for data-oriented websites, and tools like Kotlin, Swift and Dart can be used for Android/IOS development.JavaScript with frameworks like VueJS can also be used for extra flexibility.

After trying most of these I honestly would recommend sticking to Streamlit, since it is so much easier than the others!

Here it is important to start with a vision (simpler the better) and try to find out which parts are most important.Then try and specifically work on those.Continue till completion!For websites, a hosting service like Heroku will be needed, so the rest of the world can see the amazing end-product of all our hard work 🤯😱.

Even if none of the above options above suit the scenario, a report or article covering what we've done, what we've learnt and any suggestions/lessons learnt along with a well documented GitHub repository are indispensable!Make sure that readme file is up to date.

THANKS FOR READING!

I really hope this article has helped you out!For updates follow me on Twitter.

If you enjoyed this, you may also like The Complete Coding Practitioners Handbook which goes through each and every practical coding tool you'll need to know.If you're lost considering what project to take on, consider checkout out my zero-to-hero guide on choosing a project and collecting your own dataset through webscraping.

Photos by National Cancer Institute, Dane Deaner, ThisisEngineering RAEng, Adam Nowakowski and Guilherme Caetano on Unsplash.The goodness of fit graph is a modified version of the Sklearn documentation

The Complete Coding Practitioners Handbook

Kamron Bhavnagri — Tue, 05 May 2020 15:11:03 GMT

Git, debugging, testing, the terminal, Linux, the cloud, networking, patterns/antipatterns - what even is this mess?Don't worry we'll go through from beginning to end (all the way, I promise) everything you need to know to collaborate proficiency with others.

Why so many tools?

We're flooded with tools which are all titled essential to boost productivity, but... why so many of them?To answer this let's start at the very beginning and slowly work our way through our coding journey!We all started on a small solo project working to build an app, create a simple model, or just to finish an assignment.As we begin to code we notice that it just... doesn't run 😢 and so we sigh, take a deep breath in and begin to look for what went wrong.The first bug is just a small innocent typo, but with time we start running into more and more silly pesky bugs 🐞, each one a slight bit harder to deal with than the last!Once we read our code, find the typo and fix it (a little golden debugging) our coding journey continues, and we work on creating something slightly more impressive.

We soon get to a crossroad, we finish working on our small little program and want to work on something slightly more ambitious (yay)!Although we're ambitious, we notice one small thing - we make a good few mistakes.Like any good student, we get a few books, read a few articles, watch a few videos, and before long we've learned several design patterns which make for a nice, smooth coding experience and antipatterns... to avoid like the plague.

Now with a few sophisticated patterns/antipatterns in mind, we feel like we're ready to show the world our coding prowess!We start nv and nervous but with passion, and so through gathering a few friends together, we begin a new chapter of our lives 😅.The work is fun and everyone wants to play their part, but soon one question arises - how can we work together?At first, emailing/messaging code from one person to another works fine... but then a few more people pitch in, and combining every line of code becomes - unmanageable!In a moment of chaos, one man did the impossible though, Linus Torvalds extended his olive branch and gave us Git - the perfect system to collaborate with others.

Eventually, we approach another challenge, although we're writing the code just fine... we feel bogged down by our workflow.To our surprise, there's an easy and elegant solution - Linux and the terminal.Linus Torvald proposes Linux as an alternative to Windows (the ugly behemoth) and with it a terminal to write code in a fashion which completely bash's Windows.

Now with our workflow smoothened out, there are just a few questions left - how can we run this code anywhere and what if we need... more?Luckily for us, the dot com boom unfolds and the internet is ablaze!What we once had to run on our machines, can now be run on the cloud (other people/companies servers).Now we can run and distribute progressively larger (and more heavyweight) code right from the comfort of our houses!

The epochs

Chapter 1 - Debugging

Our code is bound to have problems... even if we're genius', they'll still crop up!We can't *completely avoid them, but we can approach each problem in just the right way, so we're able to smoothly eliminate it.There's a simple technique to help with this:

SIMPLIFY - Keep it simple stupid, the simpler it is the easier it is to find the problem!
EXPLORE - It's fine when we don't know what's wrong, relax and start exploring, use a few print statements, read a few errors and try to figure things out 😌
ISOLATE - Try to find where your code goes south (focused effort reveals bugs quickest)

Now I know it's easier said than done, but just try this out... it makes a big difference!Just remember to keep calm, take a deep breath 🫁 and continue, if it's a bug you'll find and destroy it with time and effort 😌!

Chapter 2 - Testing

Our code works... or does it?Testing is all about finding whether something which seems to work fine actually works fine.It's about finding whether your changes break how things work (likely in a subtle way).

Testing can be simple, or complex.At its simplest, it's about looking at what we think our code does and double-checking just that, in a more complex light it's about writing small pieces of code (unit or integration tests) to test the code (yes, code to test the code).Unit tests are for small isolated tests/scenarios and integration tests for larger/more realistic ones.Although this sounds simple (so far), testing is extremely nuanced as the way we write code has an extremely large impact on our ability to test it (hence knowledge of patterns/antipatterns may be useful)!

There's a lot to testing and I'm not an expert, but I hope that this is enough to get you going/give you some sense of direction...

Chapter 3 - Design Patterns/Antipatterns

Patterns and antipatterns are just good and bad coding practices we should try and use more/less respectively.Although at their heart design patterns/antipatterns are simple, they tend to be sorely overcomplicated!In essence, we see good and bad code all the time, so learning these comes naturally, however lots of books/articles go into fine detail by naming and shaming.

All design patterns have three basic purposes, to help create, organise (structural) or communicate (behavioural) between classes and objects.

A few examples:

Singleton - creating classes which are only initialised (used) once
Strategy - when we abstract (group) multiple algorithms (or models) into one class so they can easily be swapped out
Observer - when multiple objects need to know about when an event is triggered we can distinguish between observers and callers

Since antipatterns are just mistakes they're a good few that exist:

Analysis paralysis - when we're stuck planning and never start coding
Cargo cult programming - when we use code without understanding it
Rule of credibility - the last 10% of our work takes 90% of our time
Big ball of mud - when all our code is in one large clump
Spaghetti code - where our code isn't cleanly separated
Poltergeist - creating excess classes/code for no reason
Repeated logic/redundant code - can just use classes/functions when code is used in multiple places
Ambiguous naming of variables and functions - names should be short but still express meaning
Magic strings - fixed values with an unknown purpose

Note it's more practical to pick these all up through carefully inspecting code (especially off Stack Overflow)!

Chapter 4 - Git

Git is the collaboration one-stop-shop!It is elegant and beautiful once we learn to use it... but seemingly not before that 😧.Don't worry though, it's quite simple, Git works through tracking what changes we make (hence it's called version control), and it does this by breaking up our timeline into chunks that we've committed to using (commits).

We may now ask though - how does this help to combine our changes?Luckily for us, it's not too difficult to interpret, Git stores our work in repositories which can be shared and forked/cloned.Whenever we make changes we can commit these and then push them out to our online repositories (technically called remote repositories).Then once we're ready to share our brilliant code we can pull others over to see/confirm what we've done (with a pull request)!Although this all just sounds weirdly social right now, it gets useful when Git provides us with overviews of our changes, so we're certain that our team's outstanding work won't collide/conflict with our work.

Now there are a few more technical ways we can to use Git, primarily through segmenting work/progress into branches and providing special ways to combine our changes.Branches allow us to highlight particular parts of our codebase which we'd like to share, whilst also allowing us to isolate certain features which may be unstable/not quite ready yet!The first way to combine branches is to merge changes by adding the changes made into a new commit.The second is to replay one branch's changes on another (which we call a rebase).Which one we use depends on our situation:

When we try to make our commit history as simple as possible, a rebase is an amazing and flexible option
If we need to remove, modify, combine or change the order of commits, to keep a simple and clean history, only a rebase will suffice
However, just like time travel, a rebase is dangerous whenever we do it on anything others are using
- In practice only rebase non-publish/non-used code (this is often referred to as the golden rule)

Now that we've discussed the difficult concepts, let us take a look at the terminal (explained further below) commands we can use:

To clone a repository

git clone git_website_url

To add a file/folder to be tracked in the next commit (stores changes at the time the commend's run)

git add new_file_or_folder_location

To commit

git commit -m "added amazing new features"

To change branches

git checkout my_branch

To create and switch to a new branch

git checkout -b my_new_branch

To push

git push

To merge branches

git merge my_feature_branch

To rebase a branch (n is the number of commits to consider)

git rebase -i HEAD~n

To add an upstream branch

git remote add upstream original_repo_url

To sync a local repository (to its remote)

git fetch upstreamgit merge upstream/master

A few mistakes to avoid:

The URL to a Git repository doesn't include any specific file/folder
We fork repositories to keep an isolated version to work with ourselves before we're ready to pull together our work (so our changes don't affect each other in the middle of things)
- So the URL to enter when cloning a repo to work with is your forked version and then the original repositories main branch becomes the forked repositories upstream branch (as it's likely newer)
- Be careful when copy-pasting their URLs as they're quite easy to mix the wrong way round
- Note the upstream branch only has to be set once
Pull requests happen through an online UI (i.e. the GitHub website) not the terminal (normally)
Once we start an interactive rebase, carefully read the provided options

Atlassian documentation provides further details and examples of how to use Git.

Chapter 5 - Linux and the Terminal

As explained above, Linux is an amazing replacement for Windows (it's free by the way) which is far more flexible and lightweight!One distinct feature is the inbuilt powerful terminal (called bash) which allows us to perform complex tasks easily.

Here are the essential commands:

List files

lsls my_folder

Check current location (i.e. current folder/directory)

pwd

Change directory (into another folder)

cd folder_path

Move a file/folder

mv old_location new_location

Copy a file

cp file_location copy_location

Copy a folder

cp -r folder_location copy_location

Run another program (like a text editor, normally vi, vim or nano)

program_location

Although they don't seem anything out of the ordinary, the terminal provides a solid way to do a variety of tasks!

Note if you ever enter a text editor you can't seem to close (likely vi/a variant of vi) hit escape and then :q!

Going further

For more information, The Missing Semester of Your CS Education is a useful guide.Thanks for reading and I really hope that this has helped you out!

Photo by Kevin Ku on Unsplash

Zero to Hero... Data Collection through Web Scraping

Kamron Bhavnagri — Sun, 01 Mar 2020 06:40:13 GMT

Why?

Machine learning is cool, but we can't really do much without data.So let's kick off our journey the right way through web scraping!

Now I'm going to preface this post by stating there are two options:

Popular and easy to manage data
Unique and niche data

For this mini-series we're aiming to create a unique, exciting and impressive project, so we are skipping over the overused and cliche projects!Instead, we've chosen to create a video game recommendation system.

How we got here

Before diving into code and pulling logic apart I feel like it's important to give a brief overview of how we can collect data:

Decided on what data we're looking for (game titles, summaries, and reviews)
Research and find potential data sources (Wikipedia and Metacritic)
Scrape and export final input data

Primary precedence

Let me start with a little insight on this project, I'm actually pairing up with a friend who just started their machine learning journey.Now I understand how it feels to work with a minimal amount of knowledge as I started my first projects this way!

I've never mentioned this before, but I started programming as a nieve kid.I always wanted to program, and I thought I could when I was ~13.So I joined a game-creating competition where I completely bombed out!Know why?I thought I knew it all, I thought tutorials would explain all the details for any project I chose, and that they'd explain how everything should come together magically.Thus I was completely unprepared for the challenge ahead of me!

My friend is at the same stage I was back then, and it's reflected by his order of precedence (what he considered most to least important).Now let me explain why this is important with an anecdote.

My friend just started working on his first task, to find box art for all our games.This seemed easy to him, but alas, it was quite dubious!You see Metacritic and Wikipedia provide lists of games with small icons displayed next to each title, but they're just small thumbnail image.This detail was easy to miss, so he breezed over it and tried to collect images without a second thought (a costly mistake).

Our mistakes are similar to wandering through a jungle without reading a map!We may hike halfway through a jungle, but only after thoroughly examining a map can we possibly hope to find which direction we need to travel in!These wild bets which have extremely low chances of paying off (i.e. getting us to our desired location) are what I call nieve assumptions!

The challenge

I hear people asking how nieve assumptions relate to our amazing project.

Data collection is when you start writing code for your project and so you're currently most vulnerable to the nieve tendency

As your guide I need to explain how we can simply avoid nieve assumptions:

Start with finding context about the problem you're solving
Proceed to decompose your problem into smaller pieces
Strategise by planning how to overcome the problem
Fill in the blanks

The main takeaway here is to always research the mechanisms at play before working.For us, this means closely inspecting where your data will come from and how it will be used.Don't worry if you're naturally not thinking this way (yet), as it'll come with time and practice!

Writing the code

Finally some code!

Let's start off by deciding our output file format and location.Since we're using Scrapy it's relatively easy!Just set the FEED_FORMAT (file format) and FEED_URI (file path) settings and we're good to go.

Our first task will be to collect a list of PC video game titles and summaries.For simplicity, we're going to use our primary data source Wikipedia!Although it's possible to easily download all of Wikipedia, we only really need articles on PC games, so we're going to find the URL for each article.There's a pre-created list on Wikipedia, which we'll be scraping for titles and URLs.

We start by isolating the parts of the page with relevant data.To do this we can use CSS or XPath selectors.The first step to finding the right selector is understanding a page's HTML structure.So we can take a look at our browser's inspector which provides a small interactive view of the HTML code (in Firefox press cntrl+shift+i).To easily find a HTML element on the page use the select element feature (press cntrl+shift+c in Firefox).

After playing around it's apparent that all our game titles are a elements.We can try and use a CSS selector of a, however, you'll notice that its output is composed of more than just game titles.We can become more specific, i > a, and we'll get slightly better results!The trick for CSS selectors is to try your luck by starting fairly generic and progressively making them more specific.In the end, our experimentation revealed that td > i > a selects our desired game elements.Now though we actually want two things:

The name of each game
The URL to each Wikipedia article on a game

To find specific parts of our element with can use attributes ::!For URLs you use ::attr(href) and text ::text.Note that we don't want HTML elements, so we can use the get or getall functions to extract our data!Now here are our CSS selectors for collecting our two pieces of information from the page:

The name of each game: td > i > a::text
The URL to each Wikipedia article on a game: td > i > a::attr(href)

If you've looked at our Wikipedia webpage you should realise that we need to manually switch between pages 😑.We luckily have links to each page present at the top of the Wikipedia list!Now that means we'll need to find a CSS selector, and then loop through each page of Wikipedia games.

If you followed my advice (general -> specific) you'd eventually realise that some elements specify classes.We can easily use CSS classes by writing . following a class name!These classes are another amazing way to make selectors more specific.Using our new knowledge of classes we can create a final CSS selector for each page: div.toc > div > ul > li > a::attr(href).

All we have to do now is to loop through and scrape each page.With Scrapy all you have to do is yield another response object!To do this easily for relative paths use the response.follow function.Note here that we would scrape the first page twice if we go through each link on the page.

Now, of course, these are just game titles and webpages, but we can use Wikipedia's special export articles page!However, this, of course, isn't automated and so a small amount of manual work will have to be done to replicate the results.

Technical summary

CSS selector tips and tricks:

Elements can be specified like div
Using .class_name allows you to specify an element's class
Attributes can be specified after ::
- URLs are found with the attr(href) attribute

For more take a look at w3schools complete list.

To extract data from HTML elements use the get or getall function and to remove extraneous characters use strip.

The trouble with web scraping

I hope you appreciate how well designed Scrapy is!It makes the process of web scraping relatively easy to do.

But now why am I gloating about Scrapy when I previously emphasised how troublesome web scraping can be?Well, unfortunately for us whilst Scrapy can make how we web scrape easy, it can't remove inconsistencies in web pages 😥.We, together, have discussed how to web scrape a Wikipedia page, and I sincerely hope that web scraping with me has provided you with some transferable value (i.e. you become capable of replicating this yourself).

My code

To understand more about how to create a Scrapy spider to extract information from our website see my GitHub repository!

THANKS FOR READING!

I know data collection (especially web scraping) can be time-consuming, and often feel like a slight pain.However, being persistent and fighting through the data collection process will definitely give us a strong start!

If you haven't already, check out the first intro to building end-to-end machine learning projects.Make sure to stay put for the next post where I actively go through the next (and potentially most important) step of our journey (data preprocessing)!Last but not least, make sure to follow me on Twitter for updates!

Zero to Hero... NLP project edition

Kamron Bhavnagri — Mon, 24 Feb 2020 01:42:52 GMT

Why?

So you just went through another tutorial, another MOOC.Your guilty gut instinct knows another one just won't help, but you have no idea what else to do.

You've been told a project can go a long way to show initiative, motivation and even skill, but... you've got no idea what to do, where to go or even how to start!

Apparently, it should "genuinely motivate you to work", but... how?What should your project even be about?

Your first instinct for starting a project may be to go with the flow, and see where it leads to.You can try, be my guest, but that's like learning to navigate around a jungle yourself (you might find your target location... eventually)!Instead, you could try conducting deep research into the surrounding landscape and wildlife.However, when you're exposed to real, aggressive animals you'll notice the major difference between theoretical knowledge and real practical skills.

What you really need is a guide!Someone who knows the place well enough to give you a brief tour around.Someone able to point out general points of interest and significant events to watch out for.This way when you're left alone, you'll roughly know what to do and how to live/navigate around the jungle yourself.

It's easy to get stuck without any sense of direction during a project (like in a jungle)

Please don't get caught alone in the jungle!Instead, allow me to be your guide.In this mini-series, we together will go from the ground up building a unique (and therefore impressive) Natural Language Processing project!I hope this mini-series inspires you to start your own project whilst also offering a solid foundation to replicate the process yourself!

A light bulb

It's great to start of intrinsically motivated to work, but it's just... unrealistic.How many times have you been so blown away by a random perfect idea that was so aligned to what you were about to do, that you could take immediate action and bring it to life?If your answer is daily, you're lucky, kudos to you.

However, if you've got no idea what to do, I've got your back:

With time and effort, learning and absorbing information, you'll eventually encounter an impressive and worthy idea.

This means if you've been ruminating for a while, take a break and instead learn.You can learn through articles, books, videos, anything you like... just bask in information!The trick here is to continuously question how these ideas could be used in the real world.It doesn't matter whether you completely understand yet either (with time you'll learn...), just make sure to replicate this process until you come across a gem!

If you're unsure about your ability to finish a project, that's fine!What's the worst thing that will happen?The worst thing is that you'll have learnt more about what you can and can't do next time!Just remember that dedication pays off in the long run.If you research each idea, eventually after a five or so you'll find something golden!

Breakdown

Finding an idea was damn hard, but following through... now that's something entirely different!Lucky for confused basic simpletons like us, there's an easy way to break down the entire project into a few key stages:

Data collection
Machine learning is cool, but we can't really do much without data.So let's kick off our journey the right way by finding quality data!There are two options:
- Popular and easy to manage data
- Unique and niche data
What could possibly warrant going through the trouble of creating a special dataset just for a single project?Simple, you want to be a problem solver.
You want to show your ability to solve new, unique and challenging problems, not simple tutorials!
I know finding data from unique sources will bring about numerous seemingly unnecessary hurdles, but they're part of the fun.
Process data (make sure it's formatted correctly and cleaned)
Processing data could be the most important part of your project.
High-quality data yields high-quality results.
I know you'll be tempted to fast track your progress by simplifying your preprocessing pipeline.But just remember the saying "garbage in == garbage out".It means your lazy unprocessed data manifests itself within your model.Hence a lazy mediocre model will generate sub-optimal output (despite attempts to algorithmically improve results).
Modelling
The highly anticipated part of any data science project is creating a model.There are loads of complex models (and modifications to them) you can make, however, start simple and incrementally improve afterwards.
Application
You thought you'd finished?Hahaha... the model itself isn't nearly as impressive as a tangible application!
You have a variety of options, a website, mobile app, browser extension...Choose whatever application makes sense!
Creating a final application may take a little time and require you to broaden your skillset further, but it pays itself off extremely fast.Remember that one well thought of project is far better than a dozen small and careless mediocre ones!

Cover image (modified) sourced from here

THANKS FOR READING!

I know that creating our first NLP project won't be quick nor easy.But I think it's important to find why you're doing a project.Is it to demonstrate how fast you can work or how able you are to do meaningful and realistic work?I hope this mini-series helps you!

If you've liked this, make sure to stay put for the next post where I actively go through the first step of our journey (data collection).Make sure to follow me on Twitter for updates!

Snake classification report

Kamron Bhavnagri — Sun, 16 Feb 2020 04:28:47 GMT

Why?

5.4 million people are bitten by snakes every year and 81-138 million people die due to snake bites each year.Preventing snake bites is clearly a major issue, in need of preventative measures to save lives.The project Snaked demonstrates a potential solution to alleviate the problem of struggling to identify the snake which has bitten a person.A proof of concept app is available on the Google Play Store which uses this model.

Challenge

Each snake species varies in shape, colour, size, texture and more.
Over 3000 snake species have already been discovered worldwide!
Different snake species may look nearly identical, however, vary significantly

So it is clearly not an easy job to identify one snake from another, even though it's quintessential that we do.

Frameworks and Methods Utilized

PyTorch is used for ALL deep learning code and Numpy for numerical computations.The code is separated into the main file outlining the chosen algorithms (can be swapped/modified easily) and abstracted code to help train, evaluate and create an executable for any model easily.

There are several novel ideas/techniques used here which help to create neater/more readable pythonic code.The three primary examples of new PyTorch techniques:

The Item tuple class which allows modular and further extensible code (when extra data needs to be processed)
Use of super-convergence/the one-cycle policy in pure PyTorch instead of a highly abstracted library (i.e. Fast.AI)/bare python/numpy
Use of a dictionary to dictate how different data sets should be split up to allow easy modifications of data proportions (i.e. switching between full dataset for training and small batches for ensuring all code runs without errors)

Although several models were trialed out on the dataset, in the end, a MobileNetV2 model provided the best results, whilst also remaining relatively lightweight and so able to run on low-power devices like phones (essential for the app).The final model was trained using LDAM loss instead of cross-entropy loss due to the dataset being imbalanced (some classes having far more samples than others).Note that the codebase has support for classical rebalancing, however, experimental trials show that this method causes overfitting extremely early on.The model manages to achieve around a 70% accuracy and F1 score.More details about the choice of model, how it was improved and lessons learnt from this project can be found on The Data Science Swiss Army Knife blog.

Android Application

The android application created for this project was written with Kotlin, using Fotoapparat (for easy camera support) and PyTorch (for utilizing the chosen PyTorch model) libraries.Due to the lightweight MobileNetV2 model no network is required to connect to a server (which would normally run the computations itself).This is intentionally done to facilitate use within remote locations!Please note that this is only a sample proof of concept app and if you're bitten consult a medical expert immediately.

Sources

All data currently used for the project comes from AIcrowd's Snake Species Identification Challenge.The images and labels have been used, however, geographic locations have been ignored to allow easy usage for any image, even if it hasn't been tagged.The dataset allows 85 species to be labelled.A Jupyter Notebook is also provided which demonstrates how to collect further data using Google Image searches!All statistics used here, or within the repository are from the World Health Organisation (unless otherwise stated) or pertaining specifically to the Snaked source code.

For further information please see the following:

Metrics

Throughout this report, I'll refer to F1 scores as my primary metric.This is because of the class imbalance.Please ensure to note that there is a major difference between the support number and number of present samples for a class.The first refers to the count in the validation dataset, whereas the latter in the training dataset.The number present in the training set will be primarily used to judge the effect of a skewed dataset on model predictions.

Training and Validation Graphs

Train epoch vs accuracy

Train epoch vs loss

Validation epoch vs accuracy

Validation epoch vs loss

Model trained for 51 epochs

Classification Report

species	precision	recall	f1-score	support
agkistrodon-contortrix	0.8452380952380952	0.8765432098765432	0.8606060606060606	81.0
agkistrodon-piscivorus	0.676923076923077	0.6027397260273972	0.6376811594202899	73.0
ahaetulla-prasina	0.8	0.6666666666666666	0.7272727272727272	6.0
arizona-elegans	0.631578947368421	0.5714285714285714	0.6	21.0
boa-imperator	0.6	0.5294117647058824	0.5625	17.0
bothriechis-schlegelii	0.8888888888888888	0.7619047619047619	0.8205128205128205	21.0
bothrops-asper	0.875	0.5384615384615384	0.6666666666666667	13.0
carphophis-amoenus	0.6551724137931034	0.6551724137931034	0.6551724137931034	29.0
charina-bottae	0.8947368421052632	0.7391304347826086	0.8095238095238095	23.0
coluber-constrictor	0.5689655172413793	0.559322033898305	0.5641025641025641	59.0
contia-tenuis	0.52	0.6190476190476191	0.5652173913043478	21.0
coronella-austriaca	0.3333333333333333	0.1	0.15384615384615383	10.0
crotalus-adamanteus	0.6153846153846154	0.7272727272727273	0.6666666666666667	11.0
crotalus-atrox	0.7747252747252747	0.8392857142857143	0.8057142857142857	168.0
crotalus-cerastes	0.7857142857142857	0.8461538461538461	0.8148148148148148	13.0
crotalus-horridus	0.7741935483870968	0.8135593220338984	0.7933884297520662	59.0
crotalus-molossus	0.6363636363636364	0.7	0.6666666666666666	10.0
crotalus-oreganus	0.5333333333333333	0.6666666666666666	0.5925925925925926	12.0
crotalus-ornatus	1.0	0.8181818181818182	0.9	11.0
crotalus-pyrrhus	0.8571428571428571	0.5454545454545454	0.6666666666666665	22.0
crotalus-ruber	0.7894736842105263	0.7142857142857143	0.7500000000000001	21.0
crotalus-scutulatus	0.8620689655172413	0.7575757575757576	0.8064516129032258	33.0
crotalus-viridis	0.5384615384615384	0.6363636363636364	0.5833333333333334	22.0
diadophis-punctatus	0.8266666666666667	0.7948717948717948	0.8104575163398693	78.0
epicrates-cenchria	1.0	0.5	0.6666666666666666	2.0
haldea-striatula	0.5454545454545454	0.47058823529411764	0.5052631578947367	51.0
heterodon-nasicus	0.7	0.5384615384615384	0.608695652173913	13.0
heterodon-platirhinos	0.7058823529411765	0.75	0.7272727272727272	48.0
hierophis-viridiflavus	0.5	0.4375	0.4666666666666667	16.0
hypsiglena-jani	0.5238095238095238	0.6111111111111112	0.5641025641025642	18.0
lampropeltis-californiae	0.8701298701298701	0.8170731707317073	0.8427672955974842	82.0
lampropeltis-getula	0.8333333333333334	0.625	0.7142857142857143	16.0
lampropeltis-holbrooki	0.6923076923076923	0.6428571428571429	0.6666666666666666	14.0
lampropeltis-triangulum	0.8166666666666667	0.8305084745762712	0.8235294117647058	59.0
lichanura-trivirgata	0.9333333333333333	0.8235294117647058	0.8749999999999999	17.0
masticophis-flagellum	0.5294117647058824	0.6428571428571429	0.5806451612903226	42.0
micrurus-tener	1.0	0.95	0.9743589743589743	20.0
morelia-spilota	0.4	0.2222222222222222	0.2857142857142857	9.0
naja-naja	0.8333333333333334	0.7142857142857143	0.7692307692307692	7.0
natrix-maura	0.25	0.125	0.16666666666666666	8.0
natrix-natrix	0.7857142857142857	0.4074074074074074	0.5365853658536585	27.0
natrix-tessellata	0.5	0.36363636363636365	0.4210526315789474	11.0
nerodia-cyclopion	0.5833333333333334	0.4666666666666667	0.5185185185185186	15.0
nerodia-erythrogaster	0.53125	0.4358974358974359	0.47887323943661975	78.0
nerodia-fasciata	0.6190476190476191	0.43333333333333335	0.5098039215686274	30.0
nerodia-rhombifer	0.7213114754098361	0.6567164179104478	0.6875	67.0
nerodia-sipedon	0.5084745762711864	0.5309734513274337	0.5194805194805195	113.0
nerodia-taxispilota	0.8461538461538461	0.55	0.6666666666666667	20.0
opheodrys-aestivus	0.9042553191489362	0.9444444444444444	0.9239130434782609	90.0
opheodrys-vernalis	0.8235294117647058	0.6666666666666666	0.7368421052631577	21.0
pantherophis-alleghaniensis	0.36585365853658536	0.4	0.38216560509554137	75.0
pantherophis-emoryi	0.5714285714285714	0.6666666666666666	0.6153846153846153	36.0
pantherophis-guttatus	0.9183673469387755	0.8035714285714286	0.8571428571428571	56.0
pantherophis-obsoletus	0.5395348837209303	0.6041666666666666	0.5700245700245701	192.0
pantherophis-spiloides	0.43478260869565216	0.25	0.3174603174603175	40.0
pantherophis-vulpinus	0.5957446808510638	0.8484848484848485	0.7	33.0
phyllorhynchus-decurtatus	0.6956521739130435	0.9411764705882353	0.7999999999999999	17.0
pituophis-catenifer	0.7073170731707317	0.7016129032258065	0.7044534412955465	124.0
pseudechis-porphyriacus	0.6666666666666666	0.2857142857142857	0.4	7.0
python-bivittatus	0.875	0.7777777777777778	0.823529411764706	9.0
python-regius	0.0	0.0	0.0	3.0
regina-septemvittata	0.56	0.6666666666666666	0.6086956521739131	21.0
rena-dulcis	0.6666666666666666	0.8	0.7272727272727272	10.0
rhinocheilus-lecontei	0.8292682926829268	0.85	0.8395061728395061	40.0
sistrurus-catenatus	1.0	0.8	0.888888888888889	5.0
sistrurus-miliarius	1.0	0.6666666666666666	0.8	6.0
sonora-semiannulata	0.3333333333333333	0.4	0.3636363636363636	5.0
storeria-dekayi	0.76	0.8465346534653465	0.8009367681498829	202.0
storeria-occipitomaculata	0.631578947368421	0.6486486486486487	0.64	37.0
tantilla-gracilis	0.5454545454545454	0.6	0.5714285714285713	10.0
thamnophis-cyrtopsis	0.5714285714285714	0.4444444444444444	0.5	9.0
thamnophis-elegans	0.47058823529411764	0.2962962962962963	0.3636363636363636	27.0
thamnophis-hammondii	0.5555555555555556	0.35714285714285715	0.43478260869565216	14.0
thamnophis-marcianus	0.8787878787878788	0.8285714285714286	0.8529411764705883	35.0
thamnophis-ordinoides	0.375	0.16666666666666666	0.23076923076923078	18.0
thamnophis-proximus	0.7543859649122807	0.7413793103448276	0.7478260869565219	58.0
thamnophis-radix	0.8181818181818182	0.47368421052631576	0.6	38.0
thamnophis-sirtalis	0.7167070217917676	0.896969696969697	0.7967698519515478	330.0
tropidoclonion-lineatum	0.7777777777777778	0.6363636363636364	0.7000000000000001	11.0
vermicella-annulata	0.0	0.0	0.0	3.0
vipera-aspis	0.4	0.5714285714285714	0.47058823529411764	7.0
vipera-berus	0.75	0.75	0.75	12.0
virginia-valeriae	0.2	0.14285714285714285	0.16666666666666666	7.0
xenodon-rabdocephalus	1.0	1.0	1.0	8.0
zamenis-longissimus	0.4	0.3333333333333333	0.3636363636363636	12.0
accuracy	0.6930662557781202	0.6930662557781202	0.6930662557781202	0.6930662557781202
macro avg	0.6659430597272402	0.6050948460385794	0.6247619446039016	3245.0
weighted avg	0.6925634979734435	0.6930662557781202	0.6866874083865039	3245.0

Despite a skewed dataset, the majority of snakes had a similar precision and recall score.

Number of Samples vs F1 score

The above graph of the number of samples the neural network was trained over and F1 score per class indicates two significant points:

Having more samples of a class will increase the probability of a model robustly identifying the snake species correctly
However, this does not mean that an inability to collect a large number of images for all classes will create a major imbalance in a model's predictions

The latter takeaway shows that LDAM loss has been largely successful!On a side note though, when a class has few samples, the F1 score may not be completely representative of how the model will generalise.This is primarily because, in a small number of images, only a small number of conditions can be shown and tested.Yet, in reality, an image can be taken in any environment and transformed in a very large number of ways.

Confusion Matrix

The test dataset had too few images to clearly judge anything from the confusion matrix.This is due to a combination of having 85 classes, and classes with few samples being quite dark.

Improvements

If this model was going to be retrained, a good idea would be to use a 90-5-5, or better a 80-10-10 split.This would ensure that there would be enough data in the testing dataset to judge where exactly the model's confusion stems from (using a confusion matrix).

Combining the current classification model with an additional preprocessing pipeline which includes segmentation may also allow far higher F1 scores.This is because segmentation is able to remove extraneous noise around a snake (subtle environmental cues which prevent the model from generalising).Additionally, this would completely avoid the chance of a snake being cropped out of an image (currently unlikely but still possible for small snakes).

Super-Convergence with JUST PyTorch

Kamron Bhavnagri — Thu, 13 Feb 2020 04:56:35 GMT

Why?

When creating Snaked, my snake classification model I needed to find a way to improve results.Super-Convergence was just that, a way to train a model faster whilst getting better results!HOWEVER, I found no guides on how to do it with the built-in PyTorch scheduler.

Learn the theory

Before you go through this you'd probably like to know what super-convergence is and how it works.The general gist is to increase the learning rate as much as possible at the beginning and then progressively decrease it at a cyclical rate.This is because larger learning rate's train faster, but cause the loss to diverge.My focus here is with PyTorch though, so I myself won't explain any further.

Here's a list of resources to delve deeper:

Imports

import torchfrom torchvision import datasets, models, transformsfrom torch.utils.data import DataLoaderfrom torch import nn, optimfrom torch_lr_finder import LRFinder

Setting Hyperparameters

Set transforms

transforms = transforms.Compose([transforms.RandomResizedCrop(size=256, scale=(0.8, 1)),    transforms.RandomRotation(90),    transforms.ColorJitter(),    transforms.RandomHorizontalFlip(),    transforms.RandomVerticalFlip(),    transforms.CenterCrop(size=224), #ImgNet standards    transforms.ToTensor(),    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)), # ImgNet standards])

Load the data, model and basic hyper parameters

train_loader = DataLoader(datasets.CIFAR10(root="train_data", train=True, download=True, transform=transforms))test_loader = DataLoader(datasets.CIFAR10(root="test_data", train=False, download=True, transform=transforms))model = models.mobilenet_v2(pretrained=True)criterion = nn.CrossEntropyLoss()optimizer = optim.AdamW(model.parameters())# Set the device in use to GPU (when it's available)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model = model.to(device)## Find the perfect learning rateNote that doing this requires a separate library from [here](https://github.com/davidtvs/pytorch-lr-finder).```pythonlr_finder = LRFinder(model, optimizer, criterion, device)lr_finder.range_test(train_loader, end_lr=10, num_iter=1000)lr_finder.plot()plt.savefig("LRvsLoss.png")plt.close()

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))Stopping early, the loss has divergedLearning rate search finished. See the graph with {finder_name}.plot()

Create a scheduler

Use the one cycle learning rate scheduler (for super-convergence).

Note that the scheduler uses the maximum learning rate from the graph.To choose look for the maximum gradient (slope) downwards.

The number of epochs to train for and the steps per epoch must be entered in.It is common practice to use the batch size as the steps per epoch.

scheduler = optim.lr_scheduler.OneCycleLR(optimizer, 2e-3, epochs=50, steps_per_epoch=len(train_loader))

Train model

Train the model for 50 epochs.Print stats after every epoch (loss and accuracy).

Different schedulers should be called in different within the code.Placing the scheduler in the wrong place will cause bugs, so with the one-cycle policy ensure that the step method is called straight after each batch.

best_acc = 0epoch_no_change = 0for epoch in range(0, 50):    print(f"Epoch {epoch}/49".format())    for phase in ["train", "validation"]:        running_loss = 0.0        running_corrects = 0        # PyTorch model's state must be changend        # As layers like dropout work differently depending on state        if phase == "train":            model.train()        else: model.eval()        # Loop through the dataset        for (inputs, labels) in train_loader:            # Transfer data to the GPU            inputs, labels = inputs.to(device), labels.to(device)            # Reset the gradient (so the gradient doesn't accumilate)            optimizer.zero_grad()            with torch.set_grad_enabled(phase == "train"):                # Predict the label which the model gives the max probability (of being true)                outputs = model(inputs)                _, preds = torch.max(outputs, 1)                loss = criterion(outputs, labels)                if phase == "train":                    # Backprop                    loss.backward()                    optimizer.step()                    # Super convergence changes the learning rate                    scheduler.step()            running_loss += loss.item() * inputs.size(0)            running_corrects += torch.sum(preds == labels.data)        # Calculate and output metrics        epoch_loss = running_loss / len(self.data_loaders[phase].sampler)        epoch_acc = running_corrects.double() / len(self.data_loaders[phase].sampler)        print("\nPhase: {}, Loss: {:.4f}, Acc: {:.4f}".format(phase, epoch_loss, epoch_acc))        # Stop the model from training further if it hasn't improved for 5 consecutive epochs        if phase == "validation" and epoch_acc > best_acc:            epoch_no_change += 1            if epoch_no_change > 5:                break

Thanks for READING!

I hope this is easy enough to understand relatively quickly.As when I first implemented super-convergence it took me a long time to figure out how to use the scheduler (I couldn't find any code which utilized it).If you liked this blog post consider checking out other ways to improve your model.If you'd like to see how super-convergence is used in a real project, look no further than my snake classification project.

Cover image sourced here

Improving your computer vision model

Kamron Bhavnagri — Thu, 13 Feb 2020 01:57:00 GMT

Why?

So you've cleaned your data, written some basic code to train a model, but now don't know where to go next.Don't worry, I've got your back.I'm going to explain in as much detail as I can the tricks I've learnt about which can help improve any model.

Data Augmentation

Data augmentations are modifications you can make to your input data.They can help make your model more robust, and able to generalise better.There is a wide variety available, but I'll just describe some of the one's I've tried:

Randomly crop 80%-100% of each image
Adjustment the aspect ratio
Random colour jitter
Resize images to have 256-pixel height
Centre crop the image to 224x224 pixels

Super-Convergence

The point of super-convergence is to speed up training, whilst also improving performance (a win-win situation).It works based on the idea that higher learning rates train models fast and can act as regularizers.It decreases the learning rate with time in a cyclic fashion.Note that you can try out the AdamW optimizer as well, as it's supposed to give better results.I'm writing a whole article on how to use super-convergence in pure PyTorch, so if interested make sure to check that out!For the nitty-gritty details take a look at the original paper.

Learning rate finder/tuning hyperparameters

This one is usually done in combination with super-convergence, but can also be used itself.The idea is to plot a graph of learning rate vs loss.In this way, you can find the maximum learning rate you can safely use (without the gradient diverging and loss increasing).

Note that whilst training it does need to be decreased, or else loss will increase again (super-convergence does this for you).

Test time augmentation

Test time augmentation involves averaging the results of a model over several augmented images.This can yield higher results, however, I decided against it due to unusually high VRAM usage with the PyTorch library I found (easier with Fast.AI though).

Balance dataset

This one is REALLY important.I originally didn't notice this, but someone who looked at my original code saw that I hadn't balanced the number of images present per class.This meant that classes with fewer images would be significantly less likely to be predicted.The reason for this is that classes with more images have a much higher contribution to loss functions (standard ones at the least).

The best way to deal with this is to get more data, however in many problems, this just can't be done.

The first simple alternative is simply oversampling but can lead to overfitting.Another way is undersampling the majority classes, but this may discard important samples!Instead, try out a newer loss function like LDAM Loss.

Confusion Matrices

I'm not an expert here but in essence, the confusion matrix (from my understanding) can be used to find out which classes aren't being classified well.From here you can analyse those classes, and try to see why the model isn't doing so well and then try and improve it (i.e. maybe more images are needed of one class).

Progressive resizing

This one's simple and effective.You start with training your model with small images of low resolutions, and then progressively increase it.The reason this increases model robustness is that the model is forced to look for simple patterns before complex ones.

Consider metrics other than accuracy

Accuracy may not be the best metric for your problem.Metrics like F1 score can be equally if not more useful!

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).If this has helped you out consider checking out my article on problems I encountered whilst building my first project!

Cover image sourced here

How I published an app and model to classify 85 snake species (and how you can too)

Kamron Bhavnagri — Wed, 12 Feb 2020 12:48:21 GMT

Why?

I had just finished my last MOOC and couldn't stop wondering whether I was ready to start a project.I was frightened, scared, and lacked confidence.However, after weeks of contemplation, I bit the bullet and announced I'd create a simple image classification model.

Now here I am with an app officially available on the Play Store and a GitHub repo with its open-sourced code.I'd like to explain the hurdles I've faced and the lessons I've learnt overcoming them (in hopes it'll help you out too)!

My Journey

Created my own dataset from Google Image search results
Started with the simplest linear regression model possible
Tried to use my own custom CNN
Switched to a FAR larger dataset
Took Stanford's CS231n to strengthen my foundations (theoretical knowledge)
Created the basic code to train a MobileNetV2 model
Learnt about Super Convergence (article about Super Convergence with PURE PyTorch coming soon...)
Created an Android app for Snaked which could take and import photos, before outputting the snake species

In short, I made a LOT of mistakes, and that's precisely how I learnt.Google scrapping for images allowed me to appreciate the effort involved in creating a dataset with 120, 000 images.Starting with linear regression was a grand, stupid and hilarious blunder, but it taught me the value of CNN's and pre-trained models first hand.Trying linear regression for such a task also forced me to find out how and why neural networks work!The long training times and mediocre results from a plain pre-trained model caused me to find Super Convergence!

My mistakes were like stages, without each and every one of them I wouldn't have learnt anywhere near the amount I did

The benefits

An obvious question is why bother overcoming hurdle after hurdle when a free MOOC can teach the same content (maybe even taking less time and effort).I've already answered the question, but in short, it comes down to:

You remember information which you repeatedly use, and progressively forget all else

Now, this doesn't mean you don't go through any MOOC's/tutorials, but just make sure you don't get "stuck in tutorial hell"!Instead, if you know the basics apply the stuff you've learnt right now to create a cool, interesting project you can show of!

Challenges I overcame

Should I start now?

The fact you're questioning your current skill, knowledge and theoretical foundation level indicates that you're aware of the limits of your understanding!This doesn't mean that you're stupid, or not ready, but instead, you've learnt enough to know that there's way more ahead of you.

Just understand that there will always be more to learn, so may as well start using what you already know

Is my idea good enough?

Two things you can do to judge:

Is my idea too simple/complex?
Ask someone else

If you're unsure how complex the project will be, consider how others have faired on similar tasks.One way to find out is to search for online articles or research papers.If you find hundreds then the problem is probably too simple, but if you only find a few it's probably unrealistic (or you've got a genius idea).Knowledge on the topic at hand is a must, so just research the topic and see what you find out!

If you don't already have any connections to data scientists, then you're going to have to reach out (like I did)!I'd do this ASAP no matter what, as it's always incredibly useful to have a variety of opinions on any situation.My personal method is to reach out to data scientists near me saying I'm studying machine learning and looking for some advice.Most people reject, but if you put enough out it still works!

What if it fails?

Find out WHY your project can't work!

If finding out why an idea can't work doesn't unravel another solution, then you've found out something new... and that's an accomplishment!

I don't know what to do!

Find similar problems to the one you're trying to solve.For me, I started with tutorials on how to use PyTorch to classify digits (MNIST) and more complex objects (CIFAR100).I followed the tutorials and figured out how they achieved their task.I then used transfer learning, replicating what each tutorial had done, but this time for my own problem.

Of course, I was far from ready to tackle the full challenge but with time I figured out more and more.

If you're still stuck though, you may actually need to go back to the books (or courses).

Nothing is working!

Just stick at it and after a while, something will click!At the beginning of creating my first model, none of my code ran, but eventually (after a few days), I managed to find the bug and fix it.Know I get how to run training and evaluation loops with PyTorch!Note that it's often a small minor change which finally revives the code (so play around, debug a lot and you'll figure it out).

I don't understand how everything works?

Theory can be difficult.You can have a working model, but not know anything about how the transfer learning model, optimizer, loss function... or something else you've used works.But how long does it take to train the model... hours, days, weeks?If you can write the code, do that, run it and just learn the theory whilst it's working.Your both training then (pun intended)!

It works, but how can I improve it?

I've got a blog post specifically on how to improve your model.Take a look once you've got a working model!

What do I do after creating a decent model?

This is the cycle:

Learn, create, improve, show off, rinse and repeat!

Just create blogs, create projects and continue through that cycle.

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).If this has helped you out consider checking out how to choose a model!

Cover image sourced here

How to learn data science and (eventually) become an expert

Kamron Bhavnagri — Wed, 12 Feb 2020 07:45:50 GMT

Why?

When I first found out I wanted to become a data scientist I was completely overwhelmed by the vast breadth and depth of the field of study.The overwhelming complexity blinded me on how to learn.I began with a rough outline of the different resourced I'd found available and planned to take a few MOOC's before moving onto projects.

Well, I did that, but made a few costly mistakes along the way which have slowed down my progress.I'm going to through the problems I encountered and how to overcome them.

MOOC's vs Projects

The logical diagram above (I admit I went overboard) rationally explains why you should start by learning from a MOOC and then immediately progress to projects.It explains 3 points:

MOOC's teach core knowledge, but in a way where you're likely to forget
You can't start a project if you don't already have a rough understanding of the problem you're trying to solve
So you can learn from a MOOC to get the core knowledge and then immediately do projects to consolidate knowledge

The key takeaway is to ALWAYS start your learning off with theory (MOOC's or books) and then immediately follow up with projects!

The ideal first MOOC

How can you decide which MOOC is right for you?That simply comes down to which MOOC's give you enough foundational knowledge that afterwards, you can understand problems enough to know roughly where to look/how to piece different bits of knowledge together.

From what I've seen and learnt so far, I'd recommend Fast.AI or DeepLearning.AI for this.The main difference between the two courses is their approach to teaching.Fast.AI is top-bottom (starts with applied high-level stuff before going into the nitty-gritty details) whereas DeepLearning.AI is bottom-top (starts with the basic maths and then builds up into modern cutting edge content).Do note though that there are a plethora of other courses I've already categorised/broken down before.

THANKS FOR READING

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).

Modern Algorithms: Choosing a Model

Kamron Bhavnagri — Fri, 07 Feb 2020 00:00:00 GMT

Why?

There are so many models to trial out, but training a neural network is a slow process when you have a substantial amount of data.Here I'll step you through the full model life cycle, starting with a list of potential models, explaining their differentiating factors and then how one final model works.

The contenders

ResNext152
ResNext50_32x4d
Inception V2
VGG16
SqueezeNet 1_1
MobileNetV2

Choosing a Model

When choosing a neural network model it's important to consider each model's performances as well as its computational cost.

Consider where you're model will be used (i.e. mobile), as this may impact your choice of models (i.e. a lightweight model may be required).The second consideration is the training time.If you have a powerful graphics card like the RTX 2080Ti with a large amount of VRAM (i.e. 11 GB+) then training time likely wouldn't be a large concern.However, if you have a large dataset and a GPU with less VRAM, a model designed to run quickly will train far quicker!MobileNet balances both factors quite well, and so is the focus of this blog post.

For completion note that there's a classic pattern where larger amounts of resource usage on average leads to small performance improvements.This is because adding more layers will cause vanishing/exploding gradients, so older networks like Inception tended to perform worse than modern models.

MobileNetV2

A drop-in replacement for standard CNN's

A "factorized" version of regular CNN's can be created by dividing CNN's into two parts:

Filtration (depth wise convolution)
Finding new linear combinations of features (pointwise convolution)

The first depthwise convolution applies a single convolutional kernel to each input layer, before aggregating the results.The second pointwise convolution is a 1x1 convolution

Rationally the model reduces parameters as:

Standard convolutions produce
Depth wise convolutions produce

Thus if a kernel size of 3 (standard) is used, these convolutions will be 8-9x faster!

Linear bottleneck layers

Let me preface this by asserting that channels mentioned here are layers (like RGB).

There are two assumptions at play:

When ReLU transforms maintain a non-zero volume, the transformation is linear
ReLU can preserve input information, but only if it originated in a low-dimensional subspace

Using linear bottleneck layers (convolutional layers without ReLU's) allows greater preservation of information.This is because linear functions don't collapse any channel, unlike ReLU's.Note that the paper describing this emphasizes that collapsing a channel is fine when that information is likely stored elsewhere in another channel.This is explained further later on.

Inverted residuals

Very much like ResNet's, MobileNetV2 uses residual blocks to improve gradient flow!Here though, the links are between the linear bottleneck layers which reduce the output dimensions.This design choice is more efficient, as computations (matrix multiplication) occurs between smaller matrices.

ReLU6

Throughout the MobileNetV2 implementation, ReLU6 is always used instead of a regular ReLU.ReLU6 is a modified version of the original Rectified Linear Unity which stops activations from growing too large.This allows ReLU6 to be more robust.However, the 6 itself is an arbitrary choice of value.

Layers

1x1 convolution with ReLU
Depthwise convolution (with 3x3 kernel) as a residual bottleneck layer
Pointwise 1x1 convolution (finding linear combinations between features)

The first stage with 1x1 convolutions effectively increases the number of channels.An expansion ratio is used to represent the desired increase in channels (the size of the input bottleneck vs inner size).As there are now more channels present, it is fine to use ReLU after the bottleneck layer.The idea is that with a large number of channels if one channel is collapsed, that same info is likely still within another channel as well.Hence, ReLU can be used.

The final pointwise 1x1 convolution does the opposite (decrease output dimensions).No ReLU is used as reducing dimensionality itself can cause destruction of information.

Note more hyperparameters are present in the actual model, but for simplicity, I'm leaving them out.To learn about MobileNetV2 in more detail check out its paper.

A final note

Now that you know how a basic lightweight mode works, you may be interested in where it could come handy. I previously mentioned reduced training time, but a model which can run on any device without the need of an internet connection, or decent hardware can be useful in several scenarios. One such scenario is my snake species classification app which tells you the snake species in a photo (i.e. if bitten). Please feel free to use the GitHub repo with code as a reference to how you can do the same for your model!

Resources?

Cover image sourced from here

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).If you liked this article, then check out how to improve your model and how to overcome several hurdles during creating a project of your own.

Vanishing Gradients

Kamron Bhavnagri — Tue, 07 Jan 2020 09:00:00 GMT

Why?

Deep learning is massive right now and there are few innovations which really made this possible.Residual blocks are a quintessential modern discovery to solve the problem of exploding and vanishing gradients!

More layers == Better

Modern deep learning revolves around adding more layers, however, this begs the question, at what point will our model no longer improve?Residual blocks provide one way to increase the number of useful layers a neural network may have (before it overfits).

Linking forwards

Residual blocks work by linking the current layer to not only the next layer (like normal) but two or three after it.This allows larger gradients to flow back downstream!By doing so we offset the major impact of vanishing gradients, allowing us to delve deeper once again.

The image was sourced here

Resources

Residual blocks Building blocks of ResNet

Convolutional Neural Networks: Basic Theory in a Nutshell

Kamron Bhavnagri — Thu, 17 Oct 2019 09:30:00 GMT

Why?

Majority of the tutorials I've seen on convolutional neural networks either focus on providing a basic analogy or going straight into describing terminology.Therefore, I aim to start with an overview of the stages involved in CNN's (Convolutional Neural Networks) and then provide an analogy, as well as a small glossary of key and external resources for further assistance.Make sure to utilize the glossary to understand key terms used throughout the blog post to help understand the material and continue onwards to a few other articles or video's mentioned in my resources section!By the way, don't expect to completely understand CNN's straight away, as they ain't all too simple!

Note that I'll be providing tangible/practical code in another one of my problem-solution blog post (where I take a problem I've had and explain my final derived solution, along with how I've overcome some major hurdles).

Key Stages?

Convolutional Layers (extract features from the input)
- Filters (matrices of weights) convolve over the input to produce feature maps
  - When the filter and input are similar, a high number is produced
- Applying ReLu functions to increase non-linearity
Pooling/Down Sampling (combine clusters of neurons together) to reduce dimensionality
- Flattens clusters of neurons into one-long vectors
Fully Connected Layers (connect a neuron to all those in the next layer)

Analogy?

You probably won't understand the above descriptions straight away, however a worded example feel more intuitive (to ease the confusion)!

So humor me and imagine the following scenario:

You're given a few hundred paintings and need to identify which picture corresponds to which shape
None of the paintings are too precise, and the grid they were painted onto is HUGE (so there's no use just trying every single all combinations in a fully connected artificial neural network)
You notice that each picture/painting is composed of smaller, more subtle, strokes (lines), which form curves, which themselves create each shape's final outline

This might sound insane (are we like 2?), but more complicated and meaningful problems can be solved in the exact same way!

The process of segregating/labelling images begins with realizing that you can't comprehend a full picture at once, so you must break each down into smaller 2x2 squares.You can move across, with a stride of 2 pixels at a time and compare these squares against a few filters.The filters themselves are just another grid, which resembles unique features which may be present in the original input image.Example basic features are like lines in different directions:

Horizontal
Vertical
Diagonal from left to right (upwards)
Diagonal from right to left (downwards)

Through convolving from one mini-image (receptive field) to the next, documenting how similar each receptive and filter grids are, a smaller down sampled image can form (this is pooling).The aim of this gradual comparative process is to form a more abstract, higher level image composed of lines instead of individual pixels!

Now the enlightening idea is that you can apply the same down sampling process used to extract lines from pixels, to find curves in lines, and then shapes formed from the curves!Each of these stages form separate layers of your neural network and are separated by activation functions (for accentuating non-linearity) and fully connected layers (to join together the different patters and eventually produce a final output)!

Glossary?

Term	Definition
CNN	Convolutional Neural Network
Kernel/Channel	A matrix of weights used to produce a feature map by convolving over the input (note that multiple can be used to preserve spacial depth to a higher extent)
Filter	A set of kernel/channel\'s
Convolving	Moving through a broken down version of the input, summing the input values in each section and multiplying them by the filter/kernel
Activation/Feature Map	Original input processed by filters to accentuate certain features (effectively performs operations like edge detection, blur, ect)
Stride	The number of divisions of the input to scroll across at a time
Receptive Field	Part of the input which the filter\'s scrolling over
Padding	Adding zeroes (i.e. zero-padding) to the input or dropping part of it (valid-padding), to mitigate the impact of the stride not perfectly dividing the input (e.g. there is a remainder)

Places to LEARN MORE?

I've come across several amazing blogs and videos describing how convolutional neural networks work, so here's a rough list of the one's I feel are the most valuable!

MIT 6.S191: Convolutional Neural Networks is probably the most wholesome and complete (to a small extent) video on CNN's
A Beginner's Guide To Understanding Convolutional Neural Networks is just amazing, I wish I read this one first
- Goes through a few details about how the filters actually work which no other guide did
A friendly introduction to Convolutional Neural Networks and Image Recognition has the most easily interpretable example situations
- The example scenario's build from extremely simple to slightly more complex
The Complete Beginners Guide to Deep Learning: Convolutional Neural Networks and Image Classification has amazing visuals
Understanding of Convolutional Neural Network (CNN) Deep Learning is well broken down into sections
Intuitively Understanding Convolutions for Deep Learning is good for consolidation after the rough ideas behind CNN's are understood
- However, it's quite confusing at times (especially for a first read)
Neural Network that Changes Everything is from Computerphile and is a great first deep dive into the ideas behind CNN's
- Do note that they have other video's on these topics, however this seems like their best introduction

Cover image sourced from here

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).

Life Hack Web Scrapping

Kamron Bhavnagri — Sat, 05 Oct 2019 11:30:00 GMT

Why?

Web scrapping has made my life SO MUCH EASIER.Yet, the process for actually extracting content from websites which lock their content down using proprietary systems is never really mentioned.This makes it extremely difficult if not impossible to reformat information into a desirable format.Over a few years, I've found several (nearly) failproof techniques to help me out, and now I'd like to pass them on.

I'm going to walk you through the process of converting a web-only book to a pdf.Feel free to replicate/modify this for use in other circumstances!If you have any other tricks (or even useful scripts) for tasks like these, make sure to let me know, as creating these life-hack scripts is an interesting hobby!

Reproducibility/Applicability?

The example I'm outlining is from a website which provides study guides for a monthly fee (to protect their security I'm excluding specific URL's).Despite the potential lack of reproducibility, this guide should stay quite useful, as I'm outlining several flaws/hickups that come up for any similar project along the way!

Mistakes to Make?

I've made several mistakes when trying to web scrape for limited access information.Each mistake consumed large amounts of time and energy, so here they are:

Using AutoHotKey or similar to directly affect the mouse/keyboard
- This seems effective, however, it is extremely dodgy and isn't reproducible
Load all pages and then export a HAR file
- HAR files don't contain any actual data
- HAR files take ages to load in any text editor
Attempt to use GET/HEAD requests
- Majority of pages will randomly assign tokens and other authorization approaches which are incredibly hard to reverse engineer
- Code can never be reproduced/a different approach will be needed

Slow Progress

It seems easy/quick to write a 300 line short script for web scrapping these websites, but they are always more difficult than that.Here are potential hurdles with solutions:

Browser profile used by Selenium changing
- Programmatically find the profile
Not knowing how long to wait for a link to load
- Detect when the link isn't equal to the current one
- Or use browser JavaScript (where possible, described more below)
Needing to find information about the current web page's content
- Look at potential JavaScript functions and URL's
Restarting a long script when it fails
- Reduce the number of lookups for files
- Copy files to predictable locations
- Before beginning doing anything complex check these files
Not knowing what a long script is up to
- Print any necessary output (only for that which takes considerable time and doesn't have another metric)

Code

Preperation

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom PIL import Imagefrom natsort import natsortedimport timeimport reimport osimport shutilimport img2pdfimport hashlib

driver = webdriver.Firefox()cacheLocation = driver.capabilities['moz:profile'] + '/cache2/entries/'originalPath =  os.getcwd()baseURL = 'https://edunlimited.com'

Loading Book

driver.get(loginURL)driver.get(bookURL)wait.until(lambda driver: driver.current_url != loginURL)

Get Metadata

Quite often it is possible to find JavaScript functions which are used to provide useful information.There are a few ways you may go about doing this:

View the page's HTML source (right-click 'View Page Source')
Use the web console

bookTitle = driver.execute_script('return app.book')bookPages = driver.execute_script('return app.pageTotal')bookID = driver.execute_script('return app.book_id')

Organize Files

Scripts often don't perform as expected, and can sometimes take long periods of time to complete.Therefore it's quite liberating to preserve progress throughout the script's iterations.One good method to achieve this is keeping organized!

if not os.path.exists(bookTitle):        os.mkdir(bookTitle)    if len(os.listdir(bookTitle)) == 0:        start = 0    else:        start = int(natsorted(os.listdir(bookTitle), reverse=True)[0].replace('.jpg', ''))        driver.execute_script('app.gotoPage(' + str(start) + ')')os.chdir(bookTitle)

Loop Through the Book

Images are always stored in the cache, so when all else fails, just use this to your advantage!

This isn't easy though, first of we need to load the page and then we need to somehow recover it!

To make sure we always load the entire page, there are two safety measures in place:

Waiting for the current page to load before moving to the next
Reloading the page if it fails to load

Getting these two to work requires functions which guarantee completion (JavaScript or browser responses), and fail-safe waiting timespans.Safe time spans are trial and error, but they usually seem to work best between 0.5 to 5 seconds.

Recovering specific data directly from the hard drive's cache is a relatively obscure topic.The key is to first locate a download link (normally easy as it doesn't have to work).Then run SHA1, Hex Digest and a capitalizing function on the URL, which produces the final filename (it isn't just one of the above security algorithms, as older sources lead you to believe, but both).

On a final note, make sure to clean your data (removing the alpha channel from PNG images here) now instead of afterwards, as it reduces the number of loops used in the code!

for currentPage in range(start, bookPages - 1):        # Books take variable amounts of loading time        while driver.execute_script('return app.loading') == True:            time.sleep(0.5)        # The service is sometimes unpredictable        # So pages may fail to load        while (driver.execute_script('return app.pageImg') == '/pagetemp.jpg'):            driver.execute_script('app.loadPage()')            time.sleep(4)        location = driver.execute_script('return app.pageImg')        # Cache is temporary        pageURL = baseURL + location        fileName = hashlib.sha1((":" + pageURL).encode('utf-8')).hexdigest().upper()        Image.open(cacheLocation + fileName).convert('RGB').save(str(currentPage) + '.jpg')        driver.execute_script('app.nextPage()')

Convert to PDF

We can finally get that one convenient PDF file

finalPath =  originalPath + '/' + bookTitle + '.pdf'# Combine into a single PDFwith open(finalPath, 'wb') as f:    f.write(img2pdf.convert([i for i in natsorted(os.listdir('.')) if i.endswith(".jpg")]))

Remove Excess Images

os.chdir(originalPath)shutil.rmtree(bookTitle)

Cover image sourced from here

Thanks for READING!

This is basically the first code-centric post I've made on my blog, so I hope it has been useful!

--- Until next time, I'm signing out

Moving Towards the Real Deal (hint: PROJECTS)

Kamron Bhavnagri — Sat, 05 Oct 2019 11:30:00 GMT

EDIT: The creator of this course's hole website has been taken down due to other paid courses being scams. I only leave this here for completion/a record of what I've done so far. Please take a better course like Fast.AI

Why?

This course has had about 1 upside whereas numerous disadvantages.So I personally don't feel like this course is a worthy use of my time (which is extremely limited right now).As previously promised, I'll go through what's good and bad about this course here.I'm additionally writing about this blogs future plans.

Course Breakdown

Good:

It has a decent structure/layout
Projects are sometimes decent
- Do note thought that these projects often completely lack adequate resources on teaching how to complete them

Bad:

The course's projects don't have ANY feedback
Feedback/pleas for help fixing broken parts of the course are ignored
The quiz's don't work at all
- They are an incredible pain to have to complete
- Can't even move past the second questions without a significant effort
Siraj Raval has acquired an incredible reputation for scamming people for their money
Concepts aren't explained in an intuitive/comprehensive manner
- Need to seek out external content for practically every topic
- This may leave you in the dark about how to complete certain projects
The hole thing is practically a copy/paste job from different sources across the internet
- The main problem here is the poorly edited content
- Independently, the resources go from extremely helpful to mostly useless
Their is massive content duplication throughout the course (seems to progressively increase)

Where Next?

Right now I'm planning to continue reading books on AI, whilst potentially taking the fast.ai course (or specific parts of it).I aim to work through a few Kaggle competitions, before moving onto build personal projects.My ultimate goal for the projects is to be able to share some insights into going about creating them on this blog.

Types of Projects?

I am aware that I have multiple options on what to work on.

Pure Jupyter notebooks
Full deployed solutions
Low amounts of code - create posts containing full explained projects
High amounts of code - create posts on the major lesson's I've learnt/main parts that make them work

From talks with real data scientists, deploying complete projects is extremely impressive!So after a few smaller I may proceed to deploying some models on the cloud.

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).

Cover image sourced from here

Building a Career Vision

Kamron Bhavnagri — Wed, 18 Sep 2019 11:00:00 GMT

Why?

I've recently been questioning how to build relevant experience to acquire a job.After taking countless experts advice I've decided to tie it all together in one place.Note that here I'm including my own insights and derived steps, to provide a personal path to future success.

You may ask why is this a vision, and the answer to that is that you're not just deciding what to work on next, you're really questioning where and who you want to be?

Okay, okay, I know this sounds a little wishy-washy, but I think it comes down to forming a simple procedure:

Discover specific areas you want to work in
- Tip: Start generic (i.e. data science) and then narrow it down as you go (i.e. specifically NLP within data science)
Find where you want to be/what you want to do
Set a vision of your future work (a long-term goal)
Find out how you can work your way to becoming this person
Work and get paid

Finding Purpose

Data science or machine learning are broad, overarching terms, which are quite ambiguous and so need to be further researched.Now I don't mean deciding whether to decide on being a machine learning engineer or data scientist since these terms are often mixed up.Instead find domains (like finance, health or security) to work in.The hope is that narrowing down your options allows you to define focal points or qualities of what to be.These will form your vision.

The way I narrowed my options down was to create a Trello board where I listed, sorted and described companies around me.I took this one step at a time, first searching startups near me, reading about them (on their websites), ordering them, adding extra details (like positions mentioned) and finally entering their domains.Please note that you can't categorize all companies, as some are just too new/unique, however, consider leaving these aside for now (you should have plenty of others).

Gaining Skills

To understand how to make your vision real you'll need to take the domains from before and discover what resources already exist about them.From here you should be able to decide which data sets you can access, and the common tasks completed on the job (like data cleaning).The reason we prefer this overlooking through the Iris data set is that everyone follows tutorials.You want to stand out with projects which are both achievable and somewhat unique.

Becoming Employable

Technical knowledge and an ability to problem-solve don't directly make you employable.This is because there are a plethora of other skills which can undermine one's ability to solve difficult problems.The important ones here are:

Technical (everything above)
Communication
Business/Domain Specific
Personal development

You can demonstrate your ability to communicate through writing blogs, writing documentation for your projects/making them publicly available and more.For business/domain-specific skills I think you just need to research and apply your skills (with projects).Finally, personal development is building yourself, something I've already discussed in my previous blog on soft skills.

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).I hope this helps to figure out what's needed to get a data science job.

Cover image sourced from here

When hard skills won't do, soft skill's will

Kamron Bhavnagri — Thu, 12 Sep 2019 01:00:00 GMT

Why?

I always start my blogs with a why section because I believe there is no point in reading something unless you already understand precisely what you will get out of it.

I've been developing a lot of technical skills recently, and whilst doing this I've been constantly considering the difference between failure and success.So, in this post I'd like to explain how I'm really making personal progress myself, at my core, once again.To allow me to learn to change and adapt to circumstances.As these are the soft skills which define how you're able to utilize your technical abilities!

How?

This all may sound cliche, or stupid, however I strongly believe that there are three elements involved in manifesting success:

Passion - for me technology, in particular data science/programming
Aim - for me to get a job which I'll enjoy/grow in
Dedication - to increasing my knowledge and change my personality

My Secret Resource?

Until now I've had a private Trello board where I extracted the key, vital insights of everything significant that I watched.I do admit that I didn't hadn't added anything for half the year, but it does have a lot and now I'm committing to expanding it further.In hope that this can help someone else, I've changed the board from private to public and am promoting it here.

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).

Cover image sourced from here

A Behavioral Shift to Differ

Kamron Bhavnagri — Wed, 11 Sep 2019 13:00:00 GMT

EDIT: DataLit and the ShoolOfAI have officially been taken down as the courses online were all bogus, poorly edited compilations of other peoples hard work. I leave this post up here ONLY for completion and a review of what I've learnt so far. For reference a better (and legitimate) alternative MOOC is Fast.AI by Jeremy Howard

Why?

As you may already know, up until today I'd been diligently working through the amazing machine learning course by Andrew Ng.My original plan was to finish the remaining 3 weeks before starting a more application-driven MOOC.However, a few days ago I chanced upon Data Lit by Siraj Raval and was immediately taken back by how it taught both the theoretical AND practical side of AI simultaneously.I've been considering the course ever since and today have finally decided to pull the plug and stop working through the old course.I'd like to take the time here to detail what I've been doing, why I'm changing and how thing's may look going forward.

Did intention arise before the awards?

After discussing whether to change courses with a teacher of mine, I had a realization, I wasn't going to buy the certificate and yet I continuously obsessed over all assignments, (no matter how irrelevant they felt).This caused me to become side-tracked, aimlessly learning, having completely forgotten my end goal of being able to:

Create working models by myself
Understand, interpret and contribute to others work

Although it is easy to blame the course when judging my progression with time it's clear that what I first needed is no longer what I desire.Andrew Ng's MOOC had insight and usefulness however MY lack of consistently re-evaluating what I needed to learn and how I planned to acquire knowledge blinded me to why my progress had almost haltered.

The Difference?

Data Lit still has quiz's, programming assignments and more you may say!But this time I have brutal awareness of:

A the need to consciously counter the impulse to learn for formalities
A clear goal to create personal projects with what I learn straight away
A slowly forming background from reading machine-learning books

In short my awareness of my previous mistake being my flaw is what shall allow me to prosper with the new course

My Hope?

When I was looking into the Data Lit course, I found few resources, guides/tutorials, articles/review or even recommendations and so I aim to provide these.

For anyone considering this MOOC, here's a list of what I'm going to create throughout the course:

My own solutions to any programming assignments
Guides explaining any topics which I find excessively difficult to understand
Extra free resources which aren't included, but I find beneficial

Here's a list of what I'm planning to judge the course on afterwards:

What the course teaches well/not so well
How much time's required for the course (if you're working as hard and fast as possible)
The overall teaching style
The ability to take you from zero to hero (where you have the technical knowledge to get a job)
How well it balances teaching theoretical and practical knowledge
The number of content gaps are present (difficult topics which are skipped over/remain difficult to grasp)
Who the course is for

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).

Cover image sourced from here

Neural Networks: Basic Theory in a Nutshell

Kamron Bhavnagri — Thu, 05 Sep 2019 12:00:00 GMT

Why?

Majority of the tutorials I've seen on artificial neural networks focus solely on either the applications or maths behind them, majority of the time completely forgetting about the why and intuition behind this technology.Thus, here I aim to go through both how artificial neural network work, and where they work, explaining them at a more basic level, so anyone can (hopefully) understand.

A breakdown

Part 1: The Point of Neural Networks

Despite simple ways to model any one situation existing through means like linear regression, in the real world, data is often far more complex than a straight line or parable (and often lacks structure).Thus, we can use neural networks when we need to model complex situations like handwriting recognition.

A neural network is split into input, hidden and output layers, where each layer is a group of neurons. The image was sourced here.

Although this diagram may initially appear slightly threatening, it's actually quite basic.It exemplifies how a network may be visually deconstructed into different layers (like separate parts of a problem).

One important thing to note here (that this diagram doesn't explicitly state) is that the connections between neurons (also called nodes) all have associated weights/biases.Weights are just fancy terms for coefficients expressing the significance of each input, and bias' are set terms which sway outputs by a set amount (typically +1).

Part 2: Logical Regression Tying it all Together

In the last section, we discovered that any single neuron can take any number of inputs.So, if we take some n inputs, how do we tie them together to produce the required output?The simple answer to this question is to "scale" our input by our weights (multiplying) and then to finish by adding biases.

Now, all we need to do is compute the output through our activation (hypothesis) function (normally ReLU).Note that this happens identically to linear regression (but now to ALL inputs)

Part 3: Forwards Propagation

If we can already find a neuron's output, why do we need to do anything else?Well, there are two flaws to what I've talked about so far, firstly we've assumed that we are given weights, and secondly, we don't just have a single layer.So, let us consider randomly assign weights to every connection.To address having any number of layers, we use our initial inputs to compute the output of every neuron in the first layer.We can then use them to compute the respective outputs of the neurons in the next layer.This is a recursive process, where we propagate forwards through the network until a final output is generated.

Part 4: Backwards Propagation

Until now I've made an assumption that we already had the weights figured out.That through randomly assigning them we'd be lucky enough to get them all near-optimal.However, for obvious reason's that doesn't exactly happen very often.Go figure.

So how exactly do we find them?The answer is like this: we take a guess, and then incrementally improve it across the network.

We go from the ends of the network (output neurons), finding the cost each weight has on the final output.The cost is used to adjust the weights proportionally (nudging them up/down) to give neuron's which should have fired a greater chance of firing next time.We then recursively repeat this process through the hole network, moving or propagating backwards, to eventual minimize the errors our network makes.

In reality, a modified process called stochastic gradient descent (SGD) is used where we decompose the network several times, once per training example, and then average the adjustments to our weights to give us a final set of modified weights which are more accurate, and so useful in providing the desired outputs.

This is easier to comprehend through a simple analogy where you as an amateur golf player (like me) miss the hole in the ground.When you miss you have to retry your shot, now considering how far the ball either stopped prematurely or passed over the target.By propagated backwards in this way you can use whether or not the ball passed over the target to indicate whether to increase/decrease your strength, and the balls distance from its target to determined approximately how much harder to hit next time.This method takes rubbish random guesses, and slowly improves it, using a weight (utilized strength) to modify your cost function (the success/failure metric of how far the ball is from the hole).However, trying once may not be enough, so you may need to repeat this process many times, and then finally average the effect each incremental adjustment should have had on the weights.

The Nitty Gritty Math Details?

The hardest part of how artificial neural networks work is likely the maths.I would love to go through how this works here, however, I'm still new to this and the maths is something which is explained extensively in other places.The brilliant.org wiki page on this is quite extensive (and well explained).For a shorter/more visual explanation make sure to check out the 3 Blue 1 Brown Deep Learning.

However, what I will do here is compile a rough outline of the alternative terminologies I've seen on this topic.Hopefully, this can clear up any confusion when encountering new sources of information on machine/deep learning.

the desired output (real provided final value)
: the activation (produced value) of the neuron in layer (in simpler models this was )
: the cost/error for training example 0

: the weight at layer , from to
: the bias at layer
: the weighted sum (including the bias) of * values for layer , used to find the activation functions at layer
Note that the term above can either represent the activation function (output) itself, or the weighted sum for anyone layer (depending on your information source).

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).

The Machine Learning Data Science Path

Kamron Bhavnagri — Fri, 30 Aug 2019 12:00:00 GMT

Why?

I'm currently taking the Impact Theory University Mindset Coaching class, where Tom Bilyeu states success requires formulating a plan.Therefore, it makes complete sense for me (just like with Daniel Bourke's Masters' Degree) to create my own course outline conveying how I'll take myself from knowing nothing about AI to using it for large-scale change.

What?

This course should set up the reader with a fundamental understanding of machine/deep learning. Unlike with any one single course out there, this should encompass both the theoretical and practical aspects of AI (difficult, but necessary).

How?

Although I have a few highly theoretical courses down below, check out Machine Learning with Phil's math coursebreakdown.The following lists below can be shifted around, modified, and altered in any way shape or form for a coherent study plan. Just mix and match the materials with the highest appeal! Take note that multiple sources of similar content (different learning styles/intentions) have been kept in the list, so when one perspective doesn't make sense another will.Most importantly though, although top-down and bottom-up approaches to learning are equally powerful, Part 3 CANNOT BE SKIPPED, as it is the backbone for gaining tangible, real, unique and so irreplaceable experience!

Stages

Part 1 - Theoretical Understanding

Here the aim is to gain an overall holistic background of the basic architectures used by the libraries/frameworks extensively latter on.Here only vectorization libraries are provided.

Part 2 - Applied Knowledge

This will be a basic overview of how to use implementations of the prior concepts, to solidify one's understanding of how to use AI algorithms.Here there's an overall focus to demonstrating how machine-learning algorithms can be used in basic situations.

Part 3 - Realistic Projects

This is the useful part, to apply AI to real-world scenarios, exemplifying its use as a tool for solving complex problems.Despite learning resources being listed out here, these are easily replaceable (by other self-provided pet projects) when sections 1 and 2 supply adequate foundation knowledge. Although, there is a need to learn about more general data science topics, which may or may not be covered by part 2 (like cleaning/transforming data).

Topics?

I have a brief list down here, however, I'd like to mention this article which does a far better job at breaking down the different parts of AI.

Supervised Learning

Linear Regression
Gradient Descent
Support Vector Machines
Decision Trees
K-Nearest Neighbors
Feature Engineering
Under/Over-Fitting and Generalization
Hyperparameter Tuning
Neural Networks
- Forward Propagation
- Deep Learning
  - Back Propagation
  - Convolutional Networks
  - Recurrent Networks

Unsupervised Learning

Logistic Regression
Density Estimation
- DBSCAN and HDBSCAN
Clustering
- K-Means
Outlier/Anomaly Detection
Ranking

Learning Materials

Courses

Theoretical Understanding

Machine Learning (Coursera) - in Octave
Foundations of Machine Learning (Bloomberg)
Deep Learning from the Foundations (fast.ai)
Learning From Data (edX)

Applied Knowledge:

Neural Networks and Deep Learning (Coursera) - shorter course with TensorFlow
Applied Machine Learning in Python (Coursers) - longer course teaching variety of libraries
Practical Deep Learning for Coders (fast.ai) - longer course using Fast.AI library (extends PyTorch)

Books

Theoretical Understanding:

The Hundred-Page Machine Learning Book - mostly maths
Learning from Data - has accompanying MOOC above
Grokking Deep Learning - pure Python

Applied Knowledge

Hands-On Machine Learning with Scikit-Learn and TensorFlow - uses Scikit-Learn and TensorFlow

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking thetime to read through my blog (or skipping to the end).