It is so hard to find an intuitive understanding of how your dataset functions! Yet, coherently interpreting how your system works is crucial to finding a way to model or analyse any feature. Stick on till the end to find out why initial insight is key to good analysis, what you have to do to formulate your understanding and finally how to easily avoid and overcome problems.
The story
I've been working on a project to predict energy demand using Australian weather data in a team of 5. It spanned half a year and mimicked a classic data science project. Start with data collection, move onto cleaning, modelling and then formulate a report. Halfway through I discovered one thing - I was still cleaning the data, and it was taking forever, despite having four others working on it! It seemed like there were a million features we had to deal with, and most of them were nowhere near easily ready to use. I had seen it as an opportunity to learn more about the different (fancy) techniques which could be used to interpolate (fill in) missing data. But... I completely misunderstood my project.
We didn't understand how our data worked and so had to account for 100 features instead of 3...
It was a naive mistake but one that taught me several valuable lessons:
- Low-quality/irrelevant data does more harm than good
- My teams work can only be completed as well as it is understood
Now don't worry, we did end up finishing on time, with a model and complete report! But... only after I accounted for these flaws could my team act like a well-oiled machine. So here we'll explore how to find and profit from insight, along with how to avoid ever-present flaws and potential pitfalls (which everyone at some point comes across).
How to find insight
It's easy to emphasise the importance of understanding how data functions, but hard to discover it. With tutorials, kaggle competitions and simple beginner exercises it was easy... the understanding and interpreting part was handed to you on a silver platter. Just get out your golden fork (code) and knife (ensemble model), start cutting (testing) and whala you can eat (a high-performance model)!
Then you progress and begin a few real projects... oh no. There may not be a one size fits all strategy, but we can make the process smooth and less painful by setting ourselves up properly.
Where to begin
Just COOOODE... NO PLEASE WAIT!
Code is important, but I'll let you in on a little secret - it takes less time when you know the process. Bbubbbut... how to know the process, before starting the first time? Easy, interpret the mission objectives. Missing objectives are what you want to get out of the project. Mission objectives include the model and report itself, but also what you'll need to learn, what stages you'll go through and what challenges need to be accounted for.
I thought my goal was to create the best model I could to predict energy demand, hahaha. I was completely wrong!
My goal wasn't to predict energy demand... because that would be nearly impossible. What I actually wanted was to identify and model the energy demand trends and seasonal patterns which occurred in the short-term, and how temperature fit into this equation. It involved importing data, researching to find which variables were useful, creating graphs to intuitively show how the data looked/worked and THEN finally creating a model to concretely measure the relationship between temperature and energy demand. The main difficulties would lie in learning about how the energy time-series' worked and learning to guide my team through each stage of the process. Painfully verbose yet? The end-goal may have been a report, showing how everything worked its way till we got a model, but in reality, the model was just 10% of the work!
It indeed though is easy to put this aside... to say that it is soft, unnecessary planning which is unlikely to directly impact the project right now. In fact... yes, it is extra work and fair enough if you're not compelled to plan everything out like this. If you've got a better alternative - let me know. If not, give it a try. It may not impact you right now, but it will aid in mitigating large problems, and illustrating how everything ties together!
Finding out where to go next
But I have no idea how to do any of this... Don't worry, with time, you'll figure it all out. Just remember:
The path seems a little less bumpy once you get started!
If though you have no idea where to start, find out what you'll need to learn and find ways to do so. Simple tutorials and videos are a great way to start off. Then, once you've got a vague idea of what things should look like and what to do, just get started. Simply follow the trail and see where it leads!
For data science projects, know the practical steps and processes. I explain these in my machine learning field guide article which goes through every single step of the process in detail! If you want to learn more, books like Hands-On Machine Learning and The Hundred-Page Machine Learning Book are extremely useful.
To work better in a team make sure you understand collaborative coding tools (all explained here) and how to lead. The book Extreme Ownership is an amazing guide to teamwork and leadership (not data science specific, but Jocko Willink's advice applies nonetheless).
Avoiding collapse
Everything was going so well... until I realised we were still cleaning the data halfway through the project. Everything seemed fine, progress seemed alright, not perfect... but fine.
Even when you've set yourself up to succeed, and everything has progressed fine... things can go south! But... I was lucky because a teacher told me to regularly do one thing:
Keep a simple journal of progress, specifically commenting on what you've done, how it panned out and what can be done to improve.
It worked wonders! Instead of stressing out after coming to grips of how much I needed to finish, I was able to prioritise and execute, because I knew where I could go wrong and I could account for it. I knew my team could get distracted and lose focus, so I made sure to stick to the point and emphasise what we were trying to accomplish instead of writing down narrow tasks. I knew it was difficult to pace ourselves, so I kept a count on how many weeks were left and made sure everyone understood. I knew the coding was particularly challenging and threatening to most people, so I did a brief rundown on what it would involve/what it should look like with sample code. In short, I accounted for my weaknesses and managed to turn around a bad situation.
The process only took ~5 minutes each week and drastically boosted progress.
All you've got to do is reflect on how your actions enfold and consider what could help you out further.
This leads to simple actionable steps.
THANKS FOR READING
I hope you've enjoyed this, and that you've found it helpful! Please feel free to share this with anyone it may help.
My other articles on practical coding skills, machine learning, starting projects and web scraping may be interesting.
Follow me on Twitter for updates.
Photos by Toa Heftiba, John Barkiple, Yang Jing, Kyle Glenn and Josh Calabrese on Unsplash