Getting your hands dirty with Kaggle – Data Column | Institute for Advanced Analytics

Image above is a visualization of Reddit I created
using a public Kaggle data set found here, built with Gephi.

One lesson I’ve learned over the past 7 months is that the best way to hone your skills within the field of analytics (or data science, or whatever term you wish to use) is to get your hands dirty. You can read a tutorial on how to complete a task and it may linger in the back of your head for a couple days. However, when you actually start from scratch and complete the task yourself, it sticks with you. It’s an experience you can recall months later when you encounter a similar problem. Furthermore, you may even have records of your methodology, which can be easily referenced just in case you draw a blank.
The efficiency and importance of hands-on learning is well recognized by the Institute, hence the strong emphasis that is placed on the practicum experience. However, what if you wish to pursue your own personal endeavor of learning by doing? Assuming you’re looking for a ‘data problem’ there are many options!
First and foremost, you must have data. Collecting data can be an arduous process that presents its own list of complications. This isn’t to say that data collection isn’t important, it’s just not as fun as the actual analysis. We all know that it’s much easier to stay engaged with a project that’s enjoyable, especially when working independently. Hence, I would recommend jumping ahead and using previously assembled data.

Alright I’m ready! Any recommendations on where to start?

Ah, I’m glad you asked! There are countless options to choose from, for example here’s a pretty comprehensive list. For many projects, this may be enough, especially if you are already familiar with the software you plan to use. That being said, I’d like to present another option…

Kaggle, self-described as “The Home of Data Science,” plays host to modeling competitions, which are open to anyone wishing to participate. These competitions are sponsored by real companies, such as Airbnb and Walmart, that often provide a monetary reward for the best submissions. If you are relatively new to the field, my approach for getting started would be as follows:

Walk through some of the introductory tutorials. For example, this one deals with Titanic data and offers step by step instruction using the tool of your choice.
Once you’re comfortable with the basics, move on to whatever excites you!
Check out other competitions and see what other users have done. Often times there will be public scripts that can provide some ideas or forum posts to alleviate questions you may have.

Finally, some important notes:

Kaggle requires that any solution you submit must be reproducible via open-source software.
Most competitions are focused on machine learning algorithms. If predictive modeling sounds fun, you’re at the right place.
Placing high is extremely difficult! Don’t be discouraged if you don’t place well. Remember, it’s a learning process.
Unless you are participating in a recruiting competition, team up with your friends! It’s more enjoyable and you’ll learn more along the way.

Regardless of the route you choose, go out and get your hands dirty. Your future self will thank you.
Columnist: Alex Spancake