Cool unique data makes for intriguing projects, so let's go find some on the web! Today we'll get what we need to tell a story about the magic-making GitHub projects popular ⭐🌟⭐. Readmes, descriptions, languages... we'll collect it all.

So, let the public API and big data sorcery begin.

WARNING: Collecting, storing and using mass data from public APIs won't be quick, easy or clean. Prepare to dial up the madness (you have been warned)...

The Story

GitHub, such a beautiful place filled with amazing creative projects. Some get popular, others stagnate and die. It truly is the circle of life, the wheel of fortune (eloquently stated by Elton John).

Now we could use a massive (terabytes large 🤯) archive like GH Archive or GHTorrent, but I'm not looking to fry a computer (haha). We could use Google BigQuery to filter through this, but why not take a journey closer to the source 😏 by using the official API (side-note, it's cheaper).

With a good bit of web scraping experience under our belt, it surely can't be that hard to use a public API... Nope, it doesn't work like that. You get a little data, you see a few repositories... and then that's all for poor you 🥺.

But what if you want more 🥺😒? What if you want a LOT MORE? Well, I'd welcome you to the slightly dodgy but still legitimate public API user club. Beginner lingo - If you play your cards right you can get what you need without waiting a hundred years. Well, it's smooth sailing if you know what you're doing 🤯🤔.

Luckily, we can avoid the time-consuming pain of switching from technology to technology, by considering what's out there. To start off, we can consider our familiar cozy tried and tested data science tools (Python with Pandas and Requests). We'll consider what it does well, but also it's major drawbacks (lots exist). To come to our rescue, we'll discuss a few unique tools and techniques to bridge the gap between gathering and using data in a scalable way.

After we decide on a technology stack we can start looking at our API and (with thoughtful research) figure out ways to optimise our search queries:

Break down large searches into multiple parts
Send multiple queries at once (asynchronously)
Getting extra API time through multiple API access tokens (get your friends together, and maybe use a rotating proxy)

After all the work, we can finally sit back and relax. Knowing that we've got virtually every last drop of data 🤪. Yes, 50 glorious gigabytes of GitHub readme's and stats 😱🥳!

Technology Stack

Before we get into the nitty-gritty details on how to collect data it's important to decide what technologies to use. Normally this wouldn't be a big deal (you'd want to get started asap), but when collecting a large amount of data, we can't afford to have to rewrite everything with a different library (i.e. because of an overhead or general slow speed).

With small datasets, library/framework choice doesn't matter much. With ~50 GB, your technology choices make or breaks the project!

This section is pretty long, so here's a summary of the technologies I used along with alternate options I would use if I started over:

Task	Technology Used	Potentially Better Options
Querying GitHub API	AIOHttp and AsyncIO	Apollo Client
Saving Data	PostgreSQL	Parquet using PyArrow
Processing Data (Data Pipeline)	Dask	Spark
Machine Learning Modelling (Ensembles)	H2O	XGBoost or LightGBM
Deep Learning and NLP Modelling	PyTorch/PyTorch Lightning with Petastorm and Hugging Face Transformers	PyTorch with Petastorm

Always create a working environment.yaml or requirements.txt file listing all dependencies like ours)

There are multiple types of APIs, the most popular ones are REST and GraphQL APIs. GitHub has both, GraphQL is newer and allows us to carefully choose what information we want. Normally you'd use an HTTP requests library for GraphQL, but dedicated clients (like Apollo or GQL) do exist. It might not seem important to use a dedicated library, but they can do a lot for you (like handling pagination). Without one, you'll need to handle asynchronous requests yourself (using threading or async await). It isn't impossible but it is a burden to deal with (I, unfortunately, went this route).

Note that GQL is still under heavy development, so Apollo Client may be better (if you use JS).

The next step is to decide where to store your data. Normally Pandas would be fine, but here... it's slow and unreliable (as it completely overwrites the data each time it saves). The best option may seem like a database (they were designed to overcome these problems) like PostgreSQL. However, let me warn you right beforehand, databases... are horribly supported by machine and deep learning frameworks! But... HDF5, Parquet and LMDB can work quite well.

Now that we've got all our data, it's time to consider how we will process and analyse it. When using databases it's best to stick to Apache Spark. Spark is nice to work with, supports reading/writing to nearly ANY format, works at scale and has support for H2O (useful if you want to try out AutoML, but it is quite buggy). The downside is that we ironically don't have enough data to make Spark's overhead worth it (best for ~300+ GB datasets). Just as long as you used HDF5 or (better yet) Parquet, both Dask and Vaex should work like a charm though. Vaex is a highly efficient dataframe library which allows us to process our data for an ML model (similarly to Pandas). Although Vaex is efficient, you might run into memory problems. When you do, Dask's out-of-core functionality springs to life 😲! We can also train classical machine learning models (ensembles like random forests and gradient boosted trees) though Dask and Vaex. Dask and Vaex provide wrappers for Scikit Learn, XGBoost and LightGBM.

When it comes to deep learning, PyTorch is a natural go-to! It can use HDF5 or LMDB quite easily (with custom data loaders like this one, which you can either find or create yourself). For anything else use Petastorm (from Uber) to get data into PyTorch (by itself for Parquet and otherwise with Spark). A neat trick if you use Spark is to save the processed data into Parquet files so you can easily and quickly import them through Petastorm.

Be very, very careful with the libraries you decide to use. It's easy for conflicts and errors to arise 😱

With a solid technology stack, you and I are ready to get going! Quick side note - you'll quickly realise that Apache is a HUGE player in the big-data world!

Assembling a Query

GraphQL is really finicky. It complains about the simplest mistake, and so it can be difficult to figure out how to construct your search query. The process to come up with an appropriate API request is as follows:

Assemble a list of all the information you want/need (the number of stars and forks, readmes, descriptions, etc)
Read through the official documentation to find how to gather the basic elements
Google for anything else
Run your queries as you build/add to them to ensure they work/figure out the problem

It's a surprisingly long process, but it does pay off in the end. Here's the final query for GitHub.

query ($after: String, $first: Int, $conditions: String="is:public sort:created") {
    search(query: $conditions, type: REPOSITORY, first: $first, after: $after) {
        edges {
            node {
                ... on Repository {
                    name
                    id
                    description
                    forkCount
                    isFork
                    isArchived
                    isLocked
                    createdAt
                    pushedAt

                    primaryLanguage {
                        name
                    }

                    assignableUsers {
                        totalCount
                    }

                    stargazers {
                        totalCount
                    }

                    watchers {
                        totalCount
                    }

                    issues {
                        totalCount
                    }

                    pullRequests {
                        totalCount
                    }

                    repositoryTopics(first: 5) {
                        edges {
                            node {
                                topic {
                                    name
                                }
                            }
                        }
                    }

                    licenseInfo {
                        key
                    }

                    commits: object(expression: "master") {
                        ... on Commit {
                            history {
                                totalCount
                            }
                        }
                    }

                    readme: object(expression: "master:README.md") {
                        ... on Blob {
                            text
                        }
                    }
                }
            }
        }
        pageInfo {
            hasNextPage
            endCursor
        }
    }
}

You can parse in the arguments/variables after, first and conditions through a separate JSON dictionary.

Challenging Your Query

Divide and Conquer

You wrote one nice simple query to find all your data? You Fool 🤡🥱

Big companies are (mostly) smart. They know that if they allow us to do anything with their API, we will use and abuse it frequently 👌. So the easiest thing for them to do is to set strict restrictions! The thing about these restrictions though is, that they don't completely stop you from gathering data, they just make it a lot harder.

The most fundamental limitation is the amount you can query at once. On GitHub, it is:

2000 queries per hour
1000 items per query
100 items per page

To stay within these bounds whilst still being able to collect data we need to bundle lots of small queries together, whilst splitting apart single large queries. Combining smaller queries is easy enough (string concatenation), but to split a large query apart requires some clever coding. We can start by finding out how many items appear for a search (through the following GraphQL):

query ($conditions: String="is:public sort:created") {
    search(query: $conditions, type: REPOSITORY) {
        repositoryCount
    }
}

By modifying the conditions variables, we can limit the searches scope (for example to just 2017-2018). You can test this (or another) query with the official online interactive iGraphQL explorer. In essence, we can create two smaller searches by diving the original time period into two halves.

If the outgoing number is greater than 1000, we'll need to create two independent queries which gather half the data each! We can break a search in half by diving the original period of time into two half as long periods. By continuously repeating the division process, we'll eventually end up with a long list of valid searches!

So in essence, here's what happens (continuously repeated through a while True loop):

trial_periods = []

# Handle one period at a time
for period, num_repos, is_valid in valid_periods:
    if is_valid == True and num_repos > 0:
        # Save the data
        ...
    else:
        # Add to new list of still unfinished periods
        trial_periods.extend(self.half_period(*period))

if trial_periods == []: break
...

Please see the GitHub repo for the full working code. This is just a sample to illustrate how it works on its own, without databases, async-await and other extraneous bits...

Do remember that if (like me) you're using the API through an HTTP client instead of a dedicated GraphQL one, you'll need to manage pagination yourself! To do so you'll need to include in your query (after the huge edges part):

pageInfo {
    hasNextPage
    endCursor
}

Then pass the cursor as a variable for where to start the next query.

Asynchronous Code

If you're using an HTTP client it is important to know how to write code to run blazing fast, and ideally in a way that multiple requests can be made at once. This helps because the GitHub server will take a long time (from around 1 to 10 seconds) to respond to our queries! One great way to do this is to use asynchronous libraries. With asynchronous code, the Python interpreter (like our human brain) switches between tasks extremely fast. So whilst we're waiting for our first query to return a response, our second one can be sent off straight away! It might not seem like much, but it definitely is.

There are three ways to achieve this. The easiest way (especially for beginners) is to use threads. We can create 100 threads (arbitrary example), and launch one query on each. The operating system will handle switching between them itself! Once an operation finishes, the thread can be recycled and used for a separate query (or other operation)!

The second method is to utilize your computer's processes. When we do this, we get multiple tasks to perform in parallel! This is useful for high CPU tasks (like data processing), but we have very few processes (not everyone has a high-end i7 or threadripper 🤣).

The third and final method is async-await. It is similar in philosophy to the first (quickly switch between tasks), but instead of the OS handling it... we do! The idea here is that the OS has a lot to do, but we don't, so it's better Python handles it itself. In theory, async-await is easier and quicker than writing thread-safe code (but much, much harder in practice). The primary reason for this is that asynchronous code can behave synchronously. Simply put: if your design pattern is slightly off, you may have 0 performance gain!

I rewrote, redesigned and refactored my code a million times, and here's what I figured out:

Use a breath first approach (i.e. maximise the number of queries you can run near-simultaneously independent of anything else)
Avoid the consumer-producer pattern where one function produces items, and the other consumes them (as there aren't many guides or explanations of how to use it in practice, and seems to arbitrarily limits itself)
Whenever you're trying to run multiple things simultaneously use asyncio.gather(...)
Avoid async loops, they in practice run synchronously since loops maintain their order of elements (i.e. one by one first, second, third...)
CPU intensive tasks must run within a separate process
Find non-CPU intensive alternative (ideally asynchronous) technologies where possible (i.e. don't write to a single JSON/CSV file as they need to be completely overridden each time you append another item)
Automatically restart queries (Tenacity does this with a simple function decorator)

When you stick to the rules, development stays lite, fast and fluid 😏😌!

Note that proper asynchronous code can easily bombard a server with requests. Please don't let this happen or you'll be blocked 🥶. Simply rate limit requests with a Semaphore (other fancy methods exist, but this is enough)!

THANKS FOR READING!

I know all we currently have is raw data, saved as a file (or database server), but this is the first big step to any unique project. Soon (in part 2) we'll be able to process and view our data with PySpark, look at results from a few basic ML models (created with H2O PySparkling AI)! After that (part 3) we can take a look at using PyTorch with AllenNLP or Hugging Face Transformers.

If you enjoyed this and you’re interested in coding or data science feel free to check out my other posts like tutorials on practical coding skills or easy portfolio dashboards/websites!