Chapter 2: Where to Get Your Data
When we use social media, our actions leave a trace. A like on Facebook, a retweet on Twitter, a heart on Instagram—each of these actions represents a data point that is recorded on the internet somewhere. By agreeing to these companies’ terms of services, we allow them to store this data, which they in turn may make available to the public.
Companies allow third parties to tap into these data troves via an application programming interface (API). An API is like a middleman between the social media platform and the developers who wish to access information from it.
In this chapter, I’ll explain in more detail what APIs are and what kind of data you can harvest through them, using YouTube as a practical example.
What Is an API?
On the most basic level, an API is an interface that allows programmers to access other developers’ code. Some programmers use APIs to access data on an online platform so they can make their own apps. For example, a third-party developer might use the official Instagram app’s API to allow users to post images to their Instagram feed outside of the app. In other words, the Instagram API allows the developer to connect their own code to a user whose account is managed by Instagram.
As we’ll discuss shortly, an API also enables developers to make requests for data using a script, a program that communicates with servers and databases on the web.
To better understand how an API works, let’s consider an analogy. Imagine you’re a customer in a restaurant. An API is like the waiter, who gives you a set of options, takes your order, fetches your meal, and brings it to you. The restaurant owner determines what is available on the menu. She also controls how the dish is presented and portioned. The menu details what you can order, the name for each dish, and how it’s generally prepared.
In our context, the social media company represents the restaurant owner, developers (or clients) represent the customers, and the dishes on the menu represent the data we’re trying to gather. A client is any technology that we use to surf the web, like a browser or another application on a phone or desktop.
Whether or not a company like Facebook or Twitter offers an API is entirely up to it: it can make data available through one or multiple APIs, or it may opt to not give out any data at all. And even if the company does allow third parties to access its data stores, it often limits the available data and can change what data it shares at any point. Public outrage over privacy concerns, new laws and regulations aimed at protecting people’s data, and news events involving social media companies all play into the decisions companies make about their data offerings. Some companies may even charge for access to their data.
To find out what kind of information each company offers developers through its APIs, you usually have to read through its documentation, which is a fancy word for an instruction manual. Unfortunately, documentation isn’t standardized and can make even some of the most experienced researchers feel disoriented, especially if they are beginning coders. This is in part because the text is often aimed at application developers rather than researchers, marketers, or other nondevelopers.
The best way to find out what information a company makes available is by simply searching the company’s API on Google.
Using an API to Get Data
Now that you have a rough idea of how an API works, we’ll look into how to use one to access data.
As mentioned earlier, third parties request data from APIs using a script. These scripts are often text files that are executed or run by a machine like your computer. Think of a script like a little robot that performs tasks for you. The robot can communicate with an API, request data from it, read the data it receives, and create a spreadsheet from the data.
Scripts often communicate with APIs through URLs like the ones you use to access websites. Using a URL to communicate with an API is known as a URL-based API call (which I’ll shorten to API call). As with most other URLs, you can paste an API call into any browser, and the browser will return the text-based data that you requested. When you use a script to make an API call, your script receives the information that would have been shown in your browser.
Let’s look at Google’s YouTube API as an example. You can use this API to access a plethora of data, including the description of a YouTube channel or number of views it has received over time. You tell the API what data you want by including it in the URL. For this exercise, we’ll request a data feed of posts published by the BuzzFeed Tasty YouTube page. To do so, we’ll use this API call: https://www.googleapis.com/youtube/v3/search?channelId=UCJFp8uSYCjXOMnkUyb3CQ3Q&part=snippet.
Each part of the URL serves a different purpose. Two types of strings make up an API call: a base, which indicates the API you’re using, and various parameters, which tell the call what data you want to harvest and convey information about you (that is, the party requesting the information). In our earlier analogy, the base represents the restaurant where we’re dining, and the parameters are the individual menu items we can pick.
Note The structure of the API call is dependent on the individual API. To find out how to structure your calls to retrieve the information you need, you should consult the API’s documentation. You’ll see how to do that in more detail in “Refining the Data That Your API Returns” on page 41.
In this example, the API base is https://www.googleapis.com/youtube/. It directs the browser or the Python script to Google’s YouTube API. The next part of the call is a parameter that tells Google what version of the API we want to use. Social media websites update from time to time, and when they do, they also need to update their APIs. The version is separated from the base by a forward slash. In this example, we want to use version v3 of the API. (Since these versions update frequently, you should consult the documentation to make sure you’re using the latest version.) The next parameter is search, which specifies that we’ll be searching for YouTube videos.
Then, we specify what we’re searching for. In this case, we’re looking for videos from the BuzzFeed Tasty channel, which has the YouTube channel ID UCJFp8uSYCjXOMnkUyb3CQ3Q. You can often find a channel ID at the end of the URL for a YouTube channel. The Tasty channel URL, for instance, is https://www.youtube.com/channel/UCJFp8uSYCjXOMnkUyb3CQ3Q. To use the channel ID parameter, we type out the name of the parameter, channelId, followed by an equal sign (=) and the long channel ID. The entire parameter looks like this (note that there are no spaces): channelId=UCJFp8uSYCjXOMnkUyb3CQ3Q.
Next we need to specify what kind of data we want to access through the API. To add another parameter, we insert an ampersand (&) followed by the parameter part, which indicates that we’re about to specify which part of the YouTube video data we want to retrieve. In this case that’s snippet, which refers to information about channels and videos (such as a video’s description or a channel’s title) that Google’s YouTube API provides.
Now that we have a URL, we’re ready to make our first call to the API! In the next chapter we’ll make this call through a Python script, but for now, just paste the API call into a browser. This allows you to see the API response instantaneously. Once you do that, your browser should return the message in Listing 2-1.
{
"error": {
"errors": [
{
"domain": "usageLimits",
"reason": "dailyLimitExceededUnreg",
"message": "Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup.",
"extendedHelp": "https://code.google.com/apis/console"
}
],
"code": 403,
"message": "Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup."
}
}
Listing 2-1: The code that the API call returns in the browser. Your response may look slightly different.
Listing 2-1 is your first API response! The response is structured in JavaScript Object Notation (JSON), a format that APIs use to deliver data. We’ll discuss JSON in more detail in the coming pages.
If you take a closer look at the response, you’ll see the word error, indicating that something went wrong and the API couldn’t fetch the requested posts from the BuzzFeed Tasty channel.
Working with code often involves reading and understanding error messages. As a beginner, you might feel like you’re spending the vast majority of your time testing and fixing code. You’ll often make mistakes before finding the right approach, but the more experience you gain, the easier it will be to fix those mistakes.
In most cases, error messages will give you clues about what went wrong. If you inspect the API error response more closely, you can see that it sent a “message” and an error notification: “Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup.”
To fix this error, we need an API key, which is a method of identifying yourself to an API. YouTube and other websites with APIs want to know who’s using their API, so they sometimes require you to sign up for developer credentials, a form of identification that developers use to gain API access. Credentials are similar to a username and password. In exchange for access to their APIs, social media companies keep track of users in case someone abuses the API.
Getting a YouTube API Key
For social media networks like YouTube, you’ll usually get credentials on the platform’s website. Let’s try getting credentials from YouTube now. To sign up for Google’s YouTube developer credentials, you’ll first need to have a Google account. If you don’t have one already, create one at https://www.google.com. Once you’ve done that, sign in and navigate to a separate page that Google has set up for developers: https://console.developers.google.com/apis/credentials.
Follow the instructions from Google to create credentials and, in particular, an API key.
Note For some APIs, you may encounter the term app or application, which refers to a software or phone app. This is because many developers signing up for credentials will use the API to create third-party apps. In our case, we’re using the API to gather data, but we still need to sign up the same way as an app developer.
This should create a generic API key for you. The default name for your key is “API key,” but you can rename it by double-clicking the key name. I named mine data gathering credentials.
Once you have the key, navigate to the Library page. Google offers a variety of APIs, so you’ll need to enable access to the specific API you want. To do this, navigate to YouTube Data API v3 and click Enable. Now you’re ready to access YouTube through your API key!
The key is connected to your Google account and acts much like a username and password that will allow you to access the Google API. Thus, you should treat this information with the same care and caution you would treat any other username and password. In the past, developers who write and publish scripts to code-sharing platforms like GitHub have accidentally published their credentials online. If you make the same mistake, it can have serious consequences. For instance, someone may abuse your credentials, and that may bar you from accessing services in the future.
Retrieving JSON Objects Using Your Credentials
Now that you have user credentials, let’s access the API again using a URL. Though every API is different, APIs commonly have you enter your API credentials directly into your API call. If you find that isn’t the case for any APIs you use in the future, check the documentation to see where you should enter your credentials. For Google’s YouTube API, you add your API key into the URL as its own parameter after the parameters you specified for your original API call. Enter the following URL, replacing
Now you should receive an API response that contains data! Listing 2-2 is a sample of the data that the API call returned in my browser. Since the BuzzFeed page is constantly changing, your data will probably look slightly different, but it should be structured in the same way.
{
"kind": "youtube#searchListResponse",
"etag": "\"XI7nbFXulYBIpL0ayR_gDh3eu1k/WDIU6XWo6uKQ6aM2v7pYkRa4xxs\"",
"nextPageToken": "CAUQAA",
"regionCode": "US",
"pageInfo": {
"totalResults": 2498,
"resultsPerPage": 5
},
"items": [
{
"kind": "youtube#searchResult",
"etag": "\"XI7nbFXulYBIpL0ayR_gDh3eu1k/wiczu7uNikHDvDfTYeIGvLJQbBg\"",
"id": {
"kind": "youtube#video",
"videoId": "P-Kq9edwyDs"
},
"snippet": {
"publishedAt": "2016-12-10T17:00:01.000Z",
"channelId": "UCJFp8uSYCjXOMnkUyb3CQ3Q",
"title": "Chocolate Crepe Cake",
"description": "Customize & buy the Tasty Cookbook here: http://bzfd.it/2fpfeu5 Here is what you'll need! MILLE CREPE CAKE Servings: 8 INGREDIENTS Crepes 6 ...",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/hqdefault.jpg",
"width": 480,
"height": 360
}
},
"channelTitle": "Tasty",
"liveBroadcastContent": "none"
}
},
{
"kind": "youtube#searchResult",
"etag": "\"XI7nbFXulYBIpL0ayR_gDh3eu1k/Fe41OtBUjCV35t68y-E21BCpmsw\"",
"id": {
"kind": "youtube#video",
"videoId": "_eOA-zawYEA"
},
"snippet": {
"publishedAt": "2016-02-25T22:23:40.000Z",
"channelId": "UCJFp8uSYCjXOMnkUyb3CQ3Q",
"title": "Chicken Pot Pie (As Made By Wolfgang Puck)",
"description": "Read more! - http://bzfd.it/1XPgzLN Recipe! 2 pounds cooked boneless, skinless chicken, shredded Salt Freshly ground black pepper 4 tablespoons vegetable ...",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/_eOA-zawYEA/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/_eOA-zawYEA/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/_eOA-zawYEA/hqdefault.jpg",
"width": 480,
"height": 360
}
},
"channelTitle": "Tasty",
"liveBroadcastContent": "none"
}
},
--snip---
Listing 2-2: Sample data returned by the YouTube API
You’ll notice that the API response is still JSON. It might look overwhelming at first, but if we convert the JSON data into a more familiar form—a spreadsheet—it looks like Figure 2-1. If you read through some of the strings in the API response, you can see that the data is a snapshot of five videos from the BuzzFeed Tasty YouTube channel.
Figure 2-1: The JSON data as a spreadsheet (image shows only part of the spreadsheet)
Data formatted in JSON can look very confusing and complex at first, so let’s break each part down to get a better sense of our data’s structure. JSON data is always stored between two braces ({}). Each post is stored as a JSON object, and the data points that are part of each object are stored as key-value pairs. For example, the first post contains the following data point:
"publishedAt"①: "2016-12-10T17:00:01.000Z"②
The string before the colon is referred to as a key ①, and the string after the colon is a value ② associated with that key. The key is the category of our data—you can think of it like the header of a spreadsheet column. The value represents the actual data, like a string, an integer, or a float. To understand how each data point is formatted, you’ll need to look it up in the API documentation. In this example, the key is called “publishedAt”, which, according to YouTube’s documentation, describes the time and date a post or comment was created. The value in our example, “2016-12-10T17:00:01.000Z”, is a timestamp. This timestamp is formatted in UTC, a standardized way of storing date and time information in one string.
How an API serves data is determined by the social media platform offering the data. That also means that Google determines the keys for our data. For instance, Google decided to call the date and time when a post was published publishedAt instead of date or published_on. Those idiosyncrasies are specific to Google and its API.
Next, let’s look at one whole JSON object in our data set (Listing 2-3).
{①
"publishedAt": "2016-12-10T17:00:01.000Z",②
"channelId": "UCJFp8uSYCjXOMnkUyb3CQ3Q",
"title": "Chocolate Crepe Cake",
"description": "Customize & buy the Tasty Cookbook here: http://bzfd.it/2fpfeu5 Here is what you'll need! MILLE CREPE CAKE Servings: 8 INGREDIENTS Crepes 6 ...",
--snip---
},③
Listing 2-3: A snippet with information about a YouTube video titled Chocolate Crepe Cake
As you can see, information for each YouTube video is stored between a set of braces ①, and each JSON object is separated by a comma ③. Within these braces are four keys (“publishedAt”, “channelId”, “title”, “description”) and their associated values (“2016-12-10T17:00:01.000Z”, “UCJFp8uSYCjXOMnkUyb3CQ3Q”, “ Chocolate Crepe Cake”, “Customize & buy the Tasty Cookbook here: http://bzfd.it/2fpfeu5 Here is what you’ll need! MILLE CREPE CAKE Servings: 8 INGREDIENTS Crepes 6 …”, respectively), which are displayed in pairs as they were in the previous example. Each key-value pair is also separated from the other pairs by a comma ②.
Let’s zoom out even more and look at the original code snippet from Listing 2-2 again (see Listing 2-4).
{
"kind": "youtube#searchListResponse",
"etag": "\"XI7nbFXulYBIpL0ayR_gDh3eu1k/WDIU6XWo6uKQ6aM2v7pYkRa4xxs\"",
"nextPageToken": "CAUQAA",
"regionCode": "US",
"pageInfo": {
"totalResults": 2498,
"resultsPerPage": 5
},
"items": [
{
"kind": "youtube#searchResult",
"etag": "\"XI7nbFXulYBIpL0ayR_gDh3eu1k/wiczu7uNikHDvDfTYeIGvLJQbBg\"",
"id": {
"kind": "youtube#video",
"videoId": "P-Kq9edwyDs"
},
"snippet": {
"publishedAt": "2016-12-10T17:00:01.000Z",
"channelId": "UCJFp8uSYCjXOMnkUyb3CQ3Q",
"title": "Chocolate Crepe Cake",
"description": "Customize & buy the Tasty Cookbook here: http://bzfd.it/2fpfeu5 Here is what you'll need! MILLE CREPE CAKE Servings: 8 INGREDIENTS Crepes 6 ...",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/hqdefault.jpg",
"width": 480,
"height": 360
}
},
"channelTitle": "Tasty",
"liveBroadcastContent": "none"
}
},
{
"kind": "youtube#searchResult",
"etag": "\"XI7nbFXulYBIpL0ayR_gDh3eu1k/Fe41OtBUjCV35t68y-E21BCpmsw\"",
"id": {
"kind": "youtube#video",
"videoId": "_eOA-zawYEA"
},
"snippet": {
"publishedAt": "2016-02-25T22:23:40.000Z",
"channelId": "UCJFp8uSYCjXOMnkUyb3CQ3Q",
"title": "Chicken Pot Pie (As Made By Wolfgang Puck)",
"description": "Read more! - http://bzfd.it/1XPgzLN Recipe! 2 pounds cooked boneless, skinless chicken, shredded Salt Freshly ground black pepper 4 tablespoons vegetable ...",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/_eOA-zawYEA/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/_eOA-zawYEA/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/_eOA-zawYEA/hqdefault.jpg",
"width": 480,
"height": 360
}
},
"channelTitle": "Tasty",
"liveBroadcastContent": "none"
}
},
--snip---
Listing 2-4: Sample data returned by the YouTube API, revisited
Now you should be able to see that all the posts are nested inside a pair of brackets ([]). (Note that Listing 2-4 above is truncated in this book, and while you should be able to see the opening and closing brackets in the results of your API call, you can only see the opening brackets in these pages.) All of this data, in turn, is preceded by the string “items” and then followed by a colon. This signifies that the key “items” contains a list of data points—in this case, the videos from BuzzFeed’s Tasty YouTube channel. The “items” key-value pair is stored between one more set of braces ({}), which makes up the entire JSON object.
Now you know how to request data from an API and how to read the JSON response it returns, so let’s see how we can tailor the returned data to fit our needs.
Answering a Research Question Using Data
You may have noticed that the data our API call returns is fairly sparse. If we don’t specify what kind of data we’re asking for, the API assumes that we want only basic information and gives us default data points, but this doesn’t mean that we’re limited to that data. For example, look at Listing 2-5, which contains information from one of the videos returned from the API call in Listing 2-2.
{
"kind": "youtube#searchResult",
"etag": "\"XI7nbFXulYBIpL0ayR_gDh3eu1k/wiczu7uNikHDvDfTYeIGvLJQbBg\"",
"id": {
"kind": "youtube#video",
"videoId": "P-Kq9edwyDs"
},
"snippet": {
"publishedAt": "2016-12-10T17:00:01.000Z",
"channelId": "UCJFp8uSYCjXOMnkUyb3CQ3Q",
"title": "Chocolate Crepe Cake",
"description": "Customize & buy the Tasty Cookbook here: http://bzfd.it/2fpfeu5 Here is what you'll need! MILLE CREPE CAKE Servings: 8 INGREDIENTS Crepes 6 ...",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/P-Kq9edwyDs/hqdefault.jpg",
"width": 480,
"height": 360
}
},
"channelTitle": "Tasty",
"liveBroadcastContent": "none"
}
},
Listing 2-5: A sample response to an API call that contains only basic information about a video
This data item corresponds to the video on BuzzFeed’s Tasty channel in Figure 2-2.
Figure 2-2: A screenshot of the video represented by Listing 2-5
There’s more data on display in the online video—for instance, the number of views and comments—than what we received through the API. This information and much more is available through the API, but we need to think about the kind of data we want to get and what questions we want to answer with it. Specifically, we need to do two things. First, we need to set the goal for our research. This is possibly one of the most important but least considered steps. Having a clear set of questions or a hypothesis for your research will inform how you collect your data. Second, we should consult the API documentation to see if the data we need to meet our research goal is available.
A good example of this approach is the BuzzFeed news story “Inside the Partisan Fight for Your News Feed” (https://buzzfeed.com/craigsilverman/inside-the-partisan-fight-for-your-news-feed), a project for which Craig Silverman, Jane Lytvynenko, Jeremy Singer-Vine, and I gathered 4 million posts from 452 different Facebook pages through Facebook’s Graph API. With millions of data points, we couldn’t simply analyze all the data. We’d end up overwhelmed and wouldn’t be able to find any meaningful patterns or trends. To start our analysis, we first needed to narrow down the information we wanted to use.
Since more and more news organizations are relying on third parties like Facebook to reach their audiences, this project took a deep look into how these organizations—both new and old—compare to one another on Facebook. We decided to analyze the popularity of left- and right-leaning news organizations based on their number of followers and the engagements (reactions and comments) each page garnered. Once we narrowed the data down into two categories, we graphed the information over time, as shown in Figure 2-3.
Figure 2-3: A chart showing the performance of left- and right-leaning Facebook pages analyzed by BuzzFeed News
We can see that engagement of left-leaning Facebook pages increased over time. In order to find answers to your research questions, then, you need to be able to not only access information but also filter it.
If we wanted to do a similar analysis of, say, the popularity of BuzzFeed’s Tasty channel content over time, we would start by thinking about the categories of data that may help us answer that question. For example, we have multiple ways to measure the popularity of a video, such as the number of views, likes, dislikes, and comments. We’d need to decide which measure we want to use.
In some cases, the visual layout of a social media platform’s posts is a good way to determine how to answer your research questions. For example, Figure 2-4 can give us an idea of what kind of data to look for in our popularity analysis of BuzzFeed’s Tasty channel.
Figure 2-4: An annotated screenshot of the post that we saw represented as data in Listing 2-5
The best way to understand the nature of BuzzFeed’s content may be by looking at its video properties (like the headline and the description that accompanies each video). For example, we could use the number of views and upvotes or downvotes as a way to measure a video’s popularity. Last but not least, we could use the video timestamps to determine what content is performing well over time.
Now, how do we access some of this data? This is when we look through the API’s documentation to find out whether the information is available. As mentioned earlier, a behemoth company like Google offers a number of APIs with various sets of documentation. We’re interested in YouTube’s Data API, which has documentation at https://developers.google.com/youtube/. Each API is organized differently, so you’ll want to read the introduction or overview of any API you start using. Let’s review some of the basics of Google’s YouTube API before covering how to use it.
Refining the Data That Your API Returns
There are various parameters we can use to further narrow down or specify the kind of information we want to collect. Go to YouTube’s API documentation at https://developers.google.com/youtube/v3/docs/search/list and scroll down to the Parameters table. The left column lists the parameters by name, and the right column provides a description and usage guidelines. When you look for data, read the description of each parameter and find one that matches the type of information you want to access. Let’s say we wanted to narrow the results of our API to only videos that mention the word cake. To refine our API call, we would use the parameter q (short for query) and then enter the term we are searching for. This means you would enter https://www.googleapis.com/youtube/v3/search?channelId=UCJFp8uSYCjXOMnkUyb3CQ3Q&part=snippet&key=YOUR_API_KEY&q=cake into your browser.
Let’s break this call down. The first part is similar to the first API call we made earlier in the chapter. We’re accessing the search API search? and specifying through the parameter channelId that we want to restrict our search to videos from the BuzzFeed Tasty channel. Next, we enter our API key like before, followed by the ampersand (&) indicating that we’re about to add a parameter, q. Then we use an equal sign (=) and specify the search term cake for the API to return. When you enter this API call into your browser, you should get a JSON response containing only videos with the word cake in the description or title.
Great! You’ve now learned how to use parameters to tailor your API data requests.
Summary
Understanding how to access specific data from an API call is an important task to understand both technically and conceptually. While every API has its own parameters, limitations, and authentication processes, I hope this chapter has equipped you with the tools to successfully navigate the various approaches each API requires. In the next chapter, I’ll show you how to access and refine data using a Python script.