Introduction
We experience the social web in brief moments that flash by, often without ever coming back to them. Liking a photo on Instagram, sharing a post that someone published on Facebook, or messaging a friend on WhatsApp—whatever the specific interaction, we do it once and likely don’t think about it after.
But from swipes to clicks to status updates, our online lives are being captured by social media companies and used to fill some of the largest data servers in the world. We are producing more data than ever before. By looking at these data points as a whole, we can gain tremendous insight into human behavior. We can also investigate the harm done by these systems, from detecting false online actors (for example, automated bot accounts or fake profiles that seed misinformation) to understanding how algorithms surface questionable content to viewers over time.
If we look at these data points collectively, we can find patterns, trends, or anomalies and, hopefully, better understand the ways in which we consume and shape the human experience online. This book aims to help those who want to go from simply observing the social web one post or tweet at a time to understanding it on a larger, more meaningful scale.
What Is Data Analysis?
The main goal for any data analyst is to gain useful insights from large quantities of information. We can think of data analysis as a way to interview a vast number of records: we may ask about unusual single events, or we may be looking into long-term trends. Interviewing a data set can be a lengthy process with various twists and turns: it might take a few different approaches to find the answers to our questions, the same way it might take a few different meetings to get a good sense of an interviewee.
Even if our questions are simple and focused, getting to the answers can still require us to make several logical and philosophical decisions. What data set may be useful to examine our own behavior, and how would we get that data? If we wanted to determine the popularity of a Facebook post, would we measure that in number of reactions (likes, hahas, wows, and so forth), the number of comments it received, or a combination of both metrics? If we wanted to better understand how people discuss a specific topic on Twitter, what would be the best way to categorize tweets about it?
So while analyzing data takes a certain amount of technical know-how, it’s also a creative process that requires us to use our judgment in an intentional and informed way. In other words, data analysis is both science and art.
Who Is This Book For?
This book is written for people who have little to no previous programming experience. Given the huge role of social media, the internet, and technology in all of our lives, this book aims to explore them in an accessible and straightforward way. Through practical exercises, you’ll learn the foundational concepts of programming, data analysis, and the social web.
On some level, this book is targeted to someone just like my former self—a person who was fiercely curious about the world but also intimidated by jargon-filled forums, conferences, and online tutorials. We’ll take a macro and micro approach, looking at the ecosystem of the social web as well as the minutiae of writing code.
Coding is more than just a way to build a bot or an app: it’s a way to satisfy your curiosity in a world that is increasingly dependent on technology.
Conventions Used in This Book
To access and understand data from social media, we need to learn where that data is stored, how to access it, and how we can make sense of it. In other words, analyzing data from the web involves multiple steps: gathering the data, researching and exploring it, and analyzing it. In the final step, we’ll also draw conclusions from the data and answer our questions about the human behavior and actions that produced it in the first place.
With all that in mind, it’s important to note that this book is not just a compilation of code snippets, ready to be plugged in and used. While it contains scripts that may help you gather and analyze data from the social web, it was first and foremost designed to teach the fundamental concepts and tools of the data analysis process. Think of the chapters as a step-by-step guide for aspiring researchers who are eager to investigate a specific topic or question. My hope is that you’ll come out with the basics you need to start learning and exploring on your own in this field. After all, the landscape of social media is in constant flux, which means that you’ll need to be flexible and continually adapt your analytical approach to understanding the data it produces.
Similarly, conventions in this book were chosen and designed to prioritize your learning rather than the elegance of the code. For instance, this code uses a lot of global variables. (Don’t panic! We’ll cover what variables are in the coming chapters.) While this may not be the most efficient way to code, it’s one that’s friendly to people who might be new to Python.
As for the tools covered, I had two main criteria. I tried to choose tools that are available for free on the web, and that have a relatively low barrier to entry, allowing beginners to get started with simple projects.
What This Book Covers
The chapters of this book are structured to follow the journey of a data sleuth. We’ll begin by covering how and where to find data from the social web. After all, we need data before we can go about analyzing it! Then, in the later chapters, you’ll learn about the tools necessary to process, explore, and analyze the data we’ve mined.
Part I: Data Mining
Chapter 1: The Programming Languages You’ll Need to Know Introduces frontend languages (HTML, CSS, and JavaScript) and why they’re important within the context of social media data mining. You’ll also learn the basics of Python through practical exercises in the interactive shell.
Chapter 2: Where to Get Your Data Explains what APIs are and what kind of data you can access through them, and walks you through accessing data in JSON format. This chapter also covers the process of formulating a research question for data analysis.
Chapter 3: Getting Data with Code Shows you how to gather the data returned from the YouTube API and use Python to restructure it from JSON to a spreadsheet, specifically a .csv file.
Chapter 4: Scraping Your Own Facebook Data Defines scraping and describes how to inspect HTML to structure content from web pages into data. It also covers data archives that social media companies provide to users of their own data and shows you how to extract data into .csv files.
Chapter 5: Scraping a Live Site Explains the ethical considerations of scraping websites and walks you through the process of writing a scraper for a Wikipedia page.
Part II: Data Analysis
Chapter 6: Introduction to Data Analysis Covers the various processes involved in data analyses and introduces Google Sheets by analyzing data from an automated account, or bot.
Chapter 7: Visualizing Your Data Explores how visualization tools—like making charts within Google Sheets and using conditional formatting to highlight data variations—can help us better understand our data.
Chapter 8: Advanced Tools for Data Analysis Transfers concepts you learned from analyzing data in Google Sheets into the realm of programmatic analysis. You’ll see how to set up virtual environments in Python 3, navigate Jupyter Notebooks (a web application that is capable of reading and running Python code), and use the Python library pandas. You’ll also explore the structure and breadth of your data sets.
Chapter 9: Finding Trends in Reddit Data Builds on the previous chapter to show you how to modify data, filter data, and run basic aggregation using functions in pandas.
Chapter 10: Measuring the Twitter Activity of Political Actors Explains how to format data as timestamps, modify it more efficiently with lambda functions, and resample it temporally in pandas.
Chapter 11: Where to Go from Here Lists resources for becoming a better Python coder, learning more about statistical analyses, and analyzing text using natural language processing and machine learning.
Downloading and Installing Python
To work through the exercises in this book, you’ll need to set up a number of tools on your computer. I’ll help you with most of these—including signing up for a Google account and installing Python libraries—in the relevant chapters. But there’s one setup we need to do now, before we dive into the content of this book: getting Python installed on your machine.
While there are various ways to set up Python, one of the most straightforward is to go to http://python.org/downloads/ and download the latest version of Python for your Windows or macOS machine. There you should be able to find various installers for 64-bit and 32-bit computers (keep reading for more on the difference between these two).
Warning: Like all programming languages, Python has undergone quite a few changes. Python 2 is an older version of the language that will not work with the exercises in this book, so make sure you download the latest version of Python 3 from the website. There may be other numbers attached to the name of the latest Python version (for example, Python 3.7.3 represents version 7.3 of Python 3), but the important thing is that the first number after the word Python is 3.
Installing on Windows
To find out whether your Windows machine is 64-bit or 32-bit, select Start4Control > Panel4System and see whether it says 64-bit or 32-bit under System Type.
Once you’ve done that, download the Python installer (which ends with the file extension .exe
) and then double-click it. On the installer display screen that opens, follow these instructions:
- Select Install for All Users and click Next.
- Click Next again to install Python to
C:\Python34
. - You should be prompted to “Customize Python.” You won’t need to do that for the exercises in this book, so click Next again.
Installing on macOS
To find out whether your Mac is 64-bit or 32-bit, click the Apple menu and select About This Mac. This should open a window with some basic information about your Mac. Next to the word Processor, if it says Intel Core Solo or Intel Core Duo, you have a 32-bit machine. If it says something else, like Intel Core 2 Duo or Intel Core i5, you have a 64-bit machine.
Then download the appropriate installer file (which ends in .dmg) for your computer. Once you’ve downloaded the file, double-click it to start the installation. This should pop up an installation window (sometimes it may ask you to enter the admin password for your computer). Follow these instructions to complete the installation:
- You should see a number of screens describing the software (an introduction and a Read Me section). Click Continue to go past them and agree to the software license.
- Select your main hard drive name (for example, HD Macintosh) and click Install.
Getting Help When You’re Stuck
Learning how to code is an incremental journey, one where you’ll fail continually but learn from your mistakes. A missing colon, a minor spelling mistake, a misplaced comma—these are all small things that may throw off beginners and be a source of discouragement. But don’t despair! Every coder goes through this process; learning how to spot and eliminate errors is just part of programming.
As mentioned, social media’s ever-changing nature means we have to continuously adapt—one day we might have to analyze text-based reactions, and another day we might be looking at thousands of images. To be a good coder, in other words, means that you have to be a resourceful coder, one who knows how to look for and ask for help solving any problems you encounter.
First, there are certain ways to search Google for coding solutions that may yield better results. For instance, it can be helpful to construct your searches following this formula: coding language + verb + specific keywords. One example might be “Python open .csv file.”
Note For more information about optimizing your search terms, check out the article “Googling for Code Solutions Can Be Tricky—Here’s How to Get Started” by Suyeon Son, available here.
This formula is a good starting point to find related coding examples or similar questions other people have asked, many of which might be posted on Stack Overflow, a forum where coders exchange helpful solutions. As you read through the search results, you may end up seeing more keywords and can refine your query.
Then there are error messages. One of the most frustrating things for new coders to get used to is the fact that the smallest mistakes may result in broken code and cryptic error messages. When you write a line of code that contains a mistake, Python usually displays an error instead of running your code. Here’s an example of an error I’d get if I tried to add two different data types—a text and a number (we’ll cover data types and mathematical operations in the first chapter!):
>>> "R2D" + 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
u TypeError: can only concatenate str (not "int") to str
With Python errors, the gist of the problem is usually at the end of the error message u. In this case, that’s TypeError: can only concatenate str (not “int”) to str. To find a solution to this error, you can copy this error and paste it into a search engine, as shown in Figure 1.
A lot of the results for this query are links to platforms like the aforementioned Stack Overflow. It’s worth reading through the answers and trying the various solutions suggested by the forum participants. Answers with more upvotes may be most useful, and it can also be helpful to review answers for any dialogue that took place between the person posting the question and the respondents. If that doesn’t yield the solution to your coding issue, consider creating an account on Stack Overflow or a similar online platform and actively reaching out to the community of developers for help.
Figure 1: Google search results for the error TypeError: can only concatenate str (not ”int”) to str
The following list describes the most effective ways to get users of Stack Overflow and similar forums to help you, modeled on tips provided in another No Starch book, Automate the Boring Stuff with Python by Al Sweigart:
State your objective, not just what you did. This allows developers to see whether there may be alternative methods to achieve what you’re hoping to do.
Explain what you did, what you’ve tried already to solve your problem, and any other information that may be relevant for developers to know to be able to help you (for instance, make sure you mention what Python version you’re using and what computer you’re working with).
If you encountered errors, copy and paste the entire error message into your forum post (though be sure to protect your privacy and the privacy of others by redacting any identifying information, like your full name if it’s also the name of your computer, or passwords and other login details). To embed really long snippets of code into your post in space-efficient ways, you can use helpful platforms like Paste Bin (http://pastebin.com/).
Last but not least, consult Stack Overflow’s handy guide for other best practices for asking questions on the forum: https://stackoverflow.com/help/how-to-ask/.
Asking strangers on the internet for help may seem daunting, but it can yield wonderful results if you make it a point to ask politely and with respect for people’s time.
Summary
So many of our interactions and our behavior are now captured on social media platforms. While companies like Facebook or Twitter have certainly found ways to leverage this data in aggregate, I firmly believe that researchers and users themselves should be enabled and empowered to glean their own insights from some of these vast data sets. This book offers a beginner-friendly introduction to this kind of data analysis.
I’ve been an instructor for more than a decade and truly love seeing students and peers succeed. While the scope of this book is limited, I hope that it sparks enough curiosity in beginners to compel them to continue learning. For that purpose, feel free to explore the majority of my teaching materials on my website at https://lamivo.com/tips.html.
Now, without further ado, let’s get started!