Link

Chapter 11: Where to Go from Here

Over the past 10 chapters, you’ve learned a plethora of new tools to help you investigate the social web. You’ve gotten an overview of the social media ecosystem and built a solid technical foundation you can use to collect data and analyze it. In this final chapter, you’ll learn about a few resources that can help you deepen that know­ledge, strengthen your coding abilities, and become a better data scientist.

Coding Styles

As with prose, everyone has their own style of writing code. In this book, for example, we wrote code intended to convey concepts and help you navigate Python and the libraries we needed. This meant that our code, while functional, was also verbose: we broke down our analyses into steps and spelled out each one, like using variables to store data frames at different points in time or creating new columns to filter our data. That kind of code is great because it can help you (and other collaborators) understand every stage of the process. But as you start writing longer and more complex scripts or Jupyter notebooks, you may want to use more compact syntax than what you’ve seen here.

Likewise, as you move on to more complicated analyses and start to use more and more libraries, you’ll have to learn how to read a wide variety of coding styles. We briefly touched on the idea of writing reusable code when we rewrote our Wikipedia scraping script in Chapter 5. But there’s much more to writing clean, smart, and effective code. Coding is a much more collaborative process than media representations of isolated, antisocial coders would have you believe. Library authors, for example, benefit from the feedback of hundreds (sometimes thousands) of other coders who used their libraries and encountered issues. Coders are always using the work of others to enhance their own programs—and the first step to using someone else’s code is being able to understand it!

To that end, what follows are a few helpful resources on writing clean code with Python and pandas, as well as producing reproducible data analysis in Jupyter Notebook. While by no means a comprehensive guide, they’re a good starting point:

The libraries and tools we’ve used in this book have stood the test of time among Python users, but new libraries pop up all the time and may do certain things better than what is already available. It’s the nature of an open source programming language to evolve with the needs of its users. As you continue your journey, look for blogs and forums maintained by people in your industry to stay informed about the latest trends in Python. As someone who learned Python on the fly, I can personally attest to the value of seeking out resources about Python and specific libraries like pandas.

sStatistical Analysis

Throughout this book, we’ve used some basic concepts from the field of statistical analysis. But concepts like mean and median analysis, and aggregating and resampling raw social data, represent only a sliver of the sorts of statistical analyses you can run on the data sets available from the social web.

Here are a few resources that’ll help you develop your statistical analysis skills:

  • Statistics Done Wrong, a book by Alex Reinhart (No Starch Press, 2015) that covers some of the major missteps in statistical analyses and how to learn from them (https://nostarch.com/statsdonewrong/)
  • Naked Statistics, a book by Charles Wheelan (W. W. Norton, 2014) that provides a great introduction to statistics with relatable—and often amusing—examples (https://books.wwnorton.com/books/Naked-Statistics/)
  • “Tidy Data,” an academic paper by Hadley Wickham that lays out helpful approaches to “tidying” data, or restructuring it for more efficient data analyses (https://www.jstatsoft.org/article/view/v059i10)

Other Kinds of Analyses

Finally, there are some more-advanced kinds of analysis, particularly suited to social web data, that have resulted in some fantastic research over the past few years.

One example is natural language processing (NLP), the process of turning text into data for analysis. Many NLP methods are available through Python libraries, including the Natural Language Toolkit, or NLTK (https://www.nltk.org/), and spaCy (https://spacy.io/). These libraries allow us to break text down into smaller parts—like words, word stems, sentences, or phrases—for further analysis. You might count the occurrences of specific words in a given data set, for instance, and study their relationship to other key phrases to understand how people discuss specific topics on the social web, where speech evolves within specific communities and around every news event. What words are affiliated with a specific news phenomenon? How does this vocabulary differ from group to group? Does each community use different language to discuss the same thing? More and more groups are forming online—based on identity categories, shared politics, and other cultural factors—that eventually form a common vocabulary, cadence, and ideology. NLP can help us better understand how the members of these groups come together and form a new information universe with its own language.

Another exploding field is machine learning, a subsection of artificial intelligence that’s been deployed in everything from the Google search bar’s autocomplete text to insurance estimates. In research, machine learning can also be a powerful tool to “classify” the social web. Put simply, machine learning works by feeding a bunch of data into a program and having it find patterns from that data. For instance, if we fed some of the data from Chapter 10 about false agents on Twitter to a machine learning algorithm, we could then feed it new data and see if it could classify those accounts as false agents based on patterns in the first data set. While this isn’t a surefire way to detect false political actors, it may help narrow down a larger pool of tweets and Twitter accounts for further scrutiny.

Here are some suggested resources on NLP and machine learning:

  • Natural Language Processing with Python, a wonderful online primer for NLP by Steven Bird, Ewan Klein, and Edward Loper that clearly explains some of the most important concepts for learners while also teaching technical skills (https://www.nltk.org/book/)
  • “spaCy 101: Everything You Need to Know,” a helpful introductory tutorial to spaCy, a Python library that allows for linguistic analyses (https://spacy.io/usage/spacy-101)
  • “An Introduction to Machine Learning with scikit-learn,” which offers helpful tutorials on the scikit-learn library for people who are just getting started with machine learning (https://scikit-learn.org/stable/tutorial/basic/tutorial.html)

Conclusion

There’s only so much you can learn in the course of 10 chapters. As with any skill, there’s room for us to grow and hone our skills, adapting them to the specific fields we work in. What I aimed to do in this book was provide you with a solid foundation on which to build future analyses of the social media ecosystem. Above all, I hope that I’ve sparked in you the kind of curiosity it takes to interrogate the social media world and better understand human behavior online. We’ve only been able to observe social media’s impact for a short time. Hopefully, this book has equipped you to examine its influence for years to come.