Web Scraping Text Python



Monday, January 25, 2021

Web Scraping with requests and Beautiful Soup Simple Text Analysis with NLTK For this exercise you will need the following packages: nltk, requests, and bs4(Beautiful Soup). BeautifulSoup is a Python library that is used to pull data of HTML and XML files. It is mainly designed for web scrapping. It works with the parser to provide a natural way of navigating, searching, and modifying the parse tree. The latest version of BeautifulSoup is 4.8.1.

Table of Contents

This is a use case of web scraping twitter for sentiment analysis. Let's start from...Donald Trump.

Web Scraping Text Python Code

I am not a big fan of Donald Trump. Technically, I don’t like him at all. However, he has this charismatic sensation effect. His name occupies most newspapers and social media all the time. People’s attitude towards him is dramatic and bilateral. His descriptive words are either highly positive or negative, which are some perfect material for web scraping and sentiment analysis.

The goal of this workshop is to use a web scraping tool to read and scrape tweets about Donald Trump with a web crawler. Then we conduct a sentiment analysis using python and find out public voice about the President. And finally, we visualized the data using Tableau public.

You Should Continue to Read:

  1. IF you don’t know how to scrape contents/comments on social media.
  2. OR/AND IF You know Python but don’t know how to use it for sentiment analysis.

Let’s start with scraping using Octoparse. Downloaded the newest version from official websites and finished registration by following the instructions. After you log in, open the built-in Twitter template.

Tweet Data Extracted in the Scraper

  1. Name
  2. Publish time
  3. Content
  4. Image URL
  5. Tweet URL
  6. Numbers of comments, retweets, and likes

Enter “Donald Trump” at the Parameter field to tell the crawler the keyword. Just as simple as it seemed, I got about 10k tweets. You can scrape as many tweets as possible. After getting the tweets, export the data as a text file, name the file as “data.txt”.

Sentiment Analysis using Python

Before getting started, make sure you have Python and a text editor installed on your computer. I use Python 2.7 and Notepad++.

Then we use two opinion word lists to analyze the scraped tweets. You can download them from here. These two lists contain positive and negative words (sentiment words) that were summarized by Minqing Hu and Bing Liu from research study about presented opinions words in social media.

The idea here is to take each opinion word from the lists, return to the tweets, and count the frequency of each opinion words in the tweets. As a result, we collect corresponding opinion words in the tweets and the count.

Web Scraping Text Python

First, create a positive and negative list with two downloaded word lists. They store all the words that are parsed from the text files.

Then, preprocess texts and massage the data by taking out all the punctuations, signs and numbers with the following code

As a result, the data only consisted of tokenized words, which makes it easier to analyze. Afterword, create three dictionaries: word_count_dict, word_count_positive, and word_count_negative.

Next, define each dictionary. If an opinion word exists in the data, count it by increasing word_count_dict value by “1”.

Afterwords counting, decide whether a word sounds positive or negative. If it is a positive word, word_count_positive increases its value by “1”, otherwise positive dictionary remains the same value. Respectively, word_count_negative increases its value or remains the same value. If the word is not present in either positive or negative list, it is a pass.

Text

Polarity: Positive vs. Negative

As a result, I got 5352 negative words and 3894 positive words, saved the list with your choice of name, and opened it with Tableau public, and build up a bubble chart. If you don't know how to use Tablau public to create bubble chart, click here.

The use of positive words is unilateral. There are only 404 kinds of positive word used. The most frequent words are, for example, “like”, “great” and “right”. Most word choices are basic and colloquial, like “wow” and “cool,” whereas the use of negative words is much more multilateral. There are 809 kinds of negative word that most of them are formal and advanced. The most frequently used are “illegal,” “lies,” and “racist.” Other advanced words such as “delinquent”, “inflammatory” and “hypocrites” are also present.

The choice of words clearly indicates the level of education of whom is supportive is lower than that disapproval. Apparently, Donald Trump is not so welcomed among Twitter users.

Web Scraping Text Python

Summary

In this article, we talked about how to scrape tweets on Twitter using Octoparse. We also discussed how to preprocess data text and analyze positive/negative opinion words expressed on Twitter using Python. For a complete version of the code, you can download here (https://gist.github.com/octoparse/fd9e0006794754edfbdaea86de5b1a51)

Ashley is a data enthusiast and passionate blogger with hands-on experience in web scraping. She focuses on capturing web data and analyzing in a way that empowers companies and businesses with actionable insights. Read her blog here to discover practical tips and applications on web data extraction

Artículo en español: Scraping de Twitter y Análisis de Sentimientos Utilizando Python
También puede leer artículos de web scraping en El Website Oficial

source

I’ve recently had to perform some web scraping from a site that required login.It wasn’t very straight forward as I expected so I’ve decided to write a tutorial for it.

For this tutorial we will scrape a list of projects from our bitbucket account.

The code from this tutorial can be found on my Github.

We will perform the following steps:

  1. Extract the details that we need for the login
  2. Perform login to the site
  3. Scrape the required data

For this tutorial, I’ve used the following packages (can be found in the requirements.txt):

Open the login page

Go to the following page “bitbucket.org/account/signin” .You will see the following page (perform logout in case you’re already logged in)

Check the details that we need to extract in order to login

In this section we will build a dictionary that will hold our details for performing login:

Web Scraping Text Python Examples

  1. Right click on the “Username or email” field and select “inspect element”. We will use the value of the “name” attribue for this input which is “username”. “username” will be the key and our user name / email will be the value (on other sites this might be “email”, “user_name”, “login”, etc.).
  2. Right click on the “Password” field and select “inspect element”. In the script we will need to use the value of the “name” attribue for this input which is “password”. “password” will be the key in the dictionary and our password will be the value (on other sites this might be “user_password”, “login_password”, “pwd”, etc.).
  3. In the page source, search for a hidden input tag called “csrfmiddlewaretoken”. “csrfmiddlewaretoken” will be the key and value will be the hidden input value (on other sites this might be a hidden input with the name “csrf_token”, “authentication_token”, etc.). For example “Vy00PE3Ra6aISwKBrPn72SFml00IcUV8”.

Scraping Web Pages Python

We will end up with a dict that will look like this:

Keep in mind that this is the specific case for this site. While this login form is simple, other sites might require us to check the request log of the browser and find the relevant keys and values that we should use for the login step.

For this script we will only need to import the following:

Web scraping text python

First, we would like to create our session object. This object will allow us to persist the login session across all our requests.

Second, we would like to extract the csrf token from the web page, this token is used during login.For this example we are using lxml and xpath, we could have used regular expression or any other method that will extract this data.

** More about xpath and lxml can be found here.

Web Scraping Html Text Python

Next, we would like to perform the login phase.In this phase, we send a POST request to the login url. We use the payload that we created in the previous step as the data.We also use a header for the request and add a referer key to it for the same url.

Now, that we were able to successfully login, we will perform the actual scraping from bitbucket dashboard page

In order to test this, let’s scrape the list of projects from the bitbucket dashboard page.Again, we will use xpath to find the target elements and print out the results. If everything went OK, the output should be the list of buckets / project that are in your bitbucket account.

You can also validate the requests results by checking the returned status code from each request.It won’t always let you know that the login phase was successful but it can be used as an indicator.

for example:

Web Scraping Text Python Example

That’s it.

Full code sample can be found on Github.