How to scrape websites requires login using python libraries

Subscribe to my newsletter and never miss my upcoming articles

I recently came across a website which I want to scrape data from but it required login. I tried different methods trying to login to the site and get required info. I used many known libraries like mechanize, cookielib, bs4 etc.. I was able to achieve my goal by 5 methods using above libraries. But here I am going to share most easiest method (only about 30 lines of code)

We are going to use only two libraries

  • requests
  • bs4 or BeautifulSoup 4

For this tutorial we will scrape data from lumosity.com, a simple website for games which improve memory, focus etc. I chose this one because it has modern securities like captcha, auth-tokens etc. yet still vulnerable.

Screenshot_20201117_154833.png

Captcha doesn't show up until 2 or 3 unsuccessful attempt, as seen in image below

Screenshot_20201117_155947.png

Lets begin our coding :)

Lets start by importing the libraries

from bs4 import BeautifulSoup
import requests

requests library is a special one as it has a number of usage, one of its specialties is Session Objects. Without this Session class I would have to write a lengthy code had to use cookies library.

Reading the documentation is one of the skills of a developer, so be patient enough to read at least this request.session part of the Documentation. After reading or at least opening documentation, let's move ahead to the website we want to scrape.

Examine the website closely

For this we need developer tools. Chrome, Firefox, Opera all has inbuilt(PC versions) developer tools just F12 or Ctrl + Shift + I or right click on mouse and inspect elements

For Android there are some apps which you can give a try Packet Capture, HttpCanary(Recommended)

Screenshot_20201117_192733.png

Now type in any email and password which are fake and login. Do this only one time because captcha will start showing after 2nd attempt(in my case)

Now check network tab and you will find same as the image below

Screenshot_20201117_193904 (1).png

Don't forget to note down Request URL(1), Request Method(2) and Status Code(3). In my case

  1. Request URL = https://www.lumosity.com/authentication
  2. Request Method: POST

    Different type of requests with definition see this blog . For now think POST Request as something we are posting to the site's server and it returns some data. In easy words we post our email and password to the sites server, the server checks if those are correct or wrong and returns a message.

  3. Status Code: 401

    Different type of status code with definition can be found this site .In this case 401 stands for unauthorized, as I have given fake info.

Now the most important check for Form data or Payload by scrolling down.

Screenshot_20201117_202133.png

  1. user[login]: pseudomonk@gmail.com
  2. user[password]: pseudomonk

As you see 1st & 2nd are just emails and password i gave. Nothing Special about them, but 3rd is a game changer

  1. authenticity_token: +8VBXIulnZK9wq1Es1FYEx71c+u0rLIbbs6imk3WVYmZpqOQzQVLTMCqInwCpJ9mfLdb0DwnO+V9oGBCyhwcag==

Ohh shit...where the f**k did this come from. A strip with random letters and numbers which changes everytime you make a request, big thing right!. Where was it generated, do we need to decode it, many question comes to mind. There is no need to worry it's pretty easy.

Change from network tab to elements tab

For android or ios just type in chrome or firefox view-source: source before the link. For example in my case view-source:https://www.lumosity.com/authentication

Screenshot_20201117_210745.png

It's good to know some basics of html at least different tags and attributes . For now let's continue as see in image above:

  1. <input maxlength="60" value="pseudomonk@gmail.com" placeholder="Email" size="60" type="text" name="user[login]" id="user_login">

    and

    <input maxlength="40" placeholder="Password" size="40" type="password" name="user[password]" id="user_password">

the name attribute here name="user[login]" and name="user[password]", we will use to send payload username and password.

  1. <input type="hidden" name="authenticity_token" value="t8HDsVcnJVd6B0URepp+J7ZuYRfYPR1FUSde/v/PxRTVoiF9EYfziQdvyinLb7lS1CxJLFC2lLtCSZwmeAWM9w==">

This value we have seen same type somewhere. Oh yeah, the authenticity_token we found in above form data. But when we check both they don't match. Why? It's because every time the page loads new authenticity_token is generated as soon as we click login this token is send with request and if login was correct then you will be redirected to inside otherwise a new authenticity_token is generated. We just need to get value from the above input tag.

Real Coding Starts From Here

As we have imported the libraries required let's continue. Let's start by initiating the session

session_requests = requests.session()

Continuing set a variable for login url which is equal to Request URL from above mentioned

login_url = "https://www.lumosity.com/authentication"

next we have to extract authenticity_token value from the input tag, to do that we need to call Beautiful Soup library. Firstly we need to make a GET request to login.

result = session_requests.get(login_url)
soup = BeautifulSoup(result.content, 'html.parser')

if we print soup variable then all html documents will get printed here we have to catch value of authenticity_token. If you have read Beautiful soup Documentation, then you will a find('html tag', {name: name of the element}) method.

authenticity_token = soup1.find('input', {'name': 'authenticity_token'}).get('value')

Now print the authenticity_token just to be sure that it get's captured

print(authenticity_token)

Screenshot_20201118_123638.png

Yey!...it successfully printed out the authenticity_token. Now lets make the POST request to login to the site. How to make POST request is detailed in requests documentation . But the payload or form data should be send too. So lets make a variable and add payload using dictionary data type

payload = {
    "username": "<Your email>", 
    "password": "<Your Password>", 
    "authenticity_token": authenticity_token
}

authenticity_token will be captured by above lines and will be passed to payload. Now make the POST request

result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

But when we run the script it doesn't show anything. So how to confirm it got logged in.Easiest method is by checking status code of the request. How to check it, for that just print at last line of your code-

print(result.status_code)

If it gives 200 then you're logged in. Now you will think if it gives 200 for any fake email and password, check it yourself.

Now we have got logged into the site. Now lets scrape some data. For this tutorial i will only scrape name of the logged in person.

Screenshot_20201118_125017.png

Now lets send 2nd request url,

url = 'https://www.lumosity.com/train/turbo/odp/1/start'
result = session_requests.get(
    url, 
    headers = dict(referer = url)
)

We have to scrape the username so, we are going back to developer tools Elements tab and check our element html code.

Screenshot_20201118_131938.png

<span class="display-name">Heidrun</span>

Almost done, Beautiful soup library can easily catch it for us.

soup = BeautifulSoup(result.content, 'html.parser')
name = soup.find('span', class_ = 'display-name').text

Done just print the name.

print(name)

Screenshot_20201118_132833.png

I just printed out the status codes of both the request, that's why twice 200. Anyways at last it printed the name(Heidrun).

So we have successfully scraped data from a website which required login. You can try any other website many of them have this type of security hidden in elements. Full code used in this tuitorial is given below

from bs4 import BeautifulSoup
import requests


session_requests = requests.session()
login_url = "https://www.lumosity.com/authentication"
result = session_requests.get(login_url)
soup = BeautifulSoup(result.content, 'html.parser')

authenticity_token = soup1.find('input', {'name': 'authenticity_token'}).get('value')
print(authenticity_token)

payload = {
    "username": "<Your email>", 
    "password": "<Your Password>", 
    "authenticity_token": authenticity_token
}
result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)
print(result.status_code)
url = 'https://www.lumosity.com/train/turbo/odp/1/start'
result = session_requests.get(
    url, 
    headers = dict(referer = url)
)
soup = BeautifulSoup(result.content, 'html.parser')
name = soup.find('span', class_ = 'display-name').text
print(name)

So this article an end. Thank you all for reading this.

References

No Comments Yet