Python Web Scraping

Get Started. It's Free
or sign up with your email address
Python Web Scraping by Mind Map: Python Web Scraping

1. What is it?

1.1. Web Crawling involves programs that automatically scan web page content, normally for purpose of classification

1.1.1. such programs are often referred to as spiders

1.1.2. spiders are used by search engines like Google

1.1.3. web searches are based on classification data produced by spiders

1.1.4. spiders are automated programs, often categorised as bots, which automatically run and follow links from page to page across millions of webs sites

1.2. Web Scraping can be used to describe part of the activity of Web Crawling, but often Web Scraping can describe a program that targets a specific website

1.2.1. Web Scraping involves an automated program known as a scraper

1.2.2. Purpose of scraper is to extract relevant data from targeted web page(s) and aggregated the data into a usable structured format

2. Web Scraping Ethics

2.1. Legally ambiguous

2.1.1. Intellectual Property (IP) Rights

2.1.1.1. includes Copyright, which is an automatically granted right assigned to creator of data

2.1.1.2. applies not to acquiring data but usage of it, so the scraping itself is legal

2.2. Must consult the robots.txt document that should be stored in the root directory of the website

2.2.1. robots.txt specifies what is and what is not allowed by the site in terms of bots

2.2.2. you should respect the information in the robots.txt file

2.3. You should generally respect the website you are scraping and avoid putting excessive loads on the server via high volumes of automated requests that might overload the server and degrade its performance

3. Jupyter Notebook

3.1. Server-client application that allows editing and execution of notebook documents via web browser

3.2. notebook documents are rich text, human readable documents that combine executable code, text, pictures, graphs, etc.

3.3. application can be installed on a local machine and run without Internet connection or can be installed on remote server run via Internet connection

3.4. Notebook documents attach to a code kernel, which enables the embedded code to be executed when notebook is run

3.4.1. Jupyter notebooks attached to Python code kernel are known as IPython notebooks

3.4.1.1. Notebook files in IPython format take the *.ipynb file extension

3.5. Notebook editing tips

3.5.1. Ctrl + Enter executes cell and keeps focus on that cell

3.5.2. Shift + Enter executes cell and creates new cell below, shifting focus to the new cell

3.5.3. The print function can be used, but it is also invoked implicitly

3.5.3.1. example cell input:

3.5.3.1.1. x = [1,2,3,4,5] x

3.5.4. Press A to insert new cell above active one, press B to insert new cell below

3.5.5. Press D twice (i.e. D + D) to delete active cell

3.5.6. Press M to convert cell to Markdown

3.5.6.1. Markdown cell formatting

3.5.7. Press Y to convert cell to Code

4. Anaconda

4.1. A distribution that bundles Python language, Jupyter Notebook application and numerous packages for data science

4.2. Installing packages

4.2.1. Use Anaconda Prompt or Anaconda Powershell Prompt

4.2.1.1. Windows provides older cmd command prompt and newer Powershell prompt - you can run the same commands for tasks like package installation via either of these

4.2.2. pip install <package_name>

4.2.2.1. example

4.2.2.1.1. pip install requests-html

5. APIs

5.1. Application Programming Interface

5.2. Not web scraping, but should always be used in preference to web scraping where available

5.3. APIs can be free or paid

5.4. For getting data from websites, we use web APIs, based on HTTP

5.5. API documentation should specify how to use the API and the format of the response

5.5.1. common response format is JSON

6. HTTP

6.1. HyperText Transfer Protocol

6.2. Websites consist of a collection of HTML code files, image files, video files and various other files, such as style sheets, etc.

6.3. Client makes request to download file and server responds with requested file

6.3.1. Web browser (client) interprets downloaded files, displaying website content within the browser window

6.4. HTTP Request Methods

6.4.1. GET

6.4.1.1. Most popular request method

6.4.1.2. fetches data from server

6.4.1.3. can be bookmarked

6.4.1.4. parameters are added to URL, but in plain text

6.4.1.4.1. not to be used for exchanging sensitive information

6.4.2. POST

6.4.2.1. 2nd most popular HTTP method invoked

6.4.2.2. Alters state of some object held server side

6.4.2.2.1. example is a shopping basket

6.4.2.3. Used to send confidential information

6.4.2.3.1. Login credentials always passed via POST request

6.4.2.4. Parameters added in separate body

6.5. HTTP Response Codes

6.5.1. 200

6.5.1.1. Request was processed successfully

6.5.2. 404

6.5.2.1. Error: page not found

6.6. HTTP Response format

6.6.1. HTML for web pages

6.6.2. For web APIs, most common response format is JSON

7. JSON

7.1. Fundamentally you can think of JSON as Python dictionaries, where the keys are always strings and the values can any of the following:

7.1.1. string

7.1.2. number

7.1.3. object

7.1.3.1. i.e. a nested dictionary { }

7.1.4. array

7.1.4.1. i.e. a list [ ]

7.1.5. null

7.1.6. Boolean

7.2. Many web APIs give their response payloads for both GET and POST in JSON format

8. Web API with JSON Response

8.1. import requests

8.1.1. requests module give us ability to send HTTP GET requests to web server and capture the response

8.2. base_url = "<url>"

8.2.1. capture the URL for the GET request in a variable

8.3. response = requests.get(base_url)

8.3.1. invoke get function from imported requests module, passing in the URL via the variable

8.3.2. requests.get()

8.3.2.1. returns

8.3.2.1.1. requests.Response object

8.4. print(response.ok) print(response.status_code)

8.4.1. check the response status (looking for True and 200)

8.5. response.json()

8.5.1. returns response payload as Python dictionary

8.5.1.1. note: print response.text first to verify that format is JSON

8.5.2. note: when status_code is 400 (Bad Request) you can normally expect to see details of the error returned in the JSON response

8.5.3. note: response is a variable that references an object instance of class Response, which is returned by requests.get()

8.6. import json results = json.dumps(response.json(),indent=4) print(results)

8.6.1. use dumps() function from json module to print a pretty string representation of dictionary

8.6.2. json.dumps()

8.7. Adding parameters

8.7.1. after base URL, add "?" followed by <param>=<val>

8.7.2. multiple <val> may be comma separated

8.7.3. to specify more than one parameter, separate <param>=<val> sequences by &

8.7.4. example

8.7.4.1. https://api.exchangeratesapi.io/latest?base=GBP&symbols=USD,EUR,CAD

8.7.5. requests.get() with params

8.7.5.1. better way to pass parameters in GET request because it auto handles symbols/white space that is not legal in a URL

8.7.5.2. params takes a dictionary for its value, where every key is valid API parameter name and value is the paired parameter value

8.7.5.3. example:

8.7.5.3.1. import requests base_url = "https://itunes.apple.com/search" r = requests.get(base_url, params={"term":"kraftwerk", "country":"GB"})

8.8. Pagination

8.8.1. Search APIs sometimes deliver results in pages - e.g. Google search sends results one page at a time to the client browser

8.8.2. API documentation should describe if pagination is used and how to retrieve particular pages

8.8.3. example

8.8.3.1. import requests get_url = "https://jobs.github.com/positions.json" r = requests.get(get_url,params = {"page":1,"description":"sql"})

8.8.3.1.1. this API has a page parameter to specify which page you want

8.8.3.1.2. to loop through multiple pages, we can use a for loop and append results to a list

8.9. APIs with authentication

8.9.1. Requires registration and sometimes a paid subscription

8.9.2. Once registered, you get an application ID and Key, which is equivalent of username + password but dedicated to API use

8.9.3. HTTP POST method used instead of GET

8.9.3.1. With POST requests, the data is sent in the message body and not as part of the request URL itself (as with GET)

8.9.3.1.1. more secure

8.9.3.1.2. amount of data that can be sent as part of request is unlimited, unlike GET which is limited by URL restrictions

8.9.4. requests.post()

8.9.4.1. url is first and required parameter

8.9.4.1.1. like requests.get()

8.9.4.2. params is optional parameter

8.9.4.2.1. this is where you will typically put the application ID and key that you get as a registered user of the API

8.9.4.2.2. the value will be a dictionary and the keys/values driven by the API documentation

8.9.4.3. headers is optional parameter

8.9.4.3.1. dictionary argument

8.9.4.4. json is optional parameter

8.9.4.4.1. json argument, but for Python this is synonymous with dictionary

8.9.5. example (see my "Python Web API with POST (Authentication).ipynb" notebook for more info

8.9.5.1. import requests import json app_id = "8548bf9b" app_key = "df38bd7b9b3a6283ea6b1f5dca7ed85f" api_endpoint = "https://api.edamam.com/api/nutrition-details" header = {"Content-Type": "application/json"} recipe = {"title":"Cappuccino","ingr":["18g ground espresso","150ml milk"]} r = requests.post(api_endpoint, params = {"app_id":app_id,"app_key":app_key}, headers = header, json = recipe)

9. Structure data with pandas

9.1. pandas library has data frame object

9.1.1. data frame is a structured data type, a lot like a table

9.2. import pandas as pd

9.2.1. pd alias is a commonly used convention but not required

9.3. passing a dictionary of shallow dictionaries into a DataFrame constructor is very quick and easy method for creating a new data frame

9.3.1. a shallow dictionary is one that does not consist of any complex values such as nested lists or nested dictionaries

9.3.2. example

9.3.2.1. searchResDF = pd.DataFrame(r.json()["results"])

9.3.2.2. note that r represents a response object that was returned from a Web API GET request and the json() method is returning a dictionary object from the json formatted response payload. "results" is the key that references the dictionary of dictionaries

9.4. data frame objects can be easily exported to csv format

9.4.1. example

9.4.1.1. searchResDF.to_csv("itunes_search_results.csv")

9.5. data frame objects can be easily exported to Excel format

9.5.1. example

9.5.1.1. searchResDF.to_excel("itunes_search_results.xlsx")

10. HTTP File Downloads

10.1. Any response can be captured and then written to a file, but it can be inefficient to download the content of some request wholly into RAM and then open up a file stream to write that content to file

10.2. Leveraging the stream parameter of the requests.get() function is the key to implementing a smarter, more efficient process for HTTP downloads, and use the Python with statement for more elegant coding

10.2.1. Python with statement

10.2.1.1. Used commonly with file streams because it guarantees to close the file in the event of some exception arising, and removes the need to explicitly call the .close() method on the file stream

10.2.1.2. Also used for other processes with dependencies on things external to the Python environment, including locks, sockets, telnets, etc. (with the recurring theme being automated assurance of closing down the connection when the work is done, including if the work is interrupted via exception)

10.2.1.3. syntax:

10.2.1.3.1. with <open connection to create object> as <connection object>: <statement(s) involving connection object>

10.2.1.4. example:

10.2.1.4.1. with open('file_path', 'w') as file: file.write('hello world !')

10.2.2. when we create a GET request using requests.get() and specify stream = True, this allows us to iterate the response (for the purpose of writing) in chunks, the size of which we can specify in bytes

10.2.2.1. in order to iterate the response content we need to invoke the .iter_content() method

10.2.2.1.1. we pass the chunk_size keyword argument to iter_content()

10.2.2.1.2. example

11. HTML

11.1. HyperText Markup Language is code that describes structure and content of web pages, and is used by web browsers to render web pages

11.1.1. In addition to HTML, web pages are often supplemented by CSS and JavaScript

11.2. HTML document is composed of nested elements

11.2.1. every document has a Head and a Body element

11.2.1.1. Head contains metadata describing such things as Title, Language and Style (all of which are nested elements in the Head element)

11.2.1.1.1. No data in Head is used to display content in the browser - it is primarily used by search engines to understand what kind of page they are looking at

11.2.1.2. Body contains all the content that will become visible in the browser

11.2.1.2.1. Includes elements such as Link, Image, Paragraph, Table

11.2.2. To be valid HTML every element must be wholly nested inside another element

11.2.3. The syntax for every element is:

11.2.3.1. <tag_name>content</tag_name>

11.2.3.1.1. tag names identify the element and they can be any of the predefined names for HTML, or in some cases they can be custom

11.2.3.1.2. content can be text, other elements, or a combination of both

11.2.3.1.3. note: some elements are content-less and consist solely of attributes - these have a syntax of:

11.2.4. First tag in an HTML document should be: <!DOCTYPE html>

11.2.4.1. Below the opening <!DOCTYPE html> tag comes the root <html>...</html> element and all content is placed between the opening and closing html tags

11.2.5. Tag attributes are specified as name="value" pairs and they are always placed inside of opening tags

11.2.5.1. different elements support different tag attributes

11.2.5.2. tag attributes are separated from the tag name by a space, and multiple attributes (name="value") can be specified by using a space separator

11.2.5.3. The element for a link is <a>..</a>, which stems from an old term "anchor"

11.2.5.3.1. The content of the <a> link element is text that serves as a hot-link label

11.2.5.3.2. to re-direct the browser (on click) to an external URL, we add the href attribute

11.2.5.3.3. to make the browser open the external link in a new browser tab, we use the target attribute with the "_blank" value

11.2.5.3.4. example:

11.3. Most important tag attributes we need to understand for web scraping are Class and ID

11.3.1. HTML class attribute

11.3.1.1. used to group elements that are of the same category so that all of them can be manipulated at once

11.3.1.2. example

11.3.1.2.1. class="menu"

11.3.1.3. elements can have more than one class as they can belong to more than one category

11.3.1.3.1. multiple class assignments are specified inside a single class="value" attribute, where multiple values are separated by a space

11.3.1.4. Web developers often use very descriptive names for class attribute values because this helps improve their search engine rankings

11.3.2. HTML id attribute

11.3.2.1. value must be unique across the HTML document (web page)

11.3.2.2. every element can only have one value for its id attribute

11.4. Popular tags

11.4.1. HTML head tag

11.4.1.1. HTML title tag

11.4.1.1.1. mandatory for all HTML docs

11.4.1.1.2. can only be one such element

11.4.1.1.3. used by search engines like Google to categorise

11.4.1.2. HTML meta tag

11.4.1.2.1. content-less tags featuring attributes only

11.4.1.2.2. charset tag attribute used to specify the character encoding - e.g. UTF-8

11.4.1.2.3. name and content attributes pair together for various metadata, including author, description, keywords, etc.

11.4.1.3. HTML style tag

11.4.1.3.1. used with CSS content to define style of HTML document

11.4.1.3.2. works in combination with class and id attributes in the body elements

11.4.1.4. HTML script tag

11.4.1.4.1. used with Javascript content

11.4.2. HTML body tag

11.4.2.1. HTML div tag

11.4.2.1.1. defines division or section in HTML doc

11.4.2.1.2. just a container for other elements, a way to group elements

11.4.2.1.3. used almost exclusively with class and id attributes

11.4.2.2. HTML span tag

11.4.2.2.1. embedded within content, typically to apply some styling to part of the content whilst keeping the content together

11.4.2.2.2. unlike div tags, span tag content never starts on a new line

11.4.2.3. HTML iframe tag

11.4.2.3.1. used to embed another HTML document

11.4.2.3.2. src attribute specifies link to embedded document, often a URL external to site

11.4.2.4. HTML img tag

11.4.2.4.1. specifies image to display

11.4.2.4.2. src and alt attributes both required

11.5. HTML Lists

11.5.1. ordered lists

11.5.1.1. HTML ol tag

11.5.1.2. numbered by default, but alternatives can be specified

11.5.2. unordered lists

11.5.2.1. HTML ul tag

11.5.2.2. bullet point list

11.5.3. HTML li tag

11.5.3.1. list item, carries content for both ordered and unordered lists

11.6. HTML table tag

11.6.1. defines HTML table

11.6.2. consists of nested table row elements

11.6.2.1. HTML tr tag

11.6.2.1.1. consists of nested table data or table header elements

11.6.3. example

11.6.3.1. <table> <tr> <th>Month</th> <th>Savings</th> </tr> <tr> <td>January</td> <td>$100</td> </tr> </table>

11.7. Handling reserved characters or symbols not on our keyboard

11.7.1. Reserved symbols include < and >

11.7.2. 3 methods

11.7.2.1. specify name of symbol with & prefix and ; suffix

11.7.2.1.1. e.g.

11.7.2.1.2. note: not every symbol can be represented by a name and not every browser will recognise it

11.7.2.2. specify decimal code of Unicode codepoint for the symbol, and topped and tailed with & and ;

11.7.2.2.1. e.g.

11.7.2.3. specify hex code of Unicode codepoint for the symbol, and topped and tailed with & and ;, but with x in between the & and hex number

11.7.2.3.1. e.g.

11.8. Watch out for the non breaking space!

11.8.1. the non breaking space looks like a regular space on the screen but it has a different Unicode codepoint value of 160 (vs 32 for a regular space)

11.8.1.1. in hex, the nbsp is A0

11.8.2. referred to as nbsp character

11.8.3. an nbsp is used in HTML to ensure that two words are kept together and not allowed to break apart over two lines

11.9. XHTML

11.9.1. HTML rules are specified as guidelines, which means that poorly written HTML code that ignores certain rules is allowed

11.9.1.1. web browsers automatically handle things like opening tags with no closing tags, attribute values specified without double quotes, etc.

11.9.1.1.1. XHTML is a strict standard that insists on valid HTML

11.9.1.1.2. there are websites out there written in XHTML but it never took off in a big way, which means most websites are based on HTML

12. CSS

12.1. Cascading Style Sheets

12.2. language used to describe presentation and style of HTML documents

12.3. 3 ways that style can be applied to HTML element

12.3.1. inline

12.3.1.1. style attribute

12.3.1.1.1. example

12.3.1.1.2. note syntax for the style attribute values

12.3.2. internal

12.3.2.1. style element embedded inside the head element

12.3.2.1.1. style element content is composed of CSS selectors followed by CSS properties and values wrapped in curly braces { }

12.3.2.1.2. example

12.3.2.1.3. note that you can specify multiple CSS selectors (e.g. table, th, td) with a common set of CSS properties, separating each by comma

12.3.3. external

12.3.3.1. separate file that uses same syntax as the internal style element

12.3.3.2. browser downloads the CSS file and applies styles to every page in the site based on that file

12.3.3.2.1. this approach allows the entire look and feel of a website to be changed by altering this single CSS file

12.3.3.2.2. it's also faster for browsers to apply styling this way

12.4. CSS Ref for Properties

13. Beautiful Soup

13.1. Python package for parsing HTML and XML documents - ideal for web scraping

13.2. Web scaping workflow

13.2.1. 1. Inspect the page

13.2.1.1. use browser developer tool to inspect, and get a feel for the page structure

13.2.1.1.1. be aware that the developer's inspect tool often invokes Javascript itself, which modifies the HTML you inspect

13.2.2. 2. Obtain HTML

13.2.2.1. requests.get()

13.2.3. 3. Choose Parser

13.2.3.1. Parsing is process of decomposing HTML page and reconstructing into a parse tree (think element hierarchy)

13.2.3.2. Beautiful Soup does not have its own parser, and currently supports 3 external parsers

13.2.3.2.1. html.parser

13.2.3.2.2. lxml

13.2.3.2.3. html5lib

13.2.4. 4. Create a Beautiful Soup object

13.2.4.1. Input parameter for the Beautiful Soup constructor is a parse tree (produced by the chosen parser)

13.2.5. 5. Export the HTML to a file (optional)

13.2.5.1. Recommended because different parsers can produce different parse trees for same source HTML document, and it's useful to store the parsed HTML for reference

13.3. Basics of the web scraping workflow in Python

13.3.1. import requests from bs4 import BeautifulSoup

13.3.2. get the HTML using requests.get()

13.3.2.1. e.g.

13.3.2.1.1. url = "https://en.wikipedia.org/wiki/Music" r = requests.get(url)

13.3.3. Peek at the content to verify that response looks like an HTML document

13.3.3.1. html = r.content html[:100]

13.3.4. Make the "soup" by invoking the BeautifulSoup constructor, passing in the html response as 1st arg and the HTML parser name as 2nd arg

13.3.4.1. e.g.

13.3.4.1.1. soup = BeautifulSoup(html, "html.parser")

13.3.5. Write the parsed HTML to file, which involves opening up a binary file stream in write mode and for the file write() method, passing in the soup object with its prettify() method invoked, which produces a nicely formatting representation of the the HTML for writing to the file

13.3.5.1. e.g.(noting that "soup" is instance of BeautifulSoup)

13.3.5.1.1. with open("Wiki_response.html","wb") as file: file.write(soup.prettify("utf-8"))

13.3.6. Use the BeautifulSoup find() method to find first instance of a given element, where the tag name is passed as string argument

13.3.6.1. e.g. (noting that "soup" is instance of BeautifulSoup)

13.3.6.1.1. soup.find('head')

13.3.6.2. the result of find() is bs4.element.Tag, which is an object you can also invoke the find_all() method on

13.3.6.2.1. but if no such element is found then result is None

13.3.6.2.2. example of finding a tbody (table body) tag and then invoking find_all() to get all td (table data) tags contained within it

13.3.7. Use the BeautifulSoup find_all() method to find all instances of a given element, where the tag name is passed as string argument

13.3.7.1. e.g. (noting that "soup" is instance of BeautifulSoup)

13.3.7.1.1. links = soup.find_all('a')

13.3.7.2. the result of find_all() is bs4.element.ResultSet, which is a subclass of list

13.3.7.2.1. if no elements are found, the result is still bs4.element.ResultSet but akin to an empty list

13.3.8. Every element in the parse tree can have multiple children but only one parent

13.3.8.1. navigate to children by invoking the content property of soup element object

13.3.8.1.1. e.g. (noting "table" is instance of bs4.element.Tag)

13.3.8.2. navigate to parent by invoking the parent property of soup element object

13.3.8.2.1. e.g. (noting "table" is instance of bs4.element.Tag)

13.3.8.2.2. for navigating up multiple levels, use dot notation

13.4. Searching by Attribute

13.4.1. both find() and find_all() methods support attribute searches in same way

13.4.2. HTML standard attributes can be specified as additional arguments followed by equals = and the value enclosed in quotes " "

13.4.2.1. e.g. (noting that "soup" is instance of BeautifulSoup)

13.4.2.1.1. soup.find("div", id = "siteSub")

13.4.2.2. note that user-defined attributes cannot be searched in this manner because the find() and find_all() methods will not recognise them as a keyword argument

13.4.2.2.1. this limitation can be overcome by using the attrs argument for find() and find_all()

13.4.2.3. because class is a Python reserved keyword, it will raise exception if you try to pass it as argument to find() or find_all()

13.4.2.3.1. fix is to append an underscore to class

13.4.3. we can also search based on multiple attribute values - just pass them as 3rd, 4th, etc. arguments in find() or find_all()

13.4.3.1. e.g. (noting that "soup" is instance of BeautifulSoup)

13.4.3.1.1. soup.find("a",class_ = "mw-jump-link",href = "#p-search")

13.5. Extracting attribute data

13.5.1. we can extract attribute data from a soup tag object (bs4.element.Tag) using two approaches

13.5.1.1. 1st approach is to reference the attribute name as a dictionary key on the tag object

13.5.1.1.1. e.g. (noting that "a" is an instance of bs4.element.Tag)

13.5.1.1.2. if attribute does not exist, this approach causes exception

13.5.1.2. 2nd approach is to invoke the get() method on the tag object

13.5.1.2.1. e.g. (noting that "a" is an instance of bs4.element.Tag)

13.5.1.2.2. behaves same as approach 1 but for non existent attributes, it returns None and does not raise exception

13.5.2. to get a dictionary containing all attributes and assigned values for a soup tag object, just use the attrs property

13.5.2.1. e.g. (noting that "soup" is instance of BeautifulSoup)

13.5.2.1.1. a.attrs

13.6. Extracting tag string content data

13.6.1. The text and string properties on the soup tag object both have the same effect on tags with a single string for content, but behave differently when a tag includes nested elements

13.6.1.1. text property strips away all nested tags to provide all text content as a single string

13.6.1.1.1. however, text property for parent soup object will convert all Javascript to text because it only handles HTML

13.6.1.2. string property returns None if content of tag does not consist of a single string (unbroken by other tags)

13.6.2. strings is a generator available for the tag object and it enables us to iterate over every string fragment in a for loop, processing string by string

13.6.2.1. e.g. (noting that "p" is an instance of bs4.element.Tag)

13.6.2.1.1. for s in p.strings: print(repr(s))

13.6.3. stripped_strings is another generator available for the tag object, behaving like strings but eliminating all leading/trailing whitespace, including newline characters

13.6.3.1. e.g. (noting that "p" is an instance of bs4.element.Tag)

13.6.3.1.1. for s in p.stripped_strings: print(repr(s))

13.7. Scraping links

13.7.1. you can capture a list of all links from the Beautiful Soup object using the find_all() method and then you can pull out the URL via the href attribute

13.7.2. it is common to encounter relative URLs, which are just folder/file references relative to page base URL

13.7.2.1. we can use the urljoin() function from the parse module of the urllib package to combine the page base URL with the relative URL in order to form the absolute URL

13.7.2.1.1. e.g. (noting that "l" is an instance of bs4.element.Tag association with an "a" tag, and url is a string that holds the base URL for the page being scraped)

13.7.2.1.2. Python urllib.parse.urljoin

13.7.2.1.3. to process multiple links, we can use list comprehension

13.8. Scraping nested elements

13.8.1. sometimes you need to perform nested searches

13.8.1.1. for example, you might identify sections of a page that are commonly identifiable via a div tag with role attribute set to "note", and you want to scrape every link from within these particular div tags

13.8.1.1.1. method 1: nested for loop with list append() method

13.8.1.1.2. method 2: for loop with list extend() method

13.9. Scraping multiple pages automatically

13.9.1. this builds from scraping links from a single page - we can press on from a scraped list of links (in this example, captured in a variable named url_list)

13.9.1.1. we start by iterating our list of scraped URLs and using our core techniques to scrape all <p> tag text from each page: 1. Get request 2. Capture content of response (html) 3. Create BeautifulSoup object 4. Capture list of <p> content strings 5. Append <p> string list to master list

13.9.1.1.1. para_list = [] i = 0 for l in url_list: para_resp = requests.get(l) i += 1 if para_resp.status_code == 200: print(i,": good response :",l) else: print(i,":",para_resp.status_code," response (skipped):",l) continue para_html = para_resp.content para_soup = BeautifulSoup(para_html,"lxml") paras = [p.text for p in para_soup.find_all("p")] para_list.append(paras)

13.10. Using pandas to structure results and capture them to file

13.10.1. Having scraped related data into various list objects in Python, it's really easy to add these lists as columns to a Pandas dataframe object

13.10.1.1. e.g. (noting that titles, years_cleaned, scores_cleaned, critics_consensus, synopsis, directors and cast are all Python variables referencing list objects related to scraped data from the Rotten Tomatoes site)

13.10.1.1.1. import pandas as pd movie_list = pd.DataFrame() #Create empty dataframe movie_list["Title"] = titles movie_list["Year"] = years_cleaned movie_list["Score"] = scores_cleaned movie_list["Critic's Consensus"] = critics_consensus movie_list["Synopsis"] = synopsis movie_list["Director"] = directors movie_list["Cast"] = cast

13.11. Handling None ('NoneType' AttributeError) when scraping

13.11.1. When scraping a list of elements using a list comprehension, it is quite common for the code to fail on the following error:

13.11.1.1. AttributeError: 'NoneType' object has no attribute <name_of_attribute>

13.11.2. Common example happens when scraping a list of elements (tags) for their string content by invoking the string property

13.11.2.1. when iterator processes a tag that has no content, exception is raised

13.11.2.2. duration_list = [t.find("span",{"class":"accessible-description"}).string for t in related_vids]

13.11.3. We can handle this by implementing conditional logic in the main expression of the list comprehension

13.11.3.1. syntax is:

13.11.3.1.1. [ <val1> if <boolean_expression> else <val2> for x in y ]

13.11.3.2. e.g.

13.11.3.2.1. duration_list = [None if t.find("span",{"class":"accessible-description"}) == None else t.find("span",{"class":"accessible-description"}).string for t in related_vids]

14. Browser Developer Tools

14.1. Useful for web scraping because you can inspect the underlying HTML of any element on the page

14.2. Chrome

14.2.1. right-click any part of web page and choose Inspect

14.2.1.1. under Elements pane, right-click and choose Copy | Copy Element

14.2.1.1.1. paste into Notepad++

14.3. Edge

14.3.1. to access Developer Tools, click ellipses ... in top right corner and choose More Tools | Developer Tools

15. Using Pandas to Scrape HTML Tables

15.1. We can scrape html tables using Beautiful Soup but it has to be done column by column and can be a bit of a tedious process

15.1.1. Pandas provides the read_html() function, which takes an html document as its argument and returns a list of all table elements converted into dataframes

15.1.1.1. Note: in the background, the pandas.read_html() function leverages BeautifulSoup but it provides a much faster way to capture table content

15.1.1.2. e.g. noting that html is a variable captured from the content property of a request response object

15.1.1.2.1. import pandas as pd tables = pd.read_html(html) tables[1] #returns 2nd dataframe

16. Common Roadblocks for Web Scraping

16.1. Request headers

16.1.1. sent as part of request and contains meta data about request

16.1.2. content of header can vary and may include information such as application type, operating system, software vendor, software version, etc.

16.1.3. header content is combined into user agent string

16.1.3.1. think of user agent string as an ID card for the application making the request

16.1.3.2. all web browsers have their own unique user agent string

16.1.3.3. well known bots like the Google web crawler also has its own unique user agent string

16.1.4. many servers send different responses based on user agent string

16.1.4.1. when user agent string is missing or cannot be interpreted many sites will return a default response

16.1.4.2. this can lead to differences between html we can inspect via browser developer tools and actual html captured via our web scraping response

16.1.4.2.1. fix is to always write our html response to file and use this as our reference for scraping

16.1.5. some sites block all anonymous requests (i.e. requests that do not include a recognised user agent string)

16.1.5.1. fix is to use user agent string of one of the main web browser applications, as these are publicly available

16.1.5.1.1. Chrome user agent

16.1.5.1.2. requests supports headers parameter with value passed as dictionary

16.1.5.1.3. e.g.

16.2. Cookies

16.2.1. small piece of data that a server sends to the user's web browser

16.2.2. browser may store it and send it back with later requests to the same server

16.2.3. Typically, it's used to tell if two requests came from the same browser

16.2.3.1. e.g. keeping a user logged-in

16.2.4. Cookies are mainly used for three purposes:

16.2.4.1. Session management

16.2.4.1.1. Logins, shopping carts, game scores, or anything else the server should remember

16.2.4.2. Personalization

16.2.4.2.1. User preferences, themes, and other settings

16.2.4.3. Tracking

16.2.4.3.1. Recording and analyzing user behaviour

16.2.5. Sites that require registration and login in order to access site pages will refuse get requests with a 403 Forbidden response

16.2.5.1. fix is to create a stateful session that allows the session cookies to be received and used by our Python program, and to use an appropriate post request to get the session cookie

16.2.5.1.1. requests module has Sessions class, which we can use to create session objects and then the subsequent post/get requests are invoked via the session object

16.2.5.2. Sites that require login often redirect to a login page that includes a form tag

16.2.5.2.1. action attribute of form tag holds the relative URL of login page

16.2.5.2.2. form tag includes a number of input tags

16.2.5.2.3. use Chrome Developer tools to trace login request via the Network tab

16.3. Denial of Service

16.3.1. When making multiple requests of a web server, we must be mindful of the risk of the server being overwhelmed by too many requests

16.3.1.1. many websites have protection against denial of service attacks and will refuse multiple requests from a single client that are made too rapidly

16.3.1.1.1. fix is to import the time module and use the sleep() function to create wait duration in between multiple requests

16.4. Captchas

16.4.1. These sites are deliberately hard to scrape by bots, so you should avoid attempting to do so

16.5. Dynamically generated content with Javascript

16.5.1. For this problem we will turn to the requests-html package, to be used in place of requests + BeautifulSoup

17. requests-html package

17.1. created by the creator of requests library to combine requests + BeautifulSoup functionality

17.2. Full JavaScript support

17.3. Get page and parse html

17.3.1. from requests_html import HTMLSession

17.3.2. session = HTMLSession()

17.3.3. r = session.get("url_goes_here")

17.3.4. r.html

17.3.4.1. The HTMLSession.get() method automatically parses the html response and encapsulates it in the html property of the response object

17.3.4.2. the html property of the response becomes the basis for the scraping operations

17.4. Scrape links

17.4.1. relative links

17.4.1.1. urls = r.html.links

17.4.2. absolute links

17.4.2.1. full_path_urls = r.html.absolute_links

17.4.3. both links and absolute_links return a set rather than a list

17.5. Element search

17.5.1. html.find() method returns a list by default, so it behaves like the find_all() method in BeautifulSoup

17.5.1.1. r.html.find("a")

17.5.1.2. if we use the first parameter, we can make it return a single element, not a list

17.5.1.2.1. r.html.find("a", first=True)

17.5.2. the individual elements of the list returned by html.find() are typed as requests_html.Element

17.5.3. We can get a dictionary of an element's attributes by referencing the html.attrs property

17.5.3.1. element = r.html.find("a")[0] element.attrs

17.5.4. We can get html string representation of element using the response_html.Element.html property

17.5.4.1. element = r.html.find("a")[0] element.html

17.5.5. We can get element string content using the response_html.Element.text property

17.5.5.1. element = r.html.find("a")[0] element.text

17.5.6. We can filter element search by using containing parameter of html.find() method

17.5.6.1. r.html.find("a", containing="wikipedia")

17.5.6.2. note that search is made on text of element and search is not case sensitive

17.6. Text pattern search

17.6.1. html.search() method returns raw html and first result that matches search() argument

17.6.1.1. We can find all text that falls in between two strings by passing argument as "string1{}string2"

17.6.1.1.1. result will be all the html found in between an occurrence of string1 and string2, where the curly braces {} represents the result to be found and returned

17.6.1.2. e.g. noting that r represents a response object

17.6.1.2.1. r.html.search("known{}soccer")

17.6.1.2.2. r.html.search("known {} soccer")[0]

17.6.2. html.search_all() works the same as search() but returns all results that match

17.6.2.1. e.g. noting that r represents a response object

17.6.2.1.1. r.html.search_all("known{}soccer")

17.7. CSS Selectors

17.7.1. used to "find" (or select) the HTML elements you want to style

17.7.2. CSS Selector Reference

17.7.3. we need to understand CSS Selectors and the notation used because this is the notation used when we pass arguments to the html.find() method

17.7.3.1. element selector

17.7.3.1.1. html.find("element")

17.7.3.2. #id selector

17.7.3.2.1. html.find("#id")

17.7.3.3. .class_name selector

17.7.3.3.1. html.find(".class_name")

17.7.3.4. general attribute selectors (there are multiple forms - to the left are 3 of most common)

17.7.3.4.1. [attribute] selector

17.7.3.4.2. [attribute=value] selector

17.7.3.4.3. [attribute*=value] selector

17.7.3.5. combining selectors is done by concatenation with spaces

17.7.3.5.1. remember that when tags included in combined selection, they must come first

17.7.3.5.2. e.g. r.html.find("a[href*=wikipedia]") returns list of <a> tag elements with href attributes that include "wikipedia" substring

17.7.3.5.3. e.g. r.html.find("a.internal") returns list of <a> tag elements with class="internal"

17.7.3.6. specifying context in tag hierarchy for element search

17.7.3.6.1. parent_element child_element selector

17.7.3.6.2. parent_element > child_element selector

17.8. Scraping pages with Javascript content

17.8.1. from requests_html import AsyncHTMLSession session = AsyncHTMLSession()

17.8.1.1. One of the differences when scraping pages with JavaScript for the purpose of rendering the dynamic content into a regular html document is that we need to use an asynchronous session. This requires us to use the await keyword before requests

17.8.2. site_url = "https://angular.io/" r = await session.get(site_url) r.status_code

17.8.2.1. with an asynchronous session, we prefix the get request with the await keyword

17.8.3. await r.html.arender()

17.8.3.1. we use the asynchronous version of the render() method, arender(), combined with the await keyword

17.8.3.2. this uses the Chromium browser to render the page content and converts this into regular html that we can scrape

17.8.3.3. the first time this is run on a host, it will attempt to automatically download and install Chromium

17.8.4. session.close()

17.8.4.1. once the session is closed, we can proceed with scraping the html object in the normal way

17.8.5. tips for timeout errors

17.8.5.1. try render() method wait parameter

17.8.5.2. try render() method retries parameter

17.8.5.3. print(r.html.render.__doc__)