您需要从网页中提取 HTML 表格。
互联网和万维网 (WWW) 是当今最重要的信息来源。那里有很多信息,很难从这么多选项中选择内容。大多数信息都可以通过 HTTP 检索。
但我们也可以通过编程方式执行这些操作来自动检索和处理信息。
Python 允许我们使用它的标准库一个 HTTP 客户端来做到这一点,但是 requests 模块有助于非常容易地获取网页信息。
在这篇文章中,我们将看到如何解析 HTML 页面以提取嵌入在页面中的 HTML 表格。
1.我们将使用 requests、pandas、beautifulsoup4 和 tabulate 包。如果它们丢失,请在您的系统上安装它们。如果您不确定,请使用 pip freeze 进行验证。
import requests import pandas as pd from tabulate import tabulate
2.我们将使用https://www.nhooo.com/python/python_basic_operators.htm浏览页面并打印出嵌入其中的所有 HTML 页面。
#设置网站网址 site_url = "https://www.nhooo.com/python/python_basic_operators.htm"
3.我们将向服务器发出请求并查看响应。
#向服务器发出请求 response = requests.get(site_url) #检查响应 print(f"*** The response for {site_url} is {response.status_code}")
4.嗯,响应代码200 - 代表服务器返回的响应是成功的。因此,我们现在将检查请求标头、响应标头以及服务器返回的前 100 个文本。
#检查请求标头 print(f"*** Printing the request headers - \n {response.request.headers} ") #检查响应 headers print(f"*** Printing the request headers - \n {response.headers} ") #检查结果的内容 print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")输出结果
*** Printing the request headers - {'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} *** Printing the request headers - {'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'} *** Accessing the first 100/37624 characters - <!DOCTYPE html> <html lang="en-US"> <head> <title>Python - Basic Operators - Nhooo</title>
5.我们现在将使用 BeautifulSoup 来解析 HTML。
#解析 HTML 页面 from bs4 import BeautifulSoup tutorialpoints_page = BeautifulSoup(response.text, 'html.parser') print(f"*** The title of the page is - {tutorialpoints_page.title}") #您也可以将页面标题提取为字符串 print(f"*** The title of the page is - {tutorialpoints_page.title.string}")
6.嗯,大多数表格都将在 h2、h3、h4、h5 或 h6 标记中定义标题。我们首先识别这些标签,然后我们拾取识别标签旁边的 html 表。对于这个逻辑,我们将使用如下定义的 find、sibling 和 find_next_siblings。
#找到所有的 h3 元素 print(f"{tutorialpoints_page.find_all('h2')}") tags = tutorialpoints_page.find(lambda elm:elm.name== "h2" orelm.name== "h3" orelm.name== "h4" orelm.name== "h5" orelm.name== "h6") for sibling in tags.find_next_siblings(): ifsibling.name== "table": my_table = sibling df = pd.read_html(str(my_table)) print(tabulate(df[0], headers='keys', tablefmt='psql'))
7.现在把它们放在一起。
#STEP1 : 下载所需页面 import requests import pandas as pd #设置网站网址 site_url = "https://www.nhooo.com/python/python_basic_operators.htm" #向服务器发出请求 response = requests.get(site_url) #检查响应 print(f"*** The response for {site_url} is {response.status_code}") #检查请求标头 print(f"*** Printing the request headers - \n {response.request.headers} ") #检查响应 headers print(f"*** Printing the request headers - \n {response.headers} ") #检查结果的内容 print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ") #解析 HTML 页面 from bs4 import BeautifulSoup tutorialpoints_page = BeautifulSoup(response.text, 'html.parser') print(f"*** The title of the page is - {tutorialpoints_page.title}") #您也可以将页面标题提取为字符串 print(f"*** The title of the page is - {tutorialpoints_page.title.string}") #找到所有的 h3 元素 # print(f"{tutorialpoints_page.find_all('h2')}") tags = tutorialpoints_page.find(lambda elm:elm.name== "h2" orelm.name== "h3" orelm.name== "h4" orelm.name== "h5" orelm.name== "h6") for sibling in tags.find_next_siblings(): ifsibling.name== "table": my_table = sibling df = pd.read_html(str(my_table)) print(df)输出结果
*** The response for https://www.nhooo.com/python/python_basic_operators.htm is 200 *** Printing the request headers - {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} *** Printing the request headers - {'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'} *** Accessing the first 100/37624 characters - <!DOCTYPE html> <html lang="en-US"> <head> <title>Python - Basic Operators - Nhooo</title> *** The title of the page is - <title>Python - Basic Operators - Nhooo</title> *** The title of the page is - Python - Basic Operators - Nhooo [<h2>Types of Operator</h2>, <h2>Python Arithmetic Operators</h2>, <h2>Python Comparison Operators</h2>, <h2>Python Assignment Operators</h2>, <h2>Python Bitwise Operators</h2>, <h2>Python Logical Operators</h2>, <h2>Python Membership Operators</h2>, <h2>Python Identity Operators</h2>, <h2>Python Operators Precedence</h2>] [ Operator Description \ 0 + Addition Adds values on either side of the operator. 1 - Subtraction Subtracts right hand operand from left hand op... 2 * Multiplication Multiplies values on either side of the operator 3 / Division Divides left hand operand by right hand operand 4 % Modulus Divides left hand operand by right hand operan... 5 ** Exponent Performs exponential (power) calculation on op... 6 // Floor Division - The division of operands wher...
0 a + b = 30 1 a – b = -10 2 a * b = 200 3 b / a = 2 4 b % a = 0 5 a**b =10 to the power 20 6 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.... ] [ Operator Description \ 0 == If the values of two operands are equal, then ... 1 != If values of two operands are not equal, then ... 2 <> If values of two operands are not equal, then ... 3 > If the value of left operand is greater than t... 4 < If the value of left operand is less than the ... 5 >= If the value of left operand is greater than o... 6 <= If the value of left operand is less than or e...
0 (a == b) is not true. 1 (a != b) is true. 2 (a <> b) is true. This is similar to != operator. 3 (a > b) is not true. 4 (a < b) is true. 5 (a >= b) is not true. 6 (a <= b) is true. ] [ Operator Description \ 0 = Assigns values from right side operands to lef... 1 += Add AND It adds right operand to the left operand and ... 2 -= Subtract AND It subtracts right operand from the left opera... 3 *= Multiply AND It multiplies right operand with the left oper... 4 /= Divide AND It divides left operand with the right operand... 5 %= Modulus AND It takes modulus using two operands and assign... 6 **= Exponent AND Performs exponential (power) calculation on op... 7 //= Floor Division It performs floor division on operators and as...
0 c = a + b assigns value of a + b into c 1 c += a is equivalent to c = c + a 2 c -= a is equivalent to c = c - a 3 c *= a is equivalent to c = c * a 4 c /= a is equivalent to c = c / a 5 c %= a is equivalent to c = c % a 6 c **= a is equivalent to c = c ** a 7 c //= a is equivalent to c = c // a ] [ Operator \ 0 & Binary AND 1 | Binary OR 2 ^ Binary XOR 3 ~ Binary Ones Complement 4 << Binary Left Shift 5 >> Binary Right Shift Description \ 0 Operator copies a bit to the result if it exis... 1 It copies a bit if it exists in either operand. 2 It copies the bit if it is set in one operand ... 3 It is unary and has the effect of 'flipping' b... 4 The left operands value is moved left by the n... 5 The left operands value is moved right by the ...
0 (a & b) (means 0000 1100) 1 (a | b) = 61 (means 0011 1101) 2 (a ^ b) = 49 (means 0011 0001) 3 (~a ) = -61 (means 1100 0011 in 2's complement... 4 a << 2 = 240 (means 1111 0000) 5 a >> 2 = 15 (means 0000 1111) ] [ Operator Description \ 0 and Logical AND If both the operands are true then condition b... 1 or Logical OR If any of the two operands are non-zero then c... 2 not Logical NOT Used to reverse the logical state of its operand. Example 0 (a and b) is true. 1 (a or b) is true. 2 Not(a and b) is false. ] [ Operator Description \ 0 in Evaluates to true if it finds a variable in th... 1 not in Evaluates to true if it does not finds a varia...
0 x in y, here in results in a 1 if x is a membe... 1 x not in y, here not in results in a 1 if x is... ] [ Operator Description \ 0 is Evaluates to true if the variables on either s... 1 is not Evaluates to false if the variables on either ...
0 x is y, here is results in 1 if id(x) equals i... 1 x is not y, here is not results in 1 if id(x) ... ] [ Sr.No. Operator & Description 0 1 ** Exponentiation (raise to the power) 1 2 ~ + - Complement, unary plus and minus (method... 2 3 * / % // Multiply, divide, modulo and floor di... 3 4 + - Addition and subtraction 4 5 >> << Right and left bitwise shift 5 6 & Bitwise 'AND' 6 7 ^ | Bitwise exclusive `OR' and regular `OR' 7 8 <= < > >= Comparison operators 8 9 <> == != Equality operators 9 10 = %= /= //= -= += *= **= Assignment operators 10 11 is is not Identity operators 11 12 in not in]