Python中的抓取网页链接

先决条件：

Urllib3：这是一个功能强大，对环境友好的Python HTTP客户端，具有许多功能，例如线程安全，客户端SSL / TSL验证，连接池，使用多部分编码的文件上传等。
安装urllib3：
```
    $pip install urllib3
```
BeautifulSoup：这是一个Python库，用于从网页，XML文件中抓取/获取信息，即从HTML和XML文件中提取数据。
安装BeautifulSoup：
```
    $pip install beautifulsoup4
```

使用的命令：

html = urllib.request.urlopen(url).read()：打开URL，并以换行符结尾读取整个blob，所有这些都变成一个大字符串。

soup = BeautifulSoup（html，'html.parser'）：使用BeautifulSoup解析字符串BeautifulSoup转换该字符串，它只获取整个文件并使用HTML解析器，然后返回一个对象。

tags= soup('a'): 获取所有锚标签的列表。

tag.get（'href'，None）：从href提取并获取数据。

从网页链接的Python程序

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

# 获取链接
# 网页的URL
url = input("Enter URL: ") 

# 打开URL并阅读整个页面
html = urllib.request.urlopen(url).read()# 解析字符串
soup = BeautifulSoup(html, 'html.parser')
# 检索所有锚标签
# 返回所有链接的列表
tags = soup('a')

#打印列表标签中的所有链接
for tag in tags: 
  # 从href键获取数据
  print(tag.get('href', None), end = "\n")

输出：

Enter URL: https://www.google.com/
https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=US&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wmhttps://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true
&continue=https://www.google.com/
/advanced_search?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/

基础教程