python 中BeautifulSoup入门

Pastore Antonio 2020年09月11日

1600 阅读 0 评论约 4822 字阅读约 10 分钟

在前面的例子用，我用了BeautifulSoup来从58同城抓取了手机维修的店铺信息，这个库使用起来的确是很方便的。本文是BeautifulSoup 的一个详细的介绍，算是入门把。文档地址：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

什么是BeautifulSoup？

Beautiful Soup 是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。

直接看例子：

#!/usr/bin/python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html_doc = “””
<html><head><title>The Dormouse’s story</title></head>
<body>
The Dormouse’s story

Once upon a time there were three little sisters; and their names were
<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,
<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and
<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;
and they lived at the bottom of a well.

…

“””

soup = BeautifulSoup(html_doc)

print soup.title

print soup.title.name

print soup.title.string

print soup.p

print soup.a

print soup.find_all(‘a’)

print soup.find(id=’link3′)

print soup.get_text()

结果为：

<title>The Dormouse’s story</title>
title
The Dormouse’s story
The Dormouse’s story
<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>
[<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
<a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>

The Dormouse’s story
The Dormouse’s story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
…

可以看出：soup 就是BeautifulSoup处理格式化后的字符串，soup.title 得到的是title标签，soup.p 得到的是文档中的第一个p标签，要想得到所有标签，得用find_all

函数。find_all 函数返回的是一个序列，可以对它进行循环，依次得到想到的东西.

get_text() 是返回文本,这个对每一个BeautifulSoup处理后的对象得到的标签都是生效的。你可以试试 print soup.p.get_text()

其实是可以获得标签的其他属性的，比如我要获得a标签的href属性的值，可以使用 print soup.a[‘href’],类似的其他属性，比如class也是可以这么得到的（soup.a[‘class’]）。

特别的，一些特殊的标签，比如head标签，是可以通过soup.head 得到，其实前面也已经说了。

如何获得标签的内容数组？使用contents 属性就可以比如使用 print soup.head.contents，就获得了head下的所有子孩子，以列表的形式返回结果，

可以使用 [num] 的形式获得 ,获得标签，使用.name 就可以。

获取标签的孩子，也可以使用children，但是不能print soup.head.children 没有返回列表，返回的是 <listiterator object at 0x108e6d150>,

不过使用list可以将其转化为列表。当然可以使用for 语句遍历里面的孩子。

关于string属性，如果超过一个标签的话，那么就会返回None，否则就返回具体的字符串print soup.title.string 就返回了 The Dormouse’s story

超过一个标签的话，可以试用strings

向上查找可以用parent函数，如果查找所有的，那么可以使用parents函数

查找下一个兄弟使用next_sibling,查找上一个兄弟节点使用previous_sibling,如果是查找所有的，那么在对应的函数后面加s就可以

如何遍历树？

　使用find_all 函数

find_all(name, attrs, recursive, text, limit, **kwargs)

举例说明：

print soup.find_all(‘title’)
print soup.find_all(‘p’,’title’)
print soup.find_all(‘a’)
print soup.find_all(id=”link2″)
print soup.find_all(id=True)

返回值为：

[<title>The Dormouse’s story</title>]
[The Dormouse’s story]
[<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
[<a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>]
[<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

通过css查找,直接上例子把：

print soup.find_all(“a”, class_=”sister”)
print soup.select(“p.title”)

通过属性进行查找
print soup.find_all(“a”, attrs={“class”: “sister”})

通过文本进行查找
print soup.find_all(text=”Elsie”)
print soup.find_all(text=[“Tillie”, “Elsie”, “Lacie”])

限制结果个数
print soup.find_all(“a”, limit=2)

结果为：

[<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
[The Dormouse’s story]
[<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
[u’Elsie’]
[u’Elsie’, u’Lacie’, u’Tillie’]
[<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>]

总之，通过这些函数可以查找到想要的东西。

—end—

分享到：

橙子主题打折出售

其实我不卖，主要是这里是放广告的，所以就放了一个
毕竟主题都没做完，卖了也是坑.

购买它

AIGC

MVP 聚技站｜GitHub Copilot SDK 入门：五分钟构建你的第一个 AI Agent

引言：为什么 Agent 开发不再是少数人的游戏近年来，随着人工智能技术的快速发展，AI Agen <a href="https://www.destlive.c...

15 篇文章

探索AIGC相关的精彩内容，共 15 篇文章

Azure AI 服务之语音识别

简介 Azure AI 服务中的语音识别 API 是微软提供的一项先进技术，旨在帮助开发者轻松实现语 ... python 中BeautifulSoup入门

2026-02-17 · Xzavier Aaron

MCP | 一文详解什么是 MCP以及 MCP 可以做什么

一、什么是 MCP MCP（Model Context Protocol）是一个专为大型语言模型（L ... python 中BeautifulSoup入门

2026-02-14 · Shen, Luke

你的工作流程，值得一个“全自动数字分身”：录制、截图、成文，一气呵成

一、一句话认识 TestFlow Recorder 在数字化工作环境中，如何准确记录操作步骤并生成清 ... python 中BeautifulSoup入门

2026-02-14 · Xzavier Aaron

Flowise 前端框架配置指南

用户需求问题：有没有适合配置 Flowise 的前端框架？目标：寻找类似 Open WebUI ... python 中BeautifulSoup入门

2026-02-14 · Xzavier Aaron

查看「AIGC」全部文章

最热分类

MVP 聚技站｜GitHub Copilot SDK 入门：五分钟构建你的第一个 AI Agent 引言：为什么Agent开发不再是少数人的游戏近年来，随着人工智能技术的快速发展，AIAgen...MVP聚技站｜GitHubCopilotSDK入门：五分钟构建你的第一个AIAgent 2026-03-05 · Xzavier Aaron

Coolify开发教程-配置自定义域名和证书证书和域名首先先域名解析到Coolify所在的服务器，然后获取你的证书NGINX版本的，这里就不赘...Coolify开发教程-配置自定义域名和证书 2026-03-05 · Pastore Antonio

Azure AI 服务之语音识别简介AzureAI服务中的语音识别API是微软提供的一项先进技术，旨在帮助开发者轻松实现语...AzureAI服务之语音识别 2026-02-17 · Xzavier Aaron

修复moss本机访问SharePoint 401.1 HTTP错误环境概述在本次问题分析中，我们首先需要明确系统的运行环境。了解环境配置不仅能帮助我们定位问题，也为...修复moss本机访问SharePoint401.1HTTP错误 2026-02-15 · Xzavier Aaron

C#文件下载的几种方式简介文件下载功能在现代软件开发中占据了重要地位，无论是为用户提供资源、分发文档，还是实现数据传输，...C#文件下载的几种方式 2026-02-15 · Shen, Luke

IIS 部署 Vue 项目 404 问题解决方案在将使用VueRouter的History模式项目部署到IIS时，可能会遇到刷新页面或...IIS部署Vue项目404问题解决方案 2026-03-06 · Xzavier Aaron

WordPress服务器无法处理图片的解决方法问题描述在使用WordPress进行图片上传时，部分用户可能会遇到服务器无法处理图片的问题。这种情...WordPress服务器无法处理图片的解决方法 2026-02-15 · Xzavier Aaron

OpenClaw 入门指南：从原理到实战引言本文旨在为读者提供一份关于OpenClaw的深入指南，涵盖其核心原理和实际应用。这篇文章的...OpenClaw入门指南：从原理到实战 2026-02-15 · Xzavier Aaron

SharePoint2010升级到SharePoint2013操作手册第一章前言在技术领域中，随着软件系统的迭代升级，企业往往面临如何将现有的系统迁移到新版本的问题。...SharePoint2010升级到SharePoint2013操作手册 2026-02-15 · Xzavier Aaron

在现有Seafile 上追加 Azure Blob 磁盘已有运行中的Seafile（Docker）现有数据保持不变新增几个AzureBlob作为...在现有Seafile上追加AzureBlob磁盘 2025-11-21 · Pastore Antonio

.NET CORE 传统方式调用SharePoint 直接贴代码：usingMicrosoft.SharePoint.Client;usingPnP....NETCORE传统方式调用SharePoint 2025-09-25 · Pastore Antonio

Azure 申请SharePoint 应用登录后台如下：进入Azure:选择之后进入创建应用：进入之后输入名字按照如下图示点击注册：创...Azure申请SharePoint应用 2025-09-25 · Pastore Antonio

VS中MVC解决方案复制后修改调试端口我之前是直接去解决方案属性中修改：然后恭喜你，你会喜提报错。正确方式，打开项目属性：这里重置切...VS中MVC解决方案复制后修改调试端口 2025-03-17 · Pastore Antonio

如何在C#WinForms应用程序中显示当前版本信息在开发C#WinForms应用程序时，向用户展示当前版本信息是一个常见的需求。这不仅可以帮助用户了解...如何在C#WinForms应用程序中显示当前版本信息 2025-03-10 · Pastore Antonio

2022年4月30日削苹果削了手，太懒不想收藏资料的时候复制粘贴所以开发了一个快速发布的组件。花了2天…… 2022-04-30 · Pastore Antonio

2021年12月27日头疼/胸闷/肋条疼/脚脖子疼……没钱……明天要给员工发工资了。 2021-12-27 · Pastore Antonio

2021年12月22日今天天气：多云转晴早上太多雾，但怕迟到还是一路超速……赶到了目的地，在车库绕了四层没一个车位，绕晕...2021年12月22日 2021-12-22 · Pastore Antonio

最热标签

aspnet-wwwroot-error-solution ASP.NETCore启动报错：DirectoryNotFoundExceptionwwwroo...aspnet-wwwroot-error-solution 2025-12-28 · Pastore Antonio

SharePoint Server 出现 ERR_HTTP2_PROTOCOL_ERROR 如果SharePointServer在http的情况下能够访问，但是在https下不能访问报错如...SharePointServer出现ERR_HTTP2_PROTOCOL_ERROR 2025-10-21 · Pastore Antonio

.NET CORE 快速文本搜索器简单的搜索引擎：usingSystem;usingSystem.Collections.Gen....NETCORE快速文本搜索器 2025-09-25 · Pastore Antonio

.NET CORE 传统方式调用SharePoint 直接贴代码：usingMicrosoft.SharePoint.Client;usingPnP....NETCORE传统方式调用SharePoint 2025-09-25 · Pastore Antonio

.NET CORE 使用应用方案操作SharePoint Online 世纪互联版我来为你创建一个.NETCore应用程序，用于向世纪互联SharePointOnline....NETCORE使用应用方案操作SharePointOnline世纪互联版 2025-09-25 · Pastore Antonio

Microsoft Excel 365 的 DCOMCNFG 中缺少 Microsoft Excel 应用程序试试这个方法：这个方法不是解决你看不看得到，而是配置你看不到也能使用了。原操作方案：Micro...MicrosoftExcel365的DCOMCNFG中缺少MicrosoftExcel应用程序 2025-06-11 · Pastore Antonio

为你的wordpress主题添加支持文章格式如果你的主题不支持文章格式，首先你需要在functions.php中添加如下类似代码让你的主题支持该...为你的wordpress主题添加支持文章格式 2024-04-17 · Pastore Antonio

wordpress新增文章类型要在WordPress中添加自定义文章类型，您可以按照以下步骤进行：使用函数创建自定义文章类型：...wordpress新增文章类型 2024-04-17 · Pastore Antonio

去除WordPress登录页面的翻译组件在主题function.php中添加如下内容：add_filter(‘login_d...去除WordPress登录页面的翻译组件 2023-08-30 · Pastore Antonio

Linux 下Wordpress博客搭建 WordPress#下载安装文件cd/usr/local/nginx/html/blogw...Linux下WordPress博客搭建 2021-12-11 · Pastore Antonio

从零开始在linux下搭建wordpress博客一，准备linux环境本地虚拟机可以忽略第一部分微博以及微信的公共平台现在正火的一塌糊涂。既...从零开始在linux下搭建wordpress博客 2021-12-11 · Pastore Antonio

centos 常见问题 1：DNS配置域名无法在Linux下解析是一个比较普遍的问题，造成这个问题有很多原因，比如：服务器...centos常见问题 2023-11-08 · Pastore Antonio

linux中查找包含指定内容的文件文件名+内容根据时间查找日志查询指定时间段内的日志查找关键字只显示包含内容的文件名文件名+...linux中查找包含指定内容的文件 2021-12-11 · Pastore Antonio

Linux 下Wordpress博客搭建 WordPress#下载安装文件cd/usr/local/nginx/html/blogw...Linux下WordPress博客搭建 2021-12-11 · Pastore Antonio

linux 安装php8 准备工作　　创建空文件夹>mkdir/opt/php//创建目录>cd/...linux安装php8 2021-12-11 · Pastore Antonio

python 中BeautifulSoup入门

橙子主题打折出售

MVP 聚技站｜GitHub Copilot SDK 入门：五分钟构建你的第一个 AI Agent

热门文章

最热分类

最热标签

最新发布