用 BS4 如何搜索文本内容，然后再取出其标签？

omg21 · 2016-10-02T14:29:48Z

比方说下面这样的代码，新闻娱乐我需要搜索“娱乐”，如果找到了，就取出标签“ 娱乐 ”，我试过用 soup.find_all()，但是这个只能搜索标签，不能搜索内容。

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3566 days ago, the information mentioned may be changed or developed.

比方说下面这样的代码，

<p id="a1">新闻</p>
<p id="a2">娱乐</p>

我需要搜索“娱乐”，如果找到了，就取出标签“<p id="a2">娱乐</p>”，我试过用 soup.find_all()，但是这个只能搜索标签，不能搜索内容。

搜索

标签

取出

bs4

27 replies • 2016-10-03 20:05:27 +08:00

omg21

Oct 2, 2016

我用 soup.find_all(text="娱乐")得到结果是 []
用 soup.find(text="娱乐")得到结果是 None
所以现在不知道应该如何才能取到标签

practicer

Oct 2, 2016

html = '''<p id="a1">新闻</p>
<p id="a2">娱乐</p>'''

bs = BeautifulSoup(html, 'html.parser')
theTag = bs.find(text='娱乐').find_parent()

omg21

Oct 2, 2016

@practicer 兄弟，这样不行啊，报错。

gyh

Oct 2, 2016 via iPhone

对 find_all(p)循环，若 text=="娱乐",赋给一个变量，退出循环。
这样？

finalspeed

Oct 2, 2016 via Android

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the <a> tags whose .string is “ Elsie ”:

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text:

soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

zqcolor

Oct 3, 2016

soup = bs4.BeautifulSoup(Response)

a = soup.find('a', text = '下一页')

url = a['href']

dsg001

Oct 3, 2016

为嘛不用 lxml ？
//p[text()="娱乐"]

omg21

Oct 3, 2016

@finalspeed 这样可以取到， find_all("a", string="Elsie") ，可是这里面有"a"，如果不加"a"能不能取到整条<a>呢？
在我的实例里，想取<p id="a2">娱乐</p>，这里的标签<p>是变量，也有可能是<div>,<td>,所以只能用 find_all(text="娱乐")这样的方法表述，可是我现在取不出来。

omg21

Oct 3, 2016

@gyh 抱歉提问没说详细，<p>标签在这里是个变量，也有可能是<div>,<td>，所以你这方法不现实

aristotll

Oct 3, 2016

建议用 css selector 迟早要接触这一块的学习 BS4 自带函数还不如学习这东西更通用

popu111

Oct 3, 2016 via Android

@finalspeed 论看英文文档的重要性_(:з」∠)_中文版还没跟进到 4.4.0

dsg001

Oct 3, 2016

@aristotll css 支持文本选择器吗？

panyanyany

Oct 3, 2016

楼主可以先直接在 outerHTML 里查找到 “娱乐” 然后上溯到最近的标签啊

sherlocktheplant

Oct 3, 2016

@omg21

我没试过不过应该是这样

result = soup.find_all(string=u"娱乐")

得到的应该是一个 NavigatableString 的数组:

for item in result:
element = item.parent

sherlocktheplant

Oct 3, 2016

测试了一下应该是
result = soup.find_all(string=u"娱乐")

sherlocktheplant

Oct 3, 2016

测试了一下应该是
result = soup.find_all(text=u"娱乐")

sherlocktheplant

Oct 3, 2016

https://gist.github.com/CarlLee/dd26bb9fcbd3b24667cb4a0497a26ad2

sherlocktheplant

Oct 3, 2016

string 和 text 都可以

shoaly

Oct 3, 2016

放弃 soup 吧 ,并不鸡汤, 安利你 pyquery, 语法和 jquery 通用

omg21

Oct 3, 2016

@sherlocktheplant 你的方法能取到文本“娱乐”，但是取不到<p>啊，而且标签也不固定，不一定是<p>，还有其他的。

xucuncicero

Oct 3, 2016

```python3
soup = BeautifulSoup(html, 'html.parser')
theTag = soup.find_all(text='娱乐')
for tag in theTag:
print(tag.find_parent("p"))
```

xucuncicero

Oct 3, 2016

2 楼的没问题，你说报错，具体是啥

Kisesy

Oct 3, 2016

b = BeautifulSoup("""<p id="a1">新闻</p><p id="a2">娱乐</p><div id="a3">娱乐</div>""")
print(b.find_all(True, text='娱乐'))

额，标签不固定的话写个 True 不就行了。。。
输出 [<p id="a2">娱乐</p>, <div id="a3">娱乐</div>]

sherlocktheplant

Oct 3, 2016

@omg21 可以得到的字符串不是一般的字符串是 NavigatableString 所有可以直接通过 parent 属性访问到标签我那个 gist 你运行一下就懂了

omg21

Oct 3, 2016

@practicer 已搞定，就是用你的方法，谢谢

omg21

Oct 3, 2016

@xucuncicero 是的，是我自己搞错了

practicer

Oct 3, 2016

@omg21 不客气