博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
正则表达式,re模块
阅读量:6709 次
发布时间:2019-06-25

本文共 5538 字,大约阅读时间需要 18 分钟。

一,正则表达式

  正则表达式是对字符串操作的一种逻辑公式,我们一般使用正则表达式对字符串进行匹配和过滤,使用正则的优缺点,我们可以去http://tool.chinaz.com/regex/进行测试。

  优点:灵活,功能性强,逻辑性强

  缺点:上手难,一旦上手,使用起来很方便

  正则表达式由普通字符和元字符组成,普通字符包含大小写字母,数字,在匹配普通字符的时候我们直接写就好,比如‘abc’匹配的就是‘abc’。元字符才是正则表达式的灵魂。

  1,字符组:字符组很简单,用[]括起来,在[]中出现的内容会被匹配,例如[abc]匹配a或b或c,如果字符组中的内容过多还可以使用-,例如[a-z]匹配a到z之间的所有字母,[0-9]匹配所有阿拉伯数字

  2,简单元字符

  

  

  3,量词

  

  4,惰性匹配和贪婪匹配

  

  

  5,分组

  

  6,转义

  

二,re模块

  

  

  

  

  

 三,实例一,用re和urllib爬电影下载地址,我爬的是电影天堂小片网的电影(中间一大段if,elif只是排除编码有问题电影) 

from urllib.request import urlopen import re for i in range(1,302):     if i ==1:         url='https://www.dy2018.com/html/gndy/dyzz/index.html'     else:         url='https://www.dy2018.com/html/gndy/dyzz/index_%s.html'%i     print(url)     f1 = open('move.txt', mode='a', encoding='utf-8')     f1.write(url + '\n')     f1.close()     content=urlopen(url).read().decode('gbk')     if i==51:         obj = re.compile(r' .*?.*?a href="/(?P
.*?)".*?">(?P
.*?)', re.S) else: obj = re.compile(r'.*?a href="/(?P
.*?)".*?">(?P
.*?)
.*?)">',re.S) ss1=obj1.search(content1) d1['adress']=ss1.group('adress1') f1=open('move.txt',mode='a',encoding='utf-8') f1.write(str(d1)+'\n') f1.close() 四,用re和urllib爬维密图片
import re from urllib.request import urlopen,urlretrieve for n in range(1,5):     if n ==1:         url2 = 'https://stock.tuchong.com/topic?topicId=49282'     else:         url2='https://stock.tuchong.com/topic?topicId=49282&page=%s&count=100'%n     content=urlopen(url2).read().decode('utf-8')     obj=re.compile(r'"imageId":"(?P
.*?)"',re.S) ss=obj.finditer(content) i=(n-1)*100+1 for el in ss: num=el.group('bianhao') if num=='525558981780570122'or num=='525569719200382978': continue elif num=='525589673617915918'or num=='525563551629443081': continue elif num == '525571411419332618'or num == '525585163906711562': continue elif num == '525554334627135503'or num == '525554274497593347': continue elif num == '525555872228048910'or num == '525555889405689865': continue elif num == '525566472207728651'or num == '525569650482741255': continue elif num == '525571420010053640'or num == '525571196669132802': continue elif num == '525571239618281484'or num == '525575964084666381': continue elif num == '525577347064528914'or num == '525580594059804675': continue elif num == '525592946387451916'or num == '525602292234190873': continue elif num == '525551036092645412'or num == '525552685360087057': continue elif num == '525554300266217477'or num == '525555657477062662': continue elif num == '525557349694570506'or num == '525557418413522955': continue elif num == '525575757924401164'or num == '525585146722910211': continue elif num == '525591194036862980'or num == '525594604244828160': continue elif num == '525619248767172615'or num == '525546397526392859': continue elif num == '525547952305733639'or num == '525549618751864836': continue elif num == '525552539333034011'or num == '525560631049584650': continue elif num == '525561910952460294'or num == '525577252573020167': continue elif num == '525577390015905802'or num == '525578919019806730': continue elif num == '525589845421064206'or num == '525600746043736066': continue elif num == '525600746046750729'or num == '525600539888320533': continue elif num == '525636059267465227'or num == '525551070454611970': continue elif num == '525551173533827166'or num == '525552693952643113': continue elif num == '525557461366341633'or num == '525557169308041219': continue elif num == '525558818575613969'or num == '525559024730243078': continue elif num == '525560596692074509'or num == '525563345468391437': continue elif num == '525575809462435842'or num == '525578901843083269': continue elif num == '525580405077442575'or num == '525582148835344441': continue elif num == '525585146721337349'or num == '525549532855926790': continue elif num == '525552711133822979'or num == '525560338990235661': continue elif num == '525565097818193935'or num == '525566489386811399': continue elif num == '525569684840644622'or num == '525571239620640780': continue elif num == '525571316927692806'or num == '525572725675917316': continue elif num == '525573017733693460'or num == '525574254687682562': continue elif num == '525574538153689102'or num == '525575869593550849': continue elif num == '525594295007182858'or num == '525596004400234497': continue elif num == '525608305186177028'or num == '525650120992096265': continue elif num == '525564900245504013': continue s='http://p3.pstatp.com/weili/ms/%s.webp'%num print(s) ss=urlretrieve(s,'维密\%s.jpg'%i) i += 1
 

 

转载于:https://www.cnblogs.com/12345huangchun/p/9971441.html

你可能感兴趣的文章
精 挑 细 选
查看>>
js 获取元素所有兄弟节点实例
查看>>
《SQL Server企业级平台管理实践》读书笔记——关于SQL Server数据库的还原方式...
查看>>
全栈JavaScript之路(十七)HTML5 新增字符集属性
查看>>
iOS开发中的Html解析方法
查看>>
Binary Tree Level Order Traversal
查看>>
学习C#基础知识这段时间
查看>>
IIS7 配置 PHP5.5
查看>>
Android系统匿名共享内存Ashmem(Anonymous Shared Memory)在进程间共享的原理分析
查看>>
Redis服务快速部署
查看>>
使嵌入式Qt支持中文字体变换的方法(makeqpf)
查看>>
[Papers]NSE, $\pi$, Lorentz space [Suzuki, JMFM, 2012]
查看>>
VB 中ListView 某一列的颜色添加不上去的解决方法
查看>>
表单控制变量
查看>>
backbone 1.1.2 api
查看>>
让Zend Studio联系关系CakePHP模板文件.ctp
查看>>
中国风电生产监控平台界面
查看>>
使用Java高速实现进度条
查看>>
adb 卸载android系统程序
查看>>
svn配置
查看>>