文章目录[隐藏]
缘由
有次登陆百度站长,发现链接一连一个月都没有任何收录,发现是sitemap提交方式没有任何作用了,后来写了一个脚本去自动提交,但获取文章链接的爬虫始终没写好(算是暴力来递归实现吧),效率低下!昨天写可视化就找到了获取文章链接的最优方案!
获取文章链接
我是放弃以往的方式了,自己是WP站,有好的函数库来实现,就弄了一个PHP网页放到网站目录下来实现!
<?php
require('wp-blog-header.php');
header("Content-type: text/xml");
header('HTTP/1.1 200 OK');
$posts_to_show = 2000;
echo '<?xml version="1.0" encoding="UTF-8"?>';
echo '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.baidu.com/schemas/sitemap-mobile/1/">'
?>
<?php
$myposts = get_posts( "numberposts=" . $posts_to_show );
foreach( $myposts as $post ) { ?>
<url>
<loc><?php the_permalink(); ?></loc>
</url>
<?php }?>
</urlset>
python主动推送代码
获取主动推送token
- 先去百度站长获取自己的token信息(如下图所示地方)
代码
我自己是转接all.php到all的,自己按需要是否替换;还有推送token的替换
import re
import requests as req
import math
from bs4 import BeautifulSoup
def push_urls(url, urls):
headers = {
'User-Agent': 'curl/7.12.1',
'Host': 'data.zz.baidu.com',
'Content - Type': 'text / plain',
'Content - Length': '83'
}
try:
html = req.post(url, headers=headers, data=urls, timeout=5).text
return html
except:
return html
i = 0 # links number
all_link = []
origin_rul = 'yourSite/all' #或all.php,按自己配置来
r = req.get(origin_rul)
bs = BeautifulSoup(r.content, 'html.parser') #解析网页
hyperlink = bs.find_all(name = 'loc')
for h in hyperlink:
hh = h.string
all_link.append(hh)
i += 1
count = 0 # push success
error = 0 # can't push
url = 'http://data.zz.baidu.com/urls?site=yourSite&token=yourToken'
for m in all_link:
try:
print(push_urls(url,m))
count += 1
except:
print('nnError: %s' % m)
error += 1
print('nnErrors: %d' % error)
print('nnLinks: %d' % count)
定时推送
- 以centos7举例,打开crontab
crontab -e
- 例如每晚3点推动一次,推送代码push.py在/home目录下面
0 3 * * * python3 /home/push.py