0%

Submit a Hexo Sitemap to Baidu with GitHub Actions

When I looked at this blog’s rather modest page views, I started paying more attention to SEO. That was also the first time I logged into Baidu Search Resource Platform, only to discover that Baidu had indexed just 8 pages from my site.

No wonder almost all of my traffic was coming from Google and Bing. Baidu had barely indexed anything. Since Baidu does not provide a sitemap submission API like Google does, the only practical option is to submit URLs directly. So I put together a small workflow that lets this Hexo blog submit sitemap URLs to Baidu automatically through GitHub Actions.

Prepare the script

First, write a small Python script that downloads a sitemap.xml, extracts the URLs, and submits them to Baidu:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
#!/usr/bin/env python3

"""
Script for submitting sitemap URLs to Baidu

visit: https://knktc.com

@author:knktc
@contact:me@knktc.com
@create:2022-02-12 22:49
"""

import time
import argparse
from urllib import request
from urllib.parse import urljoin
import xml.etree.ElementTree as ET


def chunker(seq, size):
""" iterate list by chunk """
return (seq[pos:pos + size] for pos in range(0, len(seq), size))


class BaiduSubmitter:
def __init__(self, site: str, token: str, sitemap: str):
self.submit_url = self.gen_submit_url(site, token)
self.sitemap_url = self.gen_sitemap_url(site, sitemap)

@staticmethod
def gen_submit_url(site: str, token: str) -> str:
""" generate url to submit to """
return f'http://data.zz.baidu.com/urls?site={site}&token={token}'

@staticmethod
def gen_sitemap_url(site: str, sitemap: str) -> str:
""" generate url path to get sitemap """
return urljoin(site, sitemap)

@staticmethod
def get_links_from_sitemap(sitemap_url) -> list:
""" download sitemap, parse and get urls """
with request.urlopen(sitemap_url) as resp:
data = resp.read()

root = ET.fromstring(data)
return [_.text for
_ in root.findall('./{http://www.sitemaps.org/schemas/sitemap/0.9}url/{http://www.sitemaps.org/schemas/sitemap/0.9}loc')]

@staticmethod
def submit(submit_url: str, links: list):
""" submit to baidu """
data = '\n'.join(links).encode('utf8')
req = request.Request(submit_url, data=data)
return request.urlopen(req).read().decode()

def run(self, chunk_size=20, sleep_time=0.1):
""" submit process """
links = self.get_links_from_sitemap(self.sitemap_url)
print(f'Get {len(links)} links from sitemap: [{self.sitemap_url}]')

for chunk in chunker(links, chunk_size):
resp = self.submit(self.submit_url, chunk)
print(resp)
if sleep_time:
time.sleep(sleep_time)

time.sleep(1)


def get_args():
""" get cli args """
parser = argparse.ArgumentParser(description='Submit sitemap to Baidu')
parser.add_argument('--site', '-s', type=str, dest='site', required=True,
help='your site, eg: https://knktc.com')
parser.add_argument('--token', '-t', type=str, dest='token', required=True,
help='baidu ziyuan token, you may find your token in https://ziyuan.baidu.com/linksubmit')
parser.add_argument('--sitemap', '-p', type=str, dest='sitemap', default='sitemap.xml',
help='url path to get sitemap.xml file, default: sitemap.xml')
parser.add_argument('--chunk', '-c', type=int, dest='chunk_size', default=100,
help='how many urls should be submitted each time')

args = parser.parse_args()

return args


def main():
"""
main process

"""
args = get_args()
site = args.site
token = args.token
sitemap_path = args.sitemap
chunk_size = args.chunk_size

submitter = BaiduSubmitter(site, token, sitemap_path)
submitter.run(chunk_size=chunk_size)


if __name__ == '__main__':
main()

This script only uses Python’s standard library, so there is no extra dependency to install.

Save it locally, get your Baidu submission token from Baidu Search Resource Platform, and run it like this:

1
python3 baidu_submit.py --site https://knktc.com --token AABBCCDD --sitemap sitemap.xml --chunk 100

A quick explanation of the arguments:

  • --site or -s: your blog URL
  • --token or -t: the submission token from Baidu
  • --sitemap or -p: the sitemap path; for example, if your sitemap is at https://knktc.com/sitemap.xml, then sitemap.xml is enough here
  • --chunk or -c: how many URLs to send in each request; the default is 100

The output looks like this:

1
2
3
4
Get 235 links from sitemap: [https://knktc.com/sitemap.xml]
{"remain":2585,"success":100}
{"remain":2485,"success":100}
{"remain":2450,"success":35}

For convenience, I also published the script as a GitHub Gist:

https://gist.github.com/knktc/846950067e60a92612c1befbe4213a32

That way, the GitHub Actions workflow can just fetch the script directly.

GitHub Actions

There is a small trick when adding GitHub Actions files to a Hexo repository. See my earlier post: Use GitHub Actions to Submit a Sitemap for a Hexo Blog.

Then create a workflow file named baidu_sitemap.yml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# workflow to summit urls from sitemap to baidu

name: Submit baidu Sitemap

on:
schedule:
- cron: '15 2 * * *'

jobs:
submit:
runs-on: ubuntu-latest

steps:
- name: get gist
uses: andymckay/get-gist-action@0.1
with:
gistURL: https://gist.github.com/knktc/846950067e60a92612c1befbe4213a32

- name: run script
env:
BAIDU_TOKEN: ${{ secrets.BAIDU_TOKEN }}
run: python3 /tmp/baidu_submit.py --site https://knktc.com --token $BAIDU_TOKEN

There are a few important points in this workflow:

  • It runs every day at 02:15 UTC, which is 10:15 in Beijing time.
  • The script is fetched from a Gist instead of being committed into the Hexo repository.
  • The Baidu token is passed through GitHub Secrets, so you need to configure BAIDU_TOKEN in the repository settings first.

After that, GitHub Actions can submit your latest sitemap URLs to Baidu automatically every day.

如果我的文字帮到了您,那么可不可以请我喝罐可乐?