AI辅助生成的统计博客文章浏览量的网络爬虫

一直就想建一个网页追踪本博客发布文章的浏览量。最近,借助 AI(ChatGPT)辅助生成了一个网络爬虫程序,可以准确地抓取全部文章的浏览量并自动创建Markdown格式的排行榜。整个过程大约1个多小时,估计自己从头开始写至少需要2-3小时的时间,确实AI将效率倍增。这里将与AI交互生成代码的关键提示和回应过程做个记录,以备参考。

生成网络爬虫

提示:Can you write a crawler to get view counts for the posts of this blog site: www.packetmania.net

回应:To retrieve the view counts for posts on www.packetmania.net, you can develop a web crawler that performs the following steps:​

  1. Access the Sitemap: Begin by fetching the site's sitemap, located at https://www.packetmania.net/sitemap.xml. This XML file lists all the URLs of the site's posts, facilitating efficient crawling.

  2. Parse the Sitemap: Extract all post URLs from the tags within the sitemap.​

  3. Fetch Each Post: For each URL obtained, send an HTTP GET request to retrieve the HTML content of the post.​

  4. Extract View Counts: Analyze the HTML content to locate the view count. On PacketMania, the view count is typically found near the post's metadata. For example, in the post titled "IPv4 and IPv6 Header Checksum Algorithm Explained," the view count is displayed as follows: Views: 78

    Your crawler should search for this pattern or similar indicators within the HTML to extract the view count.

  5. Handle JavaScript Rendering: If the view counts are loaded dynamically via JavaScript, consider using tools like Selenium or Puppeteer, which can render JavaScript and allow you to extract the dynamically loaded content.​

  6. Respect Robots.txt and Terms of Service: Before initiating the crawling process, review the site's robots.txt file and terms of service to ensure compliance with their crawling policies.​

  7. Implement Rate Limiting: To avoid overwhelming the server, incorporate delays between your requests.​

By following these steps, your crawler can systematically retrieve view counts for each post on PacketMania.

ChatGPT 马上给出了网络爬虫的工作流程,还提醒要实现速率限制,挺好!不废话了,直接让它写代码吧。

提示:Please write a crawler for me.

回应:Here’s a Python crawler using requests, BeautifulSoup, and re to extract view counts from blog posts on www.packetmania.net. This version assumes the view counts are rendered as plain text and not dynamically via JavaScript.

Requirements Install required packages first (if not already installed):

1
pip install requests beautifulsoup4 lxml

🕷️ Crawler Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
from bs4 import BeautifulSoup
...

# Step 1: Fetch sitemap
SITEMAP_URL = "https://www.packetmania.net/sitemap.xml"
...
def get_post_urls(sitemap_url):
response = requests.get(sitemap_url, headers=headers)
soup = BeautifulSoup(response.content, 'xml')
urls = [loc.text for loc in soup.find_all('loc') if '/en/' in loc.text]
return urls

# Step 2: Crawl each post and extract view count
def extract_view_count(post_url):
...

# Step 3: Main function
def crawl_packetmania():
post_urls = get_post_urls(SITEMAP_URL)
...
for i, url in enumerate(post_urls):
print(f"[{i+1}/{len(post_urls)}] Crawling {url}")
views = extract_view_count(url)
...
time.sleep(1) # be polite

if __name__ == "__main__":
view_data = crawl_packetmania()
...

⚠️ Notes

  • If view counts are rendered via JavaScript, you’ll need to use Selenium or Playwright.
  • Respect robots.txt: https://www.packetmania.net/robots.txt
  • Add error handling and throttling for production use.

Would you like a JavaScript-rendered version using Selenium as well?

ChatGPT 输出的开头贴心地给出了安装所需 Python 模块的命令行。接下来生成的程序结构很清晰,第一个函数get_post_urls分析输入的站点地图,返回全部文章的 URL。第二个函数extract_view_count从每个文章中抓取阅读次数。最后由crawl_packetmania函数汇总调用前二者得到全部博客文章浏览量的数据。在结尾, AI 还指出对于用 JavaScript 动态渲染产生的阅读次数,需要使用 Selenium (用于自动执行网络浏览器交互的软件包)模块,并问是否要生成支持 JavaScript 动态网页的版本。

当然要,继续给 ChatGPT 提示。

提示:Yes

回应:Here’s a JavaScript-rendered version of the crawler using Selenium with Python to extract view counts from www.packetmania.net. This handles sites where view counts are rendered dynamically via JavaScript.

Prerequisites Install the required Python packages:

1
pip install selenium

Also install ChromeDriver:

  • Download from: https://sites.google.com/chromium.org/driver/
  • Make sure it matches your version of Google Chrome and is in your system PATH.

🕷️ Crawler Code (with Selenium)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import requests
from bs4 import BeautifulSoup
import re

# Step 1: Get post URLs from sitemap
def get_post_urls(sitemap_url):
...

# Step 2: Use Selenium to render the page and extract view count
def extract_view_count(driver, url):
try:
driver.get(url)
time.sleep(2) # wait for page to render
page_source = driver.page_source
match = re.search(r'Views:\s*(\d+)', page_source)
if match:
return int(match.group(1))
...

# Step 3: Main function
def crawl_with_selenium():
...
# Setup headless Chrome
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

driver = webdriver.Chrome(options=chrome_options)
results = {}

for i, url in enumerate(urls):
print(f"[{i+1}/{len(urls)}] Crawling {url}")
views = extract_view_count(driver, url)
...
...
...

🛠️ Tips

  • You can switch to Firefox by using webdriver.Firefox() if preferred.
  • To make it faster, reduce the wait time or optimize view count detection with XPath if you know the exact tag.
  • If you're running it in a server environment without GUI, ensure headless mode is enabled (as it is above).

Would you like this version wrapped into a Docker container or saved as a ready-to-run script package?

新的程序改用 Selenium 模块提供的工具抓取阅读次数,输出同样包括安装新模块的步骤。好了,现在可以测试程序了。

不幸的是,程序不工作!!

提示:Both python scripts returned "Found 0 posts.". Something wrong.

回应:It looks like I can't directly access the sitemap URL from my side. However, here's how you can troubleshoot it:

✅ Step-by-Step Fix

  1. Open the Sitemap Manually:Visit this URL in your browser https://www.packetmania.net/sitemap.xml
  • Do you see a list of blog post URLs?
  • If it returns a 404 or is empty, the site might not use a standard XML sitemap.
  1. Alternative: Crawl the Blog Index Page:If the sitemap is missing or not useful, you can extract post URLs by crawling from the main blog page, like https://www.packetmania.net/en/.

...

Let me know if the sitemap actually has content when you open it, or if you'd prefer the Selenium version of this updated crawler!

ChatGPT 正确地认识到问题出在站点地图的分析部分,程序找不到任何文章链接的 URL。但是它不能直接访问站点地图,建议手工检查站点地图内容。那好,我就喂给它 https://www.packetmania.net/sitemap.xml 的内容。

提示:It does return sitemap content:

1
2
3
4
5
6
7
8
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.packetmania.net/2025/03/09/PGITVW-4-twoptrs/</loc>
<lastmod>2025-03-29</lastmod>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
...

回应:Thanks! That helps a lot.

Your original sitemap does not contain /en/ in the URLs, so the filter if '/en/' in loc.text is incorrectly filtering everything out. Let's fix that. ... Here are the diff lines to fix the issue in your original script:

1
2
-    urls = [loc.text for loc in soup.find_all('loc') if '/en/' in loc.text]
+ urls = [loc.text for loc in soup.find_all('loc')]

That’s it! This change removes the unnecessary /en/ filter and allows the crawler to process all blog posts listed in the sitemap.

拿到站点地图内容,AI 立即意识到它的错误。原来它在分析部分的代码中加入了一个/en/过滤,所以全部的 URL 都被排除掉了。估计这是因为我一直是用英文提示的,而给的网址确实又包含一个英文副站点 www.packetmania.net/en/,AI 就想当然地只匹配英文副站点的 URL 了。AI 给出了补丁(diff)修复了这个错误。

1
2
python crawler-javascript.py
Found 61 posts.

抓取阅读次数

虽然此时 ChatGPT 生成的网络爬虫可以返回正确的文章数目,但是却不能取出阅读量,还需要继续与之交互。

提示:Now it can find the pages, but the view count still missing:

1
2
3
4
5
[1/61] Crawling https://www.packetmania.net/2025/03/09/PGITVW-4-twoptrs/
-> View count not found.
[2/61] Crawling https://www.packetmania.net/2025/02/16/Harvard-CS50-AI-intro/
-> View count not found.
...

回应:Thanks — good progress! Now that URLs are loading, but no view count is found, it’s very likely that:

  1. The view count is rendered dynamically by JavaScript, or
  2. It's inside a specific HTML tag, not just plain text like Views: 123

🔍 Let’s Debug with Selenium

Update your Selenium version to inspect specific elements, rather than relying on regex over raw HTML. Here’s a diff-style fix for the extract_view_count() function in your crawler-javascript.py:

...

这时从 AI 的回应中2.,我发现它给出的程序中还有同样的错误:阅读次数的正则表达式(Regular Expression)匹配用的是英文

1
match = re.search(r'Views:\s*(\d+)', page_source)

而实际的字符是阅读次数: 1448。下面让它纠正这个错误

提示:Actually the view count is shown as "阅读次数: 1448"

回应:Ah, perfect! The view count is shown in Chinese as: 阅读次数:1448

Let’s update your Selenium-based crawler to search for this instead.

Just update this part in extract_view_count():

1
2
-        elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'Views')]")
+ elements = driver.find_elements(By.XPATH, "//*[contains(text(), '阅读次数')]")

And update the regex:

1
2
-            match = re.search(r'Views:\s*(\d+)', el.text)
+ match = re.search(r'阅读次数[::]?\s*(\d+)', el.text)

... That should now correctly extract the Chinese-format view count from each blog post.

按照这两个 Diff 修改后,还是没有打印出阅读次数。继续让它排错

提示:Still not found

回应:Got it — since it’s still not found, let’s do a precise visual + DOM-based inspection. We’ll now:

  1. Use Selenium to print out all elements containing "阅读次数".
  2. Confirm the exact tag/structure where the view count lives.

Debugging Version of extract_view_count()

Replace your current function with this debug version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def extract_view_count(driver, url):
try:
driver.get(url)
time.sleep(2) # Let JS render

# Print out all elements that contain "阅读次数"
elements = driver.find_elements(By.XPATH, "//*[contains(text(), '阅读次数')]")
for el in elements:
print("Found text:", el.text) # Debug output
match = re.search(r'阅读次数[::]?\s*(\d+)', el.text)
if match:
return int(match.group(1))

# Optional: print page source if nothing is found
if not elements:
print("No element with '阅读次数' found.")
with open("debug_output.html", "w", encoding="utf-8") as f:
f.write(driver.page_source)

except Exception as e:
print(f"Error processing {url}: {e}")
return None

What This Does:

  • ✅ Shows every el.text containing "阅读次数"
  • 📄 Dumps HTML to debug_output.html if nothing is found
  • Helps us see how it’s rendered and where

🧠 Next Step

  1. Run the script.
  2. Look at Found text: ... output — does it show the view count?
  3. If not, open debug_output.html and search for 阅读次数 — is it inside a <script>, inside Shadow DOM, or rendered with JavaScript framework?

Once you show me the actual line or structure from the HTML, I can give you the exact XPath to extract it. Want to do that now?

再次运行修改extract_view_count函数后的程序,这一次输出了阅读次数:,但是仍然没有抓到数字。继续提示

提示:Here is the output:

1
2
3
4
5
6
7
8
9
10
python crawler-javascript.py
Found 61 posts.

[1/61] Crawling https://www.packetmania.net/2025/03/09/PGITVW-4-twoptrs/
Found text: 阅读次数:
-> View count not found.
[2/61] Crawling https://www.packetmania.net/2025/02/16/Harvard-CS50-AI-intro/
Found text: 阅读次数:
-> View count not found.
···

回应:Perfect, now we know:

  • The element does exist
  • But it only contains: 阅读次数: (no number inside the same element)

That strongly suggests the number is in a separate span or sibling tag next to the "阅读次数:" label.

Fix with More Precise XPath

Update the XPath to capture the label + number in one go:

1
elements = driver.find_elements(By.XPATH, "//*[contains(text(), '阅读次数')]/..")

...

AI 知道了问题所在,并指出了解决办法。改动程序后,阅读次数出来了!赶紧告诉 AI

提示:Now it works:

1
2
3
4
5
6
7
8
9
10
python crawler-javascript.py
Found 61 posts.

[1/61] Crawling https://www.packetmania.net/2025/03/09/PGITVW-4-twoptrs/
Found text: 阅读次数: 41
-> Views: 41
[2/61] Crawling https://www.packetmania.net/2025/02/16/Harvard-CS50-AI-intro/
Found text: 阅读次数: 276
-> Views: 276
...

回应:🎉 Awesome! Glad it’s working now — great teamwork!

Now that you're successfully extracting view counts, here are a few bonus suggestions you might want next:

📤 Export Options

  1. To CSV:
  2. To JSON: ...

🛠️ Optional Enhancements

  • Add --headful mode for debugging (show browser window)
  • Randomized delay between requests (to avoid rate limits)
  • Log failed pages for retrying
  • Schedule regular crawling with cron or APScheduler
  • Wrap it into a Docker container or Flask API

Want help with any of these next steps?

ChatGPT 也很高兴😊,它还进一步给出了数据导出和增强功能一些建议,但是我有自己的主意。

创建文章排行榜

提示 AI 继续完成一些改进:

  • "Now please enhance this script to skip URLs that are not including <year>/<month>/<day>."
  • "Can you also do more enhancements:
    1. Grab the title of each blog post
    2. Sort with the decreasing order of view count with titles list to the right.

ChatGPT 迅速生成了相应的代码,没有错误。

回应:Great! Let’s enhance the script so it only crawls URLs that match the pattern:

1
https://www.packetmania.net/YYYY/MM/DD/some-slug/

Code Diff

Apply this small filter inside your get_post_urls() function:

1
2
3
4
5
6
7
8
9
def get_post_urls(sitemap_url):
response = requests.get(sitemap_url, headers=headers)
soup = BeautifulSoup(response.content, 'xml')
- urls = [loc.text for loc in soup.find_all('loc')]
+ urls = [
+ loc.text for loc in soup.find_all('loc')
+ if re.search(r'/\d{4}/\d{2}/\d{2}/', loc.text)
+ ]
return urls

Explanation

  • The regex r'/\d{4}/\d{2}/\d{2}/' ensures the URL includes a valid date path (YYYY/MM/DD)
  • This skips entries like index.html or archive pages

New Features:

  • Grab the blog post title
  • Sort results by descending view count
  • Output CSV with: URL, Views, Title (sorted)

🔁 Code Changes

📌 1. Update extract_view_count() to also return title

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def extract_view_count(driver, url):
try:
driver.get(url)
time.sleep(2)

title = driver.title.strip()

# Find parent element that includes view count
elements = driver.find_elements(By.XPATH, "//*[contains(text(), '阅读次数')]/..")
for el in elements:
full_text = el.text
match = re.search(r'阅读次数[::]?\s*(\d+)', full_text)
if match:
views = int(match.group(1))
return views, title

except Exception as e:
print(f"Error processing {url}: {e}")
return None, None

📌 2. Main crawl function with sorting and CSV output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import csv

def crawl_packetmania():
post_urls = get_post_urls(SITEMAP_URL)
print(f"Found {len(post_urls)} posts.\n")

results = []
for i, url in enumerate(post_urls):
print(f"[{i+1}/{len(post_urls)}] Crawling {url}")
views, title = extract_view_count(driver, url)
if views is not None:
print(f" -> Views: {views} | Title: {title}")
results.append((url, views, title))
else:
print(" -> View count not found.")
time.sleep(1)

# Sort by views descending
results.sort(key=lambda x: x[1], reverse=True)

# Sort by views descending
results.sort(key=lambda x: x[1], reverse=True)

# Save to CSV
with open("packetmania_sorted_views.csv", "w", newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["URL", "Views", "Title"])
writer.writerows(results)

return results

运行程序后验证这些代码改动都没有问题,满足提示的要求。可以看到,这里 ChatGPT 自作主张导出 CSV 格式的文件,估计因为这是常见的输出格式。而其实我想要的,是可以直接用于生成博客网站帖子的 Markdown 格式文件。

保存Markdown

接下来提示 AI 导出 Markdown 格式文件:

  • “Please generate a markdown table format with the sorted list”
  • “Please add the link to the Title in the table.”
  • “Can you add a far left column with the order from 1 to the last number of posts.”
  • “For the title, please strip out the ending "<space>*|*".”

回应:

Sure! Here's how to output the sorted results as a Markdown table (| URL | Views | Title |) directly from your Python script.

Step: Add This to the End of crawl_packetmania()

1
2
3
4
5
6
7
# Generate Markdown table
with open("packetmania_views.md", "w", encoding="utf-8") as f:
f.write("| Views | Title |\n")
f.write("|-------:|-------|\n")
for url, views, title in results:
safe_title = title.replace("|", "-") # prevent Markdown formatting issues
f.write(f"| {views} | [{safe_title}]({url}) |\n")

📌 Sample Output

This will produce something like:

1
2
3
4
| Views | Title |
|-------:|-------|
| 276 | [Harvard CS50 AI Intro](https://www.packetmania.net/2025/02/16/Harvard-CS50-AI-intro/) |
| 41 | [PGITVW Part 4 - Two Pointers](https://www.packetmania.net/2025/03/09/PGITVW-4-twoptrs/) |

Final Markdown Export Code (with Rank Column)

1
2
3
4
5
6
7
# Generate Markdown table with rank, views, and linked titles
with open("packetmania_views.md", "w", encoding="utf-8") as f:
f.write("| # | Views | Title |\n")
f.write("|--:|------:|-------|\n")
for idx, (url, views, title) in enumerate(results, 1):
safe_title = title.replace("|", "-") # avoid breaking the table
f.write(f"| {idx} | {views} | [{safe_title}]({url}) |\n")

📌 Output Example

1
2
3
4
5
| # | Views | Title |
|--:|------:|-------|
| 1 | 276 | [Harvard CS50 AI Intro](https://www.packetmania.net/2025/02/16/Harvard-CS50-AI-intro/) |
| 2 | 41 | [PGITVW Part 4 - Two Pointers](https://www.packetmania.net/2025/03/09/PGITVW-4-twoptrs/) |
| 3 | 28 | [Cisco WiFi6 MU-MIMO](https://www.packetmania.net/2021/06/05/Cisco-WiFi6-MuMIMO/) |

Fix: Strip trailing " | ..." from the title

Update the title cleaning line inside extract_view_count() like this:

1
2
- title = driver.title.strip()
+ title = re.sub(r'\s*\|\s*.*$', '', driver.title.strip())

ChatGPT 完成这些任务没有一点压力,生成的程序没有问题。

支持双语博客

最后让 AI 在一个程序里支持本博客的中英文网站,可以用命令行选项--lang指定。ChatGPT 也很快搞定了。

  • "Add a command line option to generate Chinese output of the table md."
  • “please add timestamp to the name of generated md file. Also for command line option --lang, use different URL for English or Chinese: Default: Chinese (zh) URL: www.packetmania.net, --lang en URL: www.packetmania.net/en"
  • "The matching keyword for English would be "Views: "."
  • "Please also add en or zh to the name of generated md file."
  • "for the keyword, note the difference of ":" for English and “:” for Chinese."
  • For the markdown file name timestamp part, please add the time to the seconds.

回应:

Step-by-Step Enhancements

  1. Use argparse to add a --lang option
1
2
3
4
5
import argparse

parser = argparse.ArgumentParser(description="PacketMania view count crawler")
parser.add_argument("--lang", choices=["en", "zh"], default="en", help="Language for Markdown output (en or zh)")
args = parser.parse_args()
  1. Use language-specific headers

Update your Markdown export section:

1
2
3
4
5
6
7
8
9
10
11
12
# Generate Markdown table with rank, views, and linked titles
with open("packetmania_views.md", "w", encoding="utf-8") as f:
if args.lang == "zh":
f.write("| 排名 | 阅读次数 | 标题 |\n")
f.write("|-----:|----------:|------|\n")
else:
f.write("| # | Views | Title |\n")
f.write("|--:|------:|-------|\n")

for idx, (url, views, title) in enumerate(results, 1):
safe_title = title.replace("|", "-") # prevent MD issues
f.write(f"| {idx} | {views} | [{safe_title}]({url}) |\n")

How to Use It

1
2
python crawler-javascript.py --lang zh   # Chinese headers
python crawler-javascript.py --lang en # English headers (default)

🕷️ Final Enhancedcrawler-javascript.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import time
import re
import requests
import argparse
import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
...

def crawl_packetmania(lang="zh"):
base_url = "https://www.packetmania.net" if lang == "zh" else "https://www.packetmania.net/en"
sitemap_url = f"{base_url}/sitemap.xml"
keyword = "阅读次数" if lang == "zh" else "Views"

post_urls = get_post_urls(sitemap_url)
print(f"Found {len(post_urls)} posts.\n")

# Setup headless Chrome
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=chrome_options)

results = []
for i, url in enumerate(post_urls):
print(f"[{i+1}/{len(post_urls)}] Crawling {url}")
views, title = extract_view_count(driver, url, keyword)
if views is not None:
print(f" -> Views: {views} | Title: {title}")
results.append((url, views, title))
else:
print(" -> View count not found.")
time.sleep(1)

driver.quit()

# Sort by view count descending
results.sort(key=lambda x: x[1], reverse=True)

# Generate timestamped Markdown filename
timestamp = datetime.datetime.now().strftime("%Y%m%d")
md_filename = f"packetmania_views_{timestamp}_{lang}.md"

# Output Markdown table
with open(md_filename, "w", encoding="utf-8") as f:
if lang == "zh":
f.write("| 排名 | 阅读次数 | 标题 |\n")
f.write("|-----:|----------:|------|\n")
else:
f.write("| # | Views | Title |\n")
f.write("|--:|------:|-------|\n")

for idx, (url, views, title) in enumerate(results, 1):
safe_title = title.replace("|", "-")
f.write(f"| {idx} | {views} | [{safe_title}]({url}) |\n")

print(f"\n✅ Markdown table written to `{md_filename}`")
return results

Excellent observation! The colon character differs between English and Chinese:

  • English: : (ASCII colon)
  • Chinese: : (full-width colon U+FF1A)

To ensure precise matching, we should:

  • Use Views: (with ASCII colon) for English
  • Use 阅读次数: (with Chinese full-width colon) for Chinese

Updated Keyword Handling

Update the keyword logic in crawl_packetmania():

1
2
- keyword = "阅读次数" if lang == "zh" else "Views"
+ keyword = "阅读次数:" if lang == "zh" else "Views:"

And remove the regex fallback colon match ([::]?) — it's no longer needed.

🧪 Usage

1
2
python crawler-javascript.py             # 中文输出,默认站点
python crawler-javascript.py --lang en # English output, "/en" URLs

This will generate a file like packetmania_views_20250409_zh.md.

Great idea! Adding timestamp down to seconds will ensure uniqueness and precision.

Change the Timestamp Format

Update this line:

1
2
- timestamp = datetime.datetime.now().strftime("%Y%m%d")
+ timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

🧩 Updated Snippet for Markdown Filename

1
2
3
# Generate filename with timestamp to the second
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
md_filename = f"packetmania_views_{timestamp}_{lang}.md"

This will produce filenames like packetmania_views_20250409_153244_zh.md.

总结

AI 生成的 完整的 Python 爬虫程序是 crawler-javascript.py。最后让 ChatGPT 自己做个总结:

回应:Here’s the complete updated crawler script with the following features:

✅ Crawls blog posts from PacketMania
✅ Uses Selenium to extract dynamically rendered view counts
✅ Extracts post titles
✅ Filters URLs to only those matching /YYYY/MM/DD/
✅ Sorts by view count in descending order
✅ Outputs a Markdown table with optional --lang zh for Chinese headers
✅ Markdown file named with timestamp to second
✅ Command line option for dual languages with language-specific base URL and view count match:

  • zh (default): https://www.packetmania.net, 阅读次数:
  • en: https://www.packetmania.net/en, Views:

爬虫的输出示例如下

1
2
3
4
5
6
7
❯ cat packetmania_views_20250415_165701_zh.md
| 排名 | 阅读次数 | 标题 |
|-----:|----------:|------|
| 1 | 11464 | [iTerm2 + Oh-My-Zsh + Powerlevel10k 打造酷炫macOS终端](https://www.packetmania.net/2021/11/13/iTerm2-OMZ-Powerlevel10k/) |
| 2 | 8220 | [自己动手使用树莓派搭建家用NAS和流媒体服务器](https://www.packetmania.net/2021/12/19/RPi-NAS-Plex/) |
| 3 | 6631 | [IPv6动态地址分配机制详解](https://www.packetmania.net/2020/12/01/IPv6-Addressing/) |
...

导出的 Markdown 文件可直接由 Hexo 博客框架 + Next 主题转化成文章排行榜网页,其链接也加入到博客的左侧菜单中。

虽然对网络爬虫和网页服务的技术细节了解的不多,但是有了 Python 编程基础并通过与 ChatGPT 交互最终顺利达到了目标 。这是一次借助 AI 实现一个简单的网络爬虫的成功经验!