您在這裡

從XML網站地圖中獲取純URL網址的PHP程序

James Qi 在 2017年8月30日 - 18:24 發表

  在向百度站長平台提交MIP (Mobile Instant Page - 移動網頁加速器) 網址的時候,我們采用了《百度MIP版本鍊接的批量提交》一文中的辦法,這樣确實可以定時自動提交,不過要整理出需要提交的網址文本這個過程很耗時,特别是我們一些站點的網址數量龐大,用浏覽器一頁一頁訪問sitemap頁面、保存、合并、替換、上傳等每個環節都需要手工操作并苦苦等待。

  今天下午幹脆花了一些時間來編寫了一個PHP程序,設置一些參數後,自動讀取預設的sitemap網址、下載數據并進行替換、合并、保存到指定文件名下,整個過程無需手工操作,即使讀取sitemap網址依然比較慢,但已經大大簡化了操作、提高了效率。

  程序源代碼如下(分為MediaWiki版本和Drupal版本,兩者的sitemap格式稍有不同):

  适合MediaWiki生成sitemap的源文件:mediawiki_url_from_xml_to_txt.php

<?php
/*
 * convert mediawiki url from xmlsitemap to text format
 * jamesqi 2017-8-30
 *
*/

// please set below:
$input_xmlsitemap_url = '
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-0.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-1.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-2.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-3.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-4.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-5.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-6.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_0-7.xml
https://tw.18dao.net/sitemap-tw18daonet-jingle-NS_14-0.xml
';
$output_txt_file_name = 'tw.mip.18dao.net.url.txt';

$domain_input = 'tw.18dao.net';
$domain_output = 'tw.mip.18dao.net';

// please set above
// do not change below

function xmlsitemap_to_text($input,$domain_input,$domain_output) {
	$output = $input;

	$output = str_replace("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n","",$output);
	$output = str_replace("\n","",$output);
	$output = str_replace("	\n","",$output);
	$output = str_replace("	\n","",$output);
	$output = str_replace("\n","",$output);

	$pattern = "/		([^<]*)<\/lastmod>\n/";
	$replace = "";
	$output = preg_replace($pattern,$replace,$output);

	$pattern = "/		([^<]*)<\/priority>\n/";
	$replace = "";
	$output = preg_replace($pattern,$replace,$output);

	$output = str_replace("		https://$domain_input","https://$domain_output",$output);
	$output = str_replace("\n","\n",$output);

	return $output;
}

print "programe start\n";
print "input_xmlsitemap_url = $input_xmlsitemap_url\n";
print "output_txt_file_name = $output_txt_file_name\n";
print "domain_input = $domain_input\n";
print "domain_output = $domain_output\n";

if (substr($input_xmlsitemap_url,0,1) == "\n") $input_xmlsitemap_url = substr($input_xmlsitemap_url,1);
if (substr($input_xmlsitemap_url,-1) == "\n") $input_xmlsitemap_url = substr($input_xmlsitemap_url,0,-1);

$input_xmlsitemap_url_array = explode("\n",$input_xmlsitemap_url);
$input_xmlsitemap_url_array_count = count($input_xmlsitemap_url_array);
print "input_xmlsitemap_url_array_count = $input_xmlsitemap_url_array_count lines\n";

$input_xmlsitemap_url_count = 0;
$output_txt = '';
$output_txt_length = 0;
$output_txt_count = 0;
print_r($input_xmlsitemap_url_array);

foreach ($input_xmlsitemap_url_array as $input_xmlsitemap_url_key=>$input_xmlsitemap_url_value) {
	print "\n=======================\n\n";
	print "input_xmlsitemap_url_key = $input_xmlsitemap_url_key\n";
	print "input_xmlsitemap_url_value = $input_xmlsitemap_url_value\n";
	if ($input_xmlsitemap_url_value == "" || $input_xmlsitemap_url_value == "\n") {
		print "skip this null line\n";
	} else {
		$input_xmlsitemap_url_count++;
		print "input_xmlsitemap_url_count = $input_xmlsitemap_url_count\n";

		$input_xmlsitemap_content = file_get_contents($input_xmlsitemap_url_value);
		$input_xmlsitemap_content_length = strlen($input_xmlsitemap_content);
		print "input_xmlsitemap_content_length = $input_xmlsitemap_content_length bytes\n";

		$output_text_content = xmlsitemap_to_text($input_xmlsitemap_content,$domain_input,$domain_output);
		$output_text_content_length = strlen($output_text_content);
		print "output_text_content_length = $output_text_content_length bytes\n";

		$output_text_content_array = explode("\n",$output_text_content);
		$output_text_content_array_count = count($output_text_content_array);
		print "output_text_content_array_count = $output_text_content_array_count lines\n";

		$output_txt .= $output_text_content;
		$output_txt_length = $output_txt_length + $output_text_content_length;
		$output_txt_count = $output_txt_count + $output_text_content_array_count;
	}
}
print "\n=======================\n\n";
print "output_txt_length = $output_txt_length bytes\n";
print "output_txt_count = $output_txt_count lines\n";
$output_txt_file = fopen("$output_txt_file_name", "w") or die("Unable to open file!");
fwrite($output_txt_file, $output_txt);
fclose($output_txt_file);
print "programe end\n";
?>

  适合Drupal生成sitemap的源文件:drupal_url_from_xml_to_txt.php

<?php
/*
 * convert drupal url from xmlsitemap to text format
 * jamesqi 2017-8-30
 *
*/

// please set below:
$input_xmlsitemap_url = '
https://114.mingluji.com/sitemap.xml?page=1
https://114.mingluji.com/sitemap.xml?page=2
https://114.mingluji.com/sitemap.xml?page=3
https://114.mingluji.com/sitemap.xml?page=4
';
$output_txt_file_name = '114.mingluji.com.url.txt';

//$domain_input = 'tw.18dao.net';
//$domain_output = 'tw.mip.18dao.net';

// please set above
// do not change below

function xmlsitemap_to_text($input) {//,$domain_input,$domain_output
	$output = $input;

	$output = str_replace("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n","",$output);
	$output = str_replace("\n","",$output);
	$output = str_replace("\n","",$output);
	
	$pattern = "/<\?xml-stylesheet([^>]*)\?>\n/";
	$replace = "";
	$output = preg_replace($pattern,$replace,$output);

	$pattern = "/\t]*)\/>\n/";
	$replace = "";
	$output = preg_replace($pattern,$replace,$output);

	$pattern = "/([^<]*)<\/lastmod>/";
	$replace = "";
	$output = preg_replace($pattern,$replace,$output);

	$pattern = "/([^<]*)<\/changefreq>/";
	$replace = "";
	$output = preg_replace($pattern,$replace,$output);

	$pattern = "/([^<]*)<\/priority>/";
	$replace = "";
	$output = preg_replace($pattern,$replace,$output);

	$output = str_replace("","",$output);
	$output = str_replace("","",$output);
	$output = str_replace("","",$output);
	$output = str_replace("\n","",$output);
	return $output;
}

print "programe start\n";
print "input_xmlsitemap_url = $input_xmlsitemap_url\n";
print "output_txt_file_name = $output_txt_file_name\n";
print "domain_input = $domain_input\n";
print "domain_output = $domain_output\n";

if (substr($input_xmlsitemap_url,0,1) == "\n") $input_xmlsitemap_url = substr($input_xmlsitemap_url,1);
if (substr($input_xmlsitemap_url,-1) == "\n") $input_xmlsitemap_url = substr($input_xmlsitemap_url,0,-1);

$input_xmlsitemap_url_array = explode("\n",$input_xmlsitemap_url);
$input_xmlsitemap_url_array_count = count($input_xmlsitemap_url_array);
print "input_xmlsitemap_url_array_count = $input_xmlsitemap_url_array_count lines\n";

$input_xmlsitemap_url_count = 0;
$output_txt = '';
$output_txt_length = 0;
$output_txt_count = 0;
print_r($input_xmlsitemap_url_array);

foreach ($input_xmlsitemap_url_array as $input_xmlsitemap_url_key=>$input_xmlsitemap_url_value) {
	print "\n=======================\n\n";
	print "input_xmlsitemap_url_key = $input_xmlsitemap_url_key\n";
	print "input_xmlsitemap_url_value = $input_xmlsitemap_url_value\n";
	if ($input_xmlsitemap_url_value == "" || $input_xmlsitemap_url_value == "\n") {
		print "skip this null line\n";
	} else {
		$input_xmlsitemap_url_count++;
		print "input_xmlsitemap_url_count = $input_xmlsitemap_url_count\n";

		$input_xmlsitemap_content = file_get_contents($input_xmlsitemap_url_value);
		$input_xmlsitemap_content_length = strlen($input_xmlsitemap_content);
		print "input_xmlsitemap_content_length = $input_xmlsitemap_content_length bytes\n";

		$output_text_content = xmlsitemap_to_text($input_xmlsitemap_content);
		$output_text_content_length = strlen($output_text_content);
		print "output_text_content_length = $output_text_content_length bytes\n";

		$output_text_content_array = explode("\n",$output_text_content);
		$output_text_content_array_count = count($output_text_content_array);
		print "output_text_content_array_count = $output_text_content_array_count lines\n";

		$output_txt .= $output_text_content;
		$output_txt_length = $output_txt_length + $output_text_content_length;
		$output_txt_count = $output_txt_count + $output_text_content_array_count;
	}
}
print "\n=======================\n\n";
print "output_txt_length = $output_txt_length bytes\n";
print "output_txt_count = $output_txt_count lines\n";
$output_txt_file = fopen("$output_txt_file_name", "w") or die("Unable to open file!");
fwrite($output_txt_file, $output_txt);
fclose($output_txt_file);
print "programe end\n";
?>

  上面程序運行起來都有一步一步的提示,可以看到運行結果,運行結束後打開生成的純網址文本文件檢查,如果發現哪裡異常,也可能要稍微調整一下程序,來适應稍有不同的sitemap格式。

  其它的設置cron定時運行辦法還是與前面說的博文中一樣,生成的日志文件也都有。

  能夠想辦法用程序來代替手工重複操作的盡量編程解決,一次性編寫、調試麻煩一些,但後面可以節約大量時間精力,而且也不會出差錯。✌

 

發表新回應

Plain text

  • 不允許使用 HTML 標籤。
  • 自動將網址與電子郵件地址轉變為連結。
  • 自動斷行和分段。