当前位置

调用Google Webmaster Data API批量提交sitemap

James Qi 在 2012年10月25日 - 17:18 提交
内容摘要:前几天在为系列网站提交网站地图的时候,发现提交多语言、包含手机版的上百个sitemap实在是太痛苦了,不停地click+copy&paste做多了直让人犯困,算了一下,至少需要几十个小时来做这......

  前几天在为系列网站提交网站地图的时候,发现提交多语言、包含手机版的上百个sitemap实在是太痛苦了,不停地click+copy&paste做多了直让人犯困,算了一下,至少需要几十个小时来做这种极度无聊的工作。

  了解到Google Webmaster Data API是可以用于批量提交sitemap的,但不知道如何使用,查了一些资料也没有找到突破头绪就先搁置了,后来尝试用robots.txt提交,但这种办法只能提交网站更目录下的sitemap(例如http://che.postcodebase.com/sitemap.xml),无法识别子目录(或者说子路径)下的sitemap(例如http://che.postcodebase.com/m/ar/sitemap.xml)。

  前两天找到一篇《Google WebMaster API (PHP) for submitting dynamic sitemaps》是最接近我们需要的解决办法,但文章中提到的下载地址失效就没有继续弄。今天下决心非得解决不可,找程序员同事也询问了办法,然后逐条调试、修改,遇到php配置的问题又跟换了一台服务器上调试,最后终于是成功了!

  对老外这篇文章又爱又恨啊,感谢文章中的例子给了php实现的办法,但里面的一些配置没有说清楚、程序中有错误代码,搞得多花了好多时间来弄。现在把几个要点记下来:

  • Zend_GData下载地址搬家了,前文中的已经不对;
  • 下载Zend_GData后,可以不再下载专门的Zend Framework;
  • Zend生效需要include_path,可以改php.ini或者程序中加入ini_set("include_path", ".:/var/www/html/drupal7.bizdirlib.com/sites/all/will-delete/ZendGdata-1.12.0/library");
  • $sitemap-location应该是$sitemap_location,而且需要先赋值,等于要提交的sitemap网址;
  • 程序中多了两段<entry...>和</entry>错误代码;
  • $result=$fdata->post($xml,"https://www.google.com/webmasters/tools/feeds/http://yoursite中后面一个http://没有进行urlencode;
  • 运行环境中的php需要支持ssl以便用https访问google account获得授权。

好歹算是可以用了,一会儿就提交成功了好多个站点的数百个sitemap,没有白费几个小时的时间!最后把我用的代码再贴出来:

<?php
ini_set("include_path", ".:/var/www/html/drupal7.bizdirlib.com/sites/all/will-delete/ZendGdata-1.12.0/library");
echo "load start<br />\n";
require_once 'Zend/Loader.php';
Zend_Loader::loadClass('Zend_Gdata');
Zend_Loader::loadClass('Zend_Gdata_ClientLogin');
Zend_Loader::loadClass('Zend_Gdata_Gapps');
echo "load end<br />\n";

// Provide Google Account Information
$email = 'abc@gmail.com';
$passwd = '123';
$service = 'sitemaps';

// Try to connect
echo "try start<br />\n";
try {
$client = Zend_Gdata_ClientLogin::getHttpClient($email, $passwd, $service);
} catch (Zend_Gdata_App_CaptchaRequiredException $cre) {
echo 'URL of CAPTCHA image: ' . $cre->getCaptchaUrl() . "n";
echo 'Token ID: ' . $cre->getCaptchaToken() . "n";
} catch (Zend_Gdata_App_AuthException $ae) {
echo 'Problem authenticating: ' . $ae->exception() . "n";
}
echo "try end<br />\n";

$sitemap_location = 'http://cyp.postcodebase.com/sitemap.xml';
add_sitemap($sitemap_location,$client);

function add_sitemap($sitemap_location,$client){
$xml ='<entry xmlns="http://www.w3.org/2005/Atom" xmlns:wt="http://schemas.google.com/webmasters/tools/2007"><id>'.$sitemap_location.'</id>';
$xml.="<category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/webmasters/tools/2007#sitemap-regular'/><wt:sitemap-type>WEB</wt:sitemap-type></entry>";
$fdata = new Zend_Gdata($client);
echo "post start<br />\n";
$result=$fdata->post($xml,"https://www.google.com/webmasters/tools/feeds/http%3A%2F%2Fcyp%2Epostcodebase%2Ecom%2F/sitemaps/",null,"application/atom+xml");
echo "$sitemap_location<br />\n";
echo "post end<br />\n";
}
?>

  里面多了几句echo调试语句,便于查看运行情况,这个php程序(/var/www/html/drupal7.bizdirlib.com/sites/all/will-delete/ZendGdata-1.12.0/demos/Zend/Gdata/submit.php)可以在命令行运行,也可以通过浏览器查看服务器hawk726上的这个php程序所在的网址来运行,效果一样。


  2013年4月补充:新网站也可以用API来添加网站、验证网站,我只添加成功了,但验证没有成功。添加的程序:

<?php
//echo phpinfo();
//include_path='/usr/local/apache2/htdocs/drupal7.postcodebase.com/sites/all/will-delete/ZendGdata-1.12.0/library';
//include_path='';
ini_set("include_path", ".:/var/www/html/drupal7.bizdirlib.com/sites/all/will-delete/ZendGdata-1.12.0/library");
echo "load start<br />\n";
require_once 'Zend/Loader.php';
//$dirs="/usr/local/apache2/htdocs/drupal7.postcodebase.com/sites/all/will-delete/ZendGdata-1.12.0/demos/Zend/Gdata";
Zend_Loader::loadClass('Zend_Gdata');
Zend_Loader::loadClass('Zend_Gdata_ClientLogin');
Zend_Loader::loadClass('Zend_Gdata_Gapps');
echo "load end<br />\n";

//$subdomain="jpn";
$subdomain_array=array("afg");

// Provide Google Account Information
$email = 'email@gmail.com';
$passwd = 'password';
$service = 'sitemaps';

// Try to connect
echo "try start<br />\n";
try {
$client = Zend_Gdata_ClientLogin::getHttpClient($email, $passwd, $service);
} catch (Zend_Gdata_App_CaptchaRequiredException $cre) {
echo 'URL of CAPTCHA image: ' . $cre->getCaptchaUrl() . "n";
echo 'Token ID: ' . $cre->getCaptchaToken() . "n";
} catch (Zend_Gdata_App_AuthException $ae) {
echo 'Problem authenticating: ' . $ae->exception() . "n";
}
echo "try end<br />\n";
foreach ($subdomain_array as $subdomain) {
$post_address="https://www.google.com/webmasters/tools/feeds/sites/";
echo "post_address: $post_address<br />\n";
$site_url = "http://$subdomain.bizdirlib.com/";
add_site($site_url,$client,$post_address);
}
function add_site($site_url,$client,$post_address){
$xml ="<atom:entry xmlns:atom='http://www.w3.org/2005/Atom'>";
$xml.='<atom:content src="'.$site_url.'" />';
$xml.="</atom:entry>";
$fdata = new Zend_Gdata($client);
//echo "post start<br />\n";
echo "site_url: $site_url<br />\n";
$result=$fdata->post($xml,$post_address,null,"application/atom+xml");
//echo "post end<br />\n";
}
?>

  添加还是用的与提交一样的post方式,验证需要put,还有获取列表需要get,put和get的例子都还没有找到,我也不熟悉,只有先放一放再说。


  2013年9月22日补充:为了解决验证、修改geolocation等问题,自己折腾了很久PUT还是没有搞定,到处去找,终于是找到了一段非常好的程序(Webmaster Tools 或者 WebmasterTools.php),现在WebmasterTools.php放在hawk726的/var/www/html/drupal7.bizdirlib.com/sites/all/will-delete/ZendGdata-1.12.0/demos/Zend/Gdata/上:

<?php

class WebmasterTools {

    function WebmasterTools($username, $password) {
        $this->_Login($username, $password);
    }

    function _Http($method, $url, $contentType, $content='') {
        $method = strtoupper($method);
        $opts = array('http' =>
            array(
                'method'  => $method,
                'protocol_version' => 1.0,
                'header'  => 'Content-type: ' . $contentType .
                             (isset($this->auth) && isset($this->auth['Auth']) ? "\nAuthorization: GoogleLogin auth=" . $this->auth['Auth']  : '' ) .
                             "\nContent-Length: " . strlen($content),
                'content' => $content
            )
        );
        $context  = stream_context_create($opts);
        $result = @file_get_contents($url, false, $context);
        return $result;
    }

    function _Login($username, $password, $service='sitemaps') {
        $postdata = http_build_query(
            array('accountType' => 'GOOGLE',
                  'Email'  => $username,
                  'Passwd' => $password,
                  'source' => 'WebmasterTools-Class',
                  'service'=> $service)
            );

        $login = $this->_Http('POST', 'https://www.google.com/accounts/ClientLogin','application/x-www-form-urlencoded', $postdata);
        $lines = explode("\n", $login);
        $data = array();
        foreach ($lines as $line) {
          list($var,$value) = explode('=', $line);
          $data[$var] = $value;
        }
        $this->auth=$data;
    }

    function _GetText($node) {
        $text = '';
        for ($i=0; $i < $node->childNodes->length; $i++) {
            $child = $node->childNodes->item($i);
            if ($child->nodeType==XML_TEXT_NODE)
                $text .= $child->wholeText;
        }
        return $text;
    }

    // array_elements_in has the set of tags we should use as array b
    // because they may repeat.
    function _ElementToArray($node, $array_elements_in = array()) {
        $row = array();

        $array_elements = array();
        foreach ($array_elements_in as $array_element)
           $array_elements[$array_element] = true;

        for ($i=0; $i < $node->childNodes->length; $i++) {
            $item = $node->childNodes->item($i);
            if (!isset($item->tagName)) continue;
            $children = $this->_ElementToArray($item, $array_elements_in);
            if (count($children) > 0) {
                $value = $children;
            } else {
                $value = $this->_GetText($item);
            }
            if (isset($array_elements[$item->tagName])) {
                if (!isset($row[$item->tagName])) $row[$item->tagName] = array();
                $row[$item->tagName][] = $value;
            } else
                $row[$item->tagName] = $value;
        }
        return $row;
    }

    function _callWMT($method, $url, $site='', $params = array(), $array_elements_in = array()) {

      $method = strtolower($method);
      $site = "http://$site/";
      $url = str_replace('{site}', urlencode($site), $url);
      $xml = '';

      if ($method=='post' || $method=='put') {

          $doc = new DOMDocument('1.0', 'utf-8');
          $root = $doc->createElementNS("http://www.w3.org/2005/Atom", 'atom:entry' );

          if (count($params) > 0) {
              $root->setAttributeNS('http://www.w3.org/2000/xmlns/','xmlns:wt','http://schemas.google.com/webmasters/tools/2007');
          }
          $doc->appendChild($root);

          $element = $doc->createElement('atom:id', $site);
          $root->appendChild($element);

          if (count($params) > 0) {
              $element = $doc->createElement('atom:category');
              $element->setAttribute('scheme','http://schemas.google.com/g/2005#kind');
              $element->setAttribute('term','http://schemas.google.com/webmasters/tools/2007#site-info');
              $root->appendChild($element);
          } else {
              $element = $doc->createElement('atom:content');
              $element->setAttribute('src',$site);
              $root->appendChild($element);
          }

          foreach ($params as $tag => $value) {

             if (is_array($value)) {
                 $element = $doc->createElement("wt:$tag", $value['_value']);
                 foreach($value as $att => $value) {
                    if($att=='_value') continue;
                    $element->setAttribute('att','value');
                 }
             } else {
                 $element = $doc->createElement("wt:$tag", $value);
                 $root->appendChild($element);
             }
          }

          $xml = $doc->saveXML();
      }

      $body = $this->_Http($method, $url, "application/atom+xml", $xml);

      if ($body!='') {
          $doc = new DOMDocument();
          $success = $doc->loadXML($body);
          return $this->_ElementToArray($doc, $array_elements_in);
      } else {
          return false;
      }

    }

    function createSite($site) {
      $this->_callWMT('post', 'https://www.google.com/webmasters/tools/feeds/sites/', $site);
      // Google does send Content-Lenght back and get_contents fails so we get the site again !
      return $this->getSite($site);
    }

    function deleteSite($site) {
      return $this->_callWMT('delete', 'https://www.google.com/webmasters/tools/feeds/sites/{site}', $site);
    }

    function setGeoLocation($site, $location) {
      return $this->_callWMT('put',"https://www.google.com/webmasters/tools/feeds/sites/{site}", $site, array('geolocation' => $location));
    }

    function setPreferredDomain($site, $domain='') {
      if ($domain=='') $domain = $site;
      return $this->_callWMT('put',"https://www.google.com/webmasters/tools/feeds/sites/{site}", $site, array('preferred-domain' => $domain));
    }

    function getSite($site) {
      $entries = $this->_callWMT('get','https://www.google.com/webmasters/tools/feeds/sites/{site}', $site);
      return $entries;
    }

    function getSites() {
      $rawSites = $this->_callWMT('get','https://www.google.com/webmasters/tools/feeds/sites','',array(),array('entry'));
      $sites = array();
      foreach ($rawSites['feed']['entry'] as $entry) {
        $site = explode('/', $entry['title']);
        $site = $site[2];
        $sites[$site] = $entry;
      }
      return $sites;
    }

    function verifySite($site, $location = '') {

      $entry = $this->getSite($site);

      $vm = $entry['entry']['wt:verification-method'];

      if ($location!='')
          file_put_contents("$location/$vm", $vm);

      return $this->_callWMT('put',"https://www.google.com/webmasters/tools/feeds/sites/{site}", $site,
                       array('verification-method' =>
                          array('_value' => $vm,
                                'type'   => 'htmlpage',
                                'in-use' => 'true',
                                'file-content' => "goolge-site-verification: $vm"
                               )
                            ));

    }
}

function ut_WebmasterTools ($username, $password, $website,$location) {
    $wt = new WebmasterTools($username, $password);

    echo "Get Site\n";
    print_r($wt->getSite($website));

    echo "Delete Site\n";
    print_r($wt->deleteSite($website));

    echo "Create Site\n";
    print_r($wt->createSite($website));

    echo "Verify Site\n";
    print_r($wt->verifySite($website));

    echo "Set Location\n";
    print_r($wt->setGeoLocation($website,$location));
}

?>

  这段程序用起来很方便,似乎也不需要调用复杂的GData Zend什么的,所以无需特别的配置环境,我已经用这段代码修改了200多个站点的geolocation。

  另外,Google Webmaster API有添加站点的数量限制,不能超过1000个,但这个限制只针对API,如果是人工手工添加更多的站点,是可以超过1000个的。


  2014年8月21日补充:转移到usloft4065服务器上的这个目录:/var/www/html/yellowpage.bizdirlib.com/sites/all/will-delete/ZendGdata-1.12.0/demos/Zend/Gdata/

评论

-- 发自IP地址: 58.48.25.38 (位置 | 谁是)

我们有些错误的网址,虽然现在已经是404状态,但Google Webmaster Tools中看还是显示抓取错误,希望能主动批量提交删除,目前在Webmaster Data API中没有找到删除URL的办法,只能手工逐个提交。
不过这个问题似乎也不算太大,Google自己会逐步删除这些错误的网址,逐步不再爬取和提示错误的,我想是这样的。
补充:后面用301重定向基本解决了,让错误的网址转到正确的网址去,这个办法应该最合适了。

James Qi / 祁劲松

添加新评论

Plain text

  • 不允许使用HTML标签。
  • 自动将网址与电子邮件地址转变为链接。
  • 自动断行和分段。
验证码
本问题用于测试您是否为人类访问者,避免自动垃圾发贴。
图形验证
键入显示在图片中的字符