Sarah Reynolds

Sarah Reynolds

  • NA
  • 32
  • 11.1k

Matching a URL

Jan 27 2012 4:15 PM
Hi Everyone,

I'm trying to locate a websites position in google using c# and regex. I can match where the domain to check is a simple domain ie: 'mywebsite.com' but it doesn't work when the domain to be checked is 'mywebsite.com/a-product-name-p-23.html'

I assume this is the regex I am using but for the life of me I cannot work out what this should be. Basically I want to run the script to check my website pages position in serps but the url's could all be very different.

My code at the moment is:


public int GetPosition(Uri url, string searchTerm)
{
string raw = "http://www.google.co.uk/search?q={0}&num=100&hl=en&lr=&ie=UTF-8&safe=off&output=search#q={0}&hl=en&lr=&safe=off&prmd=imvns&ei=avgiT6HOGILPsgaRwfHpCA&start=0&sa=N&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=c7b3e04f0f892e66&biw=1366&bih=624";
string search = string.Format(raw, HttpUtility.UrlEncode(searchTerm));

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(search);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.ASCII))
{
string html = reader.ReadToEnd();
return FindPosition(html, url);
}
}
}


private static int FindPosition(string html, Uri url)
{

string lookup = "(<h3 class=\"r\"><a href=\")(\\w+[a-zA-Z0-9.-?=/]*)";
MatchCollection matches = Regex.Matches(html, lookup);

for (int i = 0; i < matches.Count; i++)
{
string match = matches[i].Groups[2].Value;
if (match.Contains(url.AbsoluteUri))
return i + 1;
}

return 0;
}


An example of the data passed into  'string html' is:

<h3 class="r"><a href="http://www.bbc.co.uk/news/" class=l onmousedown="return rwt(this,'','','','2','AFQjCNE9TwMbS0bHLzYb5kDoTlS2JM66mw','','0CFUQFjAB',null,event)">BBC <em>News</em> - Home</a>

If I use Keyword 'World News' and URL 'bbc.co.uk' then a serp position is retrieved but if I use the URL 'http://news.sky.com/home/world-news' OR 'www.msnbc.msn.com/id/3032507/' no serp position is retrieved despite both urls being on page1 of the serps.

I guess the regex cant match the more complex urls?

Any regex experts here who might be able to point me in the right direction or is there an easier way to do this in c#?

Many thanks, Sarah



Answers (13)