...

View Full Version : Help Improve Accuracy In My Contextual Ad Targeting Script



the-dream
08-28-2009, 12:16 PM
Hey Guys...

I run a small untargeted ad network, called Branchr Advertising...

I am currently coding a contextual advertising system, (code below) and I am wondering how I can make this more accurate, i.e. Generate a more relevant set of keywords for a given website/page.


<?php

function analyze($url, $output='array') {
// PAGE FUNCTIONS //
function strip_html_tags( $text )
{
$text = preg_replace(
array(
// REMOVE INVISIBLE CONTENT
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
// ADD LINE BREAKS AFTER BLOCKS
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0",
),
$text );
return strip_tags( $text );
}

function removeCommonWords($input){

// CREATE AN ARRAY OF COMMON/BANNED WORDS
$commonWords = array('a','able','about','above','abroad','according','accordingly','across','actually','adj','after ','afterwards','again','against','ago','ahead','aint','all','allow','allows','almost','alone','along ','alongside','already','also','although','always','am','amid','amidst','among','amongst','an','and' ,'another','any','anybody','anyhow','anyone','anything','anyway','anyways','anywhere','apart','appea r','appreciate','appropriate','are','arent','around','as','as','aside','ask','asking','associated',' at','available','away','awfully','b','back','backward','backwards','be','became','because','become', 'becomes','becoming','been','before','beforehand','begin','behind','being','believe','below','beside ','besides','best','better','between','beyond','both','brief','but','by','c','came','can','cannot',' cant','cant','caption','cause','causes','certain','certainly','changes','clearly','cmon','co','co.', 'com','come','comes','concerning','consequently','consider','considering','contain','containing','co ntains','corresponding','could','couldnt','course','cs','currently','d','dare','darent','definitely' ,'described','despite','did','didnt','different','directly','do','does','doesnt','doing','done','don t','down','downwards','during','e','each','edu','eg','eight','eighty','either','else','elsewhere','e nd','ending','enough','entirely','especially','et','etc','even','ever','evermore','every','everybody ','everyone','everything','everywhere','ex','exactly','example','except','f','fairly','far','farther ','few','fewer','fifth','first','five','followed','following','follows','for','forever','former','fo rmerly','forth','forward','found','four','from','further','furthermore','g','get','gets','getting',' given','gives','go','goes','going','gone','got','gotten','greetings','h','had','hadnt','half','happe ns','hardly','has','hasnt','have','havent','having','he','hed','hell','hello','help','hence','her',' here','hereafter','hereby','herein','heres','hereupon','hers','herself','hes','hi','him','himself',' his','hither','hopefully','how','howbeit','however','hundred','i','id','ie','if','ignored','ill','im ','immediate','in','inasmuch','inc','inc.','indeed','indicate','indicated','indicates','inner','insi de','insofar','instead','into','inward','is','isnt','it','itd','itll','its','its','itself','ive','j' ,'just','k','keep','keeps','kept','know','known','knows','l','last','lately','later','latter','latte rly','least','less','lest','let','lets','like','liked','likely','likewise','little','look','looking' ,'looks','low','lower','ltd','m','made','mainly','make','makes','many','may','maybe','maynt','me','m ean','meantime','meanwhile','merely','might','mightnt','mine','minus','miss','more','moreover','most ','mostly','mr','mrs','much','must','mustnt','my','myself','n','name','namely','nd','near','nearly', 'necessary','need','neednt','needs','neither','never','neverf','neverless','nevertheless','new','nex t','nine','ninety','no','nobody','non','none','nonetheless','noone','noone','nor','normally','not',' nothing','notwithstanding','novel','now','nowhere','o','obviously','of','off','often','oh','ok','oka y','old','on','once','one','ones','ones','only','onto','opposite','or','other','others','otherwise', 'ought','oughtnt','our','ours','ourselves','out','outside','over','overall','own','p','particular',' particularly','past','per','perhaps','placed','please','plus','possible','presumably','probably','pr ovided','provides','q','que','quite','qv','r','rather','rd','re','really','reasonably','recent','rec ently','regarding','regardless','regards','relatively','respectively','right','round','s','said','sa me','saw','say','saying','says','second','secondly','see','seeing','seem','seemed','seeming','seems' ,'seen','self','selves','sensible','sent','serious','seriously','seven','several','shall','shant','s he','shed','shell','shes','should','shouldnt','since','six','so','some','somebody','someday','someho w','someone','something','sometime','sometimes','somewhat','somewhere','soon','sorry','specified','s pecify','specifying','still','sub','such','sup','sure','t','take','taken','taking','tell','tends','t h','than','thank','thanks','thanx','that','thatll','thats','thats','thatve','the','their','theirs',' them','themselves','then','thence','there','thereafter','thereby','thered','therefore','therein','th erell','therere','theres','theres','thereupon','thereve','these','they','theyd','theyll','theyre','t heyve','thing','things','think','third','thirty','this','thorough','thoroughly','those','though','th ree','through','throughout','thru','thus','till','to','together','too','took','toward','towards','tr ied','tries','truly','try','trying','ts','twice','two','u','un','under','underneath','undoing','unfo rtunately','unless','unlike','unlikely','until','unto','up','upon','upwards','us','use','used','usef ul','uses','using','usually','v','value','various','versus','very','via','viz','vs','w','want','want s','was','wasnt','way','we','wed','welcome','well','well','went','were','were','werent','weve','what ','whatever','whatll','whats','whatve','when','whence','whenever','where','whereafter','whereas','wh ereby','wherein','wheres','whereupon','wherever','whether','which','whichever','while','whilst','whi ther','who','whod','whoever','whole','wholl','whom','whomever','whos','whose','why','will','willing' ,'wish','with','within','without','wonder','wont','would','wouldnt','x','y','yes','yet','you','youd' ,'youll','your','youre','yours','yourself','yourselves','youve','z','zero', 'january','february','march','april','may','june','july','august','september','october','november',' december', 'dont', 'weve','theyre','comments','opinions', 'week', 'day', 'month', 'year', 'hour', 'min', 'minute', 'second', 'nbsp', 'newsnbsp', 'new', 'old', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday' ,'sunday', 'new', 'web', 'mark', 'michael', 'christian', 'bao', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'gain', 'loss', 'move', 'isnt','good', 'bad', 'ok', 'okey', 'okay', 'jim', 'john', 'smith', 'bill', 'today', 'tomorrow', 'lot', 'lost', 'lots', 'companie', 'put', 'high', 'low', 'top', 'bottom');

return preg_replace('/\b('.implode('|',$commonWords).')\b/','',$input);
}
// END FUNCTIONS //


// GET THE 'TARGET' PAGE SOURCE AND PUT INTO THE $source VARIABLE.
$source = file_get_contents($url);

// USE OUR FUNCTION AND A STANDARD FUNCTION TO STRIP PAGE HTML.
$source = strip_tags(strip_html_tags($source));

// REMOVE ANY SPECIAL CHARACTERS
$source = preg_replace("/[^A-Za-z ]/","", $source);

// MAKE ALL WORDS LOWER CASE
$source = strtolower($source);

// REMOVE DISALLOWED WORDS
$source = removeCommonWords($source);

// SPLIT THE GROUP OF WORDS UP WHERE THERE IS A SPACE, PUT RESULT INTO AN ARRAY
$keywords = preg_split("/ /", $source);

foreach($keywords as $wordnum=>$keyword) {
// ENSURE THE KEYWORD ISN'T NOTHING
if($keyword != '') {
// ENSURE THE KEYWORD IS OVER 2 CHARS.
if(strlen($keyword) > 2) {
// INSURE THE KEYWORD IS LESS THAN OR EQUAL TO 10 CHARS
if(strlen($keyword) <= 10) {
// DE-PLURALIZE WORDS
if($keyword{strlen($keyword)-1} == 's') {
$keyword = substr($keyword, 0, -1);
}
// PUT FILTERED WORDS INTO A NEW ARRAY
$words[] = $keyword;
}
}
}
}

// COUNT THE TOTAL AMOUNT OF KEYWORDS
$tw = count($keywords);

// COUNT HOW MANY TIMES EACH WORD OCCURS AND PUT INTO ARRAY (word => count)
foreach($words as $wn=>$kw) {
$key[$kw] = $key[$kw] + 1;
}

foreach($key as $keywd=>$occurances) {
// ONLY INCLUDE WORDS THAT HAVE BEEN USED 3 TIMES ORE MORE
if($occurances >= 3) {
// GIVE EACH WORD A SCORE BASED ON IT'S OCCURANCES FROM THE TOTAL AMOUNT OF WORDS ON THE PAGE
$keyrank[$keywd] = round(100/$tw*$occurances, 3);
}
}

// SORT THE KEYWORDS ARRAY SO THE HIGHEST RANKED WORDS ARE AT THE TOP
arsort($keyrank, SORT_NUMERIC);

// CREATE AN ARRAY OF RESULTS
if($output == 'array') {
return $keyrank;
}

// CREATE A TABLE OF RESULTS
if($output == 'table') {
$results = '<table>';
$results .= '<tr><td><b><u>Keyword:</u></b></td><td><b><u>Rank:</u></b></td></tr>';
foreach($keyrank as $keyword=>$rank) {
$results .= '<tr><td><b>'.$keyword.'</b></td><td>'.$rank.'</td>';
}
$results .= '</table>';
return $results;
}
}


$url = $_GET['url'];
$siteKeywords = analyze($url);

// Top 30 Keywords
$count = 1;
foreach($siteKeywords as $keyword=>$rank) {
if($count <= 50) {
$kw[$keyword] = $rank;
$kwds .= "$keyword ";
} else {
break;
}
$count++;
}



$query = "SELECT * ,
match(text_1,text_2,keywords)
against ('$kwds')
as relevance
FROM
text_ads_test
WHERE
match(text_1,text_2,keywords) against ('$kwds')
ORDER BY relevance DESC";

$ads = mysql_query($query) or die(mysql_error());
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<style type="text/css">

body {
font-size: 13px;
font-family: "Helvetica Neue", Helvetica, Verdana, Arial, sans-serif;
}

.green {
color: #009f04;
}

.red {
color: #a90000;
}

</style>
<title>Branchr - Contextual Ad Targeting</title>
</head>

<body>
<?php
echo '<h2 style="padding: 0px; margin: 0px;">Branchr Contextual Ad Targeting</h2><hr /><b>The following ads are targeted to the content of:</b><br /><i>'.$url.'</i><br /><hr />';

$foundAds = mysql_num_rows($ads);

if($foundAds > 0) {
echo 'Status: <b class="green">Targeting Successful ('.$foundAds.' Relevant Ads)</b><br />Generated Keyword Set: <b>'.$kwds.'</b><hr />';
while($ad = mysql_fetch_array($ads)) {
echo "<b>".$ad['title']."</b> (Relevance: ".$ad['relevance'].")<br />".$ad['text_1']."<br />".$ad['text_2']."<br /><br />";
}
} else {
echo 'Status: <b class="red">Targeting Unsuccessful</b><br />Generated Keyword Set: <b>'.$kwds.'<hr />';
echo ':(';

}

?>
</body>
</html>


Thanks for you help!

P.S. If you run this code, you might want to remove the DB code, and have it just echo out a keyword set...

hinch
08-28-2009, 12:27 PM
it would be worth while you reading this

http://www10.org/cdrom/papers/519/

its about recommendation algorithms in your case your "buyers" would be the categories selected by the advertiser to display to ad in and the heap to search through would be your publishers websites.

run the formulae and the top 5 match's would for example be your "recommended" ie: ideal sites to put the adverts on when matching against the advertisers prefered categories.

the-dream
08-28-2009, 12:34 PM
Looks like a pretty painful thing to read... :/

But, I'll be sure to chug on with it, (may take a while). In the mean time, any small (or big) changes I can make to my current script to improve accuracy?

hinch
08-28-2009, 12:39 PM
not really you're doing a simple occurance count and sort its about as simple as it gets.

I suppose you could add in weighted key words.
so adverts in the "gardening" category for example may have a set of key words that are weighted more than standard occurance words and so if the word "garden" as a weighted key word appears then that is a prefered over a site that may have say "flowers" as a key word but no occurance of garden.

though your treading in murky grounds doing that as it involves you having to think up as many weighted keywords as you can related to each of your ad categories.

as for doing recommendation system matching I've done it a couple of times before on e-commerce sites and its a swine you'll probably want to kill yourself before you're finished.

the-dream
08-28-2009, 12:42 PM
Other than occurrence scoring for keywords, what would be a different option? But not weighted keywords, I would love to do this, but, It would take a number of months (if not years) to create an accurate set, and subset of keywords and phrases.

hinch
08-28-2009, 12:48 PM
perhaps the easiest way would be to have your publishers pick 5 categories that their site falls into when they sign up.

then you can just default display adds based on categories it kinda does your targeted ads but without all the work involved :)

you could also add in a background click through tracking to your current system. so your current system picks say 10 sites but 2 of those sites get bugger all click throughs. you could then say ok well even though they match they go into a dis-allow list of sorts for this ad as its not seeing returns that you'd hope to see.

of course there's the downfall to that of small sites with say 100 visits a month may only get 1 click through anyway but thats just the way it goes :(

the-dream
08-28-2009, 12:54 PM
perhaps the easiest way would be to have your publishers pick 5 categories that their site falls into when they sign up.


Probably would've been easier, had I known I would eventually make it targeted, but, there is 2,300 websites on the network now... so that could be a slight issue.

I'm just trying to think of an automated way to generate a keyword set for specific pages, then store the generated keyword set in a database, to use when an ad request is called.

I mean, what I have at the minute could probably be edited in about half an hour to achieve what I want to, but, I want to get the best keywords sets I can, from code similar to what I have now...

hinch
08-28-2009, 12:55 PM
you could read the meta keywords from the page instead of just dumping the header most sites that have been seo'd have some form of keywords in them

also a bulk email out to all your publishers asking them to categorise their site and in the mean time they've been dumped into a generic category would solve that issue

the-dream
08-28-2009, 01:02 PM
Yeah, I could read meta keywords, but there is an issue...

For example, there is a site in our network about cars, so if I analyzed the meta keywords for the home page, I would get general ads about cars.

But say he puts a post about car restoration on a page of his site, If I analyze the meta keywords of that page, then If they're the same as the homepage, then I will just get generic info about cars, rather than if I did analysis of the content, I would get a keyword set based on car restoration.

Do you see how this will give a different result set than analyzing the content? A less targeted result set...

--

I suppose I could bulk email everyone in the network, but, then again, there is no guarantee that they'll all reply, and again, it does some of the work, but then just generic ads are found, not ones that are targeted to the pages content (not just the websites content).

hinch
08-28-2009, 02:35 PM
i didn't mean use the meta's as the only method i mean't combine your current method of content weighting with meta keywords so you have the more generic meta's for general ad selecting then do a quick content parse on each page as its loaded to then narrow down the result set further.

also remember most forum and blog software creates different meta's per page dependant on the content and any "tags" set/selected

Zangeel
08-28-2009, 10:39 PM
Looks like a pretty painful thing to read... :/

But, I'll be sure to chug on with it, (may take a while). In the mean time, any small (or big) changes I can make to my current script to improve accuracy?

It made my eyes bleed :D

But what do you reckon are weakpoints in your function?

AdNetMart.com
10-14-2009, 03:27 PM
I fixed a couple of bugs in your script for you.


// GET THE 'TARGET' PAGE SOURCE AND PUT INTO THE $source VARIABLE.
$source = @file_get_contents($url) or die ("Cound not connect to ".$url);

This is needed or in case of a not working URL the script will return an error.



if ($keyrank) {

// SORT THE KEYWORDS ARRAY SO THE HIGHEST RANKED WORDS ARE AT THE TOP
arsort($keyrank, SORT_NUMERIC);

} else {

echo "<br>No keywords were found on your given URL.<br>";
echo "<br>Your url: ".$url."<br>";
die;

}
// CREATE AN ARRAY OF RESULTS

This is also needed, or in case of URLs with no keywords, the script will return an error. You could also use the URL with no keywords by adding it to, for example, a MySQL database, and to set your advertising interface not to display sponsored results for an URL with no keywords, or just default or RON ads, and to make a cronjob that check no-keyworded URLs every day or so to see if keywords are now availlable.

Another improvement could be made by checking the <meta name="keywords">, "description" and "title" for keywords, and adding them to the keyword list but with more weight. For example, you can make every keyword found in the title, description or keywords, worth like 2, 3 or more occurrencies, what you deem more accurate.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum