Archive for the ‘Uncategorized’ Category

(some) bots welcome

Saturday, May 3rd, 2008

Recently, I started playing with sitemaps. I even got a nice cron to help out. Today, I thought I would take a look and see if anyone had crawled one of the domains, say, earl.holaservers.com, and the some of the results (with a friendlier format) are in

+———-+———————-+————-+
| count(*) | domain | user_agent |
+———-+———————-+————-+
| 3 | earl.holaservers.com | . . . Baiduspider . . . spider_jp.html |
| 3 | earl.holaservers.com | ia_archiver |
| 4 | earl.holaservers.com | . . . Yahoo! Slurp |
| 14 | earl.holaservers.com | . . . Googlebot |
+———-+———————-+————-+

So it looks like I got crawled by

  1. the good old baidu spider who only hit my robots.txt file. Guess the japanese folks aren’t real interested in my test and rather random ftp’d files.
  2. ia_archiver, who looks to be alexa, though I am not sure where they came from
  3. our dear friend googlebot (welcome!) – who actually got a few pages
  4. yahoo! – just got robots.txt and sitemap.xml
  5. a couple other folks (not listed) just getting sitemap.xml

Looks like the bots came like 1-3 days after the pings got sent. Also looks like to pings were sent after the first day. Hmm, very strange since I have been uploading files here and there. At least one, I think. Well, this helped me track down a bug, where if you re-put a file, the timestamp wouldn’t change, and so the pinger wouldn’t find your latest. Think I fixed that, though I will tell you better in a couple days.

I think the above results rather telling. I pinged several search engines all about some random site that they had never heard of, and got rather different results from each, including crickets. Makes me wonder where the other guys are. Like ask.com? Seems like they were trying to make a run at google? Guess more like a run at baidu!

And yahoo!? Wonder why you can’t catch up? Well, might want to start by following the whole sitemap.

Moreover.com? Who are you and why don’t you come and visit?

Did I mention that it is pretty cool have traffic show up in mysql? Kind of reliant on mysql a bit, but hey, it stays up, and when my one server can’t handle traffic, I think that will be a good problem to have.

Enjoy!

Earl

ajax smooths the edges

Wednesday, April 30th, 2008

Sometime ago I decided that I wanted to do some web stuff that was user based, where you could sign up with just an email and password, and if you were new to the site, you would have to agree to terms.  This would make it so that the signup form and the login form could be nearly identical.  All the same except that darn agree to terms checkbox.

Tonight, I made the form on the front page of hs a little smarter and ajax, as above, smoothed the edges.  I used to do some javascript validation for the iagree checkbox, but I got rid of that, and via ajax I submit the username and password.  There are three cases

  1. username is taken, password is right – ding!  let’s send back an auth, and who cares if they clicked iagree
  2. username is taken, password is wrong – get a wrong password error
  3. username is available, but did not agree to terms – get a javascript alert saying you have to agree

It is all pretty well seamless, and right now it is sure darn quick, so there you go.  New or old can login / register via the front page and would not likely realize all the coolness that had just happened.

all cron’d up . . .

Wednesday, April 30th, 2008

Well, here’s the long and short of my work tonight

mysql> select * from sitemap_ping;
+———–+—————+——–+———————+
| domain_id | search_engine | status | timestamp |
+———–+—————+——–+———————+
| 3 | Ask.com | 200 | 2008-04-30 00:15:01 |
| 3 | Google | 200 | 2008-04-30 00:15:01 |
| 3 | Live Search | 200 | 2008-04-30 00:15:01 |
| 3 | Moreover.com | 200 | 2008-04-30 00:15:02 |
| 3 | Yahoo! | 200 | 2008-04-30 00:15:02 |
. . .

Views helped quite a bit. I made two.

  1. latest_modification – gets the domain_id and the latest timestamp for each domain
  2. latest_sitemap_ping – get the domain_id and the timestamp of the latest sitemap_ping for each domain

Using those two views, I ping for

  1. every domain that has a latest_modification but that hasn’t ever been pinged
  2. every domain where the last modified is more recent than the last ping

Thinking I could combine the two queries, but for now, I am satisfied. I just wiped my traffic_log, so we’ll have to see if like moreover.com starts hitting me.

Not sure how expensive those queries get with like a million rows, but other than that, I am pretty well done with sitemap stuff. I am still unsure if it will do anything, i.e., will google (WLOG) come and spider as per the sitemap if I haven’t registered the domain with their webmaster tools? Guess we shall soon see.

Enjoy!

Earl

sitemaps

Tuesday, April 29th, 2008

Not sure if it will really work, like help pagerank and the like, but I am going to try doing some nice automated sitemaps. Kind of nice having stuff in a database. Tonight I got stuff working without cache, like

http://earl.holaservers.com/sitemap.xml

I haven’t tested it, but if you uploaded your own sitemap, it would win. Also,

http://earl.holaservers.com/robots.txt

Gotta love wikipedia, and I plan on getting some ping stuff working later this week.

Perhaps it goes without saying, but those two files are both completely virtual.  So the plan would be to query daily, ping things that have changed, then sit back and watch the page rank go up.  Will be especially nice if folks start transferring domains in, and I add a nice link back.

So cool! (I think)

How the heck did it get to be like 12:30?

enjoy!

Earl

selenium rulz!, also blog != wiki

Saturday, April 26th, 2008

<Rant>

I am going to take the second part first. For me, right now, today, blogs (at least this one) are not nearly as cool as wikis. I keep a nice password protected wiki, where I track spack++ stuff. The fact that I am writing a blog now makes me think that I can take the password protection off, but there you go. I started my wiki back when a friend and I were kind of in stealth mode on some shopping stuff, but I think that nearly everything we were working on is now live. I think holaservers is gonna be awesome, but the linking from the points that I described before could make shopthar stuff explode.

Yesterday, I spent a good hour writing the post I am about to rewrite. WordPress has drafts and saves them as you go along, but so far as I can tell, once you publish, your drafts go away. I was all done with a, for me, pretty rocking post when I noticed the formatting was off. I went in, changed the formatting, clicked I think Save and only my last paragraph remained. After some searching, including getting google desktop going on my computer, I couldn’t find my post. Gone. Such things used to happen in a non-AJAX world, but now? Why? Why not take snapshots as folks are editing, and anything with a different md5 gets a different entry that you keep pretty well forever. Oh, already “published”? Who cares? Keep it around. Generally, we’re talking about text. Yeah, text. Can store a gig of it for fifteen cents a month at amazon.  That’s a lot of blog posts.  Or if my Google Apps request goes through, I could store 500 megs of it for free. Text! Who cares? Keep a snapshot per minute. If I run out of space, let me delete some drafts. How much time do folks spend editing their blog posts? Actually making changes? I used to write emails with something like EditPadPro open and paste my email in their every so often. I guess those days are back.

PbWiki on the other hand I think does just what I would like. Keep version as I go, let me compare them and roll back to them.

</Rant>

Turns out selenium rulz! Did I mention that I pretty well hate retyping stuff. I feel like I am being a little insincere since I have already written the stuff. Occasionally I have to do it with code and it just kind of irks me. I will now try to really end my rant.

Some few years ago, I took a “vacation” from work to go in to work and work on HTTP::Slap. HTTP::Slap aimed to walk web pages, helping me log into espn and the like. Was pretty into espn’s baseball challenge and needed some help with automated team changes. I got HTTP::Slap to the point where I accept and post back cookies, submit hidden tags from the forms I was posting, post via https, etc. A year or two later, I discovered WWW::Mechanize and don’t think I ever worked on HTTP::Slap again. Guess I should have looked around a bit. Live and learn.

Just a couple years ago, I went to OSCON and heard a talk about selenium. For whatever reason, I didn’t think it was worth pursuing. Yesterday when reviewing the slides I can’t really see what voter feature I was waiting for, but there you go.

Just last week, a coworker, who I think likes to remain off the grid was trying out some holaservers stuff and it didn’t work in ie. “That’s weird”, I thought. Especially since I had made nearly no effort to get it all to work in ie, and I had had ie compatibility problems in the past. I have desire for holaservers to work in ie, but not much desire to test every javascripty thing in both firefox (where I develop) and ie (what people use). I remembered selenium and decided to have a look. At the end of the first night, I had some basic crawling working. I started by exporting an html page from the fine selenium IDE, would parse the html in perl and then pass the commands over to WWW::Selenium (which I think should be version 1.15 as there is a 1.14). I liked walking through a path with the IDE, but figured that I would want to do unit tests in perl. Eventually I just did everything in perl. I started to use EPIC, a perl plugin for eclipse and could just debug through, which was pretty well just as good. Little shout out to EPIC. One of my favorite features of eclipse is being able to hit <ctrl>-<shift>-f and apply some formatting rules. If you have perltidy installed, EPIC will do that for you as well.

After a bit of work, I can walk through my site using selenium, and I got things working in ie, firefox, opera and safari. Just had a few issues.

  1. Cross site scripting strikes again. I had been blogging about it back in the day, I could have written about using mod_rewrite to overcome some cross site problems. See, my stuff starts on holaservers.com and goes to my.holaservers.com; cross site scripting shall we say. I divided my script into pre-login and post-login chunks and launched separate browsers for each.
  2. During login, I bounce, via javascript, with an auth in the query string. To have my perl script mimic this behavior, I need that bounce url. I set a cookie (selenium=1) and if I have that cookie, I alert the url, which selenium then grabs and bounces to. Works like a champ. Hopefully no users will ever see it.
  3. Think the main cross browser problem I had came from me using my own naive implementation of ajax. If you look, I don’t really do much, but for some reason, ie wouldn’t get my callbacks. I switched over to the YUI implementation and everything started working. I think I wrote mine before yui was around, or at least before I had heard of it, but again, live and learn.
  4. Safari for windows is slow. The other browsers would run the tests (which as above, launches a browser twice) in about 30 seconds. Safari on three runs did 108, timeout exceeded and 111 seconds. Pretty good for the world’s fastest browser.

I posted my test script here, and for some reason, the formatting got hosed, darn it. I even went through and fixed it, though to no avail.

Enjoy!

Earl

enjoy!

Saturday, April 12th, 2008

So, I decided to start writing a blog about a couple of my side projects. Today’s entry is about my hosting project which is currently called / hosted at holaservers.com. It is a beta (perhaps alpha) version of a free hosting site that I have wanted to do for ages. It is what I spend much of my should be sleeping time, working on. Little by little, I have been getting stuff going, and as of right now, I have seen (at least once) the following work on the site:

  1. signup and get verified via a nice little email
  2. create subdomains of holaservers.com or “transfer” domains on in
  3. use ftp to manage files
  4. use TinyMCE to edit files
  5. get credit for referring a friend
  6. enjoy!

May not sound like much, but heck, I think it is pretty cool. Probably the coolest thing is that I use Amazon’s S3 for my storage. I currently have none of the hosting stored locally. I used to think I would keep the S3 part kind of secret, cause, I don’t know, I thought people could easily rip off the idea, but the more I have done, the more I think I will just be open about stuff. Well, pretty open.

A little QA.

Q: Why the heck would you start another web hosting site? Aren’t there like thousands or tens of thousands of them out there?

A: Yeah, sure. So, that’s a pretty good argument for me that one more might be ok 🙂 I once heard that for every taco bell in the US, there is also a hedge fund. No idea where they sit in some of them, but there you go.

Q: Why else?

A: Well, I worked at a hosting company for over seven years, generally liked it and would like to try some things that I wasn’t really ever able to do. Kind of hard to move a big ship. We used a net app for our storage and that (I think) limited some possibilities for us. Some decisions that were made early on made later change near impossible. Simple things like your domain name being your login led to oh so many resulting troubles. So, on holaservers, your email is your login. See? You’re getting the picture already.

Also, there are also at least a couple viral things I would like to try out around a little points system I have pondered for ages.

I would also like to start to grow a user base to help out with other projects, like shopthar. Along similar lines, if I can get something below going, I think I could get some pretty good page rank on a few sites, which would help all the way around.

Q: How do you earn points?

A: For now, the plan is you can earn points by

  1. helping a friend signup and verify via your referral link
  2. helping a friend upgrade to a hostmonster account via your referral link
  3. registering one of your non-holaservers sites as pointing to a spack approved site and having my system find your link
  4. having your above site / link show up in a google link: query via their api

Q: Points? What are you going to do with that (those)?

A: The vision is that with points you may be able to

  1. get more space for your account
  2. get the first year free for your hostmonster account
  3. earn a chunk of holaservers’ hostmonster revenue, perhaps 1 or 5%

Q: What was that last one?

A: For a given month, let’s say holaservers earned $100,000 from hostmonster referrals, you had earned 10% of the total points and it is decided to share 5% of the revenue. Well then, you would take home $500. That would be pretty cool for the both of us! What if holaservers had earned $1,000,000,000 from hostmonster referrals in a month? That’s right, $5,000,000! Even cooler!

Q: Wow, great ideas! Why not go for some venture capital and quit your day job?

A: Kind of have this dream of running my own company, hiring folks I want to hire, perhaps not retaining folks that don’t work out, working on cool things, trying out cools ideas I have or hear, treating folks right, making some good money, the whole time trying to be a good, fair, honest, hard working man; pretty well the best boss and owner ever. Kind of have this fear that if I go for VC the above dream could just drift away and I end up sitting in some meeting with my former folks and perhaps other folks yelling at each other about who knows what, while telling me that they aren’t going to implement whatever I had just been talking about. Course I also have this memory of working on pretty cool stuff the last like ten years that has hardly made the light of day because I can generally only work on it at night, so pick your poison 🙂

Q: What about your day job?

A: Yeah, gonna keep that. I mean hey it is totally different. For my day job I write a Java Eclipse Rich Client application that will help missionaries and the like take more / better pictures at archives and the like. This project is Perl on the server. How much more different could they be?

Q: Isn’t it all just programming?

A: What? Are you even a programmer?

Q: Don’t you have a beautiful wife and family, including a little sub-year old baby? When do you find the time to get anything done?

A: Yes, and pretty much after they go to bed 🙂

On this blog, I want to kind of keep track of challenges we face and hopefully overcome here at holaservers. Join us! If you sign up for an account over at holaservers.com.

Enjoy!

Earl