Timestamps are in GMT/BST.
[4:05] * DuraLogBot (n=PircBot@fedcommsrv1.nsdlib.org) has joined #duraspace
[4:05] * Topic is 'Welcome to DuraSpace - This channel is logged - http://duraspace.org/irclogs/'
[4:05] * Set by cwilper on Tue Jun 30 16:32:05 EDT 2009
[9:37] * bradmc (firstname.lastname@example.org) Quit ()
[9:38] * bradmc (email@example.com) has joined #duraspace
[14:15] * stuartlewis (firstname.lastname@example.org) has joined #duraspace
[14:15] <stuartlewis> Is there a DSpace development meeting today?
[14:29] <kshepherd> afaik
[14:30] <kshepherd> i may have to leave early, a few family members are leaving town this morning and i haven't seen them yet (this holiday stuff is hard work)
[14:37] <stuartlewis> Seems quiet here - probably won't happen I suspect.
[14:58] <kshepherd> stuartlewis: that's what we thought last time, then there was a last-minute rush ;)
[14:58] <stuartlewis> 2 minutes to go... :)
[14:59] <kshepherd> enjoying the break? when do you start back at work?
[15:01] <stuartlewis> Yeah - having a good break. Taking the kids out on their new (2nd hand trademe) kiddie kayak
[15:01] <stuartlewis> Back onthe 5th.
[15:01] <stuartlewis> How about you?
[15:02] <kshepherd> cool ;) my family are kayak enthusiasts, haven't taken them out for a spin yet on this trip though
[15:02] <kshepherd> i'm back in the office on the 12th, but i suspect i'll be "working from home" a lot from 5th onwards
[15:03] <kshepherd> i have a test server down but i can't get a console to it from here :P
[15:04] <stuartlewis> How come?
[15:05] <kshepherd> rdp over socks is a no-go, apparently (it's stuck in the middle of a reboot, so i need to RDP to a windows machine at Waikato and use VMware Infrastructure CLient to get a console up, but i only have SSH tunnels to use, no VPN access)
[15:06] <stuartlewis> Oh dear!
[15:06] <stuartlewis> Does Waikato not run VPN?
[15:06] <kshepherd> heh yeah, i'm not too worried since its a test server but i'll definitely need to investigate VPN when i return
[15:07] <kshepherd> it's available, but only case-by-case i think... but i think i just found my 'case' ;)
[15:08] <kshepherd> well, while we're here... i'm just going to search email threads, but what is the strategy as of right now for managing spiders.txt? looks like the file is still empty in trunk
[15:09] <stuartlewis> I think we're supposed to run a script that trawls the dspace.log for any IP that requests robots.txt, and adds that to spiders.txt.
[15:10] <stuartlewis> I think if we ship it with a default file from one of the online lists of IPs, that will cover 95% for a reasonable time.
[15:10] <kshepherd> hm
[15:10] <stuartlewis> 1.6.1 (or 1.6 if we have time) can come with an update script for it to pull a copy from DSpace servers maybe.
[15:11] <kshepherd> yeah agreed. i was just going to remind about the southampton crawlers, which won't appear on any typical spider/bot lists
[15:12] <stuartlewis> Yup - we can add them.
[15:12] <stuartlewis> So maybe best we manage the list ourselves with an update mechanism?
[15:13] <stuartlewis> And I think I've mentioned my idea before of having the update mechanism include a phone-home function so we can tell from the DSpace.org web server logs who / which version of dspace etc is pulling the data.
[15:17] <kshepherd> yeah, it's an interesting idea.. would be good to get some community feedback on that one
[15:17] <stuartlewis> But I suppose for 1.6, the main thing is to get a half decent spiders.txt file. We can think about an update mechanism in 1.6.1
[15:18] <kshepherd> well, 18min in... guess there really is no meeting
[15:18] <stuartlewis> My dspace.log -> solr converter just removes googlebot, yahoo slurp and msn bot, and that removes over 80% of all hits!
[15:18] <stuartlewis> Yeah - might as well give up :)
[15:18] <kshepherd> heh yeah that's about the same ratio as we get
[15:18] * kshepherd switches to #dspcae
[15:19] <stuartlewis> Scary isn't it!
[15:19] <stuartlewis> And of the 20% "true" users, 80% of them will have come from Google!
[15:20] <stuartlewis> So if you do the math, Google hits us 5 times, for every 1 visitor they direct to us.
[15:20] <kshepherd> hehe true
[15:20] <kshepherd> seems inefficient, but then a single indexing hit can refer multiple visitors from all ends of the earth, ultimately
[15:21] <stuartlewis> True - and for a site like amazon, the stats will be a lot different!
[15:23] * bradmc (email@example.com) Quit (Read error: 54 (Connection reset by peer))
[15:24] * bradmc (firstname.lastname@example.org) has joined #duraspace
[15:25] * stuartlewis (email@example.com) Quit ()
These logs were automatically created by DuraLogBot on irc.freenode.net using the Java IRC LogBot.