#duraspace IRC Log


IRC Log for 2012-11-14

Timestamps are in GMT/BST.

[0:11] * joaomelo (~DSpace@bl23-41-52.dsl.telepac.pt) Quit (Quit: joaomelo)
[6:44] -sturgeon.freenode.net- *** Looking up your hostname...
[6:44] -sturgeon.freenode.net- *** Checking Ident
[6:44] -sturgeon.freenode.net- *** Found your hostname
[6:44] -sturgeon.freenode.net- *** No Ident response
[6:44] * DuraLogBot (~PircBot@atlas.duraspace.org) has joined #duraspace
[6:44] * Topic is '[Welcome to DuraSpace - This channel is logged - http://irclogs.duraspace.org/]'
[6:44] * Set by cwilper!ad579d86@gateway/web/freenode/ip. on Fri Oct 22 01:19:41 UTC 2010
[12:03] * Asger (~abr@eduroam-nat.statsbiblioteket.dk) has joined #duraspace
[13:18] * Asger (~abr@eduroam-nat.statsbiblioteket.dk) Quit (Ping timeout: 264 seconds)
[13:21] * Asger (~abr@ has joined #duraspace
[13:23] * mhwood (mwood@mhw.ulib.iupui.edu) has joined #duraspace
[14:05] * tdonohue (~tdonohue@ has joined #duraspace
[15:20] * tdonohue (~tdonohue@ Quit (Read error: Connection reset by peer)
[15:24] * tdonohue (~tdonohue@c-50-129-94-92.hsd1.il.comcast.net) has joined #duraspace
[15:42] * Asger (~abr@ Quit (Ping timeout: 245 seconds)
[17:43] * helix84 (a@ has joined #duraspace
[19:52] <tdonohue> Hi all, reminder that our DSpace Developers Mtg is in ~8mins https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-11-14
[19:52] <kompewter> [ DevMtg 2012-11-14 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-11-14
[19:56] * KevinVdV (~KevinVdV@d54C154B1.access.telenet.be) has joined #duraspace
[19:56] <KevinVdV> Hi everybody
[19:57] <tdonohue> Hi KevinVdV
[20:00] * hpottinger (~hpottinge@mu-162198.dhcp.missouri.edu) has joined #duraspace
[20:00] <tdonohue> Hi all, it's time for our weekly DSpace Developers Mtg. Today's agenda is up at: https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-11-14
[20:00] <kompewter> [ DevMtg 2012-11-14 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2012-11-14
[20:00] * cbeer (cbeer@2600:3c03::f03c:91ff:fedf:a498) has left #duraspace
[20:01] * aschweer (~schweer@schweer.its.waikato.ac.nz) has joined #duraspace
[20:01] <tdonohue> First off, I wanted to highlight that announcement near the top.... You should have already seen the email, but we are having a few "DSpace Future Directions" joint meetings with DCAT & Committers to go over some of the feedback that DuraSpace has heard from its Sponsorts & Registered Service Providers
[20:02] <tdonohue> So again, we'd love to have you all join one of those meetings (no need to join both, unless you really really wanted to)
[20:02] <tdonohue> If there are any questions about these meetings though, I'd be glad to answer them
[20:02] <mhwood> Got one on my calendar, now I just need to send Skype some money.
[20:04] <tdonohue> Also, it's worth mentioning, the discussion will be higher level, if it isn't clear...we won't be digging deep into tech stuff and will try to avoid having others do so. Essentially the goal is to get all the feedback out there and start to analyze whether there are common concerns we can start to form a project (or multiple) around
[20:04] * robint (52292725@gateway/web/freenode/ip. has joined #duraspace
[20:05] <tdonohue> so, if anyone has any other questions on those DSpace Future Discussions meetings, just ping me...glad to answer them as need be. But, we can move along on our agenda, for now
[20:05] * sands (~sands@ has joined #duraspace
[20:05] * bollini (~chatzilla@host252-210-dynamic.8-79-r.retail.telecomitalia.it) has joined #duraspace
[20:05] <tdonohue> So, the main topic for today obviously (as many folks start filtering in) is 3.0 release status & updates
[20:05] <hpottinger> so, it's OK for a bit of the technical go color our approach to the high level discussion, yes?
[20:06] <tdonohue> hpottinger -- for that DSpace Futures Discussions, you are more that welcome to have tech stuff "color" the high level discussion. But we just want to keep things at a higher level, and avoid digging deep on one single topic
[20:06] <tdonohue> (as we only have one hour for the meeting, and want to make the most of our time)
[20:07] <tdonohue> ok...moving along to 3.0 stuff & next steps...we've completed the 2nd Testathon...there seemed to be less reported bugs overall, but we still have quite a few that seem to be open.
[20:07] <sands> hi all. i can unfortunately only stay until 3:30EST so apologies for dropping at that point.
[20:08] <tdonohue> List of still unresolved bugs in 3.0: https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+%223.0%22+AND+resolution%3DUnresolved+ORDER+BY+priority+DESC
[20:08] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+%223.0%22+AND+resolution%3DUnresolved+ORDER+BY+priority+DESC
[20:08] <tdonohue> we're down to 29! :) (it is decreasing...just not so rapidly)
[20:08] <sands> tdonohue: still seems (because of the frequency and open list) that we are not ready to leap toward a final release. anybody else have feelings on this?
[20:08] <robint> The biggest group seem to be relted to versioning
[20:08] <robint> related
[20:09] <tdonohue> correct, some of the biggest seem related to versioning, which is also a good reason why I wanted to bring forth a recommendation (suggested by mdiggory initially) to *DISABLE* versioning by default in 3.0
[20:09] <hpottinger> I'm working on DS-1174 today
[20:09] <kompewter> [ https://jira.duraspace.org/browse/DS-1174 ] - [#DS-1174] exception handling for itemUpdate hides the stack trace, which obstructs troubleshooting efforts - DuraSpace JIRA
[20:10] <robint> tdonohue: Could we then proceed without reolving the related Jira issues?
[20:10] <tdonohue> (As much as I like versioning as a feature...and still want to promote it heavily in 3.0, I think we need to warn folks about some of the issues related to it)
[20:10] <robint> resolving
[20:11] <robint> Stick a big warning on it - this feature is beta and buggy
[20:11] <tdonohue> robint -- yes, if we decided to disable versioning in 3.0, then I'd suggest both DS-1374 and DS-1382 become "known issues" and get rescheduled for 3.1 or 4.0
[20:11] <kompewter> [ https://jira.duraspace.org/browse/DS-1374 ] - [#DS-1374] AIP Backup &amp; Restore functionality does NOT backup/restore past versions of Items - DuraSpace JIRA
[20:11] <kompewter> [ https://jira.duraspace.org/browse/DS-1382 ] - [#DS-1382] AIP Backup &amp; Restore functionality should not duplicate unchanged files across Item Versions - DuraSpace JIRA
[20:11] <mhwood> Ooh, 3.0 even includes "technology preview" features.
[20:11] <KevinVdV> But I do really need to fix DS-1363
[20:11] <kompewter> [ https://jira.duraspace.org/browse/DS-1363 ] - [#DS-1363] Unable to create new version when previous version was deleted in workspace - DuraSpace JIRA
[20:11] <helix84> mhwood: yep, we already have the mobile theme with the nice beta sticker
[20:11] <tdonohue> mhwood -- yea, that's a way to put it... "Versioning" is in beta...it works, but here's some of the "known issues/complications"
[20:12] <tdonohue> KevinVdV -- yea, I agree. Even if we slap beta on Versioning, we still need to fix Ds-1363 and any other bugs in the versioning system itself
[20:13] <KevinVdV> I will attempt to fix them in coming week ..... Just been so busy lately
[20:13] <robint> KevinVdV: let us know if you are struggling for time and maybe others can help out
[20:14] <tdonohue> The biggest versioning issue here is that Ds-1382 seems REALLY COMPLEX....it's essentially rewriting a good chunk of the AIP Backup & Restore (and then making sure it's backwards compatible). It's not doable in the near term, and I'm not sure we want to hold up 3.0 for it. But, it is very important issue to make users aware of.
[20:14] <helix84> so while sands is still here, let me ask the RT - are we still going for the planned release date?
[20:15] <KevinVdV> Might I ask what the date is ?
[20:15] * tdonohue notes the planned release date is currently set at FRIDAY :)
[20:15] <hpottinger> yee haw!
[20:15] <sands> hm
[20:15] <robint> Then I guess not
[20:15] <KevinVdV> Yeah I'm not going to get there
[20:15] <sands> i would vote not.
[20:16] <helix84> okay
[20:16] <mhwood> How much time is needed for things that *must be fixed*, then?
[20:16] <tdonohue> So...what is a more reasonable date? We really should try and keep pushing to get this wrapped up
[20:16] <helix84> considering the bugs we have left, does a week's extension sound enough?
[20:17] <mhwood> We have two blockers, one unassigned.
[20:17] <helix84> i wouldn't recommend extending it more at once, we may reconsider again in a week
[20:17] * tdonohue notes that next week (specifically Thurs & Fri) many folks in the USA will be off work...for Thanksgiving Holiday
[20:17] <KevinVdV> I will give it my everything to attempt to fix.... Might need testers
[20:17] <helix84> tdonohue: so what about the following monday/tuesday?
[20:18] <robint> mhwood: the unassigned blocker is the caching issue, which I think we agreed could be rescheduled if we had to
[20:18] <tdonohue> The holiday in the USA only extends through Thurs & Fri. Most folks (self included) will return to work on Mon, Nov 19
[20:18] <hpottinger> KevinVdV when you're ready for testers, just ask on one of the mail lists (dspace-release would get my attention, but -commit would probably be better)
[20:18] <tdonohue> s/19/26/
[20:18] <kompewter> tdonohue meant to say: The holiday in the USA only extends through Thurs & Fri. Most folks (self included) will return to work on Mon, Nov 26
[20:19] <hpottinger> how does Tuesday 11/27 sound?
[20:19] <hpottinger> and who wants to do the honors?
[20:19] <helix84> i think we should really try to keep that one
[20:20] <tdonohue> Tues, Nov 27 sounds reasonable to me. As long as everyone feels they can complete their assigned tickets
[20:20] <tdonohue> again, list of open tickets: https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+%223.0%22+AND+resolution%3DUnresolved+ORDER+BY+priority+DESC
[20:20] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+%223.0%22+AND+resolution%3DUnresolved+ORDER+BY+priority+DESC
[20:20] <tdonohue> we still have 3 which are unassigned
[20:21] <tdonohue> here's the unassigned sorted to the top of the list: https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+%223.0%22+AND+resolution+%3D+Unresolved+ORDER+BY+assignee+DESC%2C+priority+DESC
[20:21] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+%223.0%22+AND+resolution+%3D+Unresolved+ORDER+BY+assignee+DESC%2C+priority+DESC
[20:21] <hpottinger> I'll have a pull request for 1174 later today, as soon as I sort out a bit of a commit mess, the patch is very tiny
[20:21] <sands> The 27th sounds reasonable.
[20:22] <sands> I will be on the same US schedule (off Thurs/Fri, back Mon)
[20:22] <helix84> i have 1 + 2 duplicates to fix, i'll fix them. the validation issues are probably as much solved (all thanks to joao) as they can be.
[20:22] <robint> I will grab DS-1373, looks trivial
[20:22] <kompewter> [ https://jira.duraspace.org/browse/DS-1373 ] - [#DS-1373] When logged in on item page, missing XMLUI key &quot;xmlui.statistics.Navigation.view&quot; - DuraSpace JIRA
[20:22] <tdonohue> DS-1205 is still the big "question mark" --- it's the unassigned blocker that we've all grown to love (or hate)
[20:22] <kompewter> [ https://jira.duraspace.org/browse/DS-1205 ] - [#DS-1205] DSpace org.dspace.core.Context caching problem - DuraSpace JIRA
[20:23] <hpottinger> am planning to test 1205 this afternoon/evening
[20:23] <helix84> that's the one that had pull request ready before the last RC, but we didn't push the button, so it didn't go through the testathon
[20:23] <tdonohue> With the new deadline, I'd recommend though that we reschedule Ds-1205 for 3.1 or 4.0. Honestly, if someone finds they are hitting this issue, they could just install the small patch from GitHub.
[20:24] <robint> tdonohue: +1
[20:24] <tdonohue> I'm just way too worried about the small fix in Ds-1205 though...as it is in the Context class, which is used by EVERYTHING
[20:24] <mhwood> Yes, it's too late for 3.0.
[20:25] <helix84> i was thinking of heretically suggesting to put it into 3.0 and hurry with 3.1 O:-)
[20:25] <tdonohue> I'd rather do the opposite...schedule for 3.1, and if any "brave" folks can start running it in production 3.0, they can let us know by 3.1 if it has proven to be stable
[20:25] <helix84> i'm not sure how else we're going to get it tested enough
[20:26] <hpottinger> I'll take that bet, will shoehorn 1205 into our 3.0 upgrade
[20:26] <helix84> i'll do that too unless i forget
[20:26] <helix84> it's a deal, then
[20:27] <tdonohue> Sounds good...we'll reschedule Ds-1205 for 3.1 then
[20:27] <tdonohue> The only other unassigned is DS-1357
[20:27] <kompewter> [ https://jira.duraspace.org/browse/DS-1357 ] - [#DS-1357] Mobile XMLUI theme fails to load when reloading item view page - DuraSpace JIRA
[20:27] <mhwood> Suggestion: as soon as 3.1 development opens, pull this, so that everyone working on 3.1 has it. That should get some testing done.
[20:27] <tdonohue> mhwood -- true, that would be one way to get testing in as well
[20:27] <mhwood> *sigh* "this" == 1205
[20:28] <robint> I don't have a smart phone so I am ducking 1357 :)
[20:28] <helix84> uh, i thought 1357 was fixed...
[20:28] <helix84> let me recheck
[20:29] <helix84> i'll assign it for now
[20:29] <tdonohue> if it is fixed, that'd be great...just need to get it's ticket closed then :)
[20:29] <tdonohue> ok, sounds good, helix84
[20:29] <mhwood> There's a suggested fix.
[20:30] <tdonohue> So, back to the question of "Do we disable Item Versioning by default?" I want to verify that others are OK with this action, before we move forward. Then we also need to create a ticket for it & get it implemented ASAP
[20:30] <sands> Sorry folks, have to jet. Talk to you on list. Cheers.
[20:30] * sands (~sands@ Quit (Quit: sands)
[20:30] <helix84> like I wrote, it sounds reasonable to me
[20:30] <helix84> disabling it, that is
[20:30] <KevinVdV> +1 for disable !
[20:31] <mhwood> I agree, disable by default.
[20:31] <helix84> well kevin and mdiggory are the authorities for this and they both agree
[20:31] <hpottinger> +1 disable
[20:31] <bollini> +1 disable
[20:31] <tdonohue> ok...cool, sounds like we all agree then. That means I'll reschedule the AIP Backup tickets also for 3.1 & work on some docs to describe the AIP Backup + Versioning issues in 3.0
[20:32] <robint> Re. the issues assigned to Peter, I think they can all be left until the last moment and either rescheduled or turned off by default
[20:32] <tdonohue> I'll also create a ticket for setting it to disabled & get that implemented ASAP (shouldn't be hard)
[20:32] <helix84> tdonohue: we'll need to document it in the release notes, too. basically say "you can enable versioning but it diesn't play nice with aip"
[20:32] <tdonohue> helix84 -- yep, agreed
[20:33] * hpottinger notes no one raised a hand when asked who wanted to do the honors with the final release...
[20:33] * tdonohue notes what robint has said, regarding PeterDietz tickets
[20:33] <robint> sorry, I was off on a wee tangent :)
[20:33] <tdonohue> actually robint, I was going to bring it up as well.
[20:34] * helix84 keeps his head down thinking how he's been over-volunteering recently
[20:34] <tdonohue> (hpottinger, don't worry, we'll get to that big question)
[20:34] * joaomelo (~joaomelo@bl23-41-52.dsl.telepac.pt) has joined #duraspace
[20:34] <tdonohue> But, first, with the PeterDietz tickets...these all have to do with Elastic Search, which is currently 100% undocumented
[20:34] <helix84> welcome joao
[20:35] <joaomelo> hi!
[20:35] <robint> hi
[20:35] <tdonohue> So, my question is...what do we do with ElasticSearch? I'm assuming it'd be too hard to remove now...so, that means it's more than likely in 3.0...but, it's also 100% undocumented as to how to use it or even enable/disable it.
[20:36] * joaomelo_ (~joaomelo@bl23-41-52.dsl.telepac.pt) has joined #duraspace
[20:36] <helix84> joaomelo: 3.0 release was just rescheduled to Nov 27. There was DS-1386 filed today, should I schedule it for 3.0 or 3.1?
[20:36] <kompewter> [ https://jira.duraspace.org/browse/DS-1386 ] - [#DS-1386] Unable to customize OAI 2.0 description in Identify responses - DuraSpace JIRA
[20:37] <hpottinger> I'm sure Peter knows he's on the hook for supporting Elastic Search stats if it ships without documentation :-)
[20:37] <helix84> tdonohue: since we extended the deadline, let's give peter some more time. writing the docs should be simple.
[20:37] <mhwood> Without doco., ElasticSearch is not a feature. It's another technology preview. Unless its status changes between now and the 27th.
[20:37] <helix84> mhwood: it's been TP all along :)
[20:37] <robint> mhwood: agreed
[20:38] <helix84> i actually wanted to try it, but without the docs I didn't even know how to enable it
[20:38] * hpottinger intends to use it, even if it means pestering PeterDietz 24/7 :o)
[20:38] <mhwood> UTSL
[20:38] <tdonohue> ok. that all sounds good. We'll hope Peter is able to get to it before 3.0. If not, it's really not gonna get advertised that it's even there to begin with...so, it's almost a "hidden" technology preview
[20:38] <helix84> hpottinger: think of the children!
[20:39] <robint> :)
[20:39] <tdonohue> That also means, I'm going to take the honors in attempting to *disable* the ElasticSearch XMLUI Aspect (DS-1341). It's "on" by default, and that seems odd to me
[20:39] <kompewter> [ https://jira.duraspace.org/browse/DS-1341 ] - [#DS-1341] ElasticSearch should not be enabled by default &amp; no docs exist. - DuraSpace JIRA
[20:39] * helix84 explains: Peter just had another baby, that's why he's MIA
[20:40] * joaomelo (~joaomelo@bl23-41-52.dsl.telepac.pt) Quit (Ping timeout: 246 seconds)
[20:40] <helix84> tdonohue: +1, sounds like a good idea anyway
[20:40] <robint> tdonohue: thanks, I wonder where it is currently visible
[20:40] <tdonohue> yep, I recall that Peter does have a new family member...so, hopefully he will get around to this stuff when he returns
[20:40] <tdonohue> robint -- no idea where it is visible. I just happened to notice one day that the Aspect is enabled by default. So, I'll disable it
[20:41] <helix84> i mean it's a good idea because it doesn't make much sense to have both the Solr and ES listener enabled at the same time
[20:41] <helix84> by default, that is
[20:41] <tdonohue> yep, my thoughts exactly, helix84
[20:42] <tdonohue> Ok..I think that's everything I had for 3.0 questions. The last one is...who would like to do the honors of cutting the release?
[20:43] <tdonohue> (If we don't decide this today, we could email about it on dspace-release as well...there are several extra steps with the "final release", including ensuring the wiki docs get exported to PDF, etc.
[20:43] <helix84> ... and you could hear a pinhead drop ...
[20:43] <hpottinger> it's fun, it's easy, you'll be famous... :-)
[20:43] <robint> sands is not here, hmm
[20:44] <helix84> we still have another DevMtg to plan that
[20:44] <hpottinger> aw, nuts, I'll do it
[20:44] <robint> hpottinger: You are the man !
[20:44] <helix84> you're it! :)
[20:44] <mhwood> Kermit: Yaaaaaayyyyy!!!
[20:44] <tdonohue> hpottinger sounds good. Just make sure to do it from your linux box (which worked great) and not your Mac (which we never figured out) ;)
[20:45] <hpottinger> you all are missing out on a good time, is all I'm sayin' :-)
[20:45] <helix84> ok, time for that google scholar issue?
[20:46] <helix84> http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg19303.html
[20:46] <kompewter> [ [Dspace-tech] google scholar linking to extracted pdf text? ] - http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg19303.html
[20:46] <tdonohue> Sure, on to the Google Scholar issue (There's a few other tasks that need to be done before the release around docs -> PDF...but, I can send an email to dspace-release about them...)
[20:46] <joaomelo_> DS-1386 seems like a feature, not a bug
[20:46] <kompewter> [ https://jira.duraspace.org/browse/DS-1386 ] - [#DS-1386] Unable to customize OAI 2.0 description in Identify responses - DuraSpace JIRA
[20:47] <helix84> joaomelo_: it's up to you whether you want it in 3.0 or 3.1 because there's still time
[20:47] <joaomelo_> hmm
[20:48] <tdonohue> So, with regards to the Google Scholar issue (recent threads on dspace-tech, including today). Has anyone else seen this? I'm wondering if this is a widespread thing or not
[20:48] * robint (52292725@gateway/web/freenode/ip. Quit (Ping timeout: 245 seconds)
[20:49] * tdonohue notes that Google Scholar now supports "site:[your-site-ur]" searches...so you should be able to quickly see how Google Scholar is indexing your PDFs and whether it is grabbing extracted text for some
[20:49] <joaomelo_> helix: ok currently i'm working in the documentation, then i'll take a look
[20:49] <aschweer> I see some .pdf.txt results in Google Scholar, but no ore havesting in October/November at least
[20:50] <helix84> all PDFs for me
[20:50] <helix84> joaomelo_: great, thanks
[20:50] <tdonohue> yea, I'm not sure if this is entirely related to ORE or OAI-PMH harvesting....the latest email from Reinhard said that's what he's seeing, but I'm not sure if that's the cause or not
[20:50] * joaomelo_ is now known as joaomelo
[20:50] <tdonohue> The other older thread on this is here: http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg18831.html
[20:50] <kompewter> [ Re: [Dspace-tech] Extracted text showing in Google scholar search result ] - http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg18831.html
[20:51] <helix84> well i was thinking - is the issue whether TXT are accessible or just whether Google Scholar should give priority to the PDFs? Because I'd answer the first question no, it's how we designed it.
[20:51] <tdonohue> In this previous thread, PeterDietz said he thought it had to do with the HTTP 301 Redirect that the XMLUI does by default...but, I'm also not sure if that's the full issue here
[20:52] <helix84> also notice the second email in the thread from Reinhard
[20:53] <tdonohue> helix84 -- my first question is *how is Google Scholar even finding the extracted TXT files?" They are not linked to off the main Item page. So, that's the one thing that implies maybe it is indexing some other interface (like OAI-PMH/ORE)
[20:53] <joaomelo> i think that OAI is currently exporting those TXT files
[20:54] <aschweer> It could be getting them from the mets.xml files, that's what I always thought. I guess the question is -- does this happen in JSPUI as well? (my repos are all XMLUI)
[20:54] <joaomelo> but only for a specific schema (XOAI schema)
[20:54] <tdonohue> In general though, I'm just trying to gather more information here... I'd be glad to reach out to Anurag again if needed. The one issue here is that I don't have a production DSpace instance being indexed by Google Scholar, so I cannot answer any questions he may have about logs, etc.
[20:54] <helix84> just an idea - but remember that there's still the link to METS in the HTML comments. maybe Google just got that agressive about indexing. the OAI interface isn't really "hidden" from Google, either.
[20:54] <tdonohue> aschweer -- I think so far, all examples that have been found are XMLUI
[20:54] <aschweer> which would then point away from OAI
[20:55] <tdonohue> aschweer -- so, you could be right...maybe it's somehow indexing the mets.xml
[20:55] <aschweer> as helix84 said, the link is there -- so if they really want to, they can get the links from there
[20:55] <tdonohue> hmm...yea, I forgot that METS link is in a comment
[20:56] <helix84> i have dspace in scholar, I have XMLUI but also OAI
[20:56] <aschweer> I forgot, what's the status on allowing access to the mets.xml files only from localhost?
[20:56] <helix84> anyone here in scholar without OAI enabled?
[20:56] <KevinVdV> Need to run until next week
[20:56] * KevinVdV (~KevinVdV@d54C154B1.access.telenet.be) Quit (Quit: Leaving)
[20:56] <helix84> aschweer: nobody volunteered
[20:56] <aschweer> helix84: ah :)
[20:56] <tdonohue> yep, that ticket is still open
[20:57] <hpottinger> I'm looking through my archive of dspace-tech, and nothing is jumping out at me, I'm not really following the thread here...
[20:57] <tdonohue> hpottinger..which thread, this conversation we're having? we're talking about http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg19303.html
[20:57] <kompewter> [ [Dspace-tech] google scholar linking to extracted pdf text? ] - http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg19303.html
[20:58] <hpottinger> oh, aha, tdonohue posted a link up there...
[20:58] <tdonohue> and this earlier thread too http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg18831.html
[20:58] <kompewter> [ Re: [Dspace-tech] Extracted text showing in Google scholar search result ] - http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg18831.html
[20:58] <aschweer> I definitely see Googlebot accessing lots and lots of .pdf.txt files. Of course it isn't helpful enough to give a referer...
[20:58] <tdonohue> (i.e. there were 2 separate threads on dspace-tech describing this issue)
[20:58] <helix84> might be 2 separate issues, though. not sure yet.
[20:59] <tdonohue> could be separate issues, but they are eerily similar that they sound related to me
[20:59] <tdonohue> (which is why I wanted to get a sense of what others may have seen in Google Scholar)
[21:00] <helix84> let me ask again - is the issue whether TXT are accessible or just whether Google Scholar should give priority to the PDFs? Because I'd answer the first question no, it's how we designed it.
[21:00] <hpottinger> hmm... found a link to a /statistics page... which mentions pdf.txt
[21:01] <aschweer> I'd say it's the priority thing but I'm tempted to just make all my TEXT bundles private, I see no reason for them to be open
[21:01] <tdonohue> helix84 -- we could make the TXT inaccessible..that's an option. But, the root issue here is that Google Scholar seems to be ignoring the "easy to find PDF"
[21:02] <helix84> ok, i just wanted to clear that up
[21:02] <aschweer> I see no external access to mets.txt files in the last 3 weeks or so, even though there was Googlebot activity in that time
[21:04] <aschweer> gah and maybe if I look for mets.xml it'd be more useful...
[21:04] <hpottinger> is it ignoring it, or honoring its understanding of what a 301 redirect means?
[21:04] <tdonohue> we could just "fix" this by making the extracted TXT inaccessible to anonymous. I'm not really certain why it would ever need to be accessible publicly anyways..
[21:04] <tdonohue> hpottinger -- that's a good question actually
[21:05] <hpottinger> maybe 302/found would be better?
[21:05] <tdonohue> it is worth noting that the 301 Redirect that PeterDietz noted is a *default setting* in the XMLUI only...it's there because the bitstream URL structure of XMLUI is *different* from the JSPUI
[21:06] <aschweer> No external access to mets.xml in the last 4 weeks either. My customised stats don't expose .pdf.txt links, so I'm still curious how Google Scholar even finds them
[21:07] <tdonohue> In case people don't understand my statement about that 301 Redirect... It happens here: https://github.com/DSpace/DSpace/blob/master/dspace-xmlui/src/main/webapp/sitemap.xmap#L355
[21:07] <kompewter> [ DSpace/dspace-xmlui/src/main/webapp/sitemap.xmap at master · DSpace/DSpace · GitHub ] - https://github.com/DSpace/DSpace/blob/master/dspace-xmlui/src/main/webapp/sitemap.xmap#L355
[21:08] * hpottinger reaches for his ancient Cocoon book...
[21:09] <tdonohue> So, if the 301 Redirect is part of the problem, we can fix it in the DSpace codebase....either by changing it so it's not a 301 Redirect, or seeing if we can get the XMLUI to display the actual final URL in the <meta> tags that Google Scholar uses, *instead of* putting the 301 Redirect URL there
[21:10] <helix84> The fact remains that this would be a problem for a lot of dspace installations and it's in the interest of Google to provide links to PDFs, not TXTs. So we should probably let them know about that.
[21:11] <tdonohue> hpottinger...yea, this is a bit of Cocoon magic...but, the comment describes what it is doing..it's taking aURL of the structure "/bitstream/[handlePrefix]/[handlePostfix]/[sequence]/[name]" (JSPUI URL) and redirecting it to "/bitstream/handle/[handlePrefix]/[handlePostfix]/[name]?sequence=[sequence]" (which is the XMLUI URL structure)
[21:12] <hpottinger> http://cocoon.apache.org/2.2/core-modules/core/2.2/843_1_1.html
[21:12] <kompewter> [ Cocoon Core - map:redirect-to ] - http://cocoon.apache.org/2.2/core-modules/core/2.2/843_1_1.html
[21:12] <helix84> what i'm saying is even if changing 301 to 302 would fix this, there still will be a lot of time until the fix is deployed on all the affected installations
[21:12] <aschweer> GoogleMetadata constructs the citation_pdf_url piecemeal anyway, so it shouldn't be a problem to change that to the ?sequence format
[21:12] <aschweer> (but I guess that'd break compatibility with JSPUI)
[21:12] <tdonohue> helix84 -- yea, I agree we should let Google Scholar know about it... I'm just trying to first determine the extent of this issue, so we can describe it better to Google Scholar
[21:13] <hpottinger> permanent="yes" = 301
[21:15] <aschweer> Hm, I see bingbot and Baiduspider crawling .pdf.txt as well. I'd really like to know where they get the links for those.
[21:16] <tdonohue> huh..interesting, aschweer
[21:16] <tdonohue> can you grep your logs and see what that same bot IP address indexed just *before* the .pdf.txt? I know that's not always the referrer (some bots jump around a bit), but it might help
[21:17] <tdonohue> In general here, it sounds like we've hit upon some various "rough theories" but nothing beyond that. I'd suggest maybe we all just keep an eye out for more Google Scholar oddities (and report any you find to dspace-tech) in our own servers.
[21:18] <tdonohue> I can also send off an email to Anurag @ Google Scholar to let him know that we've now had several reports of these issues. If any of you have examples from your own DSpace instance, please send them along as well. It'd be good to give Google Scholar lots of examples of the issue.
[21:18] <hpottinger> for 3.0, do we drop permanent="yes"?
[21:19] <helix84> hpottinger: i wouldn't hurry with that, we don't even know if that would help with anything
[21:19] <tdonohue> hpottinger -- to be clear, if we remove "permanent='yes'" from those redirects, does that turn it into a 302 redirect?
[21:19] <hpottinger> I mean, clearly the crawlers are re-crawling anyway, so the benefit of the 301 isn't working out
[21:20] <hpottinger> oh, I'd need to test it, just wanted to get a read on whether it was worth it to try
[21:21] <tdonohue> it could be worth trying I guess. I'm hesitant to also implement it immediately though, as we still haven't determined if that's really the issue at hand
[21:22] <aschweer> From bingbot, I see several that jump straight to one particular .pdf.txt, but usually after it's seen the item page first. It does go both to the ?sequence and the other link version for the .pdf.txt link, with 24+ hours in between
[21:22] <tdonohue> So, I think the next steps are just to find as many examples as we can...and then start to try and narrow down the issue as best we can
[21:23] <aschweer> I'm happy to share examples from my logs -- just to be honest, I'm not even sure exactly what would make a good example
[21:23] <tdonohue> As I said, I can also email Anurag about this oddity... Maybe we should start up a JIRA ticket to capture all this stuff together in one place?
[21:23] <hpottinger> it would be helpful in discussions if we had a Jira ticket...
[21:24] <hpottinger> jinx
[21:24] <tdonohue> I think a "good example" is any item you notice that Google Scholar is linking to the ".pdf.txt" instead of the ".pdf"...for example, Reinhard noted that there are 3 such examples on the first page of results from this Google Scholar search: http://scholar.google.com/scholar?hl=en&q=shieber+violently+broccoli&btnG=&as_sdt=1%2C22&as_sdtp=
[21:25] <tdonohue> In those search results, scan for the items that have a link to "[TXT] from [url]" -- those are all DSpace sites...and I've checked...all are running XMLUI
[21:25] <aschweer> tdonohue: ok, I'll dig some up. Do you think some apache log snippets for bot activity on these items would help too?
[21:25] <tdonohue> it could..not sure, aschweer, to be honest. If you can actually determine what the bot looks to be doing, it may be very helpful.
[21:26] <aschweer> I've got to run now, but I'll do some digging around in the Scholar search results and the logs later today
[21:26] <tdonohue> I'm gonna close up this meeting now -- I feel like we don't have much more to discuss. But, I'll go start up a ticket for this & encourage everyone to add whatever you find
[21:27] <aschweer> thanks tdonohue
[21:27] <aschweer> bye all
[21:27] * aschweer (~schweer@schweer.its.waikato.ac.nz) Quit (Quit: leaving)
[21:27] * bollini (~chatzilla@host252-210-dynamic.8-79-r.retail.telecomitalia.it) Quit (Quit: ChatZilla 0.9.89 [Firefox 16.0.2/20121024073032])
[21:48] * hpottinger (~hpottinge@mu-162198.dhcp.missouri.edu) Quit (Quit: Later, taterz!)
[22:04] * mhwood (mwood@mhw.ulib.iupui.edu) Quit (Remote host closed the connection)
[22:26] * helix84 (a@ has left #duraspace
[22:51] * tdonohue (~tdonohue@c-50-129-94-92.hsd1.il.comcast.net) Quit (Read error: Connection reset by peer)
[22:55] * joaomelo__ (~joaomelo@bl23-41-52.dsl.telepac.pt) has joined #duraspace
[22:58] * joaomelo (~joaomelo@bl23-41-52.dsl.telepac.pt) Quit (Ping timeout: 256 seconds)

These logs were automatically created by DuraLogBot on irc.freenode.net using the Java IRC LogBot.