#duraspace IRC Log


IRC Log for 2015-03-04

Timestamps are in GMT/BST.

[6:42] -cameron.freenode.net- *** Looking up your hostname...
[6:42] -cameron.freenode.net- *** Checking Ident
[6:42] -cameron.freenode.net- *** Found your hostname
[6:42] -cameron.freenode.net- *** No Ident response
[6:42] * DuraLogBot (~PircBot@ec2-107-22-210-74.compute-1.amazonaws.com) has joined #duraspace
[6:42] * Topic is '[Welcome to DuraSpace - This channel is logged - http://irclogs.duraspace.org/]'
[6:42] * Set by cwilper!ad579d86@gateway/web/freenode/ip. on Fri Oct 22 01:19:41 UTC 2010
[9:49] * pbecker (~pbecker@ubwstmapc098.ub.tu-berlin.de) has joined #duraspace
[13:53] * tdonohue (~tdonohue@c-98-215-0-161.hsd1.il.comcast.net) has joined #duraspace
[14:23] * awoods (~awoods@c-67-165-245-76.hsd1.co.comcast.net) has joined #duraspace
[14:56] * hpottinger (~hpottinge@mu-162188.dhcp.missouri.edu) has joined #duraspace
[15:33] * peterdietz (uid52203@gateway/web/irccloud.com/x-vovzoysayfagdkon) has joined #duraspace
[17:50] * pbecker (~pbecker@ubwstmapc098.ub.tu-berlin.de) Quit (Quit: Leaving)
[18:29] * hpottinger (~hpottinge@mu-162188.dhcp.missouri.edu) Quit (Quit: Leaving, later taterz!)
[18:46] * srobbins (~Adium@mobile-130-126-255-245.near.illinois.edu) has joined #duraspace
[19:15] * srobbins (~Adium@mobile-130-126-255-245.near.illinois.edu) Quit (Quit: Leaving.)
[19:30] * hpottinger (~hpottinge@mu-162188.dhcp.missouri.edu) has joined #duraspace
[20:01] <tdonohue> Hey all, it's time for our weekly DSpace Developers Meeting : https://wiki.duraspace.org/display/DSPACE/DevMtg+2015-03-04
[20:01] <kompewter> [ DevMtg 2015-03-04 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2015-03-04
[20:02] <tdonohue> I'm going to admit, today's agenda is a tad light...so, there's plenty of time for "open discussion" to occur
[20:02] <tdonohue> It also looks like we have a small attendee list (but maybe others are "playing along" view our IRC logs)
[20:03] <tdonohue> First up, as you all surely know & saw. DSpace 5.1, 4.3 and 3.4 were all released last week (and announced on lists). Go forth and upgrade (and encourage others to do so!)
[20:03] <hpottinger> yay!!!!!!! :-)
[20:04] <peterdietz> Good to have that fix out there. Having a "dot" release out, should help adoption
[20:05] <tdonohue> Second, I'm going to be out-of-the-office (but still on email) for much of next week (Tues Mar 10 through Thurs, Mar 12). If you haven't heard, the yearly "DuraSpace Member Summit" is next week in Washington, DC. I'll be attending and reporting back on what goes on.
[20:05] * srobbins (~Adium@mobile-130-126-255-245.near.illinois.edu) has joined #duraspace
[20:05] <tdonohue> But, that does mean that I won't be able to attend this DevMtg next week (Weds, Mar 11). Is there anyone who'd be willing to "lead the meeting" (or at least call the meeting to order)?
[20:06] <hpottinger> I'm sure we can find a volunter on that day, mhwood often leads the meetings if you have to be absent
[20:07] <tdonohue> Because of the small attendance today, I think it basically means that hpottinger & peterdietz fight over who leads next week's meeting (or assign it to someone else not attending)
[20:07] <hpottinger> peterdietz, you want the gavel?
[20:07] <peterdietz> You can have it if you want it. I can take it otherwise
[20:08] * hpottinger puts a large yellow post-it note, "HOLD FOR MHWOOD" on the gavel, hands it back to tdonohue.
[20:08] <tdonohue> Or, as you suggested, you all can "self organize" next week. :) Honestly, there's not a ton pressing to discuss, but you can always just have open discussion (on whatever is on your minds) or do some more JIRA reviews, etc
[20:10] <tdonohue> So, I think we have a "good enough" solution for next week's meeting
[20:11] <hpottinger> Whenever we have more people in the room, we can discuss when we want to bump all our dependency versions
[20:11] <tdonohue> Beyond that, honestly, I'm leaving today mostly for "open discussion". Anything DSpace related you all want to talk about? Recent issues/questions? Possible 6.0 features/brainstorms? Other stuff?
[20:12] <peterdietz> There ought to be the eventual discussion on Author Profiles / CRIS, plenty of time before OR15
[20:12] <hpottinger> DS-2473
[20:12] <kompewter> [ https://jira.duraspace.org/browse/DS-2473 ] - [DS-2473] Bump dependency versions in DSpace:master - DuraSpace JIRA
[20:13] <hpottinger> right, now's the time to talk about big architecture/design stuff
[20:13] <tdonohue> peterdietz: good point & reminder. We do need to resolve the "Author Profiles & CRIS" discussions
[20:14] <tdonohue> I'm pretty sure both CINECA and @mire staff (higher ups though, I think) will be at the DuraSpace Summit next week. I could always bring up the Author Profiles & CRIS stuff with them again as a reminder too
[20:14] <hpottinger> in searching for 2473, I also found DS-2264 which has an embedded TODO
[20:14] <kompewter> [ https://jira.duraspace.org/browse/DS-2264 ] - [DS-2264] Lock DSpace 5.x dependency versions for Mirage 2 - DuraSpace JIRA
[20:16] <tdonohue> Do we want to talk DS-2473 then?
[20:16] <kompewter> [ https://jira.duraspace.org/browse/DS-2473 ] - [DS-2473] Bump dependency versions in DSpace:master - DuraSpace JIRA
[20:16] * srobbins (~Adium@mobile-130-126-255-245.near.illinois.edu) Quit (Quit: Leaving.)
[20:17] <tdonohue> Honestly, with 2473, my opinion is we should just start upgrading things that "look reasonable" (but perhaps avoid major version upgrades, as those tend to require code changes) and see if anything breaks
[20:18] <hpottinger> DS-2266 is the "TODO" they were good
[20:18] <tdonohue> Anything that is a major version upgrade may require more care...as it's more likely we'll need code changes/tweaks
[20:18] <kompewter> [ https://jira.duraspace.org/browse/DS-2266 ] - [DS-2266] Rely on latest Mirage 2 dependency versions for DSpace 6 development - DuraSpace JIRA
[20:18] <peterdietz> sounds fair. Things like bumping javax servlet from 2.5 to 3 or 3.1 ought to be handled with care
[20:19] <tdonohue> hpottinger: 2266 doesn't sound "safe" to me. We should analyze it similar to 2473....and we shouldn't just let DSpace (even if it is "master") just download the latest version of any dependencies
[20:20] <tdonohue> Since we always release major versions from "master" we'd constantly need to "rollback" any Mirage 2 latest dependencies prior to release
[20:21] <hpottinger> Hmm... I dunno, DS-2264 I don't think was intended to be permanent...
[20:21] <kompewter> [ https://jira.duraspace.org/browse/DS-2264 ] - [DS-2264] Lock DSpace 5.x dependency versions for Mirage 2 - DuraSpace JIRA
[20:22] <hpottinger> anyway, it's dependency update season
[20:22] <hpottinger> and big changes should step on up :-)
[20:23] <tdonohue> hpottinger: not saying that 2264 was intended to be "permanent". I'm just saying that having "boostrap-sass-official" version set to "latest" is potentially dangerous...we might forget to remove it and "lock it down" again for 6.0
[20:23] * aschweer (~schweer@schweer.its.waikato.ac.nz) has joined #duraspace
[20:23] <hpottinger> OK, so, possibly we want to skip DS-2266?
[20:23] <kompewter> [ https://jira.duraspace.org/browse/DS-2266 ] - [DS-2266] Rely on latest Mirage 2 dependency versions for DSpace 6 development - DuraSpace JIRA
[20:25] <aschweer> can't we use the same mechanism for Mirage 2 then for the maven dependencies? as in, use fixed version numbers and periodically someone updates those and runs a ton of testing?
[20:25] <aschweer> that round for Mirage 2 should probably just be closer to testathon, while the maven dependencies should come early in the new version (around now, like you just discussed)
[20:25] <hpottinger> I think that may be the best choice, I've added a comment to DS-2266
[20:25] <kompewter> [ https://jira.duraspace.org/browse/DS-2266 ] - [DS-2266] Rely on latest Mirage 2 dependency versions for DSpace 6 development - DuraSpace JIRA
[20:26] <tdonohue> +1 aschweer: that's what I just commented on DS-2266, that we keep versions as fixed numbers
[20:26] <kompewter> [ https://jira.duraspace.org/browse/DS-2266 ] - [DS-2266] Rely on latest Mirage 2 dependency versions for DSpace 6 development - DuraSpace JIRA
[20:27] <aschweer> yup that looks great, thanks tdonohue
[20:28] <tdonohue> In general though, I'm perfectly fine with anyone starting to "claim" DS-2473 (or this 2266 Mirage2 ticket) and upgrading dependencies. Now is the time to start trying this out, and discover which are easy and which may need more work (i.e. a separate ticket even)
[20:28] <kompewter> [ https://jira.duraspace.org/browse/DS-2473 ] - [DS-2473] Bump dependency versions in DSpace:master - DuraSpace JIRA
[20:28] <hpottinger> oh, hey, somewhat-related question: who wants to build some tests? DS-2288, DS-2397
[20:29] <kompewter> [ https://jira.duraspace.org/browse/DS-2288 ] - [DS-2288] Acceptance test suite - DuraSpace JIRA
[20:29] <kompewter> [ https://jira.duraspace.org/browse/DS-2397 ] - [DS-2397] Sort out Unit vs. Integration tests, and run them separately - DuraSpace JIRA
[20:30] <hpottinger> unit tests will probably catch *some* problems from bumping dependencies, acceptance and integration tests would catch more
[20:31] <tdonohue> I wish I had time right now...but I'd be glad to try to chip in where I can & help review the work (if someone else kicks things off). I think more tests could help in a *lot* of areas
[20:32] <tdonohue> Any other topics on folks minds?
[20:33] <tdonohue> Oh, and as a sidenote...just a reminder that the OR15 "Developer Track" proposals are due next week (March 13): http://www.or2015.net/developer-track/
[20:33] <kompewter> [ Developer Track at OR2015 | OPEN REPOSITORIES 2015 ] - http://www.or2015.net/developer-track/
[20:33] <aschweer> DS-2212 made me think, we've always had upgrade scripts for the database schema, do we need to make sure there is something like that for the usage stats data?
[20:33] <kompewter> [ https://jira.duraspace.org/browse/DS-2212 ] - [DS-2212] Statistics Shard not working on old records without a uid &amp; cannot recover from error - DuraSpace JIRA
[20:34] <tdonohue> aschweer: we have auto-Solr-index upgrades (in DSpace 5). But, it sounds like 2212 needs more than that
[20:35] <tdonohue> aschweer: have we figured out the proper "fix"? If so, it might be possible to automate in some way
[20:35] <aschweer> tdonohue: and the auto-upgrade is much much better than forcing people to manually upgrade the indexes as we had for people who skipped 3 and went straight to 4
[20:35] <aschweer> there are two issues, one is adding the uid and one is the _version_ mismatch
[20:35] <aschweer> Terry's code works well for adding the uid; the process is a bit awkward and I don't know whether it'd be easy to automate this
[20:36] <aschweer> essentially you need to create a new solr core with the same config as the statistics core, copy over all the solr documents in batches of 100,000 or so, then swap over and use the new core for the stats
[20:37] <aschweer> the _version_ mismatch during the sharding, Terry's code adresses that by simply dropping the _version_ field during the sharding, which to me looks like it won't cause problems since the sharding creates new cores anyway and the _version_ is just supposed to help when existing docs are updated
[20:38] <aschweer> but I'd be a lot more comfortable for someone with more solr knowledge to look this over (preferably @mire since they enabled the _version_ field in the first place)
[20:38] <tdonohue> yuck. So, do we know the exact "cause" here? Is this actually a DSpace problem? Is it more a Solr data upgrade issue? (It's sounding like the latter, but I need to dig in more here too)
[20:38] <aschweer> I guess what I would like to see is a commitment from us that schema changes in the solr stats won't happen unless we also provide support for people to upgrade their data
[20:39] <aschweer> tdonohue: the root cause of the uid issue is that there was a schema change in DSpace 3 that introduced the uid field. before then, stats docs had no unique keys.
[20:39] * awoods (~awoods@c-67-165-245-76.hsd1.co.comcast.net) Quit (Ping timeout: 245 seconds)
[20:40] <aschweer> however, when the uid field was added to the schema, no attempt was made to retrospectively add uids to the existing docs (and the delete and re-index approach is the only one that works, since you can't uniquely identify a document to change if it doesn't have a unique key)
[20:40] <tdonohue> Oh, I see. Yea, that should have come with a way to "fix your data" in the DSpace 3 -> 4 upgrade
[20:40] <aschweer> in the 1.8 -> 3 upgrade, but yes
[20:40] <tdonohue> oh, right
[20:40] <aschweer> then the _version_ field was added later, I'm not sure why
[20:41] <aschweer> which is why it'd be great for someone who was involved in that decision to figure out what is going on
[20:41] <aschweer> for some reason, when Terry's code is run to retrospectively add the uids, the sharding process fails with a _version_ mismatch error
[20:41] <aschweer> see http://heliosearch.org/solr/optimistic-concurrency/ for info on _version_
[20:41] <kompewter> [ Optimistic Concurrency Solr Tutorial ] - http://heliosearch.org/solr/optimistic-concurrency/
[20:41] <tdonohue> Could you add those "analysis" details to the ticket? I don't see any mention of when these fields were added (1.8 -> 3 upgrade), and therefore what DSpace instances may encounter this
[20:42] <aschweer> oh right, I think it's just in my e-mail to -tech/-dev that is referenced from the ticket
[20:43] <tdonohue> i.e. the ticket is a bit "confusing" to me, as it was unclear (until now) the exact *root cause* even though Terry has a fix. Admittedly though, I don't know this area of the code very well
[20:43] <aschweer> so there may be a third issue, which is the docValues one. That's the third change to the solr schema, made recently. For me, it meant after the 4->5 upgrade, geographic information in the usage stats only covered the time since the upgrade
[20:44] <aschweer> running Terry's code fixed that for me, but that is to be expected since the re-index created the docValues
[20:44] <aschweer> and I have tons of local customisations in the usage stats
[20:45] <aschweer> so it would be good to know if anyone else has upgraded to 5 and is using the solr usage stats and can check whether the geographic information (downloads by city/country) shows data from all time or just since the upgrade
[20:45] <tdonohue> Sounds like we need to clarify here where the "problem points" are...then we might want to make @mire (and others) more aware of these "problem points" to see if they can help us resolve them in an easier manner.
[20:45] <aschweer> tdonohue agree
[20:45] <aschweer> from my point of view, there is one overarching problem - making changes to the solr schema without big fat warnings / providing help for people to migrate their existing data
[20:46] <hpottinger> I wouldn't mind testing.... but I have no idea how to copy a live solr core to another box
[20:46] <aschweer> then there are 3 examples of the problems this causes - the uids, the _version_, the docValues. Though it might be that there is really only the uid problem and the other two are flow-on
[20:46] <tdonohue> aschweer: would you be willing to potentially split these "problem areas" out into separate tickets? Or clean up the existing one (or potentially add subtickets)
[20:46] <aschweer> hpottinger: I just e-mailed you :)
[20:47] <tdonohue> aschweer: I completely agree that we need to look *much closer* at any changes to Solr Schemas. It's likely we all "assumed" things would be OK, but now we've learned otherwise
[20:47] <aschweer> tdonohue: I can try. I agree that this isn't clear at all in Jira at the moment. My issue is that my understanding of this is still evolving too
[20:47] <aschweer> for copying a solr stats core to a dev machine: copy the solr/statistics/data directory contents and delete the write.lock file that is somewhere in that directory (unless you stop tomcat before copying) -- you might lose the last few hits, so if you're worried, just force a commit on the stats core first
[20:48] <aschweer> to force a commit, run eg curl --globoff 'http://localhost:8080/solr/statistics/update?commit=true'
[20:48] <tdonohue> aschweer: makes perfect sense. I guess I just want to be sure anything you "discover" / learn gets tracked (at some point) ;) But, if you need more time to dig, feel free
[20:48] <aschweer> you might want to synch the db content first so that the item IDs etc resolve properly
[20:49] <aschweer> I'll give it a go on Jira, but I will definitely need help with testing -- as I said, our stats are so customised that I can't be sure how much of this is just "our problem" and how much affects everyone with stats data going back to DSpace 1.8
[20:50] <tdonohue> Thanks aschweer. I'm sure we can find some testers to help out. It sounds like (from 2212 and the stackoverflow issue) others are having this issue too: http://stackoverflow.com/questions/26941260/normalizing-solr-records-for-sharding-version-issues
[20:50] <kompewter> [ dspace - Normalizing SOLR records for sharding: _version_ issues - Stack Overflow ] - http://stackoverflow.com/questions/26941260/normalizing-solr-records-for-sharding-version-issues
[20:51] <aschweer> yes, the uid / _version_ one is definitely something others have run into too
[20:51] <aschweer> that one should affect everyone with solr stats data from 1.8 and now running 3, 4 or 5
[20:52] <aschweer> the docValues one as described in my e-mail, I'm not sure if that affects everyone with solr stats data from pre-5 now running 5, or just everyone also affected by the uid issue, or just us
[20:52] <aschweer> http://dspace.2283337.n4.nabble.com/DSpace-5-and-solr-usage-statistics-data-tt4676837.html
[20:52] <kompewter> [ DSpace - Tech - DSpace 5 and solr usage statistics data | Threaded View ] - http://dspace.2283337.n4.nabble.com/DSpace-5-and-solr-usage-statistics-data-tt4676837.html
[20:53] <tdonohue> good question on the "docValues". I don't know the answer, but it is worth keeping an eye out for
[20:55] * awoods (~awoods@c-67-165-245-76.hsd1.co.comcast.net) has joined #duraspace
[20:56] <tdonohue> Any other final notes / thoughts that anyone has to mention? I notice we are down to <5mins
[20:56] <tdonohue> And aschweer, thanks for sharing what you've discovered. Hopefully we can get some sort of resolution to this problem soon. I definitely didn't realize the extent of the issue for older upgrades
[20:57] <aschweer> thanks tdonohue, I have some very frustrated repo managers wondering whether none of the devs care about usage statistics and I guess it is a fair point to ask whether we're committed to keeping that data safe across upgrades
[20:58] <tdonohue> I think we *are* committed to keeping the data safe. I think we just never realized it wasn't happening "automatically" already (and we need to be much more careful about changing the Solr Schemas going forward)
[20:59] <hpottinger> hands up who tests with a copy of a live stats core?
[20:59] * aschweer puts hand up
[21:00] <aschweer> (looks like tumbleweeds otherwise, which may explain why nobody else has run into this before -- though I guess there aren't that many folks around today)
[21:01] <tdonohue> With 5.0, I admit to testing that the *Solr index version* upgraded successfully.... I didn't think to test that old data would auto-migrate (as we didn't make any Solr schema changes in 5.0, I had assumed that was already "taken care of")
[21:01] <aschweer> the docValues change was in 5.0, wasn't it?
[21:02] <tdonohue> hmmm...and it's possible I overlooked that :/
[21:02] <aschweer> https://github.com/DSpace/DSpace/commit/8e2f87e75548b48ad44c6257b47bf45af3e5b4ef
[21:02] <kompewter> [ add docvalues to docvalue capable fields for better facet performance · 8e2f87e · DSpace/DSpace · GitHub ] - https://github.com/DSpace/DSpace/commit/8e2f87e75548b48ad44c6257b47bf45af3e5b4ef
[21:02] <aschweer> it changes the way these fields are indexed, which means facets will only include those documents that have docValues. Now I don't know whether it would automatically create docValues for old docs if they had a uid.
[21:03] <aschweer> I suppose I could try adding the uids to a DSpace 4, then do the upgrade to 5 including the solr auto-upgrade and see whether the facets look better
[21:04] <tdonohue> I'll readily admit, my attention was much more on "making sure the database upgraded, and the Solr index version upgraded". My (wrong) assumption was that anything that changed the Stats Solr Schema was taking care of updating existing data.
[21:05] <tdonohue> So, apologies if I overlooked it myself. I definitely think we need to find a better way to "refresh"/upgrade the Stats data...with Discovery we can just kick off a fully reindex (which actually happens automatically now with the 5 upgrade). But, that's not possible with Stats (obviously)
[21:06] <aschweer> yes, I agree. OAI is in the same boat as Discovery, so no problem there either. The authority core might have similar issues to statistics, in that it is the authoritative data source, not just an index into data that exists elsewhere
[21:07] <hpottinger> good catch
[21:07] <aschweer> and I guess another question is, might the Elasticsearch stats be affected by something like this one day too (I don't know anything about Elasticsearch, but again it is the authoritative source of usage stats data when enabled)
[21:09] <aschweer> sorry folks, I need to be elsewhere. I've started on sorting out the Jira issues around this, hopefully I'll get that finished today. Thanks for the discussion tdonohue and hpottinger, it's good to see that we're on the same page re keeping the data safe :)
[21:10] <tdonohue> While it's not *ideal*, I recall that in DSpace 5, it's now possible to *export* Solr Statistics (stats-util -e). I wonder if a complete export & reimport could help "refresh" things (but that could take a long time for a lot of data, I know)
[21:10] <aschweer> no, the export / re-import discards some information, if I saw correctly
[21:10] <tdonohue> ok, darn
[21:10] <tdonohue> (I was hoping maybe it was something "easy" we could recommend, even if it's not ideal yet)
[21:11] <aschweer> I believe it discards eg the user agent, geographical information (can be re-looked-up but if it's 2010 data, maybe not so awesome)
[21:11] <tdonohue> ok, good to know
[21:12] <hpottinger> I know I've tried the "just copy and delete lock files" approach before, but maybe if I just try again, I'll have luck.
[21:12] <aschweer> We could change the export code to export all fields, then delete all data , then re-import.
[21:12] <tdonohue> Well, since we are "over time" here, we may as well close out the meeting for today. Definitely sounds like DS-2212 needs more attention from anyone who can help!
[21:12] <kompewter> [ https://jira.duraspace.org/browse/DS-2212 ] - [DS-2212] Statistics Shard not working on old records without a uid &amp; cannot recover from error - DuraSpace JIRA
[21:12] <aschweer> the copy / delete lock file thing works for me -- tomcat on the dev machine needs to be down when you copy it, but nothing else I can think of
[21:12] <aschweer> oh yes, and I really do need to be elsewhere. see you!
[21:12] * aschweer (~schweer@schweer.its.waikato.ac.nz) Quit (Quit: leaving)
[21:13] <hpottinger> oh, duh, that makes sense, I bet that was the problem
[22:30] * hpottinger (~hpottinge@mu-162188.dhcp.missouri.edu) Quit (Quit: Leaving, later taterz!)
[22:45] * tdonohue (~tdonohue@c-98-215-0-161.hsd1.il.comcast.net) has left #duraspace

These logs were automatically created by DuraLogBot on irc.freenode.net using the Java IRC LogBot.