#duraspace IRC Log


IRC Log for 2015-03-25

Timestamps are in GMT/BST.

[6:55] -card.freenode.net- *** Looking up your hostname...
[6:55] -card.freenode.net- *** Checking Ident
[6:55] -card.freenode.net- *** Found your hostname
[6:55] -card.freenode.net- *** No Ident response
[6:55] * DuraLogBot (~PircBot@ec2-107-22-210-74.compute-1.amazonaws.com) has joined #duraspace
[6:55] * Topic is '[Welcome to DuraSpace - This channel is logged - http://irclogs.duraspace.org/]'
[6:55] * Set by cwilper!ad579d86@gateway/web/freenode/ip. on Fri Oct 22 01:19:41 UTC 2010
[12:02] * mhwood (mwood@mhw.ulib.iupui.edu) has joined #duraspace
[12:02] * mhwood (mwood@mhw.ulib.iupui.edu) Quit (Remote host closed the connection)
[12:08] * mhwood (mwood@mhw.ulib.iupui.edu) has joined #duraspace
[13:01] * tdonohue (~tdonohue@c-98-215-0-161.hsd1.il.comcast.net) has joined #duraspace
[13:22] * robint (81d7fa56@gateway/web/freenode/ip. has joined #duraspace
[14:48] * srobbins_ (~srobbins@libstfsdg02.library.illinois.edu) has joined #duraspace
[14:58] <tdonohue> :REMINDER: At the top of the hour (in a few minutes), our weekly DSpace Developers meeting starts here. https://wiki.duraspace.org/display/DSPACE/DevMtg+2015-03-25
[14:58] <kompewter> [ DevMtg 2015-03-25 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2015-03-25
[15:01] <tdonohue> Hi all, welcome. Our weekly DSpace Developers Meeting starts now: https://wiki.duraspace.org/display/DSPACE/DevMtg+2015-03-25
[15:01] <kompewter> [ DevMtg 2015-03-25 - DSpace - DuraSpace Wiki ] - https://wiki.duraspace.org/display/DSPACE/DevMtg+2015-03-25
[15:01] <tdonohue> The main topic for today is planning for DSpace 5.2 (which looks to be needed to help resolve some Solr Stats issues)
[15:02] <tdonohue> primarily this ticket seems high-priority enough for a DSpace 5.2 release in the nearish future: DS-2486
[15:02] <kompewter> [ https://jira.duraspace.org/browse/DS-2486 ] - [DS-2486] Missing fields in solr statistics data from previous DSpace versions - DuraSpace JIRA
[15:03] * KevinVdV (~kevin@ has joined #duraspace
[15:03] <tdonohue> Since we are on an "early meeting" today, unfortunately, I just realized that aschweer cannot give us a direct update on Ds-2486 work (as it's the middle of the night for her)...
[15:04] <tdonohue> but, based on the recent comments on that ticket, aschweer has a PR which is now in a testable state: DSPR#894 (and hpottinger seems to have already tested it some)
[15:04] <kompewter> [ https://github.com/DSpace/DSpace/pull/894 ] - Ds 2486 reindex solr by aschweer
[15:05] <tdonohue> So, I guess the first question for all of you is whether anyone else can help spend some time to test this resolution to DS-2486? It seems like a fix we really should get out there, as currently our Solr Statistics data is NOT auto-upgrading itself
[15:05] <kompewter> [ https://jira.duraspace.org/browse/DS-2486 ] - [DS-2486] Missing fields in solr statistics data from previous DSpace versions - DuraSpace JIRA
[15:06] <tdonohue> (or rather Solr Statistics data is not able to be auto-reindexed, when the schema changes)
[15:08] <mhwood> I will try to find some time to test this. We're going to need it fixed soon, as we have an instance soon to be upgraded to 5.x.
[15:08] <tdonohue> For those unfamiliar with these tickets, the main "pain point" is that geographic information in Statistics is not preserved when upgrading to 5.1
[15:08] <tdonohue> 5.x
[15:08] <tdonohue> Thanks, mhwood!
[15:09] * hpottinger (~hpottinge@mu-162188.dhcp.missouri.edu) has joined #duraspace
[15:09] <tdonohue> and there's hpottinger ;) (We're discussing DS-2486 and getting testers for it's PR)
[15:09] <kompewter> [ https://jira.duraspace.org/browse/DS-2486 ] - [DS-2486] Missing fields in solr statistics data from previous DSpace versions - DuraSpace JIRA
[15:10] <hpottinger> oh, hey
[15:10] <hpottinger> DSPR#894
[15:10] <kompewter> [ https://github.com/DSpace/DSpace/pull/894 ] - Ds 2486 reindex solr by aschweer
[15:11] <tdonohue> yep, we had linked to that PR already, hpottinger. And, mhwood had volunteered to help test as well
[15:12] <hpottinger> cool, I just gave it a +1
[15:13] <hpottinger> One thing, if the commands that come with this PR become part of our upgrade process, we need to provide some guidance on what will probably be a 12-14-hour maintenance window
[15:14] <tdonohue> It looks reasonable to me, but admittedly I have not tested it yet. One thing I noticed in it is that the SolrImportExport class doesn't seem to be "registered" in our launcher.xml (which we may want). I'll add a comment to that PR
[15:14] <hpottinger> tdonohue: good point
[15:14] <hpottinger> in the past we've asked people to run dsrun commands, but, I don't think we do that any more
[15:15] <mhwood> Well, we *can* do that, but why not make it a little easier?
[15:15] <tdonohue> yea, I'd rather we didn't do that anymore. It's so much easier to remember the command if it's just "./dspace [command]"
[15:16] <tdonohue> I added a comment to the PR about that
[15:17] <tdonohue> Does anyone else have comments/questions/thoughts on this ticket or its PR? (The only other thing that stands out is the very large maintenance window here...I wish it was much smaller, but I don't know of any way to solve that easily)
[15:18] <hpottinger> as far as maintenance windows go, I can't see a way that doesn't involve losing a 12 hour slice of stats, so I figure we recommend upgrading a copy of a live stats core, and then replace the core, unless someone smarter than me can figure out a way to run those commands without downtime or without losing stats
[15:21] <tdonohue> This large maintenance window actually may be a scenario that points to *not* using Solr as the persistent store for this data (as the maintenance of the index becomes very time consuming)... which brings us to the next topic on the agenda: "Re-examine our use of Solr as a statistics/authority store"
[15:22] <mhwood> Take DSpace down momentarily. Copy the core. DSpace back up. Reindex the copy. DSpace down. Switch to the updated copy. Extract records written since you started. Reload those to the new core. This may require some additions to the patch.
[15:22] <tdonohue> Though, I guess, before we jump entirely to this deeper discussion...it might be best to "wrap up" the 5.2 thread
[15:23] <mhwood> Yes. What else should be in 5.2? We have nearly thirty tickets marked for it. About nine are Request Code Review
[15:23] <tdonohue> mhwood: yea, that process is a bit complex but seems like it would work
[15:24] <tdonohue> Is there anything else folks know of that is high priority for 5.2?
[15:25] <tdonohue> (mhwood do you have a link to the 5.2 tickets you can share? Trying to find that myself right now)
[15:25] <mhwood> project = DSpace AND fixVersion = 5.2 AND status != Closed ORDER BY status
[15:26] <mhwood> Sorry, my saved favorite queries don't seem to be shareable.
[15:26] <mhwood> You could try https://jira.duraspace.org/issues/?filter=13033 but I think it won't work.
[15:26] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/issues/?filter=13033
[15:27] <tdonohue> Here we go...here's tickets scheduled for 5.2 that are "unresolved": https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+5.2+AND+resolution+%3D+Unresolved+ORDER+BY+due+ASC%2C+priority+DESC%2C+created+ASC&mode=hide
[15:27] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+DS+AND+fixVersion+%3D+5.2+AND+resolution+%3D+Unresolved+ORDER+BY+due+ASC%2C+priority+DESC%2C+created+ASC&mode=hide
[15:28] <tdonohue> And here's those 9 that are under "code review needed" status: https://jira.duraspace.org/issues/?jql=project%20%3D%20DS%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%205.2%20AND%20status%20%3D%20%22Code%20Review%20Needed%22%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
[15:28] <kompewter> [ Issue Navigator - DuraSpace JIRA ] - https://jira.duraspace.org/issues/?jql=project%20%3D%20DS%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%205.2%20AND%20status%20%3D%20%22Code%20Review%20Needed%22%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
[15:29] <tdonohue> It seems like, the ones under "Code Review Needed" could be quick wins for 5.2. So it seems obvious that we should try to find testers for those so we can get them in as well
[15:29] * KevinVdV (~kevin@ Quit (Ping timeout: 248 seconds)
[15:31] <tdonohue> We also likely need to figure out who amongst us would be willing to help "lead" / coordinate the 5.2 release. I'm going to have to admit that my time right now is limited...while I'm glad to support as needed, I don't know that I can coordinate this release (as I did with 5.1)
[15:31] <hpottinger> surely there is a SWORD expert reading this transcript, yes?
[15:33] <hpottinger> I also don't have a whole lot of time to devote to this, but since we have a substantial blocker in our upgrade to 5_x, if no one else wants to coordinate 5.2, I'll do it, just so we can proceed with our upgrade
[15:33] <hpottinger> SWORD expert of the future, I ask you to test DSPR#850, thanks
[15:34] <kompewter> [ https://github.com/DSpace/DSpace/pull/850 ] - [DS-2131] SWORDv2 ingestion fails with NullPointerException when replacing a non archived item by KevinVdV
[15:36] <tdonohue> quiet group here today :) hpottinger, I thank you for being willing to help coordinate. It sounds like it might be worth getting a helper though, if your time is also limited.
[15:37] <tdonohue> I'm glad to act in a supporting role. But, maybe we can also ask aschweer (or others) to see if they can chip in on getting 5.2 out the door
[15:38] <hpottinger> it would be great if we could get someone new running the release, just so we have more people with release experience
[15:38] <tdonohue> Ideally here, if *ANYONE* (in the room or reading these logs later) knows of a ticket that they'd like to see get into 5.2, we are going to need your help in getting it in (by helping testing, etc)
[15:39] <tdonohue> +1 to getting more folks familiar with the release experience. It seems like we still keep having the same (small) group manage most releases (especially these bug fix ones).
[15:40] <tdonohue> That all being said, if 5.2 *only* ends up including the fix for Solr Statistics, I think that's still a worthwhile release. But, obviously it'd be good to get in a few other fixes (especially "low hanging fruit") if we can get folks to help
[15:41] <hpottinger> I'm wondering if we have enough data to evaluate our variable meeting schedule?
[15:43] <tdonohue> hpottinger: good question. I honestly haven't looked into it. But, my suspicion is the early meetings have less attendees than the later ones...but, I know the early meetings are still more convenient for most of Europe
[15:44] <tdonohue> So, because we're already running short on time, and we don't seem to have much to say about 5.2, we might as well move along for now. For the time being though, regarding 5.2, we need more help testing Ds-2846 and we need more help getting other bug fixes in. If *anyone* is willing to chip in, we'd be glad to have you!
[15:45] <tdonohue> Ds-2486 that is (the Solr Stats ticket)...not 2846
[15:45] <hpottinger> I might as well mention DS-2506
[15:45] <kompewter> [ https://jira.duraspace.org/browse/DS-2506 ] - [DS-2506] Case-insensitive browse configuration does not work with discovery browse - DuraSpace JIRA
[15:46] <mhwood> So then, you mentioned that the dump/load time argues against using text indexing engines as long-term storage.
[15:46] <tdonohue> Ok, moving along to the highly-related Solr issue (but not something we can easily solve in 5.2): "Re-examine our use of Solr as a statistics/authority store"
[15:47] <tdonohue> mhwood: yes. Before hearing about that long maintenance window for a dump & load, I wasn't sure if we needed much more than a way to easily backup & restore Solr data. But, if the dump of that Solr data takes 12 hours, that implies we may want a better way to manage this data "persistently"
[15:48] <hpottinger> I am of the opinion that we probably need to get out of the problem space of usage statistics, there are other tools for that
[15:49] <tdonohue> Also worth linking in this same discussion in this dspace-devel thread: http://dspace.2283337.n4.nabble.com/We-need-to-think-a-bit-more-about-how-we-use-the-statistics-Solr-core-td4676995.html
[15:49] <kompewter> [ DSpace - Devel - We need to think a bit more about how we use the 'statistics' Solr core ] - http://dspace.2283337.n4.nabble.com/We-need-to-think-a-bit-more-about-how-we-use-the-statistics-Solr-core-td4676995.html
[15:50] <tdonohue> hpottinger: While I agree to some extent (I'd rather find ways to use third-party tools), I'm not sure if we'll have widespread approval for "Just use Google Analytics".
[15:50] <mhwood> So, it may be unavoidable to take the stats core offline for maintenance, but we need to keep capturing events and load them in afterward. That suggests to me a separate stream of event records that can be reloaded anytime.
[15:50] <mhwood> Without interrupting its capture.
[15:51] <hpottinger> Perhaps Logstash?
[15:51] <mhwood> Well, if we do a good job of capturing the events, a site can use anything to process them.
[15:51] <mhwood> We can get out of the statistics business but we still need to capture the data.
[15:51] <tdonohue> hpottinger: logstash or other third party tools might be an option
[15:52] <tdonohue> Part of this also may be a DCAT (or "Use Cases") question.... what do folks *need* out of a statistics engine? Is Google Analytics "enough"? Is there a need for something more directly "integrated" with DSpace?
[15:53] <mhwood> Time to let a thousand flowers bloom? Various sites try various approaches and see what works well in what cases?
[15:53] <hpottinger> I think GA is out of the picture, since you can't retroactively add usage data to it
[15:54] <tdonohue> +1 mhwood: we cannot get out of the business of "capturing the stats data", but hopefully we can find a way to get out of the business of processing/analyzing that data
[15:55] <mhwood> I think we will never come up with built-in stat.s that make everyone happy. We may want to provide something simple that can be unplugged and replaced (or not replaced).
[15:55] <tdonohue> I do think we'd need someone(s) to do some research / trials of various approaches. There's a ton of stats tools out there, we could do a dump of options (to a wiki page) and a dump of some basic "known needs" (.. ideally free, ideally easy to integrate with, etc)
[15:55] <tdonohue> And then look for folks to help us analyze those options
[15:55] <hpottinger> yes, we clearly have a responsibility to log usage data, but I don't think we need to accept responsibility for analyzing or visualizing it
[15:56] <mhwood> hpottinger +1
[15:56] <mhwood> So the question is: are we capturing the right raw data?
[15:57] <mhwood> Event consumers are easy to write. We can have a bunch of them.
[15:57] <helix84> hi, jumping in late - I wanted to note that there are standards in this area
[15:57] <tdonohue> we may have to "visualize" stats data (in the UI) to some extent though... I suspect that one of the "use cases" is the ability to actually *see* DSpace statistics in the user interface....so, as a basic example, ideally you'd want a basic "download count" to appear as part of the UI, and not in something completely separate.
[15:57] <helix84> I wanted to mention COUNTER and the SUSHI protocol
[15:58] <hpottinger> I wonder if we could "ape" the format of an apache log file, and then use any of the plethroa of Apache log file analysis tools?
[15:58] <helix84> Joao was willing to work in his free time on making DSpace a SUSHI provider
[15:58] <mhwood> We might, however, be able to leverage tools meant for such processing and visualization, drop their output in a convenient place and let DSpace just include it.
[15:59] <helix84> there is some code, but it's very early
[15:59] <tdonohue> +1 to consider moving towards SUSHI / COUNTER.
[16:00] <hpottinger> The nice thing about moving towards SUSHI/COUNTER is that format is an agreed-upon standard, no need to do any further use case work
[16:00] <helix84> DS-626
[16:00] <kompewter> [ https://jira.duraspace.org/browse/DS-626 ] - [DS-626] API for exchange of Usage Data over OAI-PMH or SUSHI - DuraSpace JIRA
[16:01] <tdonohue> I'd be supportive of 626... though I admit I don't have much experience with SUSHI yet
[16:01] <mhwood> standards: good
[16:02] <helix84> I'm trying to find the code but I forgot where it was, probably under lyncode/DSpace
[16:02] * tdonohue realizes we are now at one hour...but, I want to wrap up this discussion first
[16:04] <tdonohue> So, bringing this discussion back around to *what is actionable*. What do we want as next steps? It seems like there are several questions still here.... (1) Should we rework the existing Solr Stats to have a more persistent layer it works with, (2) Should we consider migrating to a different third-party system, SUSHI, etc.
[16:04] <mhwood> We should ask ourselves how timely the stat.s have to be. Would anyone seriously complain if we gave figures current through the previous midnight? That would make it easier to leverage external tools.
[16:05] <helix84> SUSHI is not a system, it's a protocol, we can work towards that in the 6 timeframe. COUNTER is a format - something to consider right now, the easy alternative being CSV.
[16:05] <mhwood> 1. We should not rely on Solr for long-term storage. We need a way to reload it without seriously disrupting operations other than statistical reporting.
[16:06] <tdonohue> regarding my #1, I'm now starting to wonder if we do need a more persistent layer here (since the dump & reindex can take 12+hours). So, we may need a ticket to work towards that
[16:06] <hpottinger> well... you know, it wouldn't take much to slapt together some kind of logstash feed, and then it really becomse a matter of "let's see what we can make of this pile of data"
[16:06] <tdonohue> regarding #2, we already have a ticket for SUSHI (yea, I realize it's a protocol, just was running out of space in my text). But, we may want a wiki page or ticket to discuss whether to integrate with other third-party systems
[16:07] <mhwood> 2. I need to read up on COUNTER and SUSH to understand what they provide. There is a long path between event capture and finished statistical data products.
[16:07] <helix84> hpottinger: what about logstash? it's just the same situation we're currently in with Solr/ES
[16:07] <tdonohue> (other third party systems besides Google Analytics)
[16:08] <mhwood> That was also my take on logstash. It looks quite good, but I am not yet convinced that it's so good for what we need right now.
[16:09] <tdonohue> logstash is more about analyzing logs...not exactly the same as usage stats. But, there may be other third party systems which may be more oriented towards usage stats
[16:10] <hpottinger> the difference between what we do now and feeding logstash data is, we are completely responsible for all aspects of our current stats solution, if we utilize some other stats solution, we get to leverage the work of another community
[16:11] <tdonohue> Some better examples may be looking at Piwik (http://piwik.org/) or some of their competitors
[16:11] <helix84> hpottinger: here's what I'm asking - if logstash is your answer, what is the question?
[16:11] <hpottinger> Pikwi is pretty much the same thing as GA, though (drop a JS file somewhere, let it collect data for you)
[16:12] <helix84> hpottinger: there's currently zero barrier in DSpace to feeding dspace logs to logstash. What are you proposing?
[16:12] <hpottinger> helix84: I agree, kind of, though zero barrier isn't quite true :-) there's the same barrier that prevents any new work
[16:13] <helix84> hpottinger: no no, on the contrary, there's nothing on the DSpace side we need to add to feed Logstash. It's completely separate.
[16:13] * pbecker (~pbecker@ubwstmapc098.ub.tu-berlin.de) has joined #duraspace
[16:13] <tdonohue> Piwik is similar to GA, yes...but your data is local & you can manage it yourself. Plus, Piwik can also do log analytics (never used it though) http://piwik.org/log-analytics/
[16:13] <kompewter> [ Log Analytics - Analytics Platform - Piwik ] - http://piwik.org/log-analytics/
[16:14] <helix84> I've used piwik before. Just noting that it stores data in RDBMS
[16:14] <pbecker> we are using piwik here.
[16:14] <tdonohue> So, I'm just noting that we don't necessarily need to "do our own version of usage stats/analytics". We can potentially just use a third party solution and send it the data it needs (like we do with Google Analytics)
[16:14] <hpottinger> Hmm... interesting...
[16:15] <hpottinger> yeah, if we could just produce a log that looks like an Apache log...
[16:15] <mhwood> But it would probably be nice to capture the data in something a little less horrible than the morass that is dspace.log.
[16:16] <hpottinger> right, it would be really lovely to capture the breadcrumb trail of the page being visited
[16:17] * peterdietz (uid52203@gateway/web/irccloud.com/x-yqysdedxrxgpfxpf) has joined #duraspace
[16:17] <tdonohue> hpottinger: right, and for that (capturing breadcrumb trail), you would want something like Google Analytics or Piwik (or other various competitors out there)
[16:17] <helix84> mhwood: we have a log format for statistics (see stats-log-converter which converts dspace.log to this format), the format just doesn't currently record user agent and geo information, IIRC
[16:18] <mhwood> helix84: interesting point.
[16:18] <pbecker> as tdonohue mentioned in last week's meeting: with the privacy laws we have in here, Piwik would harmonize much better then GA as least as far as I know.
[16:18] <helix84> hpottinger: I still don't understand why you mentioned logstash. What problem would it solve?
[16:18] <tdonohue> IMHO, this is yet another area of DSpace where I start to question whether "doing it our own way" is the right decision. Piwik has integrations with tons of other major platforms (http://piwik.org/integrate/). Why not just have DSpace have a Google Analytics integration, a Piwik integration, (maybe a few others)...and do away with Solr Stats altogether over the long term
[16:19] <pbecker> +1
[16:20] <hpottinger> helix84: it's a purpose-built tool for managing the kind of data we are trying to manage with custom tools
[16:20] <helix84> tdonohue: perhaps the answer would be that DSpace is an out-of-the-box solution, not that I necessarily disagree with you
[16:21] <helix84> hpottinger: but since it has the same disadvantage (persistent storage) that we're trying to solve, what does it give us?
[16:21] <mhwood> The question is how much should be in the box?
[16:21] <tdonohue> helix84: there has to be some limits on the "box" though ;) We cannot be an "out-of-the-box" solution for everything. The box could include plugins to say.. "you probably want usage statistics...here's a few free ones we integrate with automatically"
[16:22] <hpottinger> helix84: I don't know that logstash has that disadvantage
[16:23] <tdonohue> hpottinger: people who use logstash tend to use it alongside something like Elastic Search to actually create reports. So, you'd either need to keep around the logs it's analyzing, or dump from ES
[16:23] <helix84> hpottinger: how so? it uses ES for storage
[16:23] <pbecker> tdonohue: the integration of such tools has two parts. One question is how to provide usage statistics to such tools. the other question is how we can integrate the results in g.e. the item view?
[16:23] <tdonohue> hpottinger: we tried logstash locally for some time. It literally just analyzes logs and sends them to something like ES
[16:24] <pbecker> We currently have (configurable) a button on the item view that allows anyone to see the usage stats and that is a highly requested feature...
[16:25] <tdonohue> pbecker: yep, I agree. We'd need to build plugins to just capture the data (and send to GA or Piwik), and also plugins to pull the analytics data back in (and display on the item view, etc)
[16:26] <tdonohue> So, as we are overtime here...it sounds like we might want to try to wrap this up a bit.
[16:27] <helix84> stepping back, the reason why usage statistics are so important in the Open Access movement is that they are a possible alternative (or complement) to other metrics, so they are important if you want to make your processes independent of e.g. impact factors. Just trying to explain why keeping logs is so highly requested in the repository community.
[16:28] <hpottinger> I think we probably need a usable usage log, and it's possible the dspace.log is not up to that task.
[16:29] <mhwood> It's littered with other stuff that is significant in other situations, and wasn't quite designed for capture of usage observations.
[16:29] <hpottinger> we probably need to figure out what would comprise a "usable" usage log
[16:29] <tdonohue> It seems like we definitely need more discussion here on the next steps (and a decision at some point)... There are still opportunities here to brainstorm possible solutions (be it integrating with Piwik, or other products, or continuing to improve on our current Solr Stats). I wonder if it'd be good to start up a Wiki discussion page for this?
[16:29] <helix84> hpottinger: see COUNTER
[16:29] <hpottinger> and see if we can deliver that log for dspace 6
[16:29] <mhwood> helix84 pointed out that we have a format already, perhaps needing some slight extension.
[16:30] <hpottinger> oh, yes, the pre-solr stats solution
[16:31] <helix84> The tool was written in order to convert from dspace.log to something to be ingested by Solr (and now ES). All the tools are there. We might want to add a few fields, though.
[16:32] <helix84> Again, though, this is just a format, which doesn't address the problem of having a consumer which actually stores persistently (in whichever format)
[16:32] <hpottinger> a difficult task for data which is missing :-)
[16:32] <mhwood> We might split the usage log records out of dspace.log, making both easier to use.
[16:33] <tdonohue> It seems like (with the stats data log) we're jumping to a solution before we know what we need. The format of such a log (and the data it contains) probably depends on what we are integrating with (and what that external thing needs in terms of data)
[16:33] <helix84> To sum up the format situation, the contenders seem to be COUNTER, our statistics.log format (extended) or Solr CSV format
[16:33] <mhwood> I mean: don't put usage observations *into* dspace.log in the first place. Put them elsewhere, in some more useful format (such as what we already have as the output of the converters).
[16:34] <tdonohue> +1 helix84...or just a CSV format in general (which may not even be Solr's, but it could be if we plan to continue using Solr)
[16:34] <hpottinger> mhwood++
[16:35] <tdonohue> mhwood: that would require a definition of what is a "usage observation". Wouldn't you want to *know* if a user visited an particular Item View page and then an error occurred (as that's the breadcrumb of the error)? Putting an Item view "usage" log elsewhere makes it harder to debug issues
[16:36] <helix84> tdonohue: now _this_ is where logstash could help - correlating two separate streams of data
[16:36] <tdonohue> yes, I agree..but we don't want to require everyone to use logstash to debug errors ;)
[16:37] <mhwood> Well, I need to organize my thoughs a bit. I just remember all the times I had to dive into dspace.log and thought: what a swamp this is!
[16:38] <tdonohue> I'm just pointing out that "usage observation" is hard to define...and therefore, it's a bit hard to define the contents of a "usage log" without an understanding of where/how this "usage log" will be analyzed (i.e. by what tools, analytics engines)
[16:38] <helix84> large dspace.log is a problem on its own - dspace logs filling disk space is common. Now we have a reason to want to keep access logs forever. Why would we want to keep error logs forever? It makes sense to separate them. Apache HTTPD lets you do that, as an example.
[16:39] <mhwood> Here I'm gathering up log files more than N days old into yearly ZIP archives. DSpace logs compress about 90-95%.
[16:40] <tdonohue> I fully admit, I don't keep logs forever. I don't even keep Apache HTTPD usage logs forever... The stuff I *do* keep is the analysis of the log data. So, for example, if DSpace integrated with Piwik, I'd backup Piwik and dump the usage logs (after a reasonable amount of time)
[16:40] <mhwood> I agree that error logs have little long-term value and could be separated out.
[16:41] <mhwood> If DSpace stores the observations simply enough, it integrates with everything. :-)
[16:41] <tdonohue> But, yes, I can see the reasoning here of potentially supporting a "usage log" (like Apache HTTPD) for those who want to keep those around for very long periods of time
[16:42] <helix84> well you _currently_ want to keep them because we don't provide persistent storage for these events
[16:43] <helix84> at the same time, it currently is very unwieldy to work with
[16:43] <tdonohue> So, we aren't going to solve all these problems today. It sounds like we need to wrap this up :)
[16:44] <tdonohue> Should we (a) try and create some JIRA tickets here? OR (b) start up a wiki discussion page on this?....or a bit of both?
[16:44] <hpottinger> both
[16:44] <pbecker> yes, both
[16:44] <mhwood> What would be the division?
[16:44] <hpottinger> and if we don't have a direction picked out by the time OR15 rolls aorund, we should discuss then
[16:45] <tdonohue> It sounds like we've identified several different possible needs: (1) Usage logs (likely a JIRA ticket), (2) Do we want to continue using our own custom Stats or do we move towards plugins for others (e.g. Piwik) (Wiki page?), (3) If we continue to use Solr Stats, do we need a better persistence layer? (may be related to #1).
[16:46] <tdonohue> Anyone want to grab one (or more) of those to help create JIRA tickets / Wiki discussion?
[16:46] <helix84> I can create the wiki page
[16:47] <tdonohue> thanks helix84!
[16:47] <mhwood> I'll write up a ticket for (1).
[16:47] <tdonohue> thanks mhwood
[16:47] <pbecker> and (4) how do we integrate the results in DSpace (e.g. item view)
[16:47] <hpottinger> the nice thing about usage logs is, if we have them, anyone is free to use whatever tool they wish to analyze and visualize them
[16:47] <tdonohue> #4 is likely semi-related to #2 (and should be on the wiki page likely), but I agree, pbecker
[16:47] <mhwood> hpottinger++
[16:48] <pbecker> yes
[16:48] <tdonohue> To me, it also sounds like, if we have #1, then #3 may or may not be as high of a priority (depending on our long term goals around a statistics engine)
[16:49] <mhwood> I think #1 will produce most of #3.
[16:49] <robint> Hi all, just catching up on this conversation
[16:49] <robint> Just thought I would throw in that in the UK there is a project called IRUS
[16:49] <hpottinger> woah, hey, it's robint!
[16:49] <tdonohue> All in all though, this has been a great discussion...lots to think about and brainstorm on. It also seems like it could feed into the upcoming DSpace Product Roadmap (which will be drafted for discussion at OR15 by myself & others)
[16:49] <robint> run by JISC
[16:49] <mhwood> IRUS is mentioned in a ticket on this issue.
[16:50] <tdonohue> http://www.irus.mimas.ac.uk/
[16:50] <kompewter> [ IRUS-UK ] - http://www.irus.mimas.ac.uk/
[16:50] <robint> @Mire produced a stats plugin that effectively sends them stats each time an event occurs
[16:50] <mhwood> DS-626 mentions PIRUS/PIRUS2/IRUS
[16:50] <robint> Same idea as Google, but not Google
[16:50] <kompewter> [ https://jira.duraspace.org/browse/DS-626 ] - [DS-626] API for exchange of Usage Data over OAI-PMH or SUSHI - DuraSpace JIRA
[16:51] <robint> The advantage is that unlike Google they have some commitment to their users
[16:51] <robint> And the institutions can influence what happens
[16:51] <robint> So DSpace gets to offload the stats recording...
[16:52] * hpottinger likes the sound of that.
[16:52] <robint> But the users still have some guarantee of longjevity
[16:52] <tdonohue> Is IRUS only for UK institutions though?
[16:52] <robint> I'm sure thats not how you spell that word :)
[16:52] <mhwood> :-)
[16:53] <robint> tdonohue: I suspect so, so its not a global solution
[16:53] <hpottinger> it's how *you* spelled it ;-)
[16:53] <robint> But it may be a model that we could encourage
[16:53] <helix84> robint: could you please get back to me and point me to why the decision to use OAI-PMH was made within this project? As Joao (who has extensive experience with both repo statistics and OAI-PMH) pointed out, they're not necessarily a good match.
[16:53] <mhwood> "If you are a UK repository wishing to participate in IRUS-UK, please contact irus@mimas.ac.uk"
[16:54] <tdonohue> It also looks like, from the IRUS page, that it's COUNTER-compliant...which implies if we can better support COUNTER, we should be able to send data to IRUS?
[16:54] <hpottinger> or, you know, adapt the plugin to send data to *something else*
[16:54] <robint> helix84: Actually I didn't know it made use of OAI-PMH, not sure why that would be
[16:55] <helix84> robint: maybe I'm completely misguided here, sorry
[16:55] <robint> tdonohue: The @Mire patch for IRUS is freely available and its small, so we could investigate putting it into master
[16:55] <hpottinger> I've got to run soon, so, I'm just going to blurt out: DS-2506
[16:56] <kompewter> [ https://jira.duraspace.org/browse/DS-2506 ] - [DS-2506] Case-insensitive browse configuration does not work with discovery browse - DuraSpace JIRA
[16:56] <mhwood> I think the PMH mention was referring to SCEUR. Read up the ticket a ways.
[16:56] <robint> helix84: You could be right, I'm not that familiar with how it all hangs together
[16:57] <tdonohue> robint: yea, it might be good to see what this IRUS integration looks like, and whether there's any parallels with resolving DS-626 (SUSHI / COUNTER integration)
[16:57] <kompewter> [ https://jira.duraspace.org/browse/DS-626 ] - [DS-626] API for exchange of Usage Data over OAI-PMH or SUSHI - DuraSpace JIRA
[16:57] <robint> tdonohue: Maybe we need another Duraspace hosted project - World Repo Stats!
[16:58] <helix84> robint: I'd be happy if you could just get back to me about that later. I did some searching on that but didn't find out because it's a large project and I got lost.
[16:58] <robint> helix84: will do
[16:58] <hpottinger> robint: do not add to tdonohue's pile
[16:58] <tdonohue> robint: um..probably not gonna happen, unless someone wanted to fund us to build World Repo Stats ;)
[16:58] <helix84> thanks. btw the ticket is, again, DS-626
[16:58] <kompewter> [ https://jira.duraspace.org/browse/DS-626 ] - [DS-626] API for exchange of Usage Data over OAI-PMH or SUSHI - DuraSpace JIRA
[16:59] <hpottinger> http://www.cranfieldlibrary.cranfield.ac.uk/pirus2/tiki-index.php?page=Project+Plan+and+Progress source code links there
[16:59] <kompewter> [ Home : Project Plan and Progress ] - http://www.cranfieldlibrary.cranfield.ac.uk/pirus2/tiki-index.php?page=Project+Plan+and+Progress
[17:00] <hpottinger> re 2506, I'm thinking I'm going to make a PR to remove and/or "deprecate" the configuration which no longer works
[17:00] <robint> I'll have a general look at the IRUS stuff tomorrow and send something out
[17:01] <robint> Got to run now. Cheers all.
[17:01] <mhwood> THanks!
[17:03] <mhwood> Are we done, then?
[17:03] <hpottinger> think so?
[17:03] <tdonohue> Yes, we're done at this point... it was a very eventful 2-hour meeting :)
[17:03] <mhwood> Heh.
[17:04] <tdonohue> Thanks all for the good discussion on the various Stats issues...it was a lot of good brainstorming & ideas for how to possibly move forward
[17:04] <tdonohue> The meeting is officially closed. We'll skip any JIRA backlog review for today, since this meeting ate up that time
[17:04] <mhwood> Always happy to inject controversy. :-)
[17:05] <tdonohue> Good controversy though... it really was good discussion, and it should help inform the upcoming DSpace Product Roadmap around our Usage Stats options, etc
[17:05] * robint (81d7fa56@gateway/web/freenode/ip. Quit (Ping timeout: 246 seconds)
[18:17] * tdonohue (~tdonohue@c-98-215-0-161.hsd1.il.comcast.net) has left #duraspace
[18:21] * tdonohue (~tdonohue@c-98-215-0-161.hsd1.il.comcast.net) has joined #duraspace
[20:04] * mhwood (mwood@mhw.ulib.iupui.edu) has left #duraspace
[21:06] * hpottinger (~hpottinge@mu-162188.dhcp.missouri.edu) Quit (Quit: Leaving, later taterz!)
[21:42] * tdonohue (~tdonohue@c-98-215-0-161.hsd1.il.comcast.net) has left #duraspace

These logs were automatically created by DuraLogBot on irc.freenode.net using the Java IRC LogBot.