Child pages
  • [system log] System Log Aggregator
Skip to end of metadata
Go to start of metadata

We've added System Log in  INP-811 - Getting issue details... STATUS  and it's been a huge success so far, since it gave a developer invaluable information from production websites, that we weren't able to gather otherwise before it. However soon enough we've faced a following problem, that lead to exponential system log record count increase making it completely unmanageable.

Consider following scenario:

  1. developer missed 1 notice message on a page, while he was developing it (or maybe that 1 notice only happens on production server)
  2. page has 1000 visits per day

Since each notice is logged and 1000 new records are added to system log every day. System log records about errors, warnings and notices have backtrace information included to ease up debugging and it adds extra to overall database table size.

On one of a projects we've faced system log table grown to 370MB and 12490 records in it. After simplified duplicate removal (matching by "error message", "page url" and "error backtrace") this table had only 2480 records left. This can be even smaller number if we'll be processing url more intellectually (e.g. first remove irrelevant parameters from url, like "page number") and compare resulting urls.

I'm proposing to:

  1. add "LogLastRecurrenceOn" (date/time) and "LogRecurrenceCount" (int) columns
  2. from time to time (e.g. once a day) do system log record grouping (aggregation)

This way we still have all the info that was in each error, but also save space (on disk) and searching time needed to find right error along many similar ones.

 

6 Comments

  1. I find this very useful to have since as you mentioned SystemLog can grow really fast and has lots of repetitions.

    One other things to consider, what if we can add "LogRecurrenceCountToday", "LogRecurrenceCountLast7days" and "LogRecurrenceCountLast30days" columns so we can know how often it happends. Note that we still should have LogRecurrenceCount which would be total count.

    What do you think?

    1. I like the idea of having fine grained counts, since it might get developers an info about error recurrence trend. However but proposed approach (creating separate time based counters) have several downsides:

      1. We need to update each counter column each day for each system log record. Counter update operation would:
        1. lock system log database table for WRITE operation
        2. any other incoming write operation (e.g. new error happened and needs to be written to system log) would be put on hold
        3. visitor would see loading indicator (for a long time) on a page that is trying to insert system log record
        4. visitor might go away thinking website is slow to respond
        5. we don't want to loose visitor just website a page has an error on it
      2. proposed time periods aren't fine grained enough to given information about error flow in last hour for example

       

      Here how I see it's implementation:

      1. create "LogRecurrencesData" column of "text" type
      2. process system log in descending order (from last to first record)
      3. each time we find a recurrence we add it's time to "LogRecurrencesData" column (comma separated list of timestamps) to most recent system log record
      4. on system log record detail page we can draw nice Graph (e.g. http://www.highcharts.com/demo/line-time-series), since we have all the recurrence data

       

      I see potential in proposed implementation, however we might think a bit more about actual data being stored in LogRecurrencesData column:

      • average timestamp looks like 1365438606 (10 digits)
      • having 1000 recurrences stored would require 10000 (digits only) + 999 (commas to separate numbers) = 10999 bytes (~10KB)

      Here is how we can improve storage:

      1. Divide each timestamp by 60 (converts seconds to minutes). By that we will save 2 digits from each number stored.
      2. Store only difference against previous number (first number would difference against aggregated record timestamp itself). For example:
        1. 43800 - minute count in 1 month (5 digits), which is twice an improvement compared to initial digit count needed to store recurrence timestamp series
        2. 1440 - minute count in 1 day (4 digits)
        3. 60 - minute count 1 hour (2 digits)
      3. On top of resulting string we can put Bzip2 compression and reduce it to 11% of it's original size (see http://stackoverflow.com/questions/283299/best-compression-algorithm-for-a-sequence-of-integers)

      After all proposed solution would not only be very informative (detailed graph of recurrences), but also be a space effective (not take too much space on disk).

  2. While trying to implement a prototype I've faced several problems. I'll explain them below as long with solution I was able to find.

    Problems

    1. Aggregation process will be performed through CRON. To prevent any excessive load on a server, while aggregation is being performed, we'll only process N (from configuration) system log records per run. When CRON runs next time, then we must not process same record twice. Usual approach of marking processed records with a flag won't fit here, because we're processing from recent to old and might as well find new records that would aggregate records, that in turn were aggregated during previous CRON run.
    2. We have a lot of records in SystemLog table (before or after aggregation) and we need to remove them all together. In-Portal has nice "Delete All" toolbar button in "System Log" section, but it produces "Out of memory" error, because tries to delete each record through kTempTablesHandler class.
    3. ID of last processed record goes out of sync with SystemLog table after it's being truncated (because of solution to previous problem).
    4. Remembering ID of last processed record per-table isn't best solution, because we need to compare ALL records to one we're considering to be original one.

    Solutions

    1. We need to press records from old to recent and perform aggregation of all new duplicates into first found occurrence. Since new records are always added on top, then we only need to remember last processed record ID (store it in SystemCache table). Then on next CRON run we'll only process records, that have ID greater then one we're remembered before.
    2. Let's try to use "TRUNCATE TABLE SystemLog" database query instead.
    3. Before relying on remembered ID of last processed record we firstly must check if record with that ID still exists. If it doesn't, then just reset ID to 0.
    4. Store last processed ID in each record, that we're searching duplicates for.

    And it seems, that now I've solved each problem I've found so far.

    Also it's important to note, that:

    Aggregation of system log record should only be made if last rememered ID in that record is smaller, then last system log record ID. This way we only need to process newly inserted system log records (comparing to ID of last processed record).

  3. I've discussed all of above with Dmitry over chat and he expressed a need to prevent records younger, then 1 week (from configuration) to be aggregated. This could be useful if you need to analyze raw data before it was aggregated.

    In any case aggregation code must be smart enough for not to erase any important data due incorrect grouping of multiple records into one.

  4. Great finding Alex!

    I can only add that I think 1 week shouldn't hard-coded and should number of days or similar selection that webmaster can pick.

    Example, Aggregate "System Log" records older than:

    • 1 day
    • 2.days
    • 1 week
    • 1 month
    • 1 quarter
    • 1 year
    • never 

    Then in System Log we should have ways of looking at:

    1. non-aggregated
    2. aggregated
    3. both aggregated and non at the same time 

    Let me what you think?

     

     

    1. I can only add that I think 1 week shouldn't hard-coded and should number of days or similar selection that webmaster can pick.

      This what "1 week (from configuration)" text from my previous comment means - it will be configurable (smile)

      Then in System Log we should have ways of looking at:

      1. non-aggregated
      2. aggregated
      3. both aggregated and non at the same time 

      This depends only on your configuration:

      1. non-aggregated:
        1. aggregation is disabled
        2. aggregation never happened yet
        3. aggregation happened, but all records were unique already and weren't merged
      2. aggregated
        1. aggregation happened and all records were aggregated
      3. both aggregated and non at the same time
        1. aggregation happened and some records were aggregated

      Since records, that are being aggregated are immediately deleted from database, then you can't flip a switch and see all thousands of system log records as they were before aggregation. This whole aggregation idea only exists because we don't want duplicate info to be stored in database, but at same time want to keep aggregated info about evens, happened in past anyway.

      If you have mixture from aggregated and non-aggregated records in system log, then you can easily find aggregated ones by sorting/filtering by "LogRecurrenceCount" column.