Child pages
  • [uploader] File storage in In-Portal
Skip to end of metadata
Go to start of metadata

Imported From: http://groups.google.com/group/in-portal-dev/browse_thread/thread/d4ca2b893d2d8c4f#

Hi. I remember many projects where we had to store lots of user files. Sometimes they were so much, that our team was forced to find some solutions, because when too many files are stored in single folder OS begins to process this folder too slowly. Usually solution was - putting each file into the folder named by the first letter of this files name. I find that quite simple and effective.


Recently our team made something new in project TBA. There it works in such way: all files of all units are stored under a single directory - user_files/filedump. There are appr. 400 folders at each level and customizable number of levels (right now it's 3). Each time file gets stored or recalled (etc. shown on front-end), destination path gets calculated by finding md5 hash of string "<path>/<file>". Path is what we set for "upload_dir" in configs, so actually it is virtual. It is only needed to have different results for files with same names, but in different units. Files are only stored on the last level. So if number of levels is 3, it would be something like 234/123/25/test.jpg I think that this system if far from being ideal - it has many serious disadvantages, i have mentioned it only as an example. But i really think that In-Portal needs to get some automated system, which would help to deal with large number of files without of any additional custom code.


I invite you to discuss on this.

Related Tasks

INP-824 - Getting issue details... STATUS

24 Comments

  1. Hi to you too. I think, that ENTER key on your keyboard was broken after
    reading you post :)

    I didn't quite understand what is "234" and "123" and "25" in path to the
    file?

    Can you provide more detailed example, e.g. input file is "test_file.jpg"
    and what will be it's location on disk and why?

  2. Hi guys,

    Nikita, great point! Also, I just wanted to make sure you've understand
    Alex's point about the "Enter" key. It's quite hard to read and comprehend
    the text when it's in one big paragraph. To make your ideas simpler to
    understand just start braking it into a smaller pieces (paragraphs) and we
    all be happy! ;)

    Please note that we do appreciate what you post your opinions and start or
    participate in a discussion. Let's make sure it's easy for all of us to read
    and understand.

    Now back to your original idea. Yes, I do support your point and I have
    personally have come across at least 2 projects when the number of files in
    the folder became extremely hight and folder got close to unusable. I was a
    Linux, but I am sure with Windows things will become even worst. The matter
    of fact, Linux can't even delete when there is 2,000 or so files since "rm"
    command won't accept that many parameters.

    Let's start with setting our *ultimate goal* by listening each and every
    idea and then come up with the plan for reaching it?

    *
    *
    *Goal:*

    Have the ability to store large number (2,000+) of User Uploaded files
    (system/user_files) in special folder structure so these files can be easily
    deleted, moved and accessed.

    *Possible Solution:*
    *
    *
    I think we should create our folder as follows - YEARMONTH /
    FirstLetterOfFile / Filename. Example, 201101/T/test_file.jpg.

    This way we can see the date the files was saved, and then add one more
    folder level so there are NOT too many in one folder.

    The real question is what do we do with Resized part of the files. I think
    there is NO need to move under this type of structure and should stay in the
    root since we have a Clean Event which deleted Resized files anyway (once a
    while).

    Nikita, Alex and other guys - please join in and post your ideas of the goal
    and solution.

    DA

  3. MediaWiki also has code, that structures uploaded images into sub-folder,
    e.g. "a", "a1" and so on. I don't know what logic being used, but the
    purpose reminds me of what Nikita proposed.

    Resized file clean event is not a problem, since it can be easily changed
    to:

       1. delete files in subfolders
       2. delete whole subfolder structure

    Maybe we even could come up with universal (for uploads into any folders and
    any files), that should speed up work, where a lot of files are uploaded.

  4. Hello!

    Ok, i will precise my previous message and ansver theese questions:

    1) what is "234" and "123" and "25" in path to the file?
    The formula itself, how the path is converted to the multi-level
    structure, isnt so important actually - i dont like the existing
    solution. First of all it takes an MD5 hash of "<path>/<file>" (etc.
    "/system/user_files/avatars/example.jpg"). Then it summs together each
    octet (etc. a5f23=10+5+15+2+3).
    As far as md5 hash is 32 characters long hexadecimal number, the result
    is betveen 0 and 32*16=512. So this number is name for first level. If
    settings require more levels - system takes md5 hash of previous HASH
    and scenario repeats untill required number of levels is reached. As a
    result there is defined number of folder names generated, etc. 234, 322,
    123...

    2) what do we do with Resized part of the files?
    There, in TBA, resized files are stored together with original ones. As
    far as clear event uses proper methotds to get file path
    ($object->GetField('File', 'full_path')) - there shouldnt be any problem
    with removing them.

    3) What pros and cons i see in the system, that we already have in TBA?
    - too much folders at each level (theoretically up to 512 (32*16)),
    which makes it very hard to move such storages;
    - user unfriendly, you can browse easylly for this storage, but you will
    never find something untill you know the exact location;
    - too much computing in algorithm, which generates paths;
    + all files are automatically stored and uniformly distributed so you
    will never be bothered with the problem of large file amount.

  5. Hi everyone,

    Nikita, thanks for explaining how it works in TBA and listing your PROS and
    CONS - it all make sense.

    Based on Nikita's original request I am proposing a global change in files
    and folder structure for storing all User Files:

    *1. Filenames*

    Append a timestamp to the ending of the filename. This makes more sense for
    ANY clean up system - human or automatic.

    Example: my-custom-filename-whatever_20110131.jpg - this way we always know
    when file was created. We still do some minor filename processing so it's
    not like we are adding up much more work to the website here.

    *2. Folders*
    *
    *
    *a. *Create a sub-folder structure with TIMESTAMP /
    UPPERCaseFirstCharacterOfFileName / Filename.jpg

    Example (account that Filename change is done too): user_files / books /
    201101 / M / my-custom-filename-whatever_20110131.jpg
    *
    *
    *b. *I would leave all Resized files in a single folder.

    Example:
    user_files / books / resized
    / my-custom-filename-whatever_20110131_[CRC32].jpg
    user_files / books / resized
    / new-custom-filename-whatever_20110131_[CRC32].jpg

    In other words, I see NO point of creating additional structure for storing
    Resized images and then scan that structure just to remove them. Bottom
    line, no one will be looking through those Resized images anyway.
    *
    *
    *NOTE:* I believe this folders change should be optional and enabled in the
    Fields definition of Unit Config.
    *
    *
    *
    *
    *
    *
    Please share your thought on the above.
    *
    *

    DA

  6. *Bottom line, no one will be looking through those Resized images anyway.*

    Do you think, that someone will look at original files? PHP accesses resized
    image a lot more often, then original files. Then why not to structure them
    too?

    For example MediaWiki does structure both files, e.g.:
    /aaa/bbb/cc/original_file.jpg
    /aaa/bbb/cc/resized/resized_file.jpg

    or

    /resized/aaa/bbb/cc/resized_file.jpg

  7. maybe a dumb question, why do we burdening file name with timestamp, when we
    can use php command filemtime to retrive timestamp and organize files
    accordingly?

    2011/1/31 Alexander Obuhovich <aik.b...@gmail.com>

  8. Hi Alex,

    The answer to your question is - yes. There are cases when you need to
    access original images - move or review, may be not on a daily basis, but
    sometimes you do.

    As to the resized version. I don't think there is a need to MIX it up with
    originals and I strongly believe it's far more productive to keep it in
    completely separate structure. Now as to whether use it's own separate
    structure for it - I say no, but I don't think it's that so critical if we
    decide that we need it.

    Just think about the Clean up process for the Resized images (currently
    setup to run every 30days) - you can't just delete top level folder. You
    need to scan through all folders (our long structure) and actually do it. I
    am just concerned about additional server resources to be used when we can
    just skip it, but again I am ready to discuss if you have more evidence that
    we need this.

    Cheers!

    DA

  9. Let's get back to original post. Why we need all that?

    Nikita told, that when file count in folder grows over 2000 (for example),
    then *accessing that files becomes slow*.

    What exactly "accessing" term means. Is accessing when web browser requests
    a file (e.g. image) from that folder or is it something else.

    If accessing is actually reading, then structuring resized images must be
    done too, since they are accessed the same way as original (non-resized)
    image (as Dmitry didn't saw in my previous post).

    Recursive folder deletion is not an issue here, we ca do it of course.

  10. Okay, I have pointed out the ultimate goal before, here it is:

    *GOAL (as I see it):*

    Have the ability to store large number (2,000+) of User Uploaded files
    (system/user_files) in special folder structure so these files can be easily
    deleted, moved and accessed.

    1. Try deleting a folder with very large number of files on Linux. I
    personally, have come across this on 2 separate projects - had to come up
    with workaround to actually DELETE files from the disk.
    2. By accessing, it's NOT PHP accessing the files, it's probably more user
    being able to see/browse the structure and perform actions (move, delete,
    copy).

    Phil, answer to your question is simple. In some case you still might
    want/need to look at the file name and see the date right away. It's not
    critical (same as a folder structures), but it's convenience to have since
    we still checking on the files name and rename them if needed. Additionally,
    this might reduce the number of _1_1_1.jpg ending when the same Image/file
    is copied over again.

    DA

  11. ok dmitry.

    I agree with alex on the fact that the solution would be different,
    depending where the slowdown is encountered: trought apache access, or using
    php commands.

    2011/1/31 Dmitry A. <dandre...@gmail.com>

  12. I guess there is no problems with PHP/Apache accessing the files. It's only
    problem, when you are accessing files manually, e.g.by ftp.

    Also I haven't pointed out this, but any uploaded file should placed in that
    nice directory structure, not only files under "user_files". Files under
    images folder should be processed too.

  13. Yes, I agree we should introduce this on all levels...

    DA

  14. Alex,

    Looks like it's up to us now what we decide doing here.

    I propose we finalize the tasks for the following:

    *1. Changes in Filename structure*

    Append a timestamp to the ending of the filename. This makes more sense for
    ANY clean up system - human or automatic.

    Example: my-custom-filename-whatever_20110131.jpg - this way we always know
    when file was created. We still do some minor filename processing so it's
    not like we are adding up much more work to the website here.

    *2. Changes in Folder structure*
    *
    *
    *a. *Create a sub-folder structure with TIMESTAMP /
    UPPERCaseFirstCharacterOfFileName / Filename.jpg

    Example (account that Filename change is done too): user_files / books /
    201101 / M / my-custom-filename-whatever_20110131.jpg
    *
    *
    *b. *I would leave all Resized files in a single OR within some sub-folders
    (we can discuss).

    Examples:
    user_files / books / resized
    / my-custom-filename-whatever_20110131_[CRC32].jpg ( *I prefer this method*)
    user_files / books / resized / 201101 / N /
    new-custom-filename-whatever_20110131_[CRC32].jpg

    What do you think?

    DA

  15. As I've mentioned before in 2 posts in this discussion we should inspect any
    other websites, that maybe using similar system to store files.

    For example *MediaWiki*: http://www.mediawiki.org/wiki/1.6_image_storage

    The link from more older version of Wiki, then current, but I hope that you
    get post image storing scheme (if different) from more recent MediaWiki
    version.

  16. Hi Alex,

    I have read what they say on MediaWiki and I think it's a bit too much. I
    guess it was the OLD scheme + proposed new one, and I have not found much on
    the web on this.

    Anything specific you like from here:

    Processing

    Upload new file:

       1. Generate cropped content hash (eg 123456789abcdef0)
       2. Check for hash collisions in upload table
          - Collision? Already have this file; can discard the uploaded one.
          - Otherwise, move the file into place:
          $wgUploadDirectory/1/2/3/456789abcdef0
       3. Check file table for existing record with the given name ('Puppy.jpg')
          - None? Insert a new null record for the filename
       4. Insert a new upload record for filename 'Puppy.jpg', file
       123456789abcdef0
       5. Update the file record for the filename to point at this latest upload
       6. Purge affected page caches

    Revert file:

       1. Insert a new upload record referring to the prior file

    DA

  17. Here is what MediaWiki does in 1.13.2 version:

    Filename (without path, since all uploads are stored under "/images/"
    folder) is given as input to this function:

    static function getHashPathForLevel( $name, $levels ) {
    if ( $levels == 0 ) {
    return '';

    } else {

  18. Hi Alex,

     Thanks for your input alex it is very interesting.

     1 of the things I worry about is that you are we going to use hash function to process the  image folders everytime we need to access it will take much more time to execute

     I would  store a full path of the image to speed up processing

    DA

  19. Yes, storing hash path along with image name could have it's benefits from
    performance view point.

    In that case we need to check, that "/" in that path are not converted to
    %5C at some point during url generation to that file.

    On Fri, Feb 4, 2011 at 4:02 PM, Dmitry A. <dandre...@gmail.com> wrote:
    > Hi Alex,

    >  Thanks for your input alex it is very interesting.

    >  1 of the things I worry about is that you are we going to use hash
    > function to process the  image folders everytime we need to access it will
    > take much more time to execute

    >  I would  store a full path of the image to speed up processing

    > DA

  20. One more thing here: what about resized images. If you are suggesting to
    store hashed path along with original files, then where to store hashed path
    for resized images, since image name of resized file is different so hash
    path will differ too.

  21. Hi guys,

    After further thinking I can believe there is a major CON for maintaining
    files/folders when Hash is used to auto-generate Folder paths. This makes me
    believe we need to provide options to a developer and him decided which
    methods better for particular upload field.

    To summarize, I propose the following:

    *1. Filenames*
    *
    *
    *a. ability to have Prefix (defined in Field definition, default is empty, *
    *DATE TIME **which reserved**, but can specify any other symbols**)*
    *
    *
    *b. ability to have Ending **(defined in Field definition, default is
    DATE-TIME which reserved, but can specify any other symbols)*
    *
    *
    *DATE = **20110131 (2011/01/31)*
    *TIME = 123401 (12:34:01)*

    *Examples:*
    *
    *
    *- **20110131-123401_**my-custom-filename-whatever.jpg*
    *- my-custom-filename-whatever_20110131-123401.jpg
    *
    *
    *
    or other combinations

    This will give us full flexibility in filenames

    *2. Folders*
    *
    *
    Similar here, give 3 options for each upload field:

    *a.* As is now - All files go into single specified folder.

    b. Chronological - YEARMONTH / DAY / (adds 2 sub-levels)

    *c.* Hash - as proposed by Alex  (adds 2 sub-levels)

    *NOTE:* I believe all these should be done as settings for UploadFormatter.

    All resized versions will be placed without special structure under
    /resized/ folder.

    Please review and post your opinions or questions here.

    DA

  22. All seems ok, but Dmitry forgot to mention, that hashed path will be stored
    along with filename in database (in case if hashing algorithm is changed for
    given field over time).

    If other will approve it, then we need to determine what option names will
    be more associated with given functionality.

  23. Dmitry, your proposal seems to fit all cases, I understand the harder part
    (dealing with hashes) will belong only to section 2.c, it's nice to have all
    theses options for files.

    I'd like to ask here another option (may it allready exist but I don't know
    it):

    2.d - setup a different folder/server for files : if left blank, we use our
    actual /system/Images as base path, otherwise we can specify things like
    images.website.com/ , 44.25.125.32/images ...

    2.e - I propose a checkbox "prevent hotlinking" to detect(add?)
    automatically in our system/Images folder a line in .htaccess to prevent
    hotlinking (this is easy for us, but not for an average user)

    what do you think? We'd have a nice admin menu to finely tune the whole
    filestorage.

    2011/2/6 Alexander Obuhovich <aik.b...@gmail.com>

  24. 2.d - yes, we have discussion about that and task:
    https://groups.google.com/d/topic/in-portal-dev/OVSKAGJx3FU/discussion (this
    is designed to have another webserver, e.g. using nginx for faster image
    loading)

    2.e. What is hotlinking?