I recognize that some of these might be artifacts.
Of course, these questions are probably not worth thinking about for Wakaba's normal use, but I thought I'd throw them out there. (3) kind of stumps me, though.
Yeah, 3 would stump anyone, wouldn't it? Anyone who hasn't had to deal with GODDAMN DUMBASS MYSQL, that is. MySQL apparently treats integers ONLY as 32-bit signed integers, and it clamps positive values that are bigger than 31 bits. Which means I can't store an integer IP address in an integer field, but have to do use retarded hacks like making it a text field.
I'm not quite sure what you're referring to with 1 and 2, though. If by 2 you mean, why is there a field that contains the image instead of just combining the filetype and date stamp, that was originally an artifact from the Futaba design, but as it happens, it's actually required now that you can set the script up to not rename files.
>>2: eh? There are several different types of integers in MySQL, all the way up to 64 bits, and they can all be unsigned if you wish. http://dev.mysql.com/doc/mysql/en/numeric-types.html
Those aren't, as far as I know, really standardized. I was trying to keep my SQL code as standardized and generic as possible and not tied down to a specific database.
I realize very well that this is completely futile, but there you have it.
With 1 I was referring to the 6-8+ SELECTs everytime you post an image. It's getting rather slow, and I was hoping indexing parent would help. I can't comment yet on its effectiveness.
With (2), I see. I hadn't noticed the "preserve name" thing.
(3): From Wikipedia:
"The SQL standard is not freely available." WTF?
Anyway, idealism is overrated. You can discriminate against databases that don't support unsigned integers.
"All integer types can have an optional (non-standard) attribute UNSIGNED." (MySQL docs)
PostgreSQL just has bigint and the inet thing. I think the modern attitude is "You want a bigger number? Use some more bits! Ha ha ha!" They have a point, though.
That doesn't change my opinion from >>4.
Oh, I'm no good at all with SQL speed tricks. If you figure out something that does work, do tell.
Ah, there are several places in the code to be improved, but they have little effect individually, and little on Wakachan boards in general. The basic principle is that every "SELECT * FROM" without the help of an index is O(N), where N is the number of rows(posts). You can't throw it around casually unless your database is small. Which almost every board here is. It gets noticeable in the thousands.
The most glaring ones are where you do the select and then throw it away if a flag is set. Simple to fix. I'll go through it, but do you want them now or after the mythical next release?
Well, the main thing is that this only ever happens when somebody posts. That really means there is very little use in optimizing in the first place, because it happens so seldom. Even on the busiest boards on Futaba, posting only happens every other or third second. iichan never got above more than a couple of percent CPU usage.
The primary contributor to the board is the poster. So I'm mainly doing this for me, but heavy contributors on large storage boards would find it useful. I was running into 5 second lag times between posts.
I take it there's no timeframe on the mythical next version? I don't want to go through and change everything if I have to do it again in a few weeks.
Things I plan to do:
2. Separate the recent post lists into a separate table to shorten flood checks. Removes index dependency on ip and timestamp.
3. Move checks for things such as MAX_THREADS/MAX_SIZE, and so on to before their relevant selects.
4. With those, I can index md5(10), parent, and lasthit only.
5. Somehow shorten build_cache. Most activity is in the first 20-30 threads, so try and avoid rebuilding the back pages. Perhaps creating a thread table for this would help, and let me remove the lasthit index.
6. Figure out how to run a profiler on the thing. Okay, I should be doing this first, but 1-4 aren't that much work. You don't happen to have one set up, do you?
I'll let you know how it goes. Might be a while, though.
Five seconds? I find it hard to belive that just a handful of SELECTS from a database of only a couple of thousand entries would take that long. Are you sure the database doesn't have any other problems to make it that slow?
The Next Version is probably my next project after I get Xee into some sort of releasable form... Might be soon, or it might be some weeks. dmpk2k's been hacking the source quite a bit in the meantime, so working from the current version is probably a bad idea. I should also warn you that if you want any hacks to go into the official version, they'd better work on at least SQLite in addition to MySQL, because that's what I prefer to use. For that matter, how about you try running on SQLite to see if that's any faster?
And no, unfortunately I don't have a profiler set up.
The problem is I don't know what the relative cost of network communication or database connection setup is, thus the need for a profiler. So we'll see. Apparently all you need is "PerlModule Apache::DProf" in httpd.conf, which means I'll need to install all that stuff...ugh. Anyone else willing to do this?
Any changes will still use "basic" SQL. Maybe you can get dmpk2k to do (1)? That one kind of irks me. It is in both SQLite and MySQL.
I will try SQLite. Sometime.
A lot of these are good ideas. Maybe they won't result in a large difference overall, but they're still a net win. Some hosts keep tabs on the number of queries, so less is better.
That said, I have some concerns about 2. While I think getting bogged down by legacy is undesirable, changing table layout also means that current boards can't upgrade. If you can get it to work without needing to alter the table schema, that's great.
This has actually been problem to me for a while. For example, if the current posts didn't autoincrement, I'd be a whole lot happier (because it means I can do something else with it).
Maybe at some point we'll need a 3.0.0.
Existing boards can move painlessly. It is merely an extra table with id/num/ip/timestamp/comment/"is-this-an-image", holding posts from the last RENZOKU3 seconds. Granted, you might need to write a script for the complete novice, but it is fairly simple.
And what do you mean by autoincrement being restrictive? I can't think of anything.
I need more control over the post number for something I want to do. Namely, that whole DistWakaba thing. I'll probably never get around to doing it, but this just adds one more excuse.
When you have multiple boards, there's no guarantee that there won't be clashes between boards. E.g.: board 1 & 2 are at post 1000. Two different posters post to each board at the same time. So now the two boards are in an inconsistent state: different pictures or text for the same post number.
There are quite a few ways to solve this, but many of them rely on using numbers that may be arbitrary increments over the current post number. Using a master-slave architecture solves this, but also adds latency and increases the possibility of failure, so I'd prefer to use something else.
I could just use ALTER TABLE, but my understanding is that it's a crapshoot across different DBMS.
What's wrong with just changing the number on one of them, using a quadratically increasing formula for failed UPDATEs? The autoincrement keeps track and starts from the new higher number.
What is your current idea of an architecture? A set of equal hosts that know about the rest, with additions being administrative? Or a looser model where the list is propagated through the cloud?
> quadratically increasing formula for failed UPDATEs
I was not aware you could make a number autoincrement with a failed UPDATE.
The problem is that communication between hosts is potentially very expensive. You don't want to have several failed UPDATEs, because that means several failed checks (and thus communications). The fastest way is for all boards to have a method of electing a master. When a post occurs on a board, it makes a single request to the master, which provides a new post number.
This is a bit ugly though, because it causes problems when host A is receiving post 1001 from host B and 1002 from host C, yet host B is amazingly slow, so 1002 arrives first. Now what? Update the site with a hole in the middle? Wait for B to finish? Or...?
> What is your current idea of an architecture?
A set of equal hosts, who pass posts and deletions among each other, as well as actual image requests from users. This also includes lists of posts and images if a host has become desynchronized with the rest, so it can catch up. I suppose you could add seeds to the mix as well, since they're trivial to add compared to the actual boards.
Needless to say, this results in all sorts of race cases, security issues, and problems caused by latency hiding. Latency hiding and having no central point of failure causes most of the problems, yet people won't post to a slow or unreliable board, so those problems must be solved.
Well, ideas are a dime a dozen. Actually coding it is quite another story. ;_;
My main problem with point 2 in >>11 is that it adds redundant information to the database, just for a speed hack. I mean, when you start adding tables of stuff that's already in the database, you're getting close to the point where you'd have to ask yourself why you are using a database at all, and not just storing data in optimized forms in flat files or something like that.
The whole point of a database is to lump all your information in there, and then query it in various ways. Starting to second-guess it by building extra tables of the same information is inelegant and creates kind of a mess.
Yes, just insert the data while specifying all information. That way 1001 arrives later and just slides into the hole. What I meant with the quadratically increasing formula was to do it in code, but a master assigning numbers is better.
Latency hiding over the Internet requires temporary inconsistency. As long as it eventually gets there, don't worry about how long it takes. It breaks the "unseen posts are on top" mantra unless you order by arrival time, which I feel is acceptable. I think it would be better to get this working for anything before you try to make it work for Futaba /b/ traffic. How do they do it? A bunch of locally networked computers?
>>19
That's true for most boards, but in this case the other table would have around 0-20 rows as opposed to the hundreds or thousands in the main table.
Anyway, the proposed profiling effort will highlight whether it is necessary, or worthwhile compared to an index. I just don't want to take the hours of computer time necessary to do it.
After thinking about it some more, the most probable source of lag for me is point (5). Your numbering scheme of 1.html for the newest posts is better and simpler for boards with turnover, but it's death for boards that do not delete posts regularly. That's a personal tweak I'll have to make.
The problem with point 5 is that posting a new image will always require re-building all sub-pages. It's only for replies that you might get away with not rebuilding them all, and even then you have to deal with threads being deleted off the end.
Yes, quite correct. For normal boards. For my board, however, I will only need to modify the first two pages and possibly renumber the second page. This will require quite a shift through reversing the number and keeping the second page short, which is unfortunate but I do it for performance. I could hide it from casual view by making the first THREADS_PER_PAGE pages fluid and clipping one thread off from each of them.
I do not delete threads automatically, although I leave the MAX_SIZE check in because I'm paranoid. Like I said, a personal tweak. You could emulate it by deleting a page at a time, but I don't think it's worth it.
To be honest, I should probably just use another script instead of bastardizing this one so much, but I don't know of any others. Wakaba is pretty easy to use, understand, and modify. It also happens to mesh with the "translate and edit every comic we can find" fever that's currently on the board.
>>20
For something like this, I'd prefer to have completely planned the way it will eventually work, but I never said I'd use waterfall for the code itself. %)
Anyway, I think it's now apparent why I don't like the autoincrement. It works well for the current system, but it also constrains what I can do.
Is there something that prevents you from discarding it?