Why Can't Database Tables Index Themselves?

foobar · July 20, 2006, 12:00am

But I ALWAYS put a primary key on a table.

I would certainly hope so, if just so that you satisfy the rules of normalization.

Especially with SQL Server, you will run into major performance problems if you don’t.

mike154 · July 20, 2006, 12:00am

User: Select LastName from Employees

User: Select LastName from Employees Where LastName = “Jones”

User: Select FirstName, LastName From Employees Where LastName = “Smith”

Clippy: I see you’re searching on LastName! Would you like some help in indexing that field?

codinghorror · July 20, 2006, 12:00am

Not really, since the database server has to parse a new query to figure out how to execute it. And that includes determining what indexes to use, etc. This is called an “execution plan”. It’s generally cached so the next time the same query is called, the execution plan is not regenerated.

The database server, in its normal course of operation, has access to ALL the data it needs to make an excellent decision about what tables/columns should be indexed!

Chuck · July 20, 2006, 12:00am

Jeff,

Having done lots of tuning of databases both big and small, the simple answer here is this: how is the database being used? You can’t apply the rules as simply as you’re stating them when it comes to building indexes becuase you simply cannot know how the database is being used when the criteria is as narrow as you’re stating it here.

Example: you’re working with a database that’s 1GB in size. You run a few queries and then, you find that it’s automatically built you an index or two to support what you’ve done. Yippe, you’re getting data back faster now, automaticaly.

Example 2: you’re working with a database that’s 90GB in size. You run a few queries against it, and then you find that the server appears to be frozen while it’s automatically building indexes to support your few queries. After you’re done and everyone’s productivity has stopped because of the “frozen” server, you find that the server is still slow because those two queries you ran didn’t have much to do with the other queries you’re running, and then more indexes are built for you automatically to support those other queries that you just ran. And then you find you’re drive is running out of disk space…

So I much rather opt for the Index Tuning Wizard in SQL 2000 or the new 2005 tool (Database Tuning Wizard or some such name) than to have the server do it for me. It give you the flexibility to give a “workload” that gives a good representation of the typical activity that your database experiences, taking into account both sides of the indexing equation (i.e., queries vs. DML). And at the end, you still have to review the suggestions to make sure they make sense for your actual usage. I’ve found some pretty wacky suggestions from the ITW over the years. And you definitely do NOT want it building indexes for you “right now” - you want to schedule these things as they most definitely have an impact on your experience with the database.

-Chuck

Gabri · July 20, 2006, 12:00am

I’m currently working on a reasonably sized data warehouse (300Gb), which has some tables that run over 80 million records, gaining 4-5 million more each year. Can you imagine what would happen if the database itself decided it might be a cool idea to add a couple of indexes to those tables?

Of course, it would make my job as performance consultant a lot more interesting, I bet I’d see a lot more projects come my way…

WesleyS · July 20, 2006, 12:00am

OK, so the straw man “autobuilding without asking kills production performance” is dead. However, that doesn’t address tools that collect the performance information in real time and can present suggestions and show the impact of such changes (i.e., 40% faster response time on queries that include X, 10% slower insert speed, updates unaffected because the index field isn’t updated commonly).

As many have pointed out, SQL Server has tools that give a (limited) version of this. These tools are not a replacement for a DBA (if you can’t get far enough to bother with a primary key, no tool will save the design disaster from your incompetence) but they sure could make the DBAs job more effective by being able to back up decisions with hard data.

Which brings up the point that tools of this nature should keep performance metrics from before and after the changes are made, so underperforming changes can be undone.

db15 · July 20, 2006, 12:00am

The only thing that you’re missing is the idea that the database somehow knows what the data means and will beable to determine all the things the data will mean in the future.

For example: If users continue only insert unique data into a column does that mean that you can safely optimize the column to use a unique index?

What if that column was a list of employee names and the table held ‘blog’ entries. Each person only makes one ‘blog’ entry a month. For the first month the column has unique values in its index. Let’s say your auto indexer sees this as an opprotunity to put a unique on that column. Now when the next month rolls around your users can’t insert a new ‘blog’ entry because that would violate the uniqueness of the column.

Not only with the case of seemingly unique data not really being unique does it pose a problem to have an automatic system impose indexing. All forms of indexing pre-supposes a semantic constraint on the database that does not exist in its internal representation. It also presupposes a lack of concern over performance issues.

What if your program indexes a “bit” feild?
http://www.aspfaq.com/show.asp?id=2530

Using an INDEX on a database table is to explicitly identify time-space trade-offs for the database engine. Indexes aren’t magical, they live on disk. To hit an index is to hit disk to save CPU for searching. To say “index everything” is to say “I don’t care about I/O” to say “index nothing” is to say “I don’t care about CPU load” and to say “Index this” is to say: “In this instance it is better to use I/O instead of CPU” and if you can write a program that reliably predicts when the trade-off is right that is something very valuable indeed… and it is one step closer to making programmers and DBAs obsolete.

If you buy the idea that you can create a heuristic auto indexer then you aren’t far from believing that computers will (eventually over the eons of time) program themselves and all programmers and DBAs will be utterly obsolete and out of jobs. And robots will rule the earth.

Gokhan2 · July 20, 2006, 12:00am

Query optimizers decides indexes which will be used for a particular query for ages. We dont have to a specify a plan to each query by this way. “What if” type responses to this (Index optimizer) idea are “What if query optimizer sucks; I want to specify a plan for each query I execute”.

I think if Index optimizer is smart enough and you let it work from first day of database, you’ll have good enough indexes for your database.

This may not be suitable for Google’s database servers, but I definitely enjoy this type of feature if it will be available.

HitScan · July 21, 2006, 12:00am

PostgreSQL would react the same way that Danie mentiones Oracle reacting.

Sometimes an Index scan is more expensive than a full table scan. So these indexes that are auto-(generated/suggested) may end up completely ignored. It can be maddeningly more expensive if your data has never been reordered (basically remaking the table from a SELECT * FROM whatever ORDER BY index_field) because for each page of the index, several random pages of the table may need to be loaded.

Accurately determining when an index scan is more expensive depends on table statistics and cache size, and all kinds of things that developers normally never or rarely see. That’s why they have the DBAs do it. (Wish I could afford one. D: )

Danie · July 21, 2006, 12:00am

I work with Oracle are there are multiple types of indexes ie B-tree, FBI, and Bitmap. It takes planning and a considered decision to add an index to a table outside both the primary and foreign key (which should always be indexed if you do not want dead locks).

These requirements also change over time, the idea that databases could automatically index and keep these in line with future requirements is a Nirvana that will never be reached

Mindfulgeek · July 21, 2006, 12:00am

These are all great points. But the assumption here is either it is done completely automatically and all the time or just suggesting what indexes to create.

What about taking the approach of a knowledge-tree. The Clippy Indexer could walk the user through a series of questions, such as Choose how this database is used? (reports, application, transactional, etc). What percentage of the most used queries? What other queries do you wish to include in the optimization? Stuff that we would look for as dba’s and developers. Then let it do the crunching of looking for patterns and such. Then it could ask more questions like what type of index to use, should it be clustered, or maybe give more than one suggestion for each index. Finally it would ask if you wish it to create the index.

This would give control over the process to the user and it would do all the manual crunching for them.

Danie · July 21, 2006, 12:00am

Just following up on another point the original post made.

Does this query result in a table scan?

Full Table Scans (FTS) are not necessarily a 'bad thing’; this is a common misconception. This is true in Oracle at least cannot comment about any of the other Database Vendors.

Have a look around some of the work Tom Kyte has done on FTS.

In conclusion the algorithm (if FTS then add Index) would not work.

willd · July 22, 2006, 12:00am

What is being asked for in index reports and capabilities already exist, excluding the cute user interface, and even that can be had with purchasing 3rd party products.

All your modern databases have the abaility to store/report on indexes and thier use or lack. For instance I use a monthly job that gives me a report of all indexes which are used, amount of usage,space required,etc and I also get a listing of queries against tables that did not use indexes or maybe could of benifited from them, and how many times that occured in addition to the queries that are the most CPU intensive.
This type of report helps to get the ones that fall between the cracks both in creation and in no longer being needed.

As for the DBA position, for the past decade it has been more a position of dealing with the end-user, developers,etc then with the internal secrets of the database. yes the dba will have to work with the internals but time and knowledge wise it has become somthing alot less needed.
For your average database usage Oracle and DB2 both provide tools that a standard system admin, with a little extra training could use to keep the database in excellent running shape and MS-SQL never provide the options,until 2005, that you could do much about it.

cgregjam · July 22, 2006, 12:00am

I believe self indexing is a good idea. However as is said in replies, dynamically creating indeces in a production environment could cause further problems.

I would let the dba create the index they disire. but allow the database to run in a special logging mode where all plans used by the database are recorded and the appropriate index statements are generated.

Run the database and all related client applications over a period of time.

Turn off the logging mode, after which the dba can create all or the most frequently used plan indeces which are not yet created.

JeremyB · July 22, 2006, 12:00am

The simple solution to the whole “don’t do that in production” debate: A maintenance window where these types of operations are performed. Your DBA should have one of these in place already, why shouldn’t your database know and follow the same practices? (Such as not performing a backup OUTSIDE of this window, unless the database is not “hot” or in use)

I don’t know about you but the limited knowledge of databases I have doesn’t make me want to keep the same knowledge going forward until the day I die. Perhaps the automatic part of the discussion is the problem, but to suggest the entire thing is useless is only thinking about TODAY. If you are developing a RDBMS and not thinking about these types of issues you might as well just stop. I don’t want a “modern” database with the mentality of a ENIAC. “Hardcore” DBAs may prefer such a system but as time progresses you’ll have to justify the cash I’d be spending on such turds.

At the very least perhaps you can turn those Access “gurus” into a real DBA in the process, or at least have them thinking about real database problems.

MichaelB · July 23, 2006, 12:00am

Funny, people used to say the same thing about writing machine language. “Use a high-level language like C? And give up all the efficiency of machine language? Are you out of your mind?”

Joshua’s right, the database should self-optimize. Although there are always those people who will study a system deeply and understand its nuances and thus be able to out-perform and out-art any robot, for the vast majority of the human population the robots win.

Kasparov Chessmaster you.

BuggyF · July 24, 2006, 12:00am

been away, so lots o comments. the robot quote was a quote, not me. that said: those of us in the database world have found it amusing (in a macabre sort of way) that coders jump at the chance to let a robot design the datastore (which has more importance than code, of course) but rail loudly at the notion of code generation; MDA or executable XML or Firestorm or Scaffolds. Kind of funny. Guess it boils down to whose ox is to be gored.

A database is not just a bunch of files wrapped up by a SQL parser. Coders still don’t get that part. Phew.

Gokhan3 · July 25, 2006, 12:00am

Arrggghh.

Who wrote database software which you admire?
Hint: Coders!

“Some” DBA’s still dont get that part.

AndrW · July 26, 2006, 12:00am

You could miss if the fault is really of the program, it may not be getting the most of the SGDB.

I think is a matter of design, get some time and verify how the data are gonna be used.

DaveE · July 27, 2006, 12:00am

Interesting idea. I covered this on Log Buffer - http://www.pythian.com/blogs/225/ . The conversation continues there too.