Post

Generate MongoDB Index Utilization Report

When MongoDB 3.2 introduced the $indexStats aggregation pipeline stage accesses details were suddenly accessible to users. As a result, scripts could now be written to better understand how frequently indexes were being accessed by operations.

The following script will cycle through all databases and collections (omitting admin, local and config) to produce a delimited report of index utilization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
var DELIMITER = '\t';
var IGNORE = ["admin", "local", "config"];
print(["Namespace", "Index Name", "Usage Count", "Last Used", "Index Size (bytes)", "Index Specification"].join(DELIMITER));
db.getMongo().getDBNames().forEach(function (dbname) {
    if (IGNORE.indexOf(dbname) < 0) {
        db.getSiblingDB(dbname).getCollectionNames().forEach(function (cname) {
            if (!cname.includes("system.")) {
                var coll = db.getSiblingDB(dbname).getCollection(cname);
                var stats = coll.stats();
                // make sure stats ran successfully (if it's a view it won't)
                if (stats.ok == 1) {
                    coll.aggregate([{ $indexStats: {} }]).forEach(function (ix) {
                        var ixname = ix.name;
                        var ns = dbname + "." + cname;
                        var ixsize = stats.indexSizes[ixname];
                        var ops = ix.accesses.ops;
                        var since = ix.accesses.since;
                        print([ns, ixname, ops, since, ixsize, JSON.stringify(ix.spec)].join(DELIMITER));
                    });
                }
            }
        });
    }
});

For example, I ran the above against a test cluster I have in MongoDB Atlas with a DELIMITER set as a pipe character (|) to facilitate the generation of a Markdown table such as the following:

NamespaceIndex NameUsage CountLast UsedIndex Size (bytes)Index Specification
data.usersage_1_address.state_1_name_1NumberLong(0)Tue Jan 18 2022 14:05:40 GMT-0500 (Eastern Standard Time)48164864{“v”:2,”key”:{“age”:1,”address.state”:1,”name”:1},”name”:”age_1_address.state_1_name_1”,”ns”:”data.users”}
data.usersidNumberLong(0)Tue Jan 18 2022 14:05:40 GMT-0500 (Eastern Standard Time)34496512{“v”:2,”key”:{“id”:1},”name”:”_id”,”ns”:”data.users”}
data.usersaddress.state_1_name_1_age_1NumberLong(0)Tue Jan 18 2022 14:05:40 GMT-0500 (Eastern Standard Time)43565056{“v”:2,”key”:{“address.state”:1,”name”:1,”age”:1},”name”:”address.state_1_name_1_age_1”,”ns”:”data.users”}
data.usersage_1NumberLong(0)Tue Jan 18 2022 14:05:40 GMT-0500 (Eastern Standard Time)13721600{“v”:2,”key”:{“age”:1},”name”:”age_1”,”ns”:”data.users”}
encryption.__keyVaultidNumberLong(0)Tue Jan 18 2022 14:05:40 GMT-0500 (Eastern Standard Time)36864{“v”:2,”key”:{“id”:1},”name”:”_id”}
medicalRecords.patientsidNumberLong(0)Tue Jan 18 2022 14:05:40 GMT-0500 (Eastern Standard Time)36864{“v”:2,”key”:{“id”:1},”name”:”_id”}

As indexes aren’t free (see “Indexing Strategies”) dropping unused indexes will allow you to reclaim some disk space and potentially improve write throughput. In a replica set the output would be for the current node you’re connected to (likely the PRIMARY). Before dropping indexes ensure you review the output above for all data bearing nodes as some workloads may only target SECONDARY members, which would result in (likely) lower usage statistics on the PRIMARY.

Note that the Last Used values are reset when a mongod is restarted. If the Usage Count is 0, the Last Used value will indicate the time the process was started; not when that index was actually last used. The output of the Last Values above will be in your local timezone. For more information see the MDN Docs for Date.

Let me know if you find this script useful in the comments below ;)

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.