I am building a ipv4/ipv6 geo ip MongoDB database and i will have millions (100+) of ips

Structure will of the database will be

[
    { _id: 58fdbf5c0ef8a50b4cdd9a8e , ip: '34.53.63.25', ip_hex: '0x22353f19' , type: "ipv4", data : [
        {
            country : "CA",
            region : "region1"
            city : "city1",
            blacklisted : "no"
            on_date : ISODate("2022-05-05T00:00:00Z")
        },
        {
            country : "CA",
            region : "region1"
            city : "city1",
            blacklisted : "yes"
            on_date : ISODate("2022-06-05T00:00:00Z")
        },
        {
            country : "US",
            region : "region2"
            city : "city2",
            blacklisted : "no"
            on_date : ISODate("2022-05-05T00:00:00Z")
        },

        ...
    ]},

    { _id: 58fdbf5c0ef8a50b4cdd9a8e , ip: '1.2.3.4', ip_hex: '0x1020304', type: "ipv4", data : [
        {
            country : "CA",
            region : "region1"
            city : "city1",
            blacklisted : "no"
            on_date : ISODate("2022-06-05T00:00:00Z")
        },
    ]},

    { _id: 58fdbf5c0ef8a50b4cdd9a8e , ip: '2345:0425:2CA1:0000:0000:0567:5673:23b5', ip_hex: '0x234504252ca1000000000567567323b5', type: "ipv6", data : [
        {
            country : "FR",
            region : "region1"
            city : "city1",
            blacklisted : "no"
            on_date : ISODate("2022-06-05T00:00:00Z")
        },
        {
            country : "FR",
            region : "region1"
            city : "city1",
            blacklisted : "yes"
            on_date : ISODate("2022-07-05T00:00:00Z")
        },

        ...
    ]},
] 

I am converting all IP string data to HEX :

1.1.1.1 -> 0x1010101
1.2.3.4 -> 0x1020304
34.53.63.25 -> 0x22353f19
255.255.255.255 -> 0xffffffff

0001:0001:0001:0001:0001:0001:0001:0001 -> 0x10001000100010001000100010001
1111:1111:1111:1111:1111:1111:1111:1111 -> 0x11111111111111111111111111111111
2345:0425:2CA1:0000:0000:0567:5673:23b5 -> 0x234504252ca1000000000567567323b5
2345:0425:2CA1::0567:5673:23b5          -> 0x234504252ca1000000000567567323b5
ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff -> 0xffffffffffffffffffffffffffffffff

There will be a lot of searches by IP and it will be added/deleted/updated new data to every IP each day.

I will search ranges of ips, sort, update, delete.

What index is recommended on "ip_hex" column? I was thinking about a B-tree search on HEX not STR.

I want to have an efficient database. What other optimizations should i take into consideration?

Thank you.

My thoughts:

  1. Don't create another wheel. Use prior projects like https://github.com/ip2location/ipinfodb-php
  2. Consider starting with NO initial values in the database. Add them if there is a query about an IP that is not found. Otherwise the database is likely 99.999% full of entries that will never be used.
commented: 2. you mean add all possible ips? for ipv4 might be possible, for ipv6 ... not a chance, i wonder if i can have them sorted already +0

Why did you choose to use MongoDB for this use case? Why convert IP strings to hex? How does it make them more searchable that way? Also, why not use existing solutions, as rproffitt mentioned, such as GeoIP?

commented: it will not be a just a simple geoip database, will have other features aswell +0

"2." I meant what I wrote. DO NOT FILL THE DATABASE. Start with zero entries until you get a query then go find that via some web API and add that to your database. It's highly likely that in a short amount of use you'll get repeats and not have to go find new data.

commented: if i search for ip_hex: '0x234504252ca1000000000567567323b5' in 102 mil results ... i don't think is a good idea ... it will go into full text search +0

What isn't a good idea?

I propose your database starts with ZERO entries until there is a query. Obviously you will have to go get a location from some other service for that query but now you can put that into your database. After a few hundred queries your database will start to have IPs that your clients use over and over. And it only gets better from there.

What isn't a good idea: Go get a copy of an IP location database and host it yourself incurring a lot of storage fees.

What is a good idea: Don't store millions of IP locations and essentially cache the ones you do get a query for. A fraction of the storage needed and will dynamically grow only as needed.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.