Python program to extract IP Addresses from a log file

Question

RAZ_2 0 Newbie Poster

7 Years Ago

I have a python script which extracts unique ip addresses from snort log but how to modify or use regex to extract IPs only if they are logged more than 10 times per second? more specific: using "regex", if the second (i.e 41 in this scenario) doesn't change for more than 10 lines of having the same IP address then extract that IP.

                blacklist = list(open("/home/asad/blackdb/blacklist", 'r').read().split('\n'))
                logfile = list(open('/home/asad/logdb/snort.alert', 'r').read().split('\n'))
                newip = []
                for entry in logfile:
                        ips = re.findall(r'[0-9]+(?:\.[0-9]+){3}', entry)
                        for ip in ips:
                                newip.append(ip)
                newblist = blacklist + newip
                with open("/home/asad/blackdb/blacklist", 'w+') as f:
                        f.write('\n' .join(set(newblist))+'\n\n')
                        f.close()

log example text format:

`12/30-04:09:41.070967 [**] [1:10000001:1] snort alert [1:0000001] [**] [classification ID: 0] [Priority ID: 0] {ICMP} 192.168.232.2:41676 -> 192.168.248.2:21`
`12/30-04:09:41.070967 [**] [1:10000001:1] snort alert [1:0000001] [**] [classification ID: 0] [Priority ID: 0] {ICMP} 192.168.232.2:41673 -> 192.168.248.2:21`

in above log, in both lines seconds are: 41 and IPs are: 192.168.232.2 and 192.168.248.2. If there are >10 records in the same second i.e 41, then it should extract it.
any help please?

python

3 Contributors
9 Replies
16K Views
2 Years Discussion Span
Latest Post 4 Years Ago Latest Post by tdsan

Gribouillis 1,391 Programming Explorer

7 Years Ago

You could use itertools.groupby and collections.Counter. Something along the lines of

newip = []
c = Counter()
for key, group in groupby(logfile, key=lambda e: e.split('.',1)[0]):
    for entry in group:
        c.update(re.findall(r'[0-9]+(?:\.[0-9]+){3}', entry))
    newip.extend(ip for ip, cnt in c.items() if cnt > 10)
    c.clear()
newblist = blacklist + newip

The groupby() groups consecutive entries that happen in the same second, then the number of occurrences of each ip in this group is calculated.

Edited 7 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

7 Years Ago

Well, groupby() works by computing a key for each item and by
grouping together items having the same key. I could as well have written

def get_key_of_line(line):
    return line.split('.', 1)[0]

for key, group in groupby(logfile, key=get_key_of_line):
    ...

Here we suppose that every line starts with a pattern such as
12/30-04:09:41.foo-bar-baz. The grouping key will be 12/30-04:09:41 .
It can be obtained by splitting the line on the first dot. We could
also have obtained it with a regex

def get_key_of_line(line):
    return re.match(r'^[^.]*', line).group(0)

Note that if some lines don't start with a timestamp, these functions
will return the whole line if there is no dot, or the part of the line
before the first dot as the grouping key. You may need to adapt the
function to ensure that identical keys are only returned for consecutive
lines that need to be grouped.

Edited 7 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

7 Years Ago

Thank you for your support. I'm a python programming enthusiast and I like to share this knowledge!

You could get the IP by regex, for exemple

match = re.search(r'->\s*([0-9]+(?:\.[0-9]+){3})', entry)
if match:
    ip = match.group(1)

tdsan commented: I think this is great, I made one slight modification to the code "(r'->\s", I removed the -> and it seemed to work great through this log file +0

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

I think this is great, I made one slight modification to the code "(r'->\s", I removed the -> and it seemed to work great through this log file

RAZ_2 0 Newbie Poster · Answer 1 · 2017-01-02T08:26:01+00:00

Thank you in advance @Gribouillis. it did exactly what I want. please expalin this part of the code: key=lambda e: e.split('.',1)[0]

RAZ_2 0 Newbie Poster · Answer 2 · 2017-01-02T16:50:17+00:00

@Gribouillis Perfect. now I have two questions:
1- if the occurances are not in sequence but occure in the same second, then can this program group and handle them?
2- still my doubt is that when I am modifying the code as bellow, it returns an error. as I am new, these simple errors are difficult for me to debug (I want that if cnt>10 then extract IP elif cnt >100 then the last string (i.e 21 here) should be extracted and stored in another file).
I have used if and else:

#   newip.extend(ip for ip, cnt in c.items() if cnt > 10)
    for ip, cnt in c.items():
         if cnt > 10:
                newip.append(ip)
    elif cnt>100:
        a = logfile.rsplit(':', 1)[-1]
        print (a)

error:
a = logfile.read().rsplit(':', 1)[-1]
AttributeError: 'list' object has no attribute 'rsplit'

what should I do now?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2017-01-02T20:35:49+00:00

1- If you want to group when the sequence is unordered, you must sort the input before using groupby()
2- The elif needs the same indention as the corresponding if. Also note that the
cnt results come after all the entries in the group have been exhausted. If you need more information from these entries, you need to store them temporarily. If I understand your question well, you want only the last part of the ip. You could use this code

def entry_to_second(entry):
    return entry.split('.', 1)[0]
newip = []
c = Counter()
logfile = sorted(logfile, key=entry_to_second) # <--- sort the logfile by the second
for key, group in groupby(logfile, key=entry_to_second):
    for entry in group:
        c.update(re.findall(r'[0-9]+(?:\.[0-9]+){3}', entry))
    for ip, cnt in c.items():
        if cnt > 10:
            newip.append(item)
        elif cnt > 100:             # <-- align elif with if. Indentation is critical in python
            a = ip.rsplit(':', 1)[-1]   # <-- last part of ip (the :21 IF there is a :)
            print(a)
    c.clear()
newblist = blacklist + newip

RAZ_2 0 Newbie Poster · Answer 4 · 2017-01-03T05:42:20+00:00

@Gribouillis I only want this Ip address 192.168.248.2

as there are two IP addresses in the same line, so I split it by '–> ' but it returns: 192.168.248.2:21 and I want only the IP part not the 21.
I used two times split(), it works but two times splitting does not seem efficient.

I hope there is no more questions and I really appreciate such kind of unique pepole which try to help others. wish u all the best Gribouillis.

RAZ_2 0 Newbie Poster · Answer 5 · 2017-01-03T06:42:54+00:00

@Gribouillis Really Done and it is perfect! I don't know how to thank you Gribouillis. you have solved my problem and I am really proud.
Thank you in advance.

I have tracked down the error but it says "c = Counter()" is not defined, is there a module I need to import

tdsan 0 Guru · Answer 6 · 2019-07-10T17:30:14+00:00

RAZ_2

Is this the correct code, I want to run this code to extract information from my /var/log/audit/audit.log file.

blacklist = list(open('/var/log/audit/black_ip.log', 'r').read().split('\n'))
logfile = list(open('/var/log/audit/audit.log', 'r').read().split('\n'))
newip = []

def entry_to_second(entry):
    return entry.split('.', 1)[0]
newip = []
c = Counter()
logfile = sorted(logfile, key=entry_to_second) # <--- sort the logfile by the second
for key, group in groupby(logfile, key=entry_to_second):
    for entry in group:
    c.update(re.findall(r'[0-9]+(?:\.[0-9]+){3}', entry))
for ip, cnt in c.items():
    if cnt > 10:
        newip.append(item)
    elif cnt > 100:             # <-- align elif with if. Indentation is critical in python
        a = ip.rsplit(':', 1)[-1]   # <-- last part of ip (the :21 IF there is a :)
        print(a)
c.clear()
newblist = blacklist + newip

Please advise.

Todd