help me asp.....
In this assignment you have to write functions (in python) that applies various preprocessing
techniques which are popular in Natural Language Processing (NLP). For this purpose, you
are supplied with a Text Dataset (“corona_data\test_small.tsv”) and you have to process
that dataset using the following techniques:
- Tokenization
- Text Lowercase
- Remove HTML tags
- Convert number words to numeric form.
- Remove numbers
- Remove punctuation
- Remove extra whitespaces
- Convert accented characters to ASCII characters
- Expand contractions
- Remove special characters
- Remove default stopwords
- Stemming
- Lemmatization
- Part of Speech (POS) Tagging
- Named Entity Recognition
You will find the details as well as the codes to do all of the above preprocessing in the
Important Links section under LAB – 7 in the Google Classroom.
Important:
Please follow these rules while you do the preprocessing: - Use separate functions (modular programming) while you do each of the above
preprocessing. - Only preprocess the “Example” field of the dataset. Do NOT process the Labels.Page 2
- For each of the preprocessing steps do the following two things –
a. Apply it independently on the dataset and write the output as a text file (name
the text file as – <name_of_the_preprocessing>_out.txt)
E.g., tokenization_out.txt … text_lowercase_out.txt
b. Apply it sequentially from the output of previous preprocessing(s), and finally
save the complete output (as “preprocessed_out.txt”) when you are done
with subsequent preprocessings steps (1 to 11).
c. For outputting the TEXT file - You should write your own function that can take
a list of strings (data) and write them into a .txt file under a directory named
output automatically. - Write proper Comments inside the code. Failing to do so will reduce your grades. Also
copying others comments/code might give you a negative mark. - Make sure you produce the output text files inside a folder named “output” which
should be inside the same folder of your code. Rename the project folder having
your code(s) and output files as your StudentID_Section and make it a .zip before you
upload it as the solution to Assignment in Google Classroom. In case you use the
Google CoLab, your code should automatically create the output folder in the
corresponding google drive following the aforementioned folder structure.