Hello,
i hope that someone can take the time to help in this.
Here exactly what i should exactly do:
- Established the corpus.
- Prepare our project structure.
- Write a Perl script that :
1 Browse the corpus.
2 Cleans files and makes the necessary substitutions SYGMART .
3 Call Sgmart and save the result .
The purpose of this project is to implement and evaluate a document classification method programmed in Perl.
**First step: formation of the corpus**
In a first step, a body should be formed . We propose to develop a body of five distinct themes (for exemple: politics , cooking, etc. ). This corpus will be normalized (removal HTML tags , etc ) . To do this , you will find ten texts written in French or English relating to each of these five themes.
**Second step: implementation of a classification algorithm**
Further work will be to implement a classification algorithm . many
learning approaches can be used for text classification :
o K nearest neighbors
o Decision Trees
o Naïve Bayes
o Neural Networks
o support vector machines
In this project, we propose to use the well-known method of K nearest neighbors ( KNN ) view
in progress.
Third step : taking account of linguistic information
The goal here is to use your texts with different information:
o Gross Texts .
o lemmatised Texts .
o Texts lemmatised with parsing .
**The project structure** as I see it is this:
ROOT
|____REP Article
|____REP Donquichote
|
|
|____REP Art
|
|
|
|____Txt files
|
|
|
|
|
|____REP clean
|____Txt files cleaned
|
|
|
|____REP tag
|____Tagged files in .txt format
|
|
|
|
|
|____REP vect
|____Txt files
|
|____REP ParisElection
|
|
|____REP Art
|____Txt files
|
|
|____REP clean
|____Txt files cleaned
|
|
|____REP tag
|____Tagged files in .txt format
|
|
|____REP vect
|____Txt files
|
|____REP SarkozyCarla
|
|
|____REP Art
|____Txt files
|
|
|____REP clean
|____txt files cleaned
|
|
|____REP tag
|____Tagged files in .txt format
|
|
|____REP vect
|____Txt files
|
|____REP SkiGrange
|
|
|____REP Art
|____Txt files
|
|
|____REP clean
|____Txt files cleaned
|
|
|____REP tag
|____Tagged files in .txt format
|
|
|____REP vect
|____Txt files
|
|____REP Tf1DaylimotionYoutube
|
|____REP Art
|____Txt files
|
|
|____REP clean
|____Txt files cleaned
|
|
|____REP tag
|____Tagged files in .txt format
|
|
|____REP vect
|____Txt files
|
|____REP Binary
|____Executions files
|
|____REP Data
|____...
chahinez.abdelo.9 0 Newbie Poster
chahinez.abdelo.9 0 Newbie Poster
rubberman 1,355 Nearly a Posting Virtuoso Featured Poster
chahinez.abdelo.9 0 Newbie Poster
Be a part of the DaniWeb community
We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.