Humber College Data Mining Programming Worksheet
- Using the 20 newsgroup data do the following:Do the pre-processing. This step is application dependent and so you want to read till the end of the task description before deciding what pre-processing steps you’ll choose to applyCreate plots, using matplotlib, to show the following (for each topic in the data separately and save the plots to file):Most frequent words, bigrams and trigramsWord cloud plotsHistogram of word and sentence lengthUse both Matrix Factorization (LSA) and the LDA algorithms to do topic modelling. The output is a sequence of 10 words for each topicCompare your topics between LSA and LDA and prepare yourself for questions about it (and other subjects) during your presentation.Use the labels provided in the dataset to measure the performance of both algorithms based on both accuracy and the F1 scoreLSA and LDA are unsupervised algorithms. In this part, try to apply logistic regression to this problem to see if you can predict the topic in a supervised fashion. Note that this problem no longer is a binary classification problem. You have to find a way to convert it to binary classification.
NOTES1: The 20 newsgroup dataset (KAGGLE) has 2 parts when you download it, there is a train file and a test file. All the items in this project should be done on the train dataset. Test dataset should only be used to measure/illustrate the performance of your model. The reported performances should not be reported on the train dataset.
NOTES2: You will be required to run your project during the presentation.
Top-quality papers guaranteed
100% original papers
We sell only unique pieces of writing completed according to your demands.
Confidential service
We use security encryption to keep your personal data protected.
Money-back guarantee
We can give your money back if something goes wrong with your order.
Enjoy the free features we offer to everyone
-
Title page
Get a free title page formatted according to the specifics of your particular style.
-
Custom formatting
Request us to use APA, MLA, Harvard, Chicago, or any other style for your essay.
-
Bibliography page
Don’t pay extra for a list of references that perfectly fits your academic needs.
-
24/7 support assistance
Ask us a question anytime you need to—we don’t charge extra for supporting you!
Calculate how much your essay costs
What we are popular for
- English 101
- History
- Business Studies
- Management
- Literature
- Composition
- Psychology
- Philosophy
- Marketing
- Economics