Use fasttext to classify SMS content

1. Sample description:

  • Total1405506 records, of which 486996 were overdue and 486996 were non-overdue
  • Contains two fields tag (identification is overdue), message (sms content)
  • Actual training sample (non_overdue: 641065, overdue: 340783)
  • Actual test sample (non_overdue: 274660, overdue: 146132)
  • Goal: Based on the content of the SMS, predict whether the category is overdue

2. Data preprocessing:

  • Collect stop vocabulary, here (Https://github.com/goto456/stopwords) contains the Harbin Institute of Technology stop words, Chinese stop words, Sichuan University smart lab stop words, Baidu stop words; but it does not seem to be very full, huh!
  • Generate training collections and test collections: I used pythonrandom.shuffleThe function reorders the data, then selects the first 70% as the training set and the last 30% as the test set;

Third, the specific code implementation:

  • Import related packages
import re
import jieba
import os
import fasttext
import random
  • Generate stop word list
# Generate stop word list
def create_stoplist(dir_path):
         # dir_path is the directory path for the stop word file store
    stoplist=[]
    for ele in os.listdir(dir_path):
        file=dir_path+ele
        with open(file,'rb') as f:
            stoplist.extend(f.readlines())
    return list(set(stoplist))
  • Read sample data and shuffle
file='/Users/hqh/Desktop/savedata/message_2cls_subsample'
data=open(file,'r').readlines()
random.shuffle(data)
  • The input form required for data processing and conversion to fasttext
tranform_data=[] 
message_dict={}
for ele in data:
    message=ele.split('^')[0].strip("\n")
    tag=ele.split('^')[1].strip("\n")
    If len(message)>0: # There may be data with empty message, so add this restriction
                 Message=re.sub('\[.*\]','',message).strip('\n') #Remove platform name, space
                 Message_wc=list(jieba.cut(message)) # jieba participle
                 Message_wc=" ".join(list(set(message_wc)-set(stoplist)))) #)
                 Label='__label__'+tag # Finish the label format of fasttext, followed by __label__
        line=label+' '+message_wc+'\n'
                 Message_dict[message_wc]=ele+'^'+tag # Here save the cut word, stop the word processed message and the original message relationship
        tranform_data.append(line)
  • Test, training set division
n=len(tranform_data)
train_data=tranform_data[1:int(n*0.7)]
test_data=tranform_data[int(n*0.7):]

with open('/Users/hqh/Desktop/train','w') as f:
    for ele in train_data:
        f.write(ele)

with open('/Users/hqh/Desktop/test','w') as f:
    for ele in test_data:
        f.write(ele)
  • Modeling
classifier = fasttext.supervised('/Users/hqh/Desktop/train', 'model',label_prefix='__label__')
result=classifier.test("/Users/hqh/Desktop/test")
print(result.precision)
print(result.recall)
print(result.nexamples)
  • Manually find the case of identifying the error

 Errors=[] # Case for logging errors

with open('/Users/hqh/Desktop/errors','w') as f:
    for ele in test_data:
        info=ele.strip("\n").split(" ")
        try:
            key=" ".join(info[1:])
                         Message=[key] # Starting from 1 because the 0th is label
                         Label=classifier.predict(message)[0][0] # Forecasted tags
                         Tag=ele.strip("\n").split(" ")[0].split("__label__")[1] # Actual tag
            if label != tag:
                f.write(message_dict[key]+"\n")
                errors.append(message_dict[key])
        except IndexError as e:
            print('error')

print(len(errors)/len(test_data))
  • Specific results

Accuracy and recall rate are 0.9997219548711368
 

Intelligent Recommendation

Use fastText for text classification

fastText text classification study notes Download first, then make, get the executable c file Text classification, linux command line: ./fasttext supervised -input train.txt -output model The input fo...

Preliminary use of fasttext

forward from: http://blog.csdn.net/lxg0807/article/details/52960072#comments The training data and test data come from the network disk: https://pan.baidu.com/s/1jH7wyOY https://pan.baidu.com/s/1slGlP...

Use fastText training

Use fastText training classification model...

FastText

FastText Label (space separated): natural language processing FastText FastText paper link Review FastText is not a special kind of institution, but an idea that is to get results faster. FastText(pyt...

[ ] fastText

mark~ from : https://www.jiqizhixin.com/articles/2018-06-05-3 The origin of fastText fastText is a text categorization and vectorization tool launched by FAIR (Facebook AIResearch) in 2016. Its o...

More Recommendation

[Python] Crawler articles: Use TF-IDF algorithm to classify articles through article content (5)

I have a rookie, more than 800 years ago (of course this is an exaggeration), when I wrote the first crawler blog, I dug a hole for myself! In the first article of my crawler article (link is below, t...

Use of FastText Chinese word vector

faxttext in Chinese Word vector download address Call method Official documents...

Use FastText for natural language processing

Introduction FastText is Facebook's open source text classification framework for learning word vectors. It can be quickly trained on the CPU and has very powerful performance. Users only need to ente...

Use broadcast to listen to text messages and get SMS content

Directly on the code in the onRecEive () method, the configuration of the Action in the inventory file is no longer given...

Use TensorFlow to classify flowers

The first step: prepare the required library tensorflow-gpu  1.8.0 opencv-python     3.3.1 numpy skimage os pillow Step 2: Prepare the data set: Link: Password: iym3 This time I us...

Copyright  DMCA © 2018-2026 - All Rights Reserved - www.programmersought.com  User Notice

Top