Use fasttext to classify SMS content

1. Sample description:

Total1405506 records, of which 486996 were overdue and 486996 were non-overdue
Contains two fields tag (identification is overdue), message (sms content)
Actual training sample (non_overdue: 641065, overdue: 340783)
Actual test sample (non_overdue: 274660, overdue: 146132)
Goal: Based on the content of the SMS, predict whether the category is overdue

2. Data preprocessing:

Collect stop vocabulary, here (Https://github.com/goto456/stopwords) contains the Harbin Institute of Technology stop words, Chinese stop words, Sichuan University smart lab stop words, Baidu stop words; but it does not seem to be very full, huh!
Generate training collections and test collections: I used pythonrandom.shuffleThe function reorders the data, then selects the first 70% as the training set and the last 30% as the test set;

Third, the specific code implementation:

Import related packages

import re
import jieba
import os
import fasttext
import random

Generate stop word list

# Generate stop word list
def create_stoplist(dir_path):
         # dir_path is the directory path for the stop word file store
    stoplist=[]
    for ele in os.listdir(dir_path):
        file=dir_path+ele
        with open(file,'rb') as f:
            stoplist.extend(f.readlines())
    return list(set(stoplist))

Read sample data and shuffle

file='/Users/hqh/Desktop/savedata/message_2cls_subsample'
data=open(file,'r').readlines()
random.shuffle(data)

The input form required for data processing and conversion to fasttext

tranform_data=[] 
message_dict={}
for ele in data:
    message=ele.split('^')[0].strip("\n")
    tag=ele.split('^')[1].strip("\n")
    If len(message)>0: # There may be data with empty message, so add this restriction
                 Message=re.sub('\[.*\]','',message).strip('\n') #Remove platform name, space
                 Message_wc=list(jieba.cut(message)) # jieba participle
                 Message_wc=" ".join(list(set(message_wc)-set(stoplist)))) #)
                 Label='__label__'+tag # Finish the label format of fasttext, followed by __label__
        line=label+' '+message_wc+'\n'
                 Message_dict[message_wc]=ele+'^'+tag # Here save the cut word, stop the word processed message and the original message relationship
        tranform_data.append(line)

Test, training set division

n=len(tranform_data)
train_data=tranform_data[1:int(n*0.7)]
test_data=tranform_data[int(n*0.7):]

with open('/Users/hqh/Desktop/train','w') as f:
    for ele in train_data:
        f.write(ele)

with open('/Users/hqh/Desktop/test','w') as f:
    for ele in test_data:
        f.write(ele)

Modeling

classifier = fasttext.supervised('/Users/hqh/Desktop/train', 'model',label_prefix='__label__')
result=classifier.test("/Users/hqh/Desktop/test")
print(result.precision)
print(result.recall)
print(result.nexamples)

Manually find the case of identifying the error


 Errors=[] # Case for logging errors

with open('/Users/hqh/Desktop/errors','w') as f:
    for ele in test_data:
        info=ele.strip("\n").split(" ")
        try:
            key=" ".join(info[1:])
                         Message=[key] # Starting from 1 because the 0th is label
                         Label=classifier.predict(message)[0][0] # Forecasted tags
                         Tag=ele.strip("\n").split(" ")[0].split("__label__")[1] # Actual tag
            if label != tag:
                f.write(message_dict[key]+"\n")
                errors.append(message_dict[key])
        except IndexError as e:
            print('error')

print(len(errors)/len(test_data))

Specific results

Accuracy and recall rate are 0.9997219548711368

Intelligent Recommendation

Use fastText for text classification

fastText text classification study notes Download first, then make, get the executable c file Text classification, linux command line: ./fasttext supervised -input train.txt -output model The input fo...

Preliminary use of fasttext

forward from: http://blog.csdn.net/lxg0807/article/details/52960072#comments The training data and test data come from the network disk: https://pan.baidu.com/s/1jH7wyOY https://pan.baidu.com/s/1slGlP...

Use fastText training

Use fastText training classification model...

FastText

FastText Label (space separated): natural language processing FastText FastText paper link Review FastText is not a special kind of institution, but an idea that is to get results faster. FastText(pyt...

[ ] fastText

mark～ from : https://www.jiqizhixin.com/articles/2018-06-05-3 The origin of fastText fastText is a text categorization and vectorization tool launched by FAIR (Facebook AIResearch) in 2016. Its o...