1. Sample description:
2. Data preprocessing:
Third, the specific code implementation:
import re
import jieba
import os
import fasttext
import random
# Generate stop word list
def create_stoplist(dir_path):
# dir_path is the directory path for the stop word file store
stoplist=[]
for ele in os.listdir(dir_path):
file=dir_path+ele
with open(file,'rb') as f:
stoplist.extend(f.readlines())
return list(set(stoplist))
file='/Users/hqh/Desktop/savedata/message_2cls_subsample'
data=open(file,'r').readlines()
random.shuffle(data)
tranform_data=[]
message_dict={}
for ele in data:
message=ele.split('^')[0].strip("\n")
tag=ele.split('^')[1].strip("\n")
If len(message)>0: # There may be data with empty message, so add this restriction
Message=re.sub('\[.*\]','',message).strip('\n') #Remove platform name, space
Message_wc=list(jieba.cut(message)) # jieba participle
Message_wc=" ".join(list(set(message_wc)-set(stoplist)))) #)
Label='__label__'+tag # Finish the label format of fasttext, followed by __label__
line=label+' '+message_wc+'\n'
Message_dict[message_wc]=ele+'^'+tag # Here save the cut word, stop the word processed message and the original message relationship
tranform_data.append(line)
n=len(tranform_data)
train_data=tranform_data[1:int(n*0.7)]
test_data=tranform_data[int(n*0.7):]
with open('/Users/hqh/Desktop/train','w') as f:
for ele in train_data:
f.write(ele)
with open('/Users/hqh/Desktop/test','w') as f:
for ele in test_data:
f.write(ele)
classifier = fasttext.supervised('/Users/hqh/Desktop/train', 'model',label_prefix='__label__')
result=classifier.test("/Users/hqh/Desktop/test")
print(result.precision)
print(result.recall)
print(result.nexamples)
Errors=[] # Case for logging errors
with open('/Users/hqh/Desktop/errors','w') as f:
for ele in test_data:
info=ele.strip("\n").split(" ")
try:
key=" ".join(info[1:])
Message=[key] # Starting from 1 because the 0th is label
Label=classifier.predict(message)[0][0] # Forecasted tags
Tag=ele.strip("\n").split(" ")[0].split("__label__")[1] # Actual tag
if label != tag:
f.write(message_dict[key]+"\n")
errors.append(message_dict[key])
except IndexError as e:
print('error')
print(len(errors)/len(test_data))
Accuracy and recall rate are 0.9997219548711368
fastText text classification study notes Download first, then make, get the executable c file Text classification, linux command line: ./fasttext supervised -input train.txt -output model The input fo...
forward from: http://blog.csdn.net/lxg0807/article/details/52960072#comments The training data and test data come from the network disk: https://pan.baidu.com/s/1jH7wyOY https://pan.baidu.com/s/1slGlP...
Use fastText training classification model...
FastText Label (space separated): natural language processing FastText FastText paper link Review FastText is not a special kind of institution, but an idea that is to get results faster. FastText(pyt...
mark~ from : https://www.jiqizhixin.com/articles/2018-06-05-3 The origin of fastText fastText is a text categorization and vectorization tool launched by FAIR (Facebook AIResearch) in 2016. Its o...
I have a rookie, more than 800 years ago (of course this is an exaggeration), when I wrote the first crawler blog, I dug a hole for myself! In the first article of my crawler article (link is below, t...
faxttext in Chinese Word vector download address Call method Official documents...
Introduction FastText is Facebook's open source text classification framework for learning word vectors. It can be quickly trained on the CPU and has very powerful performance. Users only need to ente...
Directly on the code in the onRecEive () method, the configuration of the Action in the inventory file is no longer given...
The first step: prepare the required library tensorflow-gpu 1.8.0 opencv-python 3.3.1 numpy skimage os pillow Step 2: Prepare the data set: Link: Password: iym3 This time I us...