自然语言处理

小游2024-09-252024-09-25

自然语言处理

1、基于BART的评论生成

上huggingface.co找模型进行训练

训练方式：通过给模型数据，让模型根据事例进行预测

2、安装pyahocorasick 包

pip3 install pyahocorasick -i https://pypi.tuna.tsinghua.edu.cn/simple/

3、安装neo4j

Neo4j和JDK版本对应（很重要！！！）

javajdk路径：C:\Program Files\Java

Neo4j 版本对 JDK（Java Development Kit）有一些要求和限制。不同版本的 Neo4j 可能需要特定版本的 JDK 才能正常运行。简而言之，用JDK11版本就几乎都适用了

Neo4j 4.0 以及更早的版本：Neo4j 4.0.x 需要 JDK 11。

Neo4j 4.1 到 4.3 版本：Neo4j 4.1.x、4.2.x 和 4.3.x 需要 JDK 11。

Neo4j 4.4 版本及以后：Neo4j 4.4.x 及以后的版本需要 JDK 11 或 JDK 16。

3.1windows下载链接，下载并解压

https://dist.neo4j.org/neo4j-community-4.4.25-windows.zip

3.2 配置环境变量

3.3 启动测试

win+R cmd

输入

neo4j.bat console

报错

neo4j.bat不是内部或外部命令，也不是可运行的程序

用powershell打开后

neo4j.bat : 无法将“neo4j.bat”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写，如果包括路径，请确保路径正确，然后再试一次。

嗯….不知道碰了哪儿，她自己好了

3.4访问网址进行登录neo4j，账号和密码都是neo4j

这个网址要在小黑窗口运行的时候打开

http://localhost:7474

到这里就完成啦

4.打造图谱需要的语句

视频用eclipse写的

开头

import os
import json
from py2neo import Graph,Node

class MedicalGraph:
    def __init__(self):
        cur_dir = '/'.json(os.path.abspath(__file__).split('/')[:-1])
        self.data_path = os.path.join(cur_dir,'data/medical2.json')
        self.g = Graph("http//localhost:7474",username="neo4j",password="airline-edward-volume-album-bagel-2728")
        # 读取文件

结尾

if __name__ == '__main__':
    handler = MedicalGraph() # 连接数据库
    handler.create_graphnodes() # 创建节点
    handler.create_graphrels() # 连接边

4.1提取训练数据中的关键字段信息

def read_nodes(self):
       # 输入节点
       drugs = [] # 创建节点事例
       
       # 构建节点实体管理
       rels_department = []
       rels_noteat = [] # 疾病-忌吃食物关系
       rels_doeat = [] # 疾病-宜吃食物关系
       
       count = 0
       for data in open(self.data_path):  # 遍历data
           disease_dict = {}
           count += 1
           print(count)
           data_json = json.loads(data) # 将数据转化为json格式
           disease = data_json['name']
           disease_dict['name'] = disease # 根据病的名字构建一个新的字典
           disease.append(disease)
           disease_dict['desc'] = '' # 将这个病的信息都存在这个字典当中
           disease_dict['prevent'] = ''
           
       # 根据出现的字段进行判断，该字段是该业务的属性还是关联关系
       # 把病的名字与病的症状相关联
       if 'symptom' in data_json:
           symptom += data_json['symptom']
           for symptom in data_json['symptom']:
               rels_sympmtom.append([disease,symptom]) # 如：[感冒，头痛]
               
       #疾病与生病原因      
       if 'cause' in data_json:
           for symptom in data_json['cause']:
               rels_cause.append([disease,cause]) # 如：[感冒，着凉]
       
       # 科室之间的关系
       if 'cure_department' in data_json:
           cure_department = data_json['cure_department']
           if len(cure_department) == 1:
               rels_category.append([disease,cure_department[0]])
           if len(cure_department) == 2:
               big = cure_department[0]
               small = cure_department[1]
               rels_department.append([small,big])
               rels_category.append([disease,small])
       # 如何去治疗
       if 'cure_way' in data_json:
           disease_dict['cure_way'] = data_json['cure_way']
           
       return set(drugs),set(foods),rels_noteat,rels_category# 将创建的电和关系返回
   
   #####

# 最后·创建知识图谱实体节点类型schema
	def create_graphnodes(self):
        Drugs,Foods,Checks,Departments,Producers，rels_acompany,rels_category = self.read_nodes() # 要用的所有节点名称列举
        
        self.create_disease_nodes(disease_infos)
        
        self.create_node('Drug',Drugs)
        print(len(Drugs))
        
        self.create_node('Food',Foods)
        print(len(Foods))
        
        self.create_node('Check',Checks)
        print(len(Checks))

4.2创建边关系

# 创建知识图谱中心疾病的节点
   def create_disease_nodes(self,disease_infos):
       count = 0
       for disease_dict in disease_infos:
           node = Node("Disease",name=disease_dict['name'],desc=disease_dict['desc'],
                      prevent=disease_dict['prevent'])
           self.g.create(node) # 等待连接到节点后，创建节点
           count +=1
           print(count)
       return

# 建立节点
def create_node(self,label,nodes):
    count = 0
    for node_name in nodes:
        node = Node(label,name=node_name)
        self.g.create(node)
        count +=1
        print(count,len(nodes))
    return

# 创建实体边关系
def create_graphrels(self):
    Durgs,Foods,Checks,Departments,Producers,Symptoms,Diseases,disease_infos,rels_check,rels_recommandeat
    self.create_relationship('Disease','Food',rels_recommandeat,'recommand_eat','推荐食谱')

# 创建关联实体边
def create_relationship(self,start_node,end_node,edges,rel_type,rel_name):
    count = 0
    # 去重处理
    set_edge = []
    for edge in edges:
        set_edges.append('###'.join(edge)) # ["dhcv"###"jsdb"]
    all = len(set(set_edge))
    for edge in set(set_edge):
        edge = edge.split('###')
        p = edge[0]
        q = edge[1]
        query = "match(p:%s),(q:%s)wherebp.name='%s'and q.name='%s'create (p)-[rel:%s{name:'%s'}]->(q)"%(start_node,end_node,p,q,rel_type,rel_name)
        try:
            self.g.run(query) # 执行语句
            count += 1
            print(rel_type,count,all)
        except Exception as e:
            print(e)
    return

4.3打造图谱模型

5.实现对话部分

5.1简答类

from question_classifier import *
from question_parser import *
from answer_search import *

自然语言处理

1、基于BART的评论生成

上huggingface.co找模型进行训练

训练方式：通过给模型数据，让模型根据事例进行预测

2、安装pyahocorasick 包

3、安装neo4j

Neo4j和JDK版本对应（很重要！！！）

javajdk路径：C:\Program Files\Java

Neo4j 版本对 JDK（Java Development Kit）有一些要求和限制。不同版本的 Neo4j 可能需要特定版本的 JDK 才能正常运行。简而言之，用JDK11版本就几乎都适用了

Neo4j 4.0 以及更早的版本：Neo4j 4.0.x 需要 JDK 11。

Neo4j 4.1 到 4.3 版本：Neo4j 4.1.x、4.2.x 和 4.3.x 需要 JDK 11。

Neo4j 4.4 版本及以后：Neo4j 4.4.x 及以后的版本需要 JDK 11 或 JDK 16。

3.1windows下载链接，下载并解压

3.2 配置环境变量

3.3 启动测试

win+R cmd

输入

报错

neo4j.bat不是内部或外部命令，也不是可运行的程序

用powershell打开后

neo4j.bat : 无法将“neo4j.bat”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写，如果包括路径，请确 保路径正确，然后再试一次。

嗯….不知道碰了哪儿，她自己好了

3.4访问网址进行登录neo4j，账号和密码都是neo4j

这个网址要在小黑窗口运行的时候打开

到这里就完成啦

4.打造图谱需要的语句

视频用eclipse写的

开头

结尾

4.1提取训练数据中的关键字段信息

4.2创建边关系

4.3打造图谱模型

5.实现对话部分

5.1简答类

用户输入的文字跟txt文件里的内容进行匹配

5.2分类模块

region_words将所有特征词相加，将输入的词语跟这个匹配

actree构建树模型，增加匹配速度

ahocorasick是一个算法包，帮助执行匹配

这是主函数

问句过滤主要是为了过滤重复的词，不需要的词，然后提取关键词

这里是分类的语句

根据关键词的不同，将关键词带到分类代码中去查看，看看是否匹配，是否是此方面的回答（代码208是分类）

返回false是不在，返回true是在

有false则进行下一项

查询完成之后，根据查询内容生成sql语句

5.3连接图

6.neo4j使用

小游

neo4j.bat : 无法将“neo4j.bat”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写，如果包括路径，请确保路径正确，然后再试一次。