MENU

分类 Python开发 下的文章

编译原理实验题目——词法分析

声明:原创代码,欢迎转载。为避免实验查重,未经许可请不要直接使用。作者:卢XX 20146049(2017年09月01日后本声明失效)

注:这样的算法太暴力,过一段时间后会更新用C++实现的新算法

实验内容:

实现标准 C 语言词法分析器

实验目的:

  1. 掌握程序设计语言词法分析的设计方法;
  2. 掌握 DFA 的设计与使用方法;
  3. 掌握正规式到有限自动机的构造方法;

实验要求:

  1. 单词种别编码要求
    1基本字、运算符、界符:一符一种;2标识符:统一为一种;3常量:按类型编码;
  2. 词法分析工作过程中建立符号表、常量表,并以文本文件形式输出;
  3. 词法分析的最后结果以文本文件形式输出;
  4. 完成对所设计词法分析器的功能测试,并给出测试数据和实验结果;
  5. 为增加程序可读性,请在程序中进行适当注释说明;
  6. 按软件工程管理模式完成实验报告撰写工作,最后需要针对实验过程进行经验总结;
  7. 认真完成并按时提交实验报告。


==实验结果==

一、相关说明

  1. 程序整体构架为 B/S 构架;
  2. 采用Html5 实现前端交互界面,配合使用 Semantic UI 的前端工具库;
  3. 服务后端使用了基于 Python 的 Flask 框架提供 Web 服务,动态网页的生成采用轻量级的 Jinja2;

二、程序截图

[caption id="attachment_1636" align="aligncenter" width="1056"]初始化的程序界面 初始化的程序界面[/caption]

[caption id="attachment_1637" align="aligncenter" width="1052"]初始化的程序界面 初始化的程序界面[/caption]

[caption id="attachment_1638" align="aligncenter" width="1070"]分析完成 分析完成[/caption]

三、数据流图

数据流的大致流程为:

i. 用户在浏览器端上传文件;

ii. Web 服务器获取文件,时间戳对文件重命名后保存文件;

iii. 词法分析器获取文件进行分析,将结果保存成文件;

iv. Web 服务器获得分析结果文件,提供给用户下载;

v. 用户通过浏览器下载结果:

[caption id="attachment_1639" align="aligncenter" width="530"]数据流图 数据流图[/caption]

四、程序流程图

[caption id="attachment_1640" align="aligncenter" width="570"] 程序流程图[/caption]

五、类图

[caption id="attachment_1641" align="aligncenter" width="649"] 类图[/caption]

六、程序代码

#include <sdtio.h>      //小心把<>识别成运算符
#include "abc.h"        //这个文件名不应该识别成字符串
#define A 10

int main(){
    char a = 'Z';      //annotation A  //annotation B
    int i;       char c = getchar();    //这里实际上是两行,getchar中还隐藏着一个char,小心
    char o = "/";   //应该识别成字符而不是运算符
    char ch = getchar();
    for (i=0; i<10;i++)     /* annotation D */
    {
        printf("a");        //这里隐藏着一个 int
    }   /* 这一行中有代码也有注释

        annotation E
    fff*/

    // annotation F
    printf("Hello world!\n");
}
float add(int argA, double argB)
{
    //todo
    return argA + argB;
}

 

from flask import Flask, render_template, send_from_directory
from werkzeug.utils import secure_filename
import os
import time
import main as analyzer
from flask_wtf import FlaskForm
from wtforms import SubmitField
from flask_wtf.file import FileField, FileRequired, FileAllowed
from wtforms.validators import ValidationError, Required

UPLOAD_FOLDER  = './uploads'
ALLOWED_EXTENSIONS = set(['c', 'cpp', 'h'])


app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.config['SECRET_KEY'] = 'hard to guess string'


class UploadForm(FlaskForm):
    file = FileField(validators=[
        FileAllowed(ALLOWED_EXTENSIONS, u'只能上传C源码文件!'),
        FileRequired(u'文件未选择!')])

    submit = SubmitField(u'上传')


@app.route('/upload/', methods=['GET', 'POST'])
def upload():
    # 确保上传目录存在
    if not os.path.exists(UPLOAD_FOLDER):
        os.makedirs(UPLOAD_FOLDER)

    form = UploadForm()

    # 点击"上传"按钮后的处理
    if form.validate_on_submit():
        filename = secure_filename(form.file.data.filename)
        ext = filename.rsplit('.',1)[1]  # 获取文件后缀
        unix_time = int(time.time())
        new_filename = str(unix_time) + '.' + ext  # 修改上传的文件名
        new_filename = os.path.join(UPLOAD_FOLDER, new_filename)
        form.file.data.save(new_filename)

        analysis = analyzer.ANALYZER(new_filename)
        analysis.analysis()
        output_file = analysis.save_to_file()
    else:
        output_file = None
    return render_template('index.html', form=form, filename=output_file)


@app.route('/download/<filename>')
def download(filename):
    return send_from_directory('output', filename, as_attachment=True)


def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1] in ALLOWED_EXTENSIONS


@app.route('/')
def index():
    return render_template('index.html')


if '__main__' == __name__:
    app.run(debug=True)
import re
import os

# 运算符
sys_operator = [
    "!",
    "~",
    "++",
    "--",
    "+",
    "-",
    "*",
    "&",
    "/",
    "%",
    "<<",
    ">>",
    "<",
    ">",
    "<=",
    ">=",
    "==",
    "!=",
    "^",
    "|",
    "&&",
    "||",
    "=",
    "+=",
    "-=",
    "/=",
    "*=",
    "&=",
    "^=",
    "|=",
    "<<=",
    ">>=",
    ","
]

# 预编译指令
sys_pre_compile = [
    "include",
    "define",
    "error",
    "if",
    "else",
    "elif",
    "endif",
    "ifndef",
    "undef",
    "line",
    "pragma",
]

# 关键字
sys_keyword = [
    "int",
    "float",
    "char",
    "long",
    "double",
    "auto",
    "short",
    "signed",
    "unsigned",
    "struct",
    "union",
    "enum",
    "static",
    "switch",
    "case",
    "default",
    "break",
    "register",
    "const",
    "volatile",
    "typedef",
    "extern",
    "return",
    "void",
    "while",
    "if",
    "else",
    "for",
    "goto",
    "sizeof"
]

# 关于数据类型的关键字
sys_type_keyword = [
    "int",
    "float",
    "char",
    "long",
    "double",
    "auto",
    "short",
    "signed",
    "unsigned",
    "struct",
    "union",
    "enum",
    "static",
    "default",
    "const",
]

# 界符
sys_separator = [
    "{",
    "}",
    "(",
    ")",
    "[",
    "]",
]


class ANALYZER:
    __file_name = None
    __identifier = []   # 标识符             #
    __keyword = []      # 关键字             #
    __operator = []     # 运算符             #
    __separator = []    # 分隔符             #
    __const_num = []    # 数字常量
    __const_char = []   # 字符常量           #
    __const_string = [] # 字符串常量         #
    __esc_char = []     # 转义符
    __pre_compile = []  # 预编译指令         #
    __header_file = []  # 头文件            #

    __source_code = []  # 源代码
    __error_line = []

    def __init__(self, file):
        """
        读取文件、统计行数
        :param file: 待分析的C源代码文件
        """
        self.__file_name = file
        self.__source_code = open(file, "r").readlines()
        # print("Total lines: " + str(len(self.__source_code)))

    def analysis(self):
        self.delete_spaces_and_newline_char()
        self.delete_annotation()
        self.analysis_pre_compile()
        self.analysis_const_string()
        self.analysis_const_char()
        self.analysis_operator()
        self.analysis_separator()
        self.analysis_keyword_and_identifier()

    def delete_annotation(self):
        """
        删除注释信息
        :return:
        """
        # 处理/* */ 类型的注释
        annotation_begin = -1
        annotation_end = -1
        for line in self.__source_code:
            temp = annotation_begin
            annotation_begin = line.find("/*")

            # 取带有 "/*" 的行号
            if annotation_begin != -1:
                annotation_begin_line = self.__source_code.index(line)
            else:
                annotation_begin = temp;

            temp = annotation_end
            annotation_end = line.find("*/")

            if annotation_end != -1:
                # 取带有 "*/" 的行号
                annotation_end_line = self.__source_code.index(line)
                # 单行注释
                if annotation_begin_line == annotation_end_line:
                    self.__source_code[annotation_begin_line] = self.__source_code[annotation_begin_line][0:annotation_begin]
                # 多行注释
                else:
                    self.__source_code[annotation_begin_line] = self.__source_code[annotation_begin_line][0:annotation_begin-1]
                    del_list = []       # 待删除的行号
                    for i in range(annotation_begin_line+1, annotation_end_line+1):
                        del_list.append(i)
                    for i in reversed(del_list):
                        del self.__source_code[i]
            else:
                annotation_end = temp

        # 处理// 类型的注释
        for line in self.__source_code:
            del_list = []   # 待删除的行号
            # 整行都是注释
            if line.startswith("//"):
                del_list.append(self.__source_code.index(line))
            # 处理一行中即有代码也有注释的行
            line_split = line.split("//")
            if len(line_split) >= 2:
                self.__source_code[self.__source_code.index(line)] = line_split[0]
        # 删除整行都是注释的行
        for i in reversed(del_list):
            del self.__source_code[i]
        # 删除注释过程中可能出现空行,进行空行的删除
        self.delete_spaces_and_newline_char()

    def delete_spaces_and_newline_char(self):
        """
        删除行首空格和行尾换行符、空行
        :return: None
        """
        blank_line_num = []
        for i in range(len(self.__source_code)):
            # 删除行首空格
            self.__source_code[i] = self.__source_code[i].lstrip()
            # 删除行尾换行符\n
            self.__source_code[i] = self.__source_code[i].rstrip()

            # 记录空行的行号
            if len(self.__source_code[i]) == 0:
                blank_line_num.append(i)
        # 删除空行
        for i in reversed(blank_line_num):
            del self.__source_code[i]

    def analysis_separator(self):
        """
        分析界符
        :return:
        """
        for line in self.__source_code:
            for item in sys_separator:
                loc = line.find(item)
                if loc != -1:
                    self.__separator.append(item)
        self.__separator = set(self.__separator)

    def analysis_operator(self):
        """
        分析运算符
        :return:
        """
        for line in self.__source_code:
            """
            头文件中的<>不计
            字符串中的运算符不计
            字符中的运算符不计
            """
            if not line.startswith("#include"):
                loc1 = loc2 = loc3 = loc4 = -1
                loc1 = line.find("\"")
                if loc1 != -1:
                    loc2 = line.find("\"", loc1 + 1)
                loc3 = line.find("\'")
                if loc3 != -1:
                    loc4 = line.find("\'")

                for item in sys_operator:
                    loc = line.find(item)
                    if loc != -1:
                        if loc < loc1 or loc > loc2 or loc1 == -1:
                            if loc < loc3 or loc > loc4 or loc3 == -1:
                                self.__operator.append(item)

    def analysis_keyword_and_identifier(self):
        """
        分析关键字和标识符
        :return:
        """
        for line in self.__source_code:
            for item in sys_keyword:
                loc = line.find(item)
                if loc != -1:
                    self.__keyword.append(item)

                    # 开始找标识符
                    # words = line.split()
                    words = re.split(r'(\(|;|,| |\))',line)
                    for word in words:
                        # 类型关键字后面应该为标识符
                        if word in sys_type_keyword:
                            index = words.index(word)
                            words[index+1] = words[index+1].replace(";","")
                            self.__identifier.append(words[index+2])

    def analysis_pre_compile(self):
        """
        分析预编译指令
        :return:
        """
        pre_compile_lines = []      # 存放预编译指令的行
        # 找到所有 # 开头的行
        for line in self.__source_code:
            if line.startswith("#"):
                pre_compile_lines.append(line.lstrip("#"))
        for line in pre_compile_lines:
            words = line.split()
            if words[0] == "define":
                if len(words[1]) == 1:
                    self.__const_char.append(words[1])
                else:
                    self.__const_string.append(words[1])
                if len(words[2]) == 1:
                    self.__const_char.append(words[2])
                else:
                    self.__const_string.append(words[2])
            # 分析行: 导入头文件
            if words[0] == "include":
                if len(words) > 2:
                    self.__error_line.append(self.__source_code.index(line))

                words[1] = words[1].lstrip("\"")
                words[1] = words[1].rstrip("\"")
                words[1] = words[1].lstrip("<")
                words[1] = words[1].rstrip(">")
                self.__header_file.append(words[1])
            for word in words:
                if word in sys_pre_compile:
                    self.__pre_compile.append(word)

    def analysis_const_char(self):
        """
        分析除预编译指令外的字符常量
        :return:
        """
        for line in self.__source_code:
            loc_begin = line.find("\'")
            if loc_begin != -1:
                temp_string = line[loc_begin+1:]
                loc_end = temp_string.find("\'")
                temp_string = temp_string[:loc_end]
                if len(temp_string) == 1:
                    self.__const_char.append(temp_string)
                else:
                    # 如果引号中的内容长度超过1,这行有错误
                    self.__error_line.append(self.__source_code.index(line))

    def analysis_const_string(self):
        """
        分析除预编译指令外的字符串常量
        :return:
        """
        for line in self.__source_code:
            loc_begin = line.find("\"")
            if loc_begin != -1:
                temp_string = line[loc_begin+1:]
                loc_end = temp_string.find("\"")
                temp_string = temp_string[:loc_end]
                if len(temp_string) == 1:
                    self.__const_char.append(temp_string)
                else:
                    # 考虑这样的情况:include "abc.h"
                    if not line.startswith("#include"):
                        self.__const_string.append(temp_string)

    def analysis_ternary_operator(self):
        """
        分析三目运算符
        :return:
        """
        pass

    def show_source_code(self):
        for line in self.__source_code:
            print(line)

    def display_detail(self):
        # 去掉重复信息
        self.__identifier = set(self.__identifier)
        self.__keyword = set(self.__keyword)
        self.__operator = set(self.__operator)
        self.__separator = set(self.__separator)
        self.__const_num = set(self.__const_num)
        self.__const_char = set(self.__const_char)
        self.__const_string = set(self.__const_string)
        self.__esc_char = set(self.__esc_char)
        self.__pre_compile = set(self.__pre_compile)
        self.__header_file = set(self.__header_file)
        print("===============预编译指令==============")
        for word in self.__pre_compile:
            print(word)
        print("===============系统头文件==============")
        for word in self.__header_file:
            print(word)
        print("===============字符串常量==============")
        for word in self.__const_string:
            print(word)
        print("================字符常量===============")
        for word in self.__const_char:
            print(word)
        print("================运算符================")
        for item in self.__operator:
            print(item)
        print("================界符==================")
        for item in self.__separator:
            print(item)
        print("================关键字================")
        for item in self.__keyword:
            print(item)
        print("================标识符================")
        for item in self.__identifier:
            print(item)

    def save_to_file(self):
        # 去掉重复信息
        self.__identifier = set(self.__identifier)
        self.__keyword = set(self.__keyword)
        self.__operator = set(self.__operator)
        self.__separator = set(self.__separator)
        self.__const_num = set(self.__const_num)
        self.__const_char = set(self.__const_char)
        self.__const_string = set(self.__const_string)
        self.__esc_char = set(self.__esc_char)
        self.__pre_compile = set(self.__pre_compile)
        self.__header_file = set(self.__header_file)

        save_file_name = self.__file_name.split("/")[-1].split(".")[0] + ".txt"
        tmp_return = save_file_name
        save_file_name = os.path.join("./output", save_file_name)
        print(save_file_name)

        save_file = open(save_file_name, "w")
        save_file.write("===============源代码==============\n")
        for line in self.__source_code:
            save_file.write(line + "\n")
        save_file.write("===============预编译指令==============\n")
        for word in self.__pre_compile:
            save_file.write(word + "\n")
        save_file.write("===============系统头文件==============\n")
        for word in self.__header_file:
            save_file.write(word + "\n")
        save_file.write("===============字符串常量==============\n")
        for word in self.__const_string:
            save_file.write(word + "\n")
        save_file.write("================字符常量===============\n")
        for word in self.__const_char:
            save_file.write(word + "\n")
        save_file.write("================运算符================\n")
        for item in self.__operator:
            save_file.write(item + "\n")
        save_file.write("================界符==================\n")
        for item in self.__separator:
            save_file.write(item + "\n")
        save_file.write("================关键字================\n")
        for item in self.__keyword:
            save_file.write(item + "\n")
        save_file.write("================标识符================\n")
        for item in self.__identifier:
            save_file.write(item + "\n")

        return tmp_return


if __name__ == "__main__":

    filename = "./test.c"
    analyzer = ANALYZER(filename)
    analyzer.analysis()
    analyzer.display_detail()
    print("=============格式化后的源代码===========")
    analyzer.show_source_code()

 

openSUSE的KDE环境下调用python-matplotlib绘图

在openSUSE的KDE环境下,默认不能正常调用python-matplotlib进行绘图,现象为不能正常弹出窗口显示需要的图像。

解决方法如下:

我使用QT5后端,首先安装python-matplotlib-qt5

然后在Python的环境下执行:

>> import matplotlib

>> matplotlib.matplotlib_fname()
u'/usr/lib64/python2.7/site-packages/matplotlib/mpl-data/matplotlibrc'

如图显示出了配置文件的位置,修改该配置文件,修改Agg为Qt5Agg

backend : Qt5Agg

至此一切OK,上面所有的Qt5也可以使用Qt4

操作系统实验题目:存储器的分配与回收算法实现

实验环境:

  • 操作系统:Fedora 25 with Linux 4.8.12-300.fc25.x86_64
  • 编程语言:Python 3.5.2
  • 开发环境:PyCharm 2016.3

实验内容​ :

  1. 模拟操作系统的主存分配,运用可变分区的存储管理算法设计主存分配和回收程序,并不实际启
    动装入作业。
  2. 采用最先适应法、最佳适应法、最坏适应法分配主存空间。
  3. 当一个新作业要求装入主存时,必须查空闲区表,从中找出一个足够大的空闲区。若找到的空闲
    区大于作业需要量,这是应把它分成二部分,一部分为占用区,加一部分又成为一个空闲区。
  4. 当一个作业撤离时,归还的区域如果与其他空闲区相邻,则应合并成一个较大的空闲区,登在空
    闲区表中。
  5. 运行所设计的程序,输出有关数据结构表项的变化和内存的当前状态。

程序代码:

# coding=utf-8
import random


class MemoryBlock:
    """Class of memory block"""
    __size = 0
    __address = 0

    def __init__(self):
        self.__size = 0
        self.__address = 0

    def __init__(self, size, address):
        self.__size = size
        self.__address = address

    def set_size(self, size):
        self.__size = size

    def set_address(self, address):
        self.__address = address

    def get_size(self):
        return self.__size

    def get_address(self):
        return self.__address

    def add_size(self, size):
        self.__size += size


class MemoryTable:
    """
    Memory table, contains all memory blocks
    Based on Python List
    """
    __list = []

    def add(self, obj):
        # Append at end
        self.__list.append(obj)

    def block_at(self, id):
        return self.__list[id]

    def get_count(self):
        return len(self.__list)

    def max(self):
        max_no = 0
        for cnt in range(1, len(self.__list)):
            if self.__list[cnt] > self.__list[max]:
                max_no = cnt

        return self.__list[max_no]

    def give_back(self, obj):
        if obj.get_address() > self.__list[len(self.__list)-1].get_address():
            self.__list[len(self.__list)-1].add_size(obj.get_size())
        else:
            for cnt in range(0, self.get_count()):
                if self.get_count() == 1:
                    self.__list[0].add_size(obj.get_size())
                elif obj.get_address() > self.__list[cnt].get_address():
                    if obj.get_address() < self.__list[cnt+1].get_address():
                        self.__list[cnt].add_size(obj.get_size())
                        return

    def first_fit(self, size):
        for cnt in range(0, self.get_count()):
            if self.block_at(cnt).get_size() >= size:
                temp_return = self.block_at(cnt)
                del self.__list[cnt]

                if temp_return.get_size() == size:
                    return temp_return
                else:
                    """
                    When the free block found is too big
                    divide it into two blocks
                    one to occupied(occupied_block)
                    and the other(back_block) going back to table
                    """
                    back_block = MemoryBlock(temp_return.get_size()-size, temp_return.get_address())
                    occupied_block = MemoryBlock(size, back_block.get_size()+back_block.get_address())
                    self.__list.insert(cnt, back_block)
                    return occupied_block
        return None

    def best_fit(self, size):
        min = abs(self.__list[0].get_size() - size)
        min_no = -1
        for cnt in range(0, self.get_count()):
            if self.__list[cnt].get_size()-size < min:
                if self.__list[cnt].get_size()-size > 0:
                    min = abs(self.__list[cnt].get_size()-size)
                    min_no = cnt
        if min_no == -1:
            return None
        temp_return = self.__list[min_no]
        cnt = min_no
        del self.__list[cnt]
        if temp_return.get_size() == size:
            return temp_return
        else:
            back_block = MemoryBlock(temp_return.get_size() - size, temp_return.get_address())
            occupied_block = MemoryBlock(size, back_block.get_size() + back_block.get_address())
            self.__list.insert(cnt, back_block)
            return occupied_block

    def worst_fit(self, size):
        temp_return = self.max()
        if temp_return.get_size() < size:
            return None
        cnt = self.__list.index(temp_return)
        del self.__list[cnt]
        if temp_return.get_size() == size:
            return temp_return
        else:
            back_block = MemoryBlock(temp_return.get_size()-size, temp_return.get_address())
            occupied_block = MemoryBlock(size, back_block.get_size()+back_block.get_address())
            self.__list.insert(cnt, back_block)
            return occupied_block

    def show_detail(self):
        """Show details about all blocks"""
        for cnt in range(0, len(self.__list)):
            print("No." + str(cnt)+" (addr:" + str(self.__list[cnt].get_address())
                  +", size:" + str(self.__list[cnt].get_size()) + ")")


class PROCESS:
    """ Process class """
    __need_size = 0

    def __init__(self, size):
        self.__need_size = size

    def get_size(self):
        return self.__need_size

    def set_size(self, size):
        self.__need_size = size


"""
The beginning of time, "main" function
"""
if __name__ == "__main__":
    temp_size = 0
    temp_address = 0
    # Free memory block table
    free_table = MemoryTable()
    # Allocated memory block table
    alloc_table = []

    """
    Generate two memory blocks. Every one has a random size.
    """
    temp_size = random.randint(1, 50)
    temp_block = MemoryBlock(temp_size, 0)
    free_table.add(temp_block)

    for i in range(1, 2):
        temp_size = random.randint(1, 50)
        pre_block = free_table.block_at(i-1)
        temp_address = pre_block.get_size() + pre_block.get_address()

        temp_block = MemoryBlock(temp_size, temp_address)
        free_table.add(temp_block)

    free_table.show_detail()

    for i in range(1, 3):
        print("============================================")
        process = PROCESS(random.randint(1, 20))
        print("Process " + str(i) + " need: " + str(process.get_size()))
        alloc_block = free_table.first_fit(process.get_size())
        if alloc_block is None:
            print("Failed: memory size not enough!")
        else:
            print("Allocated size: " + str(alloc_block.get_size()))
            print("Allocated address:" + str(alloc_block.get_address()))
            alloc_table.append(alloc_block)
        free_table.show_detail()

    free_table.give_back(alloc_table[1])
    free_table.give_back(alloc_table[0])

    print("============================================")
    print("Finally:")
    free_table.show_detail()

 

 

Python + Flask + SAE 微信开发——连接篇

下面的内容介绍使用Flask开发微信的连接部分,这个开发实例使用了新浪SAE,目前费用是每天10云豆(0.1元),至于怎么申请SAE和微信公众号这里不再赘述

新建三个文件:config.yaml、index.wsgi、seahi.py

name: seahi
version: 2

 

import sae
from seahi import app
application=sae.create_wsgi_app(app)
#coding=utf-8
from flask import Flask, request, make_response
from hashlib import sha1
import time
import xml.etree.ElementTree as ET

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def check():
    # 微信认证
    if request.method == 'GET':
        token = 'ta5dv1'
        signature = request.args.get('signature', '')
        echostr = request.args.get('echostr', '')
        timestamp = request.args.get('timestamp', '')
        nonce = request.args.get('nonce', '')
        tmp = [timestamp, nonce, token]
        tmp.sort()
        tmp = ''.join(tmp)
        if signature == sha1(tmp.encode(encoding='utf-8')).hexdigest():
            return make_response(echostr)
        else:
            return 'Access denied.'

if __name__ == '__main__':
    app.run()

2016-11-25_181710

 

Python抓取百度搜索结果

写这个小脚本是为了抓取 tjut.edu.cn 下的二级域名,如果想要完整的解析情况,还是应该使用nslookup工具。

import re,sys,urllib,codecs
from bs4 import BeautifulSoup
def getlist(url):
    file = open("tjut.txt","a+")
    url =  "http://www.baidu.com/s?q1=site:*.tjut.edu.cn&pn=" + str(url)
    # print url
    xh = urllib.urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(xh, "html.parser")
    content = soup.find('div',id="content_left")
    con = content.findAll('div', class_='result c-container ')
    for entry in con:
        addrDiv = entry.findAll('div', class_='f13')
        addrPattern = re.compile('<.*?>(.*?)</a>',re.S)
        
        addr = re.search(addrPattern, str(addrDiv)).group(0).strip()

        detailPattern = re.compile('target="_blank">(.*?)/.*?</a>', re.S)
        addrFinal = re.search(detailPattern, addr)
        temp = addrFinal.group(1).strip()
        print temp
        file.write(temp)
        file.write('\n')

for i in range(1,200):
    getlist(i*10)
    print i

代码比较简单,写的比较随意,没有注释,也没有注意结构

脚本运行后使用 cat tjut.txt | sort | uniq > tjutFinal.txt 命令进行了内容去重

最后的结果也贴出来吧,找到67条记录(nslookup下可能查找到104条):

chinese.tjut.edu.cn
cs.tjut.edu.cn
eie.tjut.edu.cn
ele.tjut.edu.cn
gjjl.tjut.edu.cn
ha.tjut.edu.cn
hqc.tjut.edu.cn
hr.tjut.edu.cn
huagong.tjut.edu.cn
jbg.tjut.edu.cn
jbw.tjut.edu.cn
jgdw.tjut.edu.cn
jgh.tjut.edu.cn
jgz.tjut.edu.cn
jjc.tjut.edu.cn
jj.tjut.edu.cn
jlt.tjut.edu.cn
jsjch.tjut.edu.cn
jtz.tjut.edu.cn
jw.tjut.edu.cn
jy.tjut.edu.cn
jyweb.tjut.edu.cn
jzz.tjut.edu.cn
kcj.tjut.edu.cn
kczx.tjut.edu.cn
kjc.tjut.edu.cn
lib.tjut.edu.cn
lx.tjut.edu.cn
lxy.tjut.edu.cn
lxyz.tjut.edu.cn
mail.tjut.edu.cn
mater.tjut.edu.cn
ms.tjut.edu.cn
my.tjut.edu.cn
nem.tjut.edu.cn
news.tjut.edu.cn
pay.tjut.edu.cn
rsc.tjut.edu.cn
scic.tjut.edu.cn
sclc.tjut.edu.cn
shenbo.org.tjut.edu.cn
ssfw.tjut.edu.cn
study.tjut.edu.cn
www.tjut.edu.cn
xcb.tjut.edu.cn
xgb.tjut.edu.cn
xg.tjut.edu.cn
xiaoban.org.tjut.edu.cn
xiaoban.tjut.edu.cn
xk.tjut.edu.cn
xny.tjut.edu.cn
xss.tjut.edu.cn
xxgk.tjut.edu.cn
xxs.tjut.edu.cn
yda.tjut.edu.cn
yfz.tjut.edu.cn
yhy.tjut.edu.cn
yjs.tjut.edu.cn
yys.tjut.edu.cn
zdh.tjut.edu.cn
zsb.tjut.edu.cn
ztjy.tjut.edu.cn
zxzf.tjut.edu.cn
dns.tjut.edu.cn
dns1.tjut.edu.cn
mail.tjut.edu.cn
postmaster.tjut.edu.cn

 

使用python删除文本文件中的重复行

上次写了《Python爬取中图分类号信息 – 附txt、pdf下载》,回去爬出数据后发现有非常多的重复行,所以有了删除重复行的需求

这个方法使用了 set 类型:无序不重复元素集

思路是将文件按行读取到一个 list 中,类型转换到 set ,这个过程会删除掉重复元素,然后再将set类型转换到list

# coding=utf-8
fin = open("input.txt",'r')
fout = open('output.txt','a+')
bufferedlines = []
num = 0
bufferedlines = fin.readlines()
bufferedlines.sort()
result = list(set(bufferedlines))
result.sort()

for line in result:
    print line
    fout.write(line)

 

Python爬取中图分类号信息 - 附txt、pdf下载

关于中图法

《中国图书馆分类法》(原称《中国图书馆图书分类法》)是我国建国后编制出版的一部具有代表性的大型综合性分类法,是当今国内图书馆使用最广泛的分类法体系,简称《中图法》。《中图法》初版于1975年,1999年出版了第四版。修订后的《中图法》第四版增加了类分资料的类目,并与类分图书的类目以“+”标识进行了区分,因此正式改名为《中国图书馆分类法》,简称不变。《中图法》第四版全面补充新主题、扩充类目体系,使分类法跟上科学技术发展的步伐。同时规范类目,完善参照系统、注释系统,调整类目体系,增修复分表,明显加强类目的扩容性和分类的准确性。

爬取信息

这些天和同学合作一个项目,博主主要负责提供一些数据,第一部分数据是中图法的分类号,博主在网上搜索良久,只找到了稍全面的pdf版,不方便导出信息,最后选择从 http://ztflh.xhma.com/ 爬取,最终爬取的信息在这里分享出来,顺带贴出源码

pdf下载(详细版):点击下载

txt下载(精简版):点击下载

Python 源码:

#coding=utf-8
import urllib
import urllib2
from bs4 import BeautifulSoup
import re


def spider(url):
	removeLi = re.compile('<li>|</li>')
	removeSpan = re.compile('<span.*?>')
	replaceSpan = re.compile('</span>')
	removeAddrLeft = re.compile('<a.*?>')
	removeAddrRight = re.compile('</a>')
	# file存储爬取到的信息
	file = open("ztflh_zhanghao.txt", "a")
        # tmpFile存储当前爬取的网页,以免中途中断不方便查看进度
	tmpFile = open("tmp.txt", "a")
	tmpFile.write(url)
        tmpFile.write('\n')
	res = urllib2.urlopen(url).read()
	soup = BeautifulSoup(res, "html.parser")
	ul = soup.ul
	lis = ul.findAll("li")
	
	del lis[0]
	lis.pop()

	for li in lis:
		li = li.encode('utf8')
		li = re.sub(replaceSpan,"\t",li)
		li = re.sub(removeSpan,"",li)
		li = re.sub(removeLi,"",li)
		li = re.sub(removeAddrLeft,"",li)
		li = re.sub(removeAddrRight,"",li)
		print li
		file.write(li)
		file.write('\n')
def main():
	baseURL = "http://ztflh.xhma.com/"
	spider(baseURL)
	baseURL = baseURL + 'html/'
	for i in range(2, 50000):
		baseURL = baseURL + str(i) + '.html'
		spider(baseURL)
		baseURL = "http://ztflh.xhma.com/html/"

main()

[caption id="attachment_1014" align="aligncenter" width="806"]爬取中图法 爬取过程截图[/caption]

「转载」UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position xxx ordinal not in range

Python在安装时,默认的编码是ascii,当程序中出现非ascii编码时,python的处理常常会报这样的错:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x?? in position 1: ordinal not in range(128)

此时需要自己设置python的默认编码,一般设置为utf8的编码格式。

查询系统默认编码可以在解释器中输入以下命令:

sys.getdefaultencoding()

修改默认编码:

sys.setdefaultencoding('utf8')

如果有如下的错误提示,执行reload(sys)

AttributeError: 'module' object has no attribute 'setdefaultencoding'

此时在执行sys.getdefaultencoding()就会发现编码已经被设置为utf8的了,但是在解释器里修改的编码只能保证当次有效,在重启解释器后,编码又被重置为默认的ascii

设置python的默认编码:

一个解决的方案在程序中加入以下代码:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

另一个方案是在python的Lib\site-packages文件夹下新建一个sitecustomize.py,内容为:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

系统在python启动的时候,自行调用该文件,设置系统的默认编码,而不需要每次都手动的加上解决代码,属于一劳永逸的解决方法。

另外有一种解决方案是在程序中所有涉及到编码的地方,强制编码为utf8,即添加代码encode("utf8"),这种方法并不推荐使用,因为一旦少写一个地方,将会导致大量的错误报告。