Python 下载文件并处理最佳实践是什么

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 1040 天前的主题，其中的信息可能已经有所发展或是发生改变。

我这边需要实现一个接口: 参数是

{"pdf_file": "http://myserver/somefolder/somefile.pdf"}

输出是

{"key1": "value1", "key2":"value2"}
~~

PDF 文件为几页到几十页的文字

现在测试结果是
下载这个文件大约需要 5 秒
处理大约需要 5 秒

问题时如何加快这个过程？

网上搜到的一个例子是:


~~~python
class Job(BaseModel):
    uid: UUID = Field(default=uuid4())
    status: str = "in_progress"
    result: int = None


jobs: Dict[UUID, Job] = {}


async def run_in_process(fn, *args):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(app.state.executor, fn, *args)  # wait and return result


async def start_cpu_bound_task(uid: UUID, stream: io.BufferedReader, filename: str) -> None:
    jobs[uid].result = await run_in_process(cpu_bound_func, stream, filename)
    jobs[uid].status = "complete"


@app.post("/new_cpu_bound_task/", status_code=HTTPStatus.ACCEPTED)
async def task_handler(req: InsurancePoliciesExtractionReqeust, background_tasks: BackgroundTasks):
    new_task = Job()
    jobs[new_task.uid] = new_task
    ## 以下是我改动的
    content = None
    async with aiohttp.ClientSession() as session:
        async with session.get(req.pdf_file) as resp:
            if resp.status == 200:
                content = io.BytesIO(await resp.read())
                filename = os.path.basename(req.pdf_file)
    ## 以上是我的改动
    background_tasks.add_task(start_cpu_bound_task, new_task.uid, content, filename)
    return new_task


@app.get("/status/{uid}")
async def status_handler(uid: UUID):
    return jobs[uid]


@app.on_event("startup")
async def startup_event():
    app.state.executor = ProcessPoolExecutor()


@app.on_event("shutdown")
async def on_shutdown():
    app.state.executor.shutdown()

1 条回复 • 2022-06-24 00:36:26 +08:00

nyxsonsleep

2022-06-24 00:36:26 +08:00

下载多线程。
处理？完全不明白，一个下载文件，下载文件与原文件相同即可，你要处理什么？
哪怕开 4 线程，合并文件的开销也就几毫秒吧。