Skip to main content


This notebook covers how to get started with UpstageLayoutAnalysisLoader.


Install langchain-upstage package.

pip install -U langchain-upstage

Environment Setupโ€‹

Make sure to set the following environment variables:

The previously used UPSTAGE_DOCUMENT_AI_API_KEY is deprecated. However, the key previously used in UPSTAGE_DOCUMENT_AI_API_KEY can now be used in UPSTAGE_API_KEY.


import os

os.environ["UPSTAGE_API_KEY"] = "YOUR_API_KEY"
from langchain_upstage import UpstageLayoutAnalysisLoader

file_path = "/PATH/TO/YOUR/FILE.pdf"
layzer = UpstageLayoutAnalysisLoader(file_path, split="page")

# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load() # or layzer.lazy_load()

for doc in docs[:3]:
page_content='SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective\nDepth Up-Scaling Dahyun Kim* , Chanjun Park*1, Sanghoon Kim*+, Wonsung Lee*โ€ , Wonho Song*\nYunsu Kim* , Hyeonwoo Kim* , Yungi Kim, Hyeonju Lee, Jihoo Kim\nChangbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim\nMikyoung Cha, Hwalsuk Leet , Sunghun Kim+ Upstage AI, South Korea {kdahyun, chan jun ยท park, limerobot, wonsung ยท lee, hwalsuk lee, hunkim} @ upstage ยท ai Abstract We introduce SOLAR 10.7B, a large language\nmodel (LLM) with 10.7 billion parameters,\ndemonstrating superior performance in various\nnatural language processing (NLP) tasks. In-\nspired by recent efforts to efficiently up-scale\nLLMs, we present a method for scaling LLMs\ncalled depth up-scaling (DUS), which encom-\npasses depthwise scaling and continued pre-\ntraining. In contrast to other LLM up-scaling\nmethods that use mixture-of-experts, DUS does\nnot require complex changes to train and infer-\nence efficiently. We show experimentally that\nDUS is simple yet effective in scaling up high-\nperformance LLMs from small ones. Building\non the DUS model, we additionally present SO-\nLAR 10.7B-Instruct, a variant fine-tuned for\ninstruction-following capabilities, surpassing\nMixtral-8x7B-Instruct. SOLAR 10.7B is pub-\nlicly available under the Apache 2.0 license,\npromoting broad access and application in the\nLLM field 1 1 Introduction The field of natural language processing (NLP)\nhas been significantly transformed by the introduc-\ntion of large language models (LLMs), which have\nenhanced our understanding and interaction with\nhuman language (Zhao et al., 2023). These ad-\nvancements bring challenges such as the increased\nneed to train ever larger models (Rae et al., 2021;\nWang et al., 2023; Pan et al., 2023; Lian, 2023;\nYao et al., 2023; Gesmundo and Maile, 2023) OW-\ning to the performance scaling law (Kaplan et al.,\n2020; Hernandez et al., 2021; Anil et al., 2023;\nKaddour et al., 2023). To efficiently tackle the\nabove, recent works in scaling language models\nsuch as a mixture of experts (MoE) (Shazeer et al.,\n2017; Komatsuzaki et al., 2022) have been pro-\nposed. While those approaches are able to effi- ciently and effectively scale-up LLMs, they often\nrequire non-trivial changes to the training and infer-\nence framework (Gale et al., 2023), which hinders\nwidespread applicability. Effectively and efficiently\nscaling up LLMs whilst also retaining the simplic-\nity for ease of use is an important problem (Alberts\net al., 2023; Fraiwan and Khasawneh, 2023; Sallam\net al., 2023; Bahrini et al., 2023). Inspired by Komatsuzaki et al. (2022), we\npresent depth up-scaling (DUS), an effective and\nefficient method to up-scale LLMs whilst also re-\nmaining straightforward to use. DUS consists of\nscaling the number of layers in the base model and\ncontinually pretraining the scaled model. Unlike\n(Komatsuzaki et al., 2022), DUS does not scale\nthe model using MoE and rather use a depthwise\nscaling method analogous to Tan and Le (2019)\nwhich is adapted for the LLM architecture. Thus,\nthere are no additional modules or dynamism as\nwith MoE, making DUS immediately compatible\nwith easy-to-use LLM frameworks such as Hug-\ngingFace (Wolf et al., 2019) with no changes to\nthe training or inference framework for maximal\nefficiency. Furthermore, DUS is applicable to all\ntransformer architectures, opening up new gate-\nways to effectively and efficiently scale-up LLMs\nin a simple manner. Using DUS, we release SO-\nLAR 10.7B, an LLM with 10.7 billion parameters,\nthat outperforms existing models like Llama 2 (Tou-\nvron et al., 2023) and Mistral 7B (Jiang et al., 2023)\nin various benchmarks. We have also developed SOLAR 10.7B-Instruct,\na variant fine-tuned for tasks requiring strict adher-\nence to complex instructions. It significantly out-\nperforms the Mixtral-8x7B-Instruct model across\nvarious evaluation metrics, evidencing an advanced\nproficiency that exceeds the capabilities of even\nlarger models in terms of benchmark performance. * Equal Contribution 1 Corresponding Author\nhttps : / /\nSOLAR-1 0 ยท 7B-v1 . 0 By releasing SOLAR 10.7B under the Apache\n2.0 license, we aim to promote collaboration and in-\nnovation in NLP. This open-source approach allows 2024\nApr\n4\n[cs.CL]\narxiv:2...117.7.13' metadata={'page': 1, 'type': 'text', 'split': 'page'}
page_content="Step 1-1 Step 1-2\nOutput Output Output\nOutput Output Output\n24 Layers 24Layers\nMerge\n8Layers\n---- 48 Layers\nCopy\n8 Layers Continued\n32Layers 32Layers\nPretraining\n24Layers\n24 Layers Input\nInput Input Input Input Input\nStep 1. Depthwise Scaling Step2. Continued Pretraining Figure 1: Depth up-scaling for the case with n = 32, s = 48, and m = 8. Depth up-scaling is achieved through a\ndual-stage process of depthwise scaling followed by continued pretraining. for wider access and application of these models\nby researchers and developers globally. 2 Depth Up-Scaling To efficiently scale-up LLMs, we aim to utilize pre-\ntrained weights of base models to scale up to larger\nLLMs (Komatsuzaki et al., 2022). While exist-\ning methods such as Komatsuzaki et al. (2022) use\nMoE (Shazeer et al., 2017) to scale-up the model ar-\nchitecture, we opt for a different depthwise scaling\nstrategy inspired by Tan and Le (2019). We then\ncontinually pretrain the scaled model as just scaling\nthe model without further pretraining degrades the\nperformance. Base model. Any n-layer transformer architec-\nture can be used but we select the 32-layer Llama\n2 architecture as our base model. We initialize the\nLlama 2 architecture with pretrained weights from\nMistral 7B, as it is one of the top performers com-\npatible with the Llama 2 architecture. By adopting\nthe Llama 2 architecture for our base model, we\naim to leverage the vast pool of community re-\nsources while introducing novel modifications to\nfurther enhance its capabilities. Depthwise scaling. From the base model with n\nlayers, we set the target layer count s for the scaled\nmodel, which is largely dictated by the available\nhardware. With the above, the depthwise scaling process\nis as follows. The base model with n layers is\nduplicated for subsequent modification. Then, we\nremove the final m layers from the original model\nand the initial m layers from its duplicate, thus\nforming two distinct models with n - m layers.\nThese two models are concatenated to form a scaled\nmodel with s = 2ยท (n-m) layers. Note that n = 32\nfrom our base model and we set s = 48 considering our hardware constraints and the efficiency of the\nscaled model, i.e., fitting between 7 and 13 billion\nparameters. Naturally, this leads to the removal of\nm = 8 layers. The depthwise scaling process with\nn = 32, s = 48, and m = 8 is depicted in 'Step 1:\nDepthwise Scaling' of Fig. 1. We note that a method in the community that also\n2 'Step 1:\nscale the model in the same manner as\nDepthwise Scaling' of Fig. 1 has been concurrently\ndeveloped. Continued pretraining. The performance of the\ndepthwise scaled model initially drops below that\nof the base LLM. Thus, we additionally apply\nthe continued pretraining step as shown in 'Step\n2: Continued Pretraining' of Fig. 1. Experimen-\ntally, we observe rapid performance recovery of\nthe scaled model during continued pretraining, a\nphenomenon also observed in Komatsuzaki et al.\n(2022). We consider that the particular way of\ndepthwise scaling has isolated the heterogeneity\nin the scaled model which allowed for this fast\nperformance recovery. Delving deeper into the heterogeneity of the\nscaled model, a simpler alternative to depthwise\nscaling could be to just repeat its layers once more,\ni.e., from n to 2n layers. Then, the 'layer distance',\nor the difference in the layer indices in the base\nmodel, is only bigger than 1 where layers n and\nn + 1 are connected, i.e., at the seam. However, this results in maximum layer distance\nat the seam, which may be too significant of a\ndiscrepancy for continued pretraining to quickly\nresolve. Instead, depthwise scaling sacrifices the\n2m middle layers, thereby reducing the discrep-\nancy at the seam and making it easier for continued 2https : / /huggingface ยท co/Undi 95/\nMistral-11B-v0 ยท 1" metadata={'page': 2, 'type': 'text', 'split': 'page'}
page_content="Properties Instruction Training Datasets Alignment\n Alpaca-GPT4 OpenOrca Synth. Math-Instruct Orca DPO Pairs Ultrafeedback Cleaned Synth. Math-Alignment\n Total # Samples 52K 2.91M 126K 12.9K 60.8K 126K\n Maximum # Samples Used 52K 100K 52K 12.9K 60.8K 20.1K\n Open Source O O X O O Table 1: Training datasets used for the instruction and alignment tuning stages, respectively. For the instruction\ntuning process, we utilized the Alpaca-GPT4 (Peng et al., 2023), OpenOrca (Mukherjee et al., 2023), and Synth.\nMath-Instruct datasets, while for the alignment tuning, we employed the Orca DPO Pairs (Intel, 2023), Ultrafeedback\nCleaned (Cui et al., 2023; Ivison et al., 2023), and Synth. Math-Alignment datasets. The 'Total # Samples indicates\nthe total number of samples in the entire dataset. The 'Maximum # Samples Used' indicates the actual maximum\nnumber of samples that were used in training, which could be lower than the total number of samples in a given\ndataset. 'Open Source' indicates whether the dataset is open-sourced. pretraining to quickly recover performance. We\nattribute the success of DUS to reducing such dis-\ncrepancies in both the depthwise scaling and the\ncontinued pretraining steps. We also hypothesize\nthat other methods of depthwise scaling could also\nwork for DUS, as long as the discrepancy in the\nscaled model is sufficiently contained before the\ncontinued pretraining step. Comparison to other up-scaling methods. Un-\nlike Komatsuzaki et al. (2022), depthwise scaled\nmodels do not require additional modules like gat-\ning networks or dynamic expert selection. Conse-\nquently, scaled models in DUS do not necessitate\na distinct training framework for optimal training\nefficiency, nor do they require specialized CUDA\nkernels for fast inference. A DUS model can seam-\nlessly integrate into existing training and inference\nframeworks while maintaining high efficiency. 3 Training Details After DUS, including continued pretraining, we\nperform fine-tuning of SOLAR 10.7B in two stages:\n1) instruction tuning and 2) alignment tuning. Instruction tuning. In the instruction tuning\nstage, the model is trained to follow instructions in\na QA format (Zhang et al., 2023). We mostly use\nopen-source datasets but also synthesize a math QA\ndataset to enhance the model's mathematical capa-\nbilities. A rundown of how we crafted the dataset is\nas follows. First, seed math data are collected from\nthe Math (Hendrycks et al., 2021) dataset only, to\navoid contamination with commonly used bench-\nmark datasets such as GSM8K (Cobbe et al., 2021).\nThen, using a process similar to MetaMath (Yu\net al., 2023), we rephrase the questions and an-\nswers of the seed math data. We use the resulting\nrephrased question-answer pairs as a QA dataset and call it 'Synth. Math-Instruct*. Alignment tuning. In the alignment tuning stage,\nthe instruction-tuned model is further fine-tuned\nto be more aligned with human or strong AI\n(e.g., GPT4 (OpenAI, 2023)) preferences using\nsDPO (Kim et al., 2024a), an improved version\nof direct preference optimization (DPO) (Rafailov\net al., 2023). Similar to the instruction tuning stage,\nwe use mostly open-source datasets but also syn-\nthesize a math-focused alignment dataset utilizing\nthe 'Synth. Math-Instruct' dataset mentioned in the\ninstruction tuning stage. The alignment data synthesis process is as\nfollows. We take advantage of the fact that\nthe rephrased question-answer pairs in Synth.\nMath-Instruct data are beneficial in enhancing the\nmodel's mathematical capabilities (see Sec. 4.3.1).\nThus, we speculate that the rephrased answer to the\nrephrased question is a better answer than the orig-\ninal answer, possibly due to the interim rephrasing\nstep. Consequently, we set the rephrased question\nas the prompt and use the rephrased answer as the\nchosen response and the original answer as the re-\njected response and create the {prompt, chosen,\nrejected} DPO tuple. We aggregate the tuples from\nthe rephrased question-answer pairs and call the\nresulting dataset 'Synth. Math-Alignment*. 4 Results 4.1 Experimental Details Training datasets. We present details regarding\nour training datasets for the instruction and align-\nment tuning stages in Tab. 1. We do not always\nuse the entire dataset and instead subsample a set\namount. Note that most of our training data is\nopen-source, and the undisclosed datasets can be\nsubstituted for open-source alternatives such as the" metadata={'page': 3, 'type': 'text', 'split': 'page'}

Was this page helpful?

You can also leave detailed feedback on GitHub.